CN113760854A - Method for identifying data in HDFS memory and related equipment - Google Patents

Method for identifying data in HDFS memory and related equipment Download PDF

Info

Publication number
CN113760854A
CN113760854A CN202111063577.3A CN202111063577A CN113760854A CN 113760854 A CN113760854 A CN 113760854A CN 202111063577 A CN202111063577 A CN 202111063577A CN 113760854 A CN113760854 A CN 113760854A
Authority
CN
China
Prior art keywords
memory
metadata
data
hdfs
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111063577.3A
Other languages
Chinese (zh)
Inventor
梁海昆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Cloud Network Technology Co Ltd
Original Assignee
Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Cloud Network Technology Co Ltd filed Critical Beijing Kingsoft Cloud Network Technology Co Ltd
Priority to CN202111063577.3A priority Critical patent/CN113760854A/en
Publication of CN113760854A publication Critical patent/CN113760854A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0652Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Abstract

The application relates to a method for identifying data in an HDFS memory and related equipment, which are applied to the technical field of data processing, wherein the method comprises the following steps: acquiring the respective access time of at least one metadata in the memory of a name node in the HDFS; acquiring the current time of memory operation of the name node; calculating a time difference value between the access time and the current time; and determining the metadata corresponding to the access time with the time difference value larger than the first preset value as cold data. The method and the device solve the problems that in the prior art, as the capacity of the memory of the NN is limited, the memory of the NN is consumed more and more along with the increase of directories and files in the HDFS, the available memory capacity of the NN is reduced, and therefore the running speed of a system is reduced.

Description

Method for identifying data in HDFS memory and related equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method for identifying data in an HDFS memory and a related device.
Background
Hadoop is a cluster distributed project dominated by the Apache fund and mainly comprises two core modules: Map/Reduce programming model and HDFS (Hadoop Distributed File System). The HDFS mainly realizes the characteristics of high availability of data, cluster expansibility, high-speed reading and writing of data and the like through a multi-backup mechanism, a heartbeat mechanism and the like of file data blocks. Due to the above characteristics of HDFS, most enterprises currently choose to build cloud storage based on HDFS.
HDFS clusters have two types of nodes and operate in manager-worker mode, i.e., one NameNode (manager) and multiple datanodes (workers). The NameNode (NN) is mainly responsible for managing the HDFS file system, and the DataNode (DN) is mainly used for storing data files.
In the related art, HDFS is often used as a data storage system, and metadata information of these data is indexed in an NN memory, which is recorded in the NN memory. Because the capacity of the memory of the NN is often limited, as the number of directories and files in the HDFS increases, the memory of the NN is also consumed more and more, which causes the available memory capacity of the NN to decrease, thereby slowing down the operating speed of the system.
Disclosure of Invention
The application provides a method for identifying data in an HDFS memory and related equipment, which are used for solving the problem that in the prior art, as the capacity of the NN memory is often limited, the capacity of the NN memory is consumed more and more along with the increase of directories and files in the HDFS, the available memory capacity of the NN is reduced, and therefore the running speed of a system is reduced.
In a first aspect, an embodiment of the present application provides a method for identifying data in an HDFS memory, including:
acquiring the respective access time of at least one metadata in the memory of a name node in the HDFS;
acquiring the current time of memory operation of the name node;
calculating a time difference between the access time and the current time;
and determining the metadata corresponding to the access time with the time difference value larger than a first preset value as cold data.
Optionally, after determining that the metadata with the time difference value larger than the first preset value is cold data, the method further includes:
acquiring data elements of the cold data, wherein the data elements comprise the time difference value;
determining a target processing unit of the cold data corresponding to the data element;
and sending the cold data to the target processing unit so as to process the cold data through the target processing unit.
Optionally, the data element further includes a revisitation tendency degree, and the revisitation tendency degree indicates a possibility that the cold data is revisited;
the target processing unit for determining the cold data corresponding to the data element includes:
and if the time difference is greater than a second preset value, or the revisit tendency degree is smaller than a preset tendency degree, determining that the target processing unit is a recycle bin, wherein the second preset value is greater than the first preset value.
Optionally, the obtaining the access time of each of at least one metadata in the memory of the name node in the HDFS includes:
and acquiring the respective access time of each metadata with the storage time greater than the preset time in the memory of the name node.
Optionally, the obtaining the access time of each of at least one metadata in the memory of the name node in the HDFS includes:
and acquiring the respective access time of each metadata which does not carry a specific identifier in the memory of the name node, wherein the specific identifier is an identifier indicating that the metadata meets a preset condition.
Optionally, the determining that the metadata with the time difference value larger than the first preset value is cold data includes:
recording the times that the time difference between the access time of the metadata carrying the specific identification and the current time of the current identification process is greater than the first preset value in the identification process of each data, wherein the current identification process is any one of the identification processes of the data;
and when the times are more than the preset times, determining that the metadata carrying the specific identification is cold data.
Optionally, before obtaining the access time of each of at least one metadata in the memory of the name node in the HDFS, the method further includes:
monitoring the amount of free memory space of the name node;
and when the amount of the free memory space is less than the preset space memory space, executing the preset time at each interval, and acquiring the access parameter value of each metadata in the memory of the name node.
Optionally, the obtaining the access time of each of at least one metadata in the memory of the name node in the HDFS includes:
scanning a metadata access record table, wherein data identification of metadata in the metadata access record table and access time of the metadata are correspondingly stored;
and extracting the access time from the metadata access record table.
In a second aspect, an embodiment of the present application provides an apparatus for identifying data in an HDFS memory, including:
the first acquisition module is used for acquiring the respective access time of at least one metadata in the memory of a name node in the HDFS;
the second acquisition module is used for acquiring the current time of the memory operation of the name node;
the calculation module is used for calculating a time difference value between the access time and the current time;
and the determining module is used for determining that the metadata corresponding to the access time with the time difference value larger than a first preset value is cold data.
In a third aspect, an embodiment of the present application provides an electronic device, including: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory for storing a computer program;
the processor is configured to execute the program stored in the storage, and implement the method for identifying data in the HDFS memory according to the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the method for identifying data in an HDFS memory according to the first aspect is implemented.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: according to the method provided by the embodiment of the application, the respective access time of at least one metadata in the memory of a name node in the HDFS is obtained; acquiring the current time of memory operation of the name node; calculating a time difference value between the access time and the current time; and determining the metadata corresponding to the access time with the time difference value larger than the first preset value as cold data. Therefore, the metadata in the memory of the NameNode name node in the HDFS are distinguished, the metadata with longer access time from the current time is identified and used as cold data, and the identification of the metadata in the NameNode memory is realized. The identified cold data can be subsequently removed, thereby increasing the available memory capacity and increasing the operating rate.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is an architecture diagram of a method for identifying data in an HDFS memory according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a method for identifying data in an HDFS memory according to an embodiment of the present disclosure;
fig. 3 is a data transmission diagram in the method for identifying data in an HDFS memory according to an embodiment of the present disclosure;
fig. 4 is a structural diagram of an apparatus for identifying data in an HDFS memory according to an embodiment of the present disclosure;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Before further detailed description of the embodiments of the present invention, terms and expressions referred to in the embodiments of the present invention are described, and the terms and expressions referred to in the embodiments of the present invention are applicable to the following explanations.
(1) HDFS, a cluster consisting of a plurality of servers, a Hadoop distribution File System, a Hadoop distributed File System, a common storage System for big data, and a System for storing the big data;
(2) hot data, data that is often accessed and viewed by a user;
(3) cold data-data that is not frequently accessed;
(4) NameNode, NN for short, a host role in HDFS, which is mainly responsible for the management of HDFS clusters, receives client requests and distributes storage nodes;
(5) the data node is DN, a host role in HDFS, which is mainly responsible for data storage and receiving NameNode instruction;
(6) NameNode maintains, records the basic information of the files stored in the current HDFS, is similar to the directory index of the HDFS, each directory or file is a piece of metadata, and the metadata mainly comprises the following 3 parts according to types:
1. attribute information of the file and the directory itself, such as a file name, a directory name, modification information, and the like;
2. information related to storage of information of file records, such as storage block information, blocking conditions, copy number, and the like;
3. recording the information of the DataNode of the HDFS for the management of the DataNode.
According to an embodiment of the application, a method for identifying data in an HDFS memory is provided. Optionally, in this embodiment of the present application, the method for identifying data in the HDFS memory may be applied to a hardware environment formed by the terminal 101 and the server 102 as shown in fig. 1. As shown in fig. 1, a server 102 is connected to a terminal 101 through a network, which may be used to provide services (such as video services, application services, etc.) for the terminal or a client installed on the terminal, and a database may be provided on the server or separately from the server for providing data storage services for the server 102, and the network includes but is not limited to: the terminal 101 is not limited to a PC, a mobile phone, a tablet computer, and the like.
The method for identifying data in the HDFS memory according to the embodiment of the present application may be executed by the server 102, the terminal 101, or both the server 102 and the terminal 101. The terminal 101 executes the method for identifying data in the HDFS memory according to the embodiment of the present application, or may be executed by a client installed thereon.
Taking the example that the terminal executes the method for identifying data in the HDFS memory according to the embodiment of the present application, the method may be applied to the terminal, fig. 2 is a schematic flow chart of an optional method for identifying data in the HDFS memory according to the embodiment of the present application, and as shown in fig. 2, the flow of the method may include the following steps:
step 201, obtaining respective access time of at least one metadata in the memory of the name node in the HDFS.
In some embodiments, when the HDFS is started, the NN and the DN are respectively started, after the DN is started, the NN is registered, and after the NameNode is started, the HDFS loads metadata to the NameNode, so that the metadata is stored in a memory of the NN.
The access time of the metadata in the memory of the NN may be stored in correspondence with the metadata, and the metadata is updated once per access; the access time may be set in a file separately, and when the access time is acquired, the access time may be acquired from the file.
In an optional embodiment, obtaining the respective access time of at least one metadata in the memory of the name node in the HDFS specifically includes:
scanning a metadata access record table, wherein the data identification of the metadata in the metadata access record table and the access time of the metadata are correspondingly stored; and extracting the access time from the metadata access record table.
In some embodiments, the data identifier is a unique identifier used for representing metadata identity information, the access time and the data identifier are stored correspondingly, and when the access time of the metadata is obtained, the access time can be obtained by looking up in a metadata access record table through the data identifier of the metadata.
In order to find out cold data in the metadata in time, the above-mentioned obtaining of the metadata access time may be performed once every preset time interval, and the metadata is further identified after obtaining the access time. For example, the preset time period may be set according to actual situations, for example, the preset time period is 1 hour, a day, a week or a month.
In an alternative embodiment, since the data in the memory changes faster, the metadata in the NN memory is constantly updated, and when determining whether the metadata is cold data, the newly stored metadata may not be accessible again, which may cause unnecessary processing of the metadata if such metadata is determined to be cold data.
Therefore, when the metadata in the memory of the NN is obtained, the access time of each metadata with the storage time of the NameNode in the memory being longer than the preset time is obtained, so that the metadata with shorter time stored in the memory of the NN can be filtered, and the identification of cold data is not added, so that the data processing amount is reduced, and the identification efficiency of the cold data is improved. The critical value can be set according to actual conditions, for example, the critical value is 12 or 24 hours.
In another optional embodiment, a part of metadata in the metadata of the memory of the NN carries a specific identifier, where the specific identifier is an identifier indicating that the metadata meets a preset condition, for example, the metadata carrying the specific identifier needs to be retained in the memory of the NN to meet a specific requirement.
Therefore, the access time of the acquired at least one piece of metadata may be the access time of acquiring each piece of metadata that does not carry the specific identifier in the memory of the NameNode.
The specific identifier may also be a high-frequency identifier generated after the metadata is accessed for multiple times, and after the metadata carries the high-frequency identifier, the metadata is retained in the memory of the NN so as to avoid identifying the metadata as cold data. The high-frequency identification is configured identification after the access times of the metadata are larger than the preset times.
Further, in order to improve the identification precision of the cold data, the times that the time difference between the access time of the metadata carrying the specific identifier and the current time of the current identification process is greater than a first preset value in each identification process of the data are recorded, and the current identification process is any one identification process in the identification processes of the data; and when the times are more than the preset times, determining the metadata carrying the specific identification as cold data.
Thus, when the metadata has a specific identifier (e.g., a high frequency identifier), if it is detected to be not accessed for a plurality of times, it is recognized as cold data after the number of times is greater than a preset number of times. Therefore, the influence of the metadata carrying the specific identification on subsequent cold data identification due to the previous access frequency can be avoided, the metadata is determined to be the cold data through judging that the metadata is not called for many times, and the identification precision of the metadata can be improved.
Step 202, obtaining the current time of the memory operation of the name node.
In some embodiments, the current time is the time at which cold data was identified. In practical application, the memory of the name node runs in real time, the metadata is accessed in real time, and the current running time of the memory of the name node can be obtained from a time system of a running terminal.
Step 203, calculating a time difference value between the access time and the current time.
In some embodiments, after the access time of the metadata and the current time are obtained, the access time may be subtracted from the current time to obtain a time difference. The time difference value represents a length of time that the metadata has not been accessed.
And 204, determining the metadata corresponding to the access time with the time difference value larger than the first preset value as cold data.
In some embodiments, when the difference between the access time of the metadata and the current time is greater than a first preset value, it indicates that the metadata has not been accessed for a long time, and therefore, such metadata is identified as cold data.
Further, in an optional embodiment, after determining that the metadata with the time difference value greater than the first preset value is cold data, the method further includes:
acquiring data elements of cold data, wherein the data elements comprise time difference values; determining a target processing unit of cold data corresponding to the data element; the cold data is sent to the target processing unit.
In some embodiments, to increase the space capacity in the memory, after the cold data in the memory of the NN is identified, the target processing unit corresponding to different cold data may be determined according to the data elements, so that the cold data is processed by the target processing unit.
Wherein the target processing unit may be a processing unit that deletes cold data and stores the cold data to an external storage unit.
In some embodiments, by deleting the cold data in the NameNode, on one hand, part of the memory in the NameNode can be cleaned, and the cold data does not need to be loaded to the memory during starting, so that the starting speed of the NameNode is improved, the occupation of the cold data is reduced, the response speed of the NameNode is improved, and the metadata information corresponding to the cold data of the cold data is dynamically removed from the NN memory, so that the memory pressure of the NN is reduced. In addition, the cold data in the NameNode is stored in the external storage unit, so that the data loss of the part of cold data caused by deletion from the memory is avoided.
The external storage unit may be, but is not limited to, an external file or an external database.
In an optional embodiment, the data element further comprises a revisitation tendency degree, wherein the revisitation tendency degree indicates the possibility of revising cold data; the target processing unit for determining cold data corresponding to the data elements comprises:
and if the time difference is greater than a second preset value, or the revisit tendency degree is smaller than the preset tendency degree, determining that the target processing unit is a recycle bin, wherein the second preset value is greater than the first preset value.
In some embodiments, if the metadata is not accessed for a long time and exceeds the second preset value, the probability of accessing the metadata is low, and therefore, the metadata is sent to the recycle bin to be deleted, and memory occupation of other positions is avoided. In addition, when the re-access tendency of the metadata is smaller than the preset tendency, the probability that the metadata is re-accessed or accessed is small,
the second preset value may be, but not limited to, 6 months, and the preset tendency may be any value less than 10%.
The revisitation tendency degree is positively correlated with the number of times of accessing the metadata, and the smaller the number of times of accessing, the lower the revisitation tendency degree.
In an optional embodiment, after storing the cold data in a preset external storage unit and deleting the cold data in the NameNode, the method further includes:
acquiring a data query request; judging whether target metadata corresponding to the data query request exists in the NameNode; if yes, returning storage data corresponding to the target metadata; if the cold data does not exist, based on a preset external storage unit, obtaining storage data corresponding to the target metadata, and processing the cold data through the target processing unit.
In some embodiments, since the cold data is deleted from the NN and stored in the external storage unit in the above embodiments, after the data query request is obtained, the query request may not be queried from the NN without the cold data query request. Therefore, after the data query request is acquired, whether target metadata corresponding to the data query request exists in the NameNode is judged, and if the target metadata exists in the NN, storage data corresponding to the target metadata are directly returned; if the target metadata does not exist, the target metadata is cold data, and the cold data is stored in the external storage unit, so that the storage data corresponding to the target metadata can be obtained based on the preset external storage unit.
Specifically, when the target metadata is stored in the NN, the NN sends a query command to the DN, so that the DN queries, in its data storage unit, the target raw data corresponding to the target metadata, thereby returning the target raw data.
Further, obtaining storage data corresponding to the target metadata based on a preset external storage unit includes:
judging whether a preset external storage unit has target metadata corresponding to the data query request or not; and if the preset external storage unit comprises the target metadata, loading the target metadata into the NameNode, and returning the storage data corresponding to the target metadata.
In some embodiments, when the target metadata is not in the NN, the NN may send a data query request to a preset external storage unit, query in the preset external storage unit whether target metadata corresponding to the data query request exists, and load the target metadata in the preset external storage unit into the NN, and further, the NN may query the target metadata, so that the storage data corresponding to the target metadata may be returned. Therefore, when the cold data is required to be inquired, the metadata is automatically supplemented back to the NN again, and the metadata is guaranteed not to be lost.
Fig. 3 is a specific process for obtaining stored data provided in the embodiment of the present application, and referring to fig. 3, the HDFS includes a plurality of DataNode nodes and a NameNode, where the DataNode stores data in the HDFS, and the NameNode is used for storing metadata. The DataNode registers to the NN, the NN stores cold data in the metadata into a preset external storage unit, and deletes the cold data. After a data query request is acquired, corresponding target metadata are queried in the NN, if the NN does not exist, the corresponding target metadata are acquired from a preset external storage unit and are reloaded to the NN, and if the NN exists, corresponding storage data are acquired from the DN based on the target metadata.
In an optional embodiment, before obtaining the respective access time of at least one metadata in the memory of the name node in the HDFS, the method further includes:
monitoring the amount of the free memory space of the NameNode; and when the amount of the free memory space is less than the preset space memory space, executing to acquire the respective access time of at least one metadata in the memory of the name node in the HDFS.
In some embodiments, when the memory space in the memory of the NN is large, the influence of the reduction of the operation rate of the memory is small, and in order to reduce the calculation amount of the memory of the NN, when the free memory space amount of the NN is smaller than the preset space storage amount, cold data in the metadata is identified. Thus, the amount of calculation in the memory is reduced.
In the application, metadata in the memory of the NameNode in the HDFS is distinguished, the metadata with longer access time from the current time is identified and used as cold data, and the identification of the metadata in the memory of the NameNode is realized. The identified cold data can be subsequently removed, thereby increasing the available memory capacity and increasing the operating rate. And the cold data in the HDFS is automatically removed from the corresponding metadata information in the NN memory, the NN memory is released, the utilization rate of the NN memory is improved, and meanwhile, when a user really needs to access the cold data, the NN can reload the metadata back to the memory, so that dynamic optimization of the memory is realized.
Based on the same concept, an apparatus for identifying data in an HDFS memory is provided in the embodiments of the present application, and specific implementation of the apparatus may refer to the description of the method embodiment, and repeated details are not repeated, as shown in fig. 4, the apparatus mainly includes:
a first obtaining module 401, configured to obtain access time of at least one metadata in a memory of a name node in the HDFS;
a second obtaining module 402, configured to obtain a current time of memory operation of the name node;
a calculating module 403, configured to calculate a time difference between the access time and the current time;
a determining module 404, configured to determine that the metadata corresponding to the access time with the time difference being greater than the first preset value is cold data.
Based on the same concept, an embodiment of the present application further provides an electronic device, as shown in fig. 5, the electronic device mainly includes: a processor 501, a memory 502 and a communication bus 503, wherein the processor 501 and the memory 502 communicate with each other through the communication bus 503. The memory 502 stores a program executable by the processor 501, and the processor 501 executes the program stored in the memory 502, so as to implement the following steps:
acquiring the respective access time of at least one metadata in the memory of a name node in the HDFS;
acquiring the current time of memory operation of the name node;
calculating a time difference value between the access time and the current time;
and determining the metadata corresponding to the access time with the time difference value larger than the first preset value as cold data.
The communication bus 503 mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 503 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.
The memory 502 may include a Random Access Memory (RAM) or a non-volatile memory (non-volatile memory), such as at least one disk memory. Alternatively, the memory may be at least one memory device located remotely from the aforementioned processor 501.
The processor 501 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc., and may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic devices, discrete gates or transistor logic devices, and discrete hardware components.
In still another embodiment of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, which, when run on a computer, causes the computer to execute the method for identifying data in an HDFS memory described in the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes, etc.), optical media (e.g., DVDs), or semiconductor media (e.g., solid state drives), among others.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (11)

1. A method for identifying data in an HDFS memory is characterized by comprising the following steps:
acquiring the respective access time of at least one metadata in the memory of a name node in the HDFS;
acquiring the current time of memory operation of the name node;
calculating a time difference between the access time and the current time;
and determining the metadata corresponding to the access time with the time difference value larger than a first preset value as cold data.
2. The method according to claim 1, wherein after determining that the metadata with the time difference value greater than the first preset value is cold data, the method further comprises:
acquiring data elements of the cold data, wherein the data elements comprise the time difference value;
determining a target processing unit of the cold data corresponding to the data element;
and sending the cold data to the target processing unit so as to process the cold data through the target processing unit.
3. The HDFS in-memory data identification method according to claim 2, wherein the data elements further include a revisitation propensity degree indicating a likelihood that the cold data is revisited;
the target processing unit for determining the cold data corresponding to the data element includes:
and if the time difference is greater than a second preset value, or the revisit tendency degree is smaller than a preset tendency degree, determining that the target processing unit is a recycle bin, wherein the second preset value is greater than the first preset value.
4. The method for identifying data in an HDFS memory according to any of claims 1 to 3, wherein the obtaining of the respective access time of at least one metadata in the memory of a name node in the HDFS comprises:
and acquiring the respective access time of each metadata with the storage time greater than the preset time in the memory of the name node.
5. The method for identifying data in an HDFS memory according to any of claims 1 to 3, wherein the obtaining of the respective access time of at least one metadata in the memory of a name node in the HDFS comprises:
and acquiring the respective access time of each metadata which does not carry a specific identifier in the memory of the name node, wherein the specific identifier is an identifier indicating that the metadata meets a preset condition.
6. The method according to claim 5, wherein the determining that the metadata with the time difference value greater than the first preset value is cold data comprises:
recording the times that the time difference between the access time of the metadata carrying the specific identification and the current time of the current identification process is greater than the first preset value in the identification process of each data, wherein the current identification process is any one of the identification processes of the data;
and when the times are more than the preset times, determining that the metadata carrying the specific identification is cold data.
7. The method for identifying data in a memory of an HDFS according to any of claims 1 to 3, wherein before obtaining the respective access time of at least one metadata in the memory of a name node in the HDFS, the method further comprises:
monitoring the amount of free memory space of the name node;
and when the amount of the free memory space is less than the preset space memory space, executing the preset time at each interval, and acquiring the access parameter value of each metadata in the memory of the name node.
8. The method for identifying data in an HDFS memory according to any of claims 1 to 3, wherein the obtaining of the respective access time of at least one metadata in the memory of a name node in the HDFS comprises:
scanning a metadata access record table, wherein data identification of metadata in the metadata access record table and access time of the metadata are correspondingly stored;
and extracting the access time from the metadata access record table.
9. An apparatus for identifying data in an HDFS memory, comprising:
the first acquisition module is used for acquiring the respective access time of at least one metadata in the memory of a name node in the HDFS;
the second acquisition module is used for acquiring the current time of the memory operation of the name node;
the calculation module is used for calculating a time difference value between the access time and the current time;
and the determining module is used for determining that the metadata corresponding to the access time with the time difference value larger than a first preset value is cold data.
10. An electronic device, comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory for storing a computer program;
the processor is configured to execute the program stored in the memory, and implement the method for identifying data in the HDFS memory according to any one of claims 1 to 8.
11. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out a method for identifying data in an HDFS memory according to any one of claims 1 to 8.
CN202111063577.3A 2021-09-10 2021-09-10 Method for identifying data in HDFS memory and related equipment Pending CN113760854A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111063577.3A CN113760854A (en) 2021-09-10 2021-09-10 Method for identifying data in HDFS memory and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111063577.3A CN113760854A (en) 2021-09-10 2021-09-10 Method for identifying data in HDFS memory and related equipment

Publications (1)

Publication Number Publication Date
CN113760854A true CN113760854A (en) 2021-12-07

Family

ID=78794832

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111063577.3A Pending CN113760854A (en) 2021-09-10 2021-09-10 Method for identifying data in HDFS memory and related equipment

Country Status (1)

Country Link
CN (1) CN113760854A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113760855A (en) * 2021-09-10 2021-12-07 北京金山云网络技术有限公司 Data storage method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130013561A1 (en) * 2011-07-08 2013-01-10 Microsoft Corporation Efficient metadata storage
CN107169056A (en) * 2017-04-27 2017-09-15 四川长虹电器股份有限公司 Distributed file system and the method for saving distributed file system memory space
CN107665224A (en) * 2016-07-29 2018-02-06 北京京东尚科信息技术有限公司 Scan the mthods, systems and devices of HDFS cold datas
CN108021585A (en) * 2016-10-28 2018-05-11 腾讯科技(深圳)有限公司 Distributed data storage method and device
CN112286459A (en) * 2020-10-29 2021-01-29 苏州浪潮智能科技有限公司 Data processing method, device, equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130013561A1 (en) * 2011-07-08 2013-01-10 Microsoft Corporation Efficient metadata storage
CN107665224A (en) * 2016-07-29 2018-02-06 北京京东尚科信息技术有限公司 Scan the mthods, systems and devices of HDFS cold datas
CN108021585A (en) * 2016-10-28 2018-05-11 腾讯科技(深圳)有限公司 Distributed data storage method and device
CN107169056A (en) * 2017-04-27 2017-09-15 四川长虹电器股份有限公司 Distributed file system and the method for saving distributed file system memory space
CN112286459A (en) * 2020-10-29 2021-01-29 苏州浪潮智能科技有限公司 Data processing method, device, equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
詹玲等: "基于Ceph 文件系统的元数据缓存备份", 《计算机工程》, vol. 43, no. 4, 30 April 2017 (2017-04-30), pages 67 - 72 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113760855A (en) * 2021-09-10 2021-12-07 北京金山云网络技术有限公司 Data storage method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109947668B (en) Method and device for storing data
CN110109868B (en) Method, apparatus and computer program product for indexing files
US9514170B1 (en) Priority queue using two differently-indexed single-index tables
CN110958300B (en) Data uploading method, system, device, electronic equipment and computer readable medium
CN111488377A (en) Data query method and device, electronic equipment and storage medium
CN110750211B (en) Storage space management method and device
CN113760854A (en) Method for identifying data in HDFS memory and related equipment
US11429311B1 (en) Method and system for managing requests in a distributed system
CN110427394B (en) Data operation method and device
CN110708361B (en) System, method and device for determining grade of digital content publishing user and server
CN112579633A (en) Data retrieval method, device, equipment and storage medium
CN113779412B (en) Message touch method, node and system based on blockchain network
CN113849482A (en) Data migration method and device and electronic equipment
CN111400327B (en) Data synchronization method and device, electronic equipment and storage medium
CN113760855A (en) Data storage method and device, electronic equipment and storage medium
CN113821166A (en) Method, device and equipment for aggregating multi-version small objects
CN114896215A (en) Metadata storage method and device
CN111399754B (en) Method and device for releasing storage space and distributed system
CN110083509B (en) Method and device for arranging log data
CN114036121A (en) Log file processing method, device, system, equipment and storage medium
CN113742378A (en) Data query and storage method, related equipment and storage medium
CN113779426A (en) Data storage method and device, terminal equipment and storage medium
CN111078643A (en) Method and device for deleting files in batches and electronic equipment
CN112543213B (en) Data processing method and device
CN115718571B (en) Data management method and device based on multidimensional features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination