CN113760854A - Method for identifying data in HDFS memory and related equipment - Google Patents
Method for identifying data in HDFS memory and related equipment Download PDFInfo
- Publication number
- CN113760854A CN113760854A CN202111063577.3A CN202111063577A CN113760854A CN 113760854 A CN113760854 A CN 113760854A CN 202111063577 A CN202111063577 A CN 202111063577A CN 113760854 A CN113760854 A CN 113760854A
- Authority
- CN
- China
- Prior art keywords
- memory
- metadata
- data
- hdfs
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000012545 processing Methods 0.000 claims abstract description 25
- 230000008569 process Effects 0.000 claims description 20
- 238000004891 communication Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000012544 monitoring process Methods 0.000 claims description 3
- 230000000875 corresponding effect Effects 0.000 description 28
- 238000013500 data storage Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
- G06F3/0611—Improving I/O performance in relation to response time
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
- G06F3/0652—Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
Abstract
The application relates to a method for identifying data in an HDFS memory and related equipment, which are applied to the technical field of data processing, wherein the method comprises the following steps: acquiring the respective access time of at least one metadata in the memory of a name node in the HDFS; acquiring the current time of memory operation of the name node; calculating a time difference value between the access time and the current time; and determining the metadata corresponding to the access time with the time difference value larger than the first preset value as cold data. The method and the device solve the problems that in the prior art, as the capacity of the memory of the NN is limited, the memory of the NN is consumed more and more along with the increase of directories and files in the HDFS, the available memory capacity of the NN is reduced, and therefore the running speed of a system is reduced.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method for identifying data in an HDFS memory and a related device.
Background
Hadoop is a cluster distributed project dominated by the Apache fund and mainly comprises two core modules: Map/Reduce programming model and HDFS (Hadoop Distributed File System). The HDFS mainly realizes the characteristics of high availability of data, cluster expansibility, high-speed reading and writing of data and the like through a multi-backup mechanism, a heartbeat mechanism and the like of file data blocks. Due to the above characteristics of HDFS, most enterprises currently choose to build cloud storage based on HDFS.
HDFS clusters have two types of nodes and operate in manager-worker mode, i.e., one NameNode (manager) and multiple datanodes (workers). The NameNode (NN) is mainly responsible for managing the HDFS file system, and the DataNode (DN) is mainly used for storing data files.
In the related art, HDFS is often used as a data storage system, and metadata information of these data is indexed in an NN memory, which is recorded in the NN memory. Because the capacity of the memory of the NN is often limited, as the number of directories and files in the HDFS increases, the memory of the NN is also consumed more and more, which causes the available memory capacity of the NN to decrease, thereby slowing down the operating speed of the system.
Disclosure of Invention
The application provides a method for identifying data in an HDFS memory and related equipment, which are used for solving the problem that in the prior art, as the capacity of the NN memory is often limited, the capacity of the NN memory is consumed more and more along with the increase of directories and files in the HDFS, the available memory capacity of the NN is reduced, and therefore the running speed of a system is reduced.
In a first aspect, an embodiment of the present application provides a method for identifying data in an HDFS memory, including:
acquiring the respective access time of at least one metadata in the memory of a name node in the HDFS;
acquiring the current time of memory operation of the name node;
calculating a time difference between the access time and the current time;
and determining the metadata corresponding to the access time with the time difference value larger than a first preset value as cold data.
Optionally, after determining that the metadata with the time difference value larger than the first preset value is cold data, the method further includes:
acquiring data elements of the cold data, wherein the data elements comprise the time difference value;
determining a target processing unit of the cold data corresponding to the data element;
and sending the cold data to the target processing unit so as to process the cold data through the target processing unit.
Optionally, the data element further includes a revisitation tendency degree, and the revisitation tendency degree indicates a possibility that the cold data is revisited;
the target processing unit for determining the cold data corresponding to the data element includes:
and if the time difference is greater than a second preset value, or the revisit tendency degree is smaller than a preset tendency degree, determining that the target processing unit is a recycle bin, wherein the second preset value is greater than the first preset value.
Optionally, the obtaining the access time of each of at least one metadata in the memory of the name node in the HDFS includes:
and acquiring the respective access time of each metadata with the storage time greater than the preset time in the memory of the name node.
Optionally, the obtaining the access time of each of at least one metadata in the memory of the name node in the HDFS includes:
and acquiring the respective access time of each metadata which does not carry a specific identifier in the memory of the name node, wherein the specific identifier is an identifier indicating that the metadata meets a preset condition.
Optionally, the determining that the metadata with the time difference value larger than the first preset value is cold data includes:
recording the times that the time difference between the access time of the metadata carrying the specific identification and the current time of the current identification process is greater than the first preset value in the identification process of each data, wherein the current identification process is any one of the identification processes of the data;
and when the times are more than the preset times, determining that the metadata carrying the specific identification is cold data.
Optionally, before obtaining the access time of each of at least one metadata in the memory of the name node in the HDFS, the method further includes:
monitoring the amount of free memory space of the name node;
and when the amount of the free memory space is less than the preset space memory space, executing the preset time at each interval, and acquiring the access parameter value of each metadata in the memory of the name node.
Optionally, the obtaining the access time of each of at least one metadata in the memory of the name node in the HDFS includes:
scanning a metadata access record table, wherein data identification of metadata in the metadata access record table and access time of the metadata are correspondingly stored;
and extracting the access time from the metadata access record table.
In a second aspect, an embodiment of the present application provides an apparatus for identifying data in an HDFS memory, including:
the first acquisition module is used for acquiring the respective access time of at least one metadata in the memory of a name node in the HDFS;
the second acquisition module is used for acquiring the current time of the memory operation of the name node;
the calculation module is used for calculating a time difference value between the access time and the current time;
and the determining module is used for determining that the metadata corresponding to the access time with the time difference value larger than a first preset value is cold data.
In a third aspect, an embodiment of the present application provides an electronic device, including: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory for storing a computer program;
the processor is configured to execute the program stored in the storage, and implement the method for identifying data in the HDFS memory according to the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the method for identifying data in an HDFS memory according to the first aspect is implemented.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: according to the method provided by the embodiment of the application, the respective access time of at least one metadata in the memory of a name node in the HDFS is obtained; acquiring the current time of memory operation of the name node; calculating a time difference value between the access time and the current time; and determining the metadata corresponding to the access time with the time difference value larger than the first preset value as cold data. Therefore, the metadata in the memory of the NameNode name node in the HDFS are distinguished, the metadata with longer access time from the current time is identified and used as cold data, and the identification of the metadata in the NameNode memory is realized. The identified cold data can be subsequently removed, thereby increasing the available memory capacity and increasing the operating rate.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is an architecture diagram of a method for identifying data in an HDFS memory according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a method for identifying data in an HDFS memory according to an embodiment of the present disclosure;
fig. 3 is a data transmission diagram in the method for identifying data in an HDFS memory according to an embodiment of the present disclosure;
fig. 4 is a structural diagram of an apparatus for identifying data in an HDFS memory according to an embodiment of the present disclosure;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Before further detailed description of the embodiments of the present invention, terms and expressions referred to in the embodiments of the present invention are described, and the terms and expressions referred to in the embodiments of the present invention are applicable to the following explanations.
(1) HDFS, a cluster consisting of a plurality of servers, a Hadoop distribution File System, a Hadoop distributed File System, a common storage System for big data, and a System for storing the big data;
(2) hot data, data that is often accessed and viewed by a user;
(3) cold data-data that is not frequently accessed;
(4) NameNode, NN for short, a host role in HDFS, which is mainly responsible for the management of HDFS clusters, receives client requests and distributes storage nodes;
(5) the data node is DN, a host role in HDFS, which is mainly responsible for data storage and receiving NameNode instruction;
(6) NameNode maintains, records the basic information of the files stored in the current HDFS, is similar to the directory index of the HDFS, each directory or file is a piece of metadata, and the metadata mainly comprises the following 3 parts according to types:
1. attribute information of the file and the directory itself, such as a file name, a directory name, modification information, and the like;
2. information related to storage of information of file records, such as storage block information, blocking conditions, copy number, and the like;
3. recording the information of the DataNode of the HDFS for the management of the DataNode.
According to an embodiment of the application, a method for identifying data in an HDFS memory is provided. Optionally, in this embodiment of the present application, the method for identifying data in the HDFS memory may be applied to a hardware environment formed by the terminal 101 and the server 102 as shown in fig. 1. As shown in fig. 1, a server 102 is connected to a terminal 101 through a network, which may be used to provide services (such as video services, application services, etc.) for the terminal or a client installed on the terminal, and a database may be provided on the server or separately from the server for providing data storage services for the server 102, and the network includes but is not limited to: the terminal 101 is not limited to a PC, a mobile phone, a tablet computer, and the like.
The method for identifying data in the HDFS memory according to the embodiment of the present application may be executed by the server 102, the terminal 101, or both the server 102 and the terminal 101. The terminal 101 executes the method for identifying data in the HDFS memory according to the embodiment of the present application, or may be executed by a client installed thereon.
Taking the example that the terminal executes the method for identifying data in the HDFS memory according to the embodiment of the present application, the method may be applied to the terminal, fig. 2 is a schematic flow chart of an optional method for identifying data in the HDFS memory according to the embodiment of the present application, and as shown in fig. 2, the flow of the method may include the following steps:
In some embodiments, when the HDFS is started, the NN and the DN are respectively started, after the DN is started, the NN is registered, and after the NameNode is started, the HDFS loads metadata to the NameNode, so that the metadata is stored in a memory of the NN.
The access time of the metadata in the memory of the NN may be stored in correspondence with the metadata, and the metadata is updated once per access; the access time may be set in a file separately, and when the access time is acquired, the access time may be acquired from the file.
In an optional embodiment, obtaining the respective access time of at least one metadata in the memory of the name node in the HDFS specifically includes:
scanning a metadata access record table, wherein the data identification of the metadata in the metadata access record table and the access time of the metadata are correspondingly stored; and extracting the access time from the metadata access record table.
In some embodiments, the data identifier is a unique identifier used for representing metadata identity information, the access time and the data identifier are stored correspondingly, and when the access time of the metadata is obtained, the access time can be obtained by looking up in a metadata access record table through the data identifier of the metadata.
In order to find out cold data in the metadata in time, the above-mentioned obtaining of the metadata access time may be performed once every preset time interval, and the metadata is further identified after obtaining the access time. For example, the preset time period may be set according to actual situations, for example, the preset time period is 1 hour, a day, a week or a month.
In an alternative embodiment, since the data in the memory changes faster, the metadata in the NN memory is constantly updated, and when determining whether the metadata is cold data, the newly stored metadata may not be accessible again, which may cause unnecessary processing of the metadata if such metadata is determined to be cold data.
Therefore, when the metadata in the memory of the NN is obtained, the access time of each metadata with the storage time of the NameNode in the memory being longer than the preset time is obtained, so that the metadata with shorter time stored in the memory of the NN can be filtered, and the identification of cold data is not added, so that the data processing amount is reduced, and the identification efficiency of the cold data is improved. The critical value can be set according to actual conditions, for example, the critical value is 12 or 24 hours.
In another optional embodiment, a part of metadata in the metadata of the memory of the NN carries a specific identifier, where the specific identifier is an identifier indicating that the metadata meets a preset condition, for example, the metadata carrying the specific identifier needs to be retained in the memory of the NN to meet a specific requirement.
Therefore, the access time of the acquired at least one piece of metadata may be the access time of acquiring each piece of metadata that does not carry the specific identifier in the memory of the NameNode.
The specific identifier may also be a high-frequency identifier generated after the metadata is accessed for multiple times, and after the metadata carries the high-frequency identifier, the metadata is retained in the memory of the NN so as to avoid identifying the metadata as cold data. The high-frequency identification is configured identification after the access times of the metadata are larger than the preset times.
Further, in order to improve the identification precision of the cold data, the times that the time difference between the access time of the metadata carrying the specific identifier and the current time of the current identification process is greater than a first preset value in each identification process of the data are recorded, and the current identification process is any one identification process in the identification processes of the data; and when the times are more than the preset times, determining the metadata carrying the specific identification as cold data.
Thus, when the metadata has a specific identifier (e.g., a high frequency identifier), if it is detected to be not accessed for a plurality of times, it is recognized as cold data after the number of times is greater than a preset number of times. Therefore, the influence of the metadata carrying the specific identification on subsequent cold data identification due to the previous access frequency can be avoided, the metadata is determined to be the cold data through judging that the metadata is not called for many times, and the identification precision of the metadata can be improved.
In some embodiments, the current time is the time at which cold data was identified. In practical application, the memory of the name node runs in real time, the metadata is accessed in real time, and the current running time of the memory of the name node can be obtained from a time system of a running terminal.
In some embodiments, after the access time of the metadata and the current time are obtained, the access time may be subtracted from the current time to obtain a time difference. The time difference value represents a length of time that the metadata has not been accessed.
And 204, determining the metadata corresponding to the access time with the time difference value larger than the first preset value as cold data.
In some embodiments, when the difference between the access time of the metadata and the current time is greater than a first preset value, it indicates that the metadata has not been accessed for a long time, and therefore, such metadata is identified as cold data.
Further, in an optional embodiment, after determining that the metadata with the time difference value greater than the first preset value is cold data, the method further includes:
acquiring data elements of cold data, wherein the data elements comprise time difference values; determining a target processing unit of cold data corresponding to the data element; the cold data is sent to the target processing unit.
In some embodiments, to increase the space capacity in the memory, after the cold data in the memory of the NN is identified, the target processing unit corresponding to different cold data may be determined according to the data elements, so that the cold data is processed by the target processing unit.
Wherein the target processing unit may be a processing unit that deletes cold data and stores the cold data to an external storage unit.
In some embodiments, by deleting the cold data in the NameNode, on one hand, part of the memory in the NameNode can be cleaned, and the cold data does not need to be loaded to the memory during starting, so that the starting speed of the NameNode is improved, the occupation of the cold data is reduced, the response speed of the NameNode is improved, and the metadata information corresponding to the cold data of the cold data is dynamically removed from the NN memory, so that the memory pressure of the NN is reduced. In addition, the cold data in the NameNode is stored in the external storage unit, so that the data loss of the part of cold data caused by deletion from the memory is avoided.
The external storage unit may be, but is not limited to, an external file or an external database.
In an optional embodiment, the data element further comprises a revisitation tendency degree, wherein the revisitation tendency degree indicates the possibility of revising cold data; the target processing unit for determining cold data corresponding to the data elements comprises:
and if the time difference is greater than a second preset value, or the revisit tendency degree is smaller than the preset tendency degree, determining that the target processing unit is a recycle bin, wherein the second preset value is greater than the first preset value.
In some embodiments, if the metadata is not accessed for a long time and exceeds the second preset value, the probability of accessing the metadata is low, and therefore, the metadata is sent to the recycle bin to be deleted, and memory occupation of other positions is avoided. In addition, when the re-access tendency of the metadata is smaller than the preset tendency, the probability that the metadata is re-accessed or accessed is small,
the second preset value may be, but not limited to, 6 months, and the preset tendency may be any value less than 10%.
The revisitation tendency degree is positively correlated with the number of times of accessing the metadata, and the smaller the number of times of accessing, the lower the revisitation tendency degree.
In an optional embodiment, after storing the cold data in a preset external storage unit and deleting the cold data in the NameNode, the method further includes:
acquiring a data query request; judging whether target metadata corresponding to the data query request exists in the NameNode; if yes, returning storage data corresponding to the target metadata; if the cold data does not exist, based on a preset external storage unit, obtaining storage data corresponding to the target metadata, and processing the cold data through the target processing unit.
In some embodiments, since the cold data is deleted from the NN and stored in the external storage unit in the above embodiments, after the data query request is obtained, the query request may not be queried from the NN without the cold data query request. Therefore, after the data query request is acquired, whether target metadata corresponding to the data query request exists in the NameNode is judged, and if the target metadata exists in the NN, storage data corresponding to the target metadata are directly returned; if the target metadata does not exist, the target metadata is cold data, and the cold data is stored in the external storage unit, so that the storage data corresponding to the target metadata can be obtained based on the preset external storage unit.
Specifically, when the target metadata is stored in the NN, the NN sends a query command to the DN, so that the DN queries, in its data storage unit, the target raw data corresponding to the target metadata, thereby returning the target raw data.
Further, obtaining storage data corresponding to the target metadata based on a preset external storage unit includes:
judging whether a preset external storage unit has target metadata corresponding to the data query request or not; and if the preset external storage unit comprises the target metadata, loading the target metadata into the NameNode, and returning the storage data corresponding to the target metadata.
In some embodiments, when the target metadata is not in the NN, the NN may send a data query request to a preset external storage unit, query in the preset external storage unit whether target metadata corresponding to the data query request exists, and load the target metadata in the preset external storage unit into the NN, and further, the NN may query the target metadata, so that the storage data corresponding to the target metadata may be returned. Therefore, when the cold data is required to be inquired, the metadata is automatically supplemented back to the NN again, and the metadata is guaranteed not to be lost.
Fig. 3 is a specific process for obtaining stored data provided in the embodiment of the present application, and referring to fig. 3, the HDFS includes a plurality of DataNode nodes and a NameNode, where the DataNode stores data in the HDFS, and the NameNode is used for storing metadata. The DataNode registers to the NN, the NN stores cold data in the metadata into a preset external storage unit, and deletes the cold data. After a data query request is acquired, corresponding target metadata are queried in the NN, if the NN does not exist, the corresponding target metadata are acquired from a preset external storage unit and are reloaded to the NN, and if the NN exists, corresponding storage data are acquired from the DN based on the target metadata.
In an optional embodiment, before obtaining the respective access time of at least one metadata in the memory of the name node in the HDFS, the method further includes:
monitoring the amount of the free memory space of the NameNode; and when the amount of the free memory space is less than the preset space memory space, executing to acquire the respective access time of at least one metadata in the memory of the name node in the HDFS.
In some embodiments, when the memory space in the memory of the NN is large, the influence of the reduction of the operation rate of the memory is small, and in order to reduce the calculation amount of the memory of the NN, when the free memory space amount of the NN is smaller than the preset space storage amount, cold data in the metadata is identified. Thus, the amount of calculation in the memory is reduced.
In the application, metadata in the memory of the NameNode in the HDFS is distinguished, the metadata with longer access time from the current time is identified and used as cold data, and the identification of the metadata in the memory of the NameNode is realized. The identified cold data can be subsequently removed, thereby increasing the available memory capacity and increasing the operating rate. And the cold data in the HDFS is automatically removed from the corresponding metadata information in the NN memory, the NN memory is released, the utilization rate of the NN memory is improved, and meanwhile, when a user really needs to access the cold data, the NN can reload the metadata back to the memory, so that dynamic optimization of the memory is realized.
Based on the same concept, an apparatus for identifying data in an HDFS memory is provided in the embodiments of the present application, and specific implementation of the apparatus may refer to the description of the method embodiment, and repeated details are not repeated, as shown in fig. 4, the apparatus mainly includes:
a first obtaining module 401, configured to obtain access time of at least one metadata in a memory of a name node in the HDFS;
a second obtaining module 402, configured to obtain a current time of memory operation of the name node;
a calculating module 403, configured to calculate a time difference between the access time and the current time;
a determining module 404, configured to determine that the metadata corresponding to the access time with the time difference being greater than the first preset value is cold data.
Based on the same concept, an embodiment of the present application further provides an electronic device, as shown in fig. 5, the electronic device mainly includes: a processor 501, a memory 502 and a communication bus 503, wherein the processor 501 and the memory 502 communicate with each other through the communication bus 503. The memory 502 stores a program executable by the processor 501, and the processor 501 executes the program stored in the memory 502, so as to implement the following steps:
acquiring the respective access time of at least one metadata in the memory of a name node in the HDFS;
acquiring the current time of memory operation of the name node;
calculating a time difference value between the access time and the current time;
and determining the metadata corresponding to the access time with the time difference value larger than the first preset value as cold data.
The communication bus 503 mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 503 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.
The memory 502 may include a Random Access Memory (RAM) or a non-volatile memory (non-volatile memory), such as at least one disk memory. Alternatively, the memory may be at least one memory device located remotely from the aforementioned processor 501.
The processor 501 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc., and may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic devices, discrete gates or transistor logic devices, and discrete hardware components.
In still another embodiment of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, which, when run on a computer, causes the computer to execute the method for identifying data in an HDFS memory described in the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes, etc.), optical media (e.g., DVDs), or semiconductor media (e.g., solid state drives), among others.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (11)
1. A method for identifying data in an HDFS memory is characterized by comprising the following steps:
acquiring the respective access time of at least one metadata in the memory of a name node in the HDFS;
acquiring the current time of memory operation of the name node;
calculating a time difference between the access time and the current time;
and determining the metadata corresponding to the access time with the time difference value larger than a first preset value as cold data.
2. The method according to claim 1, wherein after determining that the metadata with the time difference value greater than the first preset value is cold data, the method further comprises:
acquiring data elements of the cold data, wherein the data elements comprise the time difference value;
determining a target processing unit of the cold data corresponding to the data element;
and sending the cold data to the target processing unit so as to process the cold data through the target processing unit.
3. The HDFS in-memory data identification method according to claim 2, wherein the data elements further include a revisitation propensity degree indicating a likelihood that the cold data is revisited;
the target processing unit for determining the cold data corresponding to the data element includes:
and if the time difference is greater than a second preset value, or the revisit tendency degree is smaller than a preset tendency degree, determining that the target processing unit is a recycle bin, wherein the second preset value is greater than the first preset value.
4. The method for identifying data in an HDFS memory according to any of claims 1 to 3, wherein the obtaining of the respective access time of at least one metadata in the memory of a name node in the HDFS comprises:
and acquiring the respective access time of each metadata with the storage time greater than the preset time in the memory of the name node.
5. The method for identifying data in an HDFS memory according to any of claims 1 to 3, wherein the obtaining of the respective access time of at least one metadata in the memory of a name node in the HDFS comprises:
and acquiring the respective access time of each metadata which does not carry a specific identifier in the memory of the name node, wherein the specific identifier is an identifier indicating that the metadata meets a preset condition.
6. The method according to claim 5, wherein the determining that the metadata with the time difference value greater than the first preset value is cold data comprises:
recording the times that the time difference between the access time of the metadata carrying the specific identification and the current time of the current identification process is greater than the first preset value in the identification process of each data, wherein the current identification process is any one of the identification processes of the data;
and when the times are more than the preset times, determining that the metadata carrying the specific identification is cold data.
7. The method for identifying data in a memory of an HDFS according to any of claims 1 to 3, wherein before obtaining the respective access time of at least one metadata in the memory of a name node in the HDFS, the method further comprises:
monitoring the amount of free memory space of the name node;
and when the amount of the free memory space is less than the preset space memory space, executing the preset time at each interval, and acquiring the access parameter value of each metadata in the memory of the name node.
8. The method for identifying data in an HDFS memory according to any of claims 1 to 3, wherein the obtaining of the respective access time of at least one metadata in the memory of a name node in the HDFS comprises:
scanning a metadata access record table, wherein data identification of metadata in the metadata access record table and access time of the metadata are correspondingly stored;
and extracting the access time from the metadata access record table.
9. An apparatus for identifying data in an HDFS memory, comprising:
the first acquisition module is used for acquiring the respective access time of at least one metadata in the memory of a name node in the HDFS;
the second acquisition module is used for acquiring the current time of the memory operation of the name node;
the calculation module is used for calculating a time difference value between the access time and the current time;
and the determining module is used for determining that the metadata corresponding to the access time with the time difference value larger than a first preset value is cold data.
10. An electronic device, comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory for storing a computer program;
the processor is configured to execute the program stored in the memory, and implement the method for identifying data in the HDFS memory according to any one of claims 1 to 8.
11. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out a method for identifying data in an HDFS memory according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111063577.3A CN113760854A (en) | 2021-09-10 | 2021-09-10 | Method for identifying data in HDFS memory and related equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111063577.3A CN113760854A (en) | 2021-09-10 | 2021-09-10 | Method for identifying data in HDFS memory and related equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113760854A true CN113760854A (en) | 2021-12-07 |
Family
ID=78794832
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111063577.3A Pending CN113760854A (en) | 2021-09-10 | 2021-09-10 | Method for identifying data in HDFS memory and related equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113760854A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113760855A (en) * | 2021-09-10 | 2021-12-07 | 北京金山云网络技术有限公司 | Data storage method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130013561A1 (en) * | 2011-07-08 | 2013-01-10 | Microsoft Corporation | Efficient metadata storage |
CN107169056A (en) * | 2017-04-27 | 2017-09-15 | 四川长虹电器股份有限公司 | Distributed file system and the method for saving distributed file system memory space |
CN107665224A (en) * | 2016-07-29 | 2018-02-06 | 北京京东尚科信息技术有限公司 | Scan the mthods, systems and devices of HDFS cold datas |
CN108021585A (en) * | 2016-10-28 | 2018-05-11 | 腾讯科技(深圳)有限公司 | Distributed data storage method and device |
CN112286459A (en) * | 2020-10-29 | 2021-01-29 | 苏州浪潮智能科技有限公司 | Data processing method, device, equipment and medium |
-
2021
- 2021-09-10 CN CN202111063577.3A patent/CN113760854A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130013561A1 (en) * | 2011-07-08 | 2013-01-10 | Microsoft Corporation | Efficient metadata storage |
CN107665224A (en) * | 2016-07-29 | 2018-02-06 | 北京京东尚科信息技术有限公司 | Scan the mthods, systems and devices of HDFS cold datas |
CN108021585A (en) * | 2016-10-28 | 2018-05-11 | 腾讯科技(深圳)有限公司 | Distributed data storage method and device |
CN107169056A (en) * | 2017-04-27 | 2017-09-15 | 四川长虹电器股份有限公司 | Distributed file system and the method for saving distributed file system memory space |
CN112286459A (en) * | 2020-10-29 | 2021-01-29 | 苏州浪潮智能科技有限公司 | Data processing method, device, equipment and medium |
Non-Patent Citations (1)
Title |
---|
詹玲等: "基于Ceph 文件系统的元数据缓存备份", 《计算机工程》, vol. 43, no. 4, 30 April 2017 (2017-04-30), pages 67 - 72 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113760855A (en) * | 2021-09-10 | 2021-12-07 | 北京金山云网络技术有限公司 | Data storage method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109947668B (en) | Method and device for storing data | |
CN110109868B (en) | Method, apparatus and computer program product for indexing files | |
US9514170B1 (en) | Priority queue using two differently-indexed single-index tables | |
CN110958300B (en) | Data uploading method, system, device, electronic equipment and computer readable medium | |
CN111488377A (en) | Data query method and device, electronic equipment and storage medium | |
CN110750211B (en) | Storage space management method and device | |
CN113760854A (en) | Method for identifying data in HDFS memory and related equipment | |
US11429311B1 (en) | Method and system for managing requests in a distributed system | |
CN110427394B (en) | Data operation method and device | |
CN110708361B (en) | System, method and device for determining grade of digital content publishing user and server | |
CN112579633A (en) | Data retrieval method, device, equipment and storage medium | |
CN113779412B (en) | Message touch method, node and system based on blockchain network | |
CN113849482A (en) | Data migration method and device and electronic equipment | |
CN111400327B (en) | Data synchronization method and device, electronic equipment and storage medium | |
CN113760855A (en) | Data storage method and device, electronic equipment and storage medium | |
CN113821166A (en) | Method, device and equipment for aggregating multi-version small objects | |
CN114896215A (en) | Metadata storage method and device | |
CN111399754B (en) | Method and device for releasing storage space and distributed system | |
CN110083509B (en) | Method and device for arranging log data | |
CN114036121A (en) | Log file processing method, device, system, equipment and storage medium | |
CN113742378A (en) | Data query and storage method, related equipment and storage medium | |
CN113779426A (en) | Data storage method and device, terminal equipment and storage medium | |
CN111078643A (en) | Method and device for deleting files in batches and electronic equipment | |
CN112543213B (en) | Data processing method and device | |
CN115718571B (en) | Data management method and device based on multidimensional features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |