CN113377733A - Storage optimization method for Hadoop distributed file system - Google Patents

Storage optimization method for Hadoop distributed file system Download PDF

Info

Publication number
CN113377733A
CN113377733A CN202110644122.4A CN202110644122A CN113377733A CN 113377733 A CN113377733 A CN 113377733A CN 202110644122 A CN202110644122 A CN 202110644122A CN 113377733 A CN113377733 A CN 113377733A
Authority
CN
China
Prior art keywords
file
information
log
record
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110644122.4A
Other languages
Chinese (zh)
Other versions
CN113377733B (en
Inventor
王周恺
贾乔
马维纲
王怀军
曹霆
李宇昕
王侃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202110644122.4A priority Critical patent/CN113377733B/en
Publication of CN113377733A publication Critical patent/CN113377733A/en
Application granted granted Critical
Publication of CN113377733B publication Critical patent/CN113377733B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems

Abstract

The invention discloses a storage optimization method for a Hadoop distributed file system, which specifically comprises the following steps: firstly, selecting an INFO level log file, wherein the selected log file comprises a specific execution timestamp and file name information, and acquiring access records and deletion records of the INFO level log file; extracting and sorting all information containing the keywords in the IFNO level log, and then sequencing and numbering according to the time stamp; then determining a feature label, selecting features, constructing a feature vector, and forming a sample set of the training file elimination model; selecting three characteristic values of the characteristic vector as three classification nodes of the decision tree in sequence, establishing the decision tree by adopting an ID3 algorithm, and constructing a file elimination model by the decision tree; and finally, predicting the reusability of the file by using the established file elimination model. The method of the invention optimizes the storage efficiency of the distributed file system, reduces the data storage scale and improves the storage efficiency of the HDFS.

Description

Storage optimization method for Hadoop distributed file system
Technical Field
The invention belongs to the technical field of data storage, and particularly relates to a storage optimization method for a Hadoop distributed file system.
Background
With the increasingly wide application of big data computing engines (such as Apache Hadoop and Apache Spark), a great deal of new data needs to be stored in the Hadoop distributed file system HDFS continuously, which puts a great pressure on the storage of the HDFS. The traditional method continuously enlarges the capacity of the HDFS by increasing hardware investment, thereby storing massive increased data, but the cost is higher, most of the data stored in the HDFS has low utilization value, the probability of being used or being accessed by other equipment is low, and a great amount of hardware resources and software cost are wasted.
In the cloud computing era, the storage optimization problem of a large-scale distributed file system is receiving more and more extensive attention. For example, Kirsten et al propose a load balancing method for a generalized distributed file system from the perspective of access rate balancing and data value. Shruthi et al propose a storage space clustering algorithm, which can reduce the access time and the number of node accesses during work by defining the "distance" between nodes by defining data similarity and association and placing data on appropriate nodes. Although the method can optimize the distributed file system to a certain extent and improve the rate of the storage space, the bottom layer architecture and the core allocation rule of the HDFS need to be changed, so the realization difficulty is high, and the portability is poor. In China, people like money payment and the like propose a virtual desktop optimization technology based on repeated data deletion, and people like Wuqiping propose an ARC cache elimination mechanism facing cloud storage data fault tolerance. The method uses erasure codes for fault tolerance, only uses the traditional copy redundancy fault tolerance in the cache, obtains better effect, and obviously reduces the storage space of the distributed file system. However, in a real distributed file system, the proportion of the repeated data is not large, and the proportion of the repeated data is large, and a large amount of low-value data which is used only once or for a plurality of times is used.
Disclosure of Invention
The invention aims to provide a storage optimization method for a Hadoop distributed file system, which is used for optimizing the storage efficiency of the distributed file system, reducing the data storage scale and improving the storage efficiency of an HDFS.
The technical scheme adopted by the invention is that the storage optimization method for the Hadoop distributed file system is implemented according to the following steps:
step 1, extracting file operation records, specifically:
step 1.1: selecting an INFO level log file, wherein the selected log file comprises specific execution time stamp and file name information;
step 1.2: obtaining an access record of an INFO level log file;
step 1.3: acquiring a deletion record of an INFO level log file;
step 1.4: extracting and sorting all information containing the keywords in the IFNO level log, and then sequencing and numbering according to the time stamp; the selection type represents the operation type, 1 represents a deletion operation, 0 represents an access operation, F represents a file name, and d represents the time when the operation occurs;
step 2: determining a feature label, selecting features, constructing a feature vector, and forming a sample set of a training file elimination model;
and step 3: selecting three characteristic values of the characteristic vector as three classification nodes of the decision tree in sequence, establishing the decision tree by adopting an ID3 algorithm, and constructing a file elimination model by the decision tree; calling a programming interface of the MLlib by taking the MLlib as a tool, taking a sample set as input, training a file elimination model, and after training is finished, storing the model back to the HDFS in a json format for later use of eliminated files;
and 4, step 4: and predicting the reusability of the file by using the established file elimination model.
The present invention is also characterized in that,
in step 1.2, for the acquisition of the log file access record, the information generated by the NameNode node and all the keywords containing complefile and the keywords containing timestamp are found out from the log file by using the filter operation in combination with the lambda expression, and the keywords containing timestamp and the keywords containing filename are extracted and stored as the file access record in the HDFS.
In step 1.3, firstly, a filter function is combined with a lambda expression to find out all the addtoeinvaderides-containing keywords and information generated by a NameNode node from a log file; then extracting a timestamp keyword and a block name keyword contained in the data; then, the same filter function and lambda expression are utilized to find out the information which is the same as the block name key words contained in the addToinvestidates information and contains the key words allocataBlock from the log file, and the file name key words and the time stamp key words contained in the information are extracted; finally, the file name key included in the allocatablock information and the timestamp key in the corresponding addtimeidentifiers information are saved as file deletion records and stored in the HDFS.
In the step 2, the method specifically comprises the following steps:
step 2.1: defining label of the sample set as 'whether deletion is possible'; "yes" is a positive label, which indicates that the multiplexing possibility of the file is very low, and the file can be deleted, and the tuple marked as "yes" belongs to a positive sample; "no" is a negative label, which indicates that the file may be multiplexed and not deleted, and the tuple marked as "no" belongs to a negative sample;
converting each file access and deletion record into a feature vector tuple containing a label; regarding the tuple characteristics, taking type as label information; the main correspondence rules for the tags are as follows:
for each file operation record r 0:
a) if type is 0, the record is marked as an access record; indicating that the corresponding file F is multiplexed at the time d when the file operation occurs; the file has multiplexing possibility at d and cannot be deleted; the tuple is marked as "no" and is a negative sample;
b) if type is 1, the record is marked as a deleted record; when the file operation occurs at the time d or later, the corresponding file f has no multiplexing possibility any more and can be deleted; the tuple is marked as "yes", and is a positive sample;
step 2.2: selecting characteristics;
all file access records and file deletion records in the HDFS are taken out, and the creation time length d is calculated for records with the same file namec(unit: day), duration of non-visit da(unit: day), creating average daily visit frequency f to daterqRespectively expressed by formula (1), formula (2) and formula (3) to form the characteristics of a sample set; each item composed of the above labels and feature values is also called a feature vector;
dc=d0-dc0 (1);
da=d0-da0 (2);
Figure BDA0003108366450000051
in the formula (d)c0A file creation date; d0A file timestamp; da0Date of last access to the file; n is0Is to d0The number of accesses to the file at that time.
In step 4, the method specifically comprises the following steps: firstly, reading a current file list from an HDFS (Hadoop distributed File System), and then predicting whether the file list can be deleted or not by using a trained file elimination model; for each file, extracting all operation records of the file from the log through the file name, and calculating label, d of the file at the d momentc,da,frqAnd obtaining a feature vector, entering the prediction of a trained decision tree, and returning a label 'can delete (yes)' or 'suggest reservation (no)'.
The method has the advantages that the importance and the access heat of the files stored in the HDFS are periodically analyzed by establishing an analysis model, the files or file copies with low repeated access frequency in the HDFS are selected according to the analysis result of the model, the files are recommended to be deleted to a user, and the storage space is cleaned, so that the data storage scale is reduced, and the storage efficiency of the HDFS is improved.
Drawings
FIG. 1 is a flow chart of a method of storage optimization for a Hadoop distributed file system of the present invention;
FIG. 2 is a flow chart of creating a file in a storage optimization method for a Hadoop distributed file system according to the present invention;
FIG. 3 is a flowchart of reading a file in a storage optimization method for a Hadoop distributed file system according to the present invention;
FIG. 4 is a flowchart of file deletion in the storage optimization method for a Hadoop distributed file system according to the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention relates to a storage optimization method for a Hadoop distributed file system, which is implemented according to the following steps as shown in figure 1:
step 1, extracting file operation records, specifically:
step 1.1: selecting an INFO level log file, wherein the selected log file comprises specific execution time stamp and file name information;
the HDFS stores massive log file contents, records various operations on a distributed file system, is mainly divided into three levels, namely WARN, INFO and DEBUG, and the recording detail degree of the HDFS is increased in sequence. The DEBUG level log is positioned at the bottom layer, the recorded content is most direct and detailed, but the data volume is large; the top layer of the WARN-level log only records key information and information which possibly causes errors, and the information amount is too small to facilitate the log analysis work. The invention hopes to extract the log file as little as possible, reduces the analysis scale and the calculation amount of the log, and strives to complete the analysis of the log file with the minimum cost. Therefore, selecting the middle INFO level log file for analysis;
on the basis of the explicit level, the acquisition of the file operation record can be started, and the file operation comprises writing, reading and deleting of the file. Correspondingly, the log file operation record also comprises a write record, a read record and a delete record, wherein the write record and the read record of the log file are also called as access records of the log file and can be obtained together; the deletion record of the log file is different from the access record of the log file and needs to be acquired by adopting a separate method;
step 1.2: obtaining an access record of an INFO level log file;
firstly, reading an INFO level log file into a big data computing platform from a distributed database HDFS by using PySparkAPI provided by Apache Spark to form an original elastic distributed data set; because the number of log files is huge, on the basis of forming the elastic distributed data set, it is necessary to analyze and determine which records are meaningful, and the records can be used for analysis and calculation, and are extracted and stored.
The first is the analysis of the file write records, since the files on the HDFS are all read-only files, writing the files creates files. The specific flow is shown in fig. 2.
Firstly, a DFS client side initiates a file creating request to a NameNode through an HDFS, and after the NameNode confirms that a file does not exist, a corresponding file is newly built in a name space and blocks are distributed for the file. After the completion, the HDFS returns FSDataOutputStream to the DFS client and initializes a data stream pipe (pipeline) between the datanodes, and the DataNode data receiving service is started. The client writes data to the FSDataOutputStream in a data stream (stream) mode, the FSDataOutputStream divides the data into blocks, stores the blocks into a data queue (data queue), sends the blocks to the DataNode in a packet (packet) mode, the DataNode receives the data and transmits the data to other DataNodes through a data stream pipeline, and the DataNode receiving the data returns ack information. After the file is written, the client closes the data stream, sends information of 'completing file operation' to the NameNode, and the NameNode verifies information of the file INode, lease, operation authority and the like, so that the file is created completely.
According to the above process, the operations closely related to storing files are mainly 1 in fig. 2, create file request; 2. newly building a file, and distributing block operation; 4. sending a packet operation; ack operation; 7. the file operation is completed. But in the INFO-level log, log information related to "1. create file request" is not stored; in addition, "4. send packet operation" and "5. ack operation" both represent data write operations to the distributed file, and therefore the log information corresponding to the reservation operation 4 is selected. In summary, the key log information when the HDFS writes a file (creates a file) includes operations 2, 4, and 7, the log information corresponding to these operations can be collected and used for subsequent analysis and calculation, and the specific information formats of the log information corresponding to these three operations are shown in table 1.
TABLE 1 HDFS File write data related Log information
Figure BDA0003108366450000081
Secondly, the acquisition of the read file information is concerned, the read file is simpler than the write file, the involved operations are less, and correspondingly, the involved log information is less. Specifically as shown in fig. 3.
And the DFS client initiates a file opening request to the NameNode through the HDFS, and the NameNode returns the information of the data block corresponding to the file after confirming the operation authority and the file are stored. After completion, HDFS returns FSDataInputStream to the DFS client for reading data. The client requests to read data from the FSDataOutputStream, which selects the closest node from all the data nodes containing the first data block, connects and starts reading. After reading a data block, FSDataOutputStream selects the nearest node containing the next data block to read. After reading, the client closes the data flow and sends the information of 'completing file operation' to the NameNode. In the process, if a data node communication error exists, the next node containing the data block is connected, the record of the node is removed, and the node is not connected any more.
According to the above process, the operations closely related to reading the file are mainly 1. open file request in fig. 3; 2. a block allocation list; 4. reading data; 6. the file operation is completed. However, in the INFO-level log, log information related to "1. open file request" is not stored; in addition, the log information related to the "2. block allocation list" does not include file name information, so that it is impossible to determine which files should be eliminated by analyzing such information, and similarly, the log information related to the "4. read data" operation is also included. Therefore, in the process of reading the file, the log information that can be collected and used for analysis is only the information corresponding to operation 6, which is specifically shown in table 2.
Table 2 HDFS reads data-related log information from a file
Figure BDA0003108366450000091
It is clear from the combination of table 1 and table 2 that whether the file is written or read, as the access to the file, a log with the keyword "complete file" is left, indicating that the file operation of writing or reading is completed. The format is as follows:
< timestamp > DIR < completeFile: < filename > is closed by < DFS client number >
The log comprises a timestamp and a file name, can simultaneously represent the writing and reading of files, has representativeness and exclusivity in meaning, and well meets the requirement of selecting the log. Therefore, the invention uses the functions of filter, map and the like provided by Spark, combines the lambda expression, acquires all the information containing the complefile keywords and generated by the NameNode node from the elastic distributed data set, and extracts the timestamp keywords and the file name keywords contained in the information as the file access records to be saved (stored in the HDFS).
Step 1.3: acquiring a deletion record of an INFO level log file;
in the same step 1.2, firstly, reading an INFO level log file into a big data computing platform from a distributed database HDFS by using PySparkAPI provided by Apache Spark to form an original elastic distributed data set; and then analyzes and selects the file deletion record having the role therefrom. However, the retrieval of file deletion records is more complicated than the retrieval of file access records, because the INFO log file mainly records deletion operations in block units, and it is therefore difficult to directly learn from the log file which files have been deleted in the distributed file system.
Specifically, the flow of deleting a file on the HDFS is shown in fig. 4.
The DFS first initiates a file creation request to the NameNode. Then, NameNode looks up the block allocation information of the file in the name space, adds the block of the file to the 'invalid block list', and prepares for recovery. At the same time, all references to the file and blocks are deleted. The NameNode and the DataNode are master-slave information, and the NameNode does not actively send information to the DataNode. In contrast, during operation of the HDFS, the DataNode continuously and periodically sends a "heartbeat" to the NameNode, which then replies with a "heartbeat". When a file is to be deleted, the DataNode obtains an invalid block list from the NameNode through heartbeat, and deletes the data of the corresponding block on the node of the DataNode.
Therefore, log information related to file deletion is shown in table 3. In table 3, both NameNode and DataNode record the block name to be deleted, belonging to duplicate information. Therefore, in order to reduce the log information to be processed, the invention collects the file deletion information on the NameNode as the basis for subsequent calculation and analysis.
TABLE 3 Log information related to File deletion on HDFS
Figure BDA0003108366450000111
However, the log does not include a file name, and therefore, it is necessary to acquire detailed information of the deleted file by using a block matching method, which is as follows.
First, the log information format of the record creation file is as follows:
< timestamp > BLOCK: < filename > < BLOCK name > { < BLOCK copy information > }
The log information format of the recording deletion file is as follows:
< timestamp > BLOCK: < Block name >
Therefore, only the allocated block operation (allocateBlock) log with the same < block name > contained in the file deletion operation (addtoavailates) log needs to be found, so that the file to which the block being recycled belongs can be located, the deletion time can be determined, the file can be further determined to be deleted, and the record of the deleted file can be sorted out. The specific process is similar to the acquisition of file access records, firstly, a filter function is combined with a lambda expression to find out all the addToinvestidates keywords and information generated by a NameNode node from a log file; then extracting a timestamp keyword and a block name keyword contained in the data; then, the same filter function and lambda expression are utilized to find out the information which is the same as the block name key words contained in the addToinvestidates information and contains the key words allocataBlock from the log file, and the file name key words and the time stamp key words contained in the information are extracted; finally, the file name key included in the allocatablock information and the timestamp key in the corresponding addtobialids information are saved as file deletion records (stored in the HDFS), specifically, when addtatables log information is encountered, the block indicated by the < block name > part in the log information is deleted, and if a block is deleted, it means that the file to which the block belongs is also deleted. Under the logic, if the continuously searched log file contains allocatablock log information having the same < block name > as the addtoavaillates log information, the < file name > recorded in the allocatablock log information is the deleted file name, and the < timestamp > recorded in the original addtoavaillates log information is also recorded as the deletion time of the deleted file.
Step 1.4: extracting and sorting all information containing the keywords in the selected IFNO level logs, and then sequencing and numbering the information according to the time stamps; selection type represents operation type, 1 represents deletion operation, 0 represents access operation (including writing file and reading file), F represents file name, d represents time when operation occurs, and the collected and sequenced log information is used by a sample set for constructing a training file elimination model.
Step 2: determining a feature label (label), selecting features, constructing a feature vector, and forming a sample set of a training file elimination model; the method specifically comprises the following steps:
step 2.1: the label of the sample set is defined as "whether it can be deleted". "yes" is a positive label, indicating that the file multiplexing probability is low, and the file multiplexing probability can be deleted, and the tuple labeled "yes" belongs to a positive sample. "no" is a negative label, which indicates that the file may be multiplexed and not deleted, and the tuple labeled "no" belongs to a negative sample.
And converting each file access and deletion record into a feature vector tuple containing a label. For tuple characteristics, type above is taken as label information. The main correspondence rules for the tags are as follows:
for each file operation record r 0:
a) if type is 0, the record is marked as an access record. Indicating that at time d, when the file operation occurs, the corresponding file F is multiplexed. The file has a multiplexing possibility at d, and cannot be deleted. The tuple is labeled "no", as a negative sample.
b) If type is 1, the record is recorded as a deleted record. Indicating that the corresponding file f no longer has the multiplexing possibility and can be deleted at the time d when the file operation occurs or later. The tuple is marked as "yes", and is a positive sample;
step 2.2: selecting characteristics;
the method adopts a machine learning algorithm of supervised learning to establish a file elimination model, and the selection of characteristics is very important. The features cannot be too many, otherwise the vector space dimension is too high, which can greatly increase the amount of computation.
To extract the features, a record of creation of the file, i.e., the first access to the file, may be extracted from the file access records.
To evaluate the reuse possibility of a file, consider for eachOne file access, delete record: for file F0Select the operation time stamp d0. Note file F0Is created by a date dc0Last visit date is da0. To d0N occurs for the file at any moment0And (7) secondary access. The following three dimensions are sequentially selected as the feature vector space:
(1) creation time length dc(Unit: day)
dc=d0-dc0
(2) Length d from last access timea(Unit: day), i.e. length of non-visit
da=d0-da0
(3) Creating an average daily Access frequency to date frq
frq=d0-dc0
In summary, the present invention selects three characteristics dc、daAnd frq. Table 4 lists an example of a calibrated sample set.
Table 4 sample set format example
Figure BDA0003108366450000141
And step 3: selecting three characteristic values of the characteristic vector as three classification nodes of the decision tree in sequence, establishing the decision tree by adopting an ID3 algorithm, and constructing a file elimination model by the decision tree; the method comprises the steps of directly calling a programming interface of the MLlib by taking the MLlib as a tool, taking a sample set as input, training a file elimination model, specifically, firstly calculating information entropy (information entropy) of samples in the sample set, wherein the calculation formula of the information entropy is as follows, D represents a data set, m represents the number of attributes, and p represents the number of the attributesiTuple label in finger D is CiCan use | Ci,DEstimate, |/| D |:
Figure BDA0003108366450000142
on the basis, when each feature is taken as a division node, the information entropy is also calculated, and the formula is as follows:
Figure BDA0003108366450000151
after the information entropy is calculated, an information gain (information gain) when the feature value is used as a classification node is calculated, and the calculation formula is as follows:
Gain(A)=Info(D)-InfoA(D)
and for the generation of each round of classification nodes, taking the characteristic value with the maximum information gain as a partition attribute to generate the whole decision tree. After training is finished, storing the file back to the HDFS in a json format for use by obsolete files later;
and 4, step 4: predicting the reusability of the file by using the established file elimination model;
the method specifically comprises the following steps: firstly, reading a current file list from the HDFS, and then predicting whether the file list can be deleted or not by using a trained file elimination model. For each file, extracting all operation records of the file from the log through the file name, and calculating label, d of the file at the d momentc,da,frqAnd obtaining a feature vector, entering the prediction of a trained decision tree, and returning a label 'can delete (yes)' or 'suggest reservation (no)'.
In order to verify the effectiveness of the method, a verification experiment is designed, and the functions of the file elimination model provided by the invention are evaluated from the aspects of precision, recall degree, F1-measure and the like. On the basis, the performance of the file elimination model provided by the invention is evaluated from the aspects of response time and the like.
The experimental environment comprises a computing cluster formed by 7 isomorphic PC machines through a Juniper J6350 enterprise-level high-performance router, and a python 3.5.0 operating environment is deployed in the cluster at first. The hardware configuration of each computing node in the cluster is the same as
Figure BDA0003108366450000152
Core i5-4590 CPU, 8GB memory, 380GB mechanical hard disk. The machine runs the Ubuntu 15.10 operating system, and the whole cluster is built based on the HDFS of Spark 1.5.1 and Hadoop 2.6.0. In the cluster, one PC is used as a control Node (Master Node) to manage the other six computing nodes (Slave nodes) responsible for specific service execution.
k-fold cross validation is a common classifier evaluation method that partitions the original data set into k closely sized disjoint subsets (i.e., folds), D1,D2,…,Dk. Training and testing need to be repeated k times: in the ith wheel, DiIs retained as the prediction set and the remaining data is used as the training set. Averaging the k results gives a single estimate.
k-fold differs from other model evaluation methods such as Holdout verification, random subsampling and the like in that in k-fold cross-validation, the number of times each subset is used as a training set is the same as that of a test set, thereby reducing the risk that data partitioning affects model verification.
Precision (precision) and recall (recall) are commonly used classifier evaluation metrics. In the present invention, these two criteria will be used, in combination with F1And evaluating the decision tree in the file elimination model by score so as to evaluate the functionality of the file elimination model.
The tags of the classifier can be classified into positive tags (positive) and negative tags (negative). The positive label corresponds to the positive sample and is an interested part and a part expected to be searched out; the negative label corresponds to the negative example, and is an implicit part. In experimental verification for the present invention, the positive label "yes" indicates that the file can be deleted and the negative label "no" indicates that retention is recommended. The following four cases can be classified:
1) true Positive (TP): the prediction label is positive, and the prediction result is correct.
2) True Negative (TN): the prediction label is negative, and the prediction result is correct.
3) False Positive (TP): the prediction tag is positive and the prediction result is wrong.
4) False Negative (FN): the prediction tag is negative and the prediction result is wrong.
The confusion matrix composed of these four cases is shown in table 5.
TABLE 5 confusion matrix
Figure BDA0003108366450000171
On the basis, the precision, the recall degree and the F1The calculation method of score is described as follows:
the accuracy describes the accuracy of the returned result of the positive tag, and the calculation formula is as follows:
Figure BDA0003108366450000172
the recall describes how many tuples of the active tags can be accurately retrieved, and the calculation formula is as follows:
Figure BDA0003108366450000173
accuracy and recall describe the model in two opposite ways. For example, one hundred positive results return only one, with very high accuracy but very low recall. If all tuples are returned directly, the recall rate is 100% but the precision is greatly affected.
F1score takes harmonic mean (harmonic mean) of precision and recall, giving good compromise between these two different aspects, and is a good indicator of classifier evaluation. And thus also used in experimental validation in connection with the present invention. The calculation formula is as follows:
Figure BDA0003108366450000174
the experimental data come from log files generated from the Name Node from 2018, 10 months to 2019, 4 months on the HDFS of the cluster, the number of the logs is 870 ten thousand, and the size of the logs is 3.4 GB. The extracted samples collected 6 ten thousand. One tenth of the training set is used as a training set, and the rest is used as a prediction set. After a decision tree in a prediction model is trained by using a training set, processing prediction set data by using the trained decision tree, recording the statistical condition of a classifier label, performing detailed analysis and evaluation on the function of the decision tree by calculating an evaluation index according to the three formulas, and further performing functional analysis on a file elimination model based on the decision tree.
The results of the experiments are shown in table 6 for the decision tree evaluation.
TABLE 6 evaluation results for decision Tree
Figure BDA0003108366450000181
As can be seen from Table 6, the recall of the decision tree is 97.7%, indicating that the prediction model can find most useless files using the trained decision tree, and the model is very effective. Furthermore, the accuracy of the decision tree is slightly lower than the recall, 73.0%. The main reason for this result is that the management operation of the HDFS by the user is not complete, and the file is not deleted immediately after the file is used, and has a certain randomness; secondly, in the training process, the negative sample (file access) is far larger than the positive sample (file deletion), so that the training set is sparse, and the learning of the positive sample is influenced to a certain extent. However, the prediction model in the invention recommends files which may be useless to the user, and the user makes a decision, so that the safety of the prediction system is ensured to a certain extent. Finally, F integrating precision and recall1The score index is 83.6%, which indicates that the overall performance of the prediction model is better, and eliminated files can be effectively recommended to the user.
The optimization method comprises the following steps: and reading the logs of the latest specified time from the HDFS, extracting the logs, screening and sorting the logs into structured file operation records. Specifically, it is necessary to clearly extract a meaningful log for a file access record, and for a file deletion record, a file and block copy matching method is required to calculate the file deletion record by combining a file life cycle with a block allocation record created by a file and a block recovery record at the time of deletion.
After the log extraction is finished, the label of the sample set needs to be determined, and feature extraction is performed on the file operation record. And after the feature extraction is finished, the record containing the label is submitted to a decision tree for training, and in the method provided by the invention, a file elimination model is established by using an ID3 decision tree. The specific process is to extract the creating time length d by analyzing the historical file operation record and the file management habit on the HDFScLength of time d from final accessaCreating average daily access frequency to date frqOn the basis, the reuse possibility of the files is analyzed by using a machine learning algorithm ID3 decision tree, and whether the files are automatically eliminated is judged. And finally, evaluating the decision tree in the aspects of accuracy, precision, recall degree and the like. If the trained decision tree is qualified, importing a current file list from the HDFS, predicting reusability of each file by using the trained decision tree, and providing a recommended eliminated file list and returning the recommended eliminated file list to the user.

Claims (5)

1. A storage optimization method for a Hadoop distributed file system is characterized by comprising the following steps:
step 1, extracting file operation records, specifically:
step 1.1: selecting an INFO level log file, wherein the selected log file comprises specific execution time stamp and file name information;
step 1.2: obtaining an access record of an INFO level log file;
step 1.3: acquiring a deletion record of an INFO level log file;
step 1.4: extracting and sorting all information containing the keywords in the IFNO level log, and then sequencing and numbering according to the time stamp; the selection type represents the operation type, 1 represents a deletion operation, 0 represents an access operation, F represents a file name, and d represents the time when the operation occurs;
step 2: determining a feature label, selecting features, constructing a feature vector, and forming a sample set of a training file elimination model;
and step 3: selecting three characteristic values of the characteristic vector as three classification nodes of the decision tree in sequence, establishing the decision tree by adopting an ID3 algorithm, and constructing a file elimination model by the decision tree; calling a programming interface of the MLlib by taking the MLlib as a tool, taking a sample set as input, training a file elimination model, and after training is finished, storing the model back to the HDFS in a json format for later use of eliminated files;
and 4, step 4: and predicting the reusability of the file by using the established file elimination model.
2. The storage optimization method for the Hadoop distributed file system according to claim 1, wherein in step 1.2, for the acquisition of the log file access record, the filter operation is used in combination with the lambda expression to find out all the information generated by the NameNode node, which contains completFile keywords, from the log file, and the timestamp keywords and the filename keywords contained in the information are extracted and stored as the file access record in the HDFS.
3. The method as claimed in claim 2, wherein in step 1.3, the filter function is first used in conjunction with the lambda expression to find out all the information generated by the NameNode node, which includes addtoeinvalidates keywords, from the log file; then extracting a timestamp keyword and a block name keyword contained in the data; then, the same filter function and lambda expression are utilized to find out the information which is the same as the block name key words contained in the addToinvestidates information and contains the key words allocataBlock from the log file, and the file name key words and the time stamp key words contained in the information are extracted; finally, the file name key included in the allocatablock information and the timestamp key in the corresponding addtimeidentifiers information are saved as file deletion records and stored in the HDFS.
4. The storage optimization method for the Hadoop distributed file system according to claim 3, wherein in the step 2, specifically:
step 2.1: defining label of the sample set as 'whether deletion is possible'; "yes" is a positive label, which indicates that the multiplexing possibility of the file is very low, and the file can be deleted, and the tuple marked as "yes" belongs to a positive sample; "no" is a negative label, which indicates that the file may be multiplexed and not deleted, and the tuple marked as "no" belongs to a negative sample;
converting each file access and deletion record into a feature vector tuple containing a label; regarding the tuple characteristics, taking type as label information; the main correspondence rules for the tags are as follows:
for each file operation record r 0:
a) if type is 0, the record is marked as an access record; indicating that the corresponding file F is multiplexed at the time d when the file operation occurs; the file has multiplexing possibility at d and cannot be deleted; the tuple is marked as "no" and is a negative sample;
b) if type is 1, the record is marked as a deleted record; when the file operation occurs at the time d or later, the corresponding file f has no multiplexing possibility any more and can be deleted; the tuple is marked as "yes", and is a positive sample;
step 2.2: selecting characteristics;
all file access records and file deletion records in the HDFS are taken out, and the creation time length d is calculated for records with the same file namecThe unit: day; duration of non-access daThe unit: day; creating an average daily Access frequency to date frqRespectively expressed by formula (1), formula (2) and formula (3) to form the characteristics of a sample set; each item composed of the above labels and feature values is also called a feature vector;
dc=d0-dc0 (1);
da=d0-da0 (2);
Figure FDA0003108366440000031
in the formula (d)c0A file creation date; d0A file timestamp; da0Date of last access to the file; n is0Is to d0The number of accesses to the file at that time.
5. The storage optimization method for the Hadoop distributed file system according to claim 4, wherein in the step 4, specifically: firstly, reading a current file list from an HDFS (Hadoop distributed File System), and then predicting whether the file list can be deleted or not by using a trained file elimination model; for each file, extracting all operation records of the file from the log through the file name, and calculating label, d of the file at the d momentc,da,frqAnd obtaining a feature vector, entering the prediction of a trained decision tree, and returning a label 'yes can be deleted' or 'no is reserved by suggestion'.
CN202110644122.4A 2021-06-09 2021-06-09 Storage optimization method for Hadoop distributed file system Active CN113377733B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110644122.4A CN113377733B (en) 2021-06-09 2021-06-09 Storage optimization method for Hadoop distributed file system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110644122.4A CN113377733B (en) 2021-06-09 2021-06-09 Storage optimization method for Hadoop distributed file system

Publications (2)

Publication Number Publication Date
CN113377733A true CN113377733A (en) 2021-09-10
CN113377733B CN113377733B (en) 2022-12-27

Family

ID=77573292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110644122.4A Active CN113377733B (en) 2021-06-09 2021-06-09 Storage optimization method for Hadoop distributed file system

Country Status (1)

Country Link
CN (1) CN113377733B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114415978A (en) * 2022-03-29 2022-04-29 维塔科技(北京)有限公司 Multi-cloud cluster data reading and writing method and device, storage medium and electronic equipment
CN115510292A (en) * 2022-11-18 2022-12-23 四川汉唐云分布式存储技术有限公司 Distributed storage system tree search management method, device, equipment and medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040236974A1 (en) * 2003-05-22 2004-11-25 International Business Machines Corporation Advanced computer hibernation functions
US20070276878A1 (en) * 2006-04-28 2007-11-29 Ling Zheng System and method for providing continuous data protection
CN102289524A (en) * 2011-09-26 2011-12-21 深圳市万兴软件有限公司 Data recovery method and system
JP2013025655A (en) * 2011-07-25 2013-02-04 Nidec Sankyo Corp Log file management module and log file management method
CN105868396A (en) * 2016-04-19 2016-08-17 上海交通大学 Multi-version control method of memory file system
CN108052679A (en) * 2018-01-04 2018-05-18 焦点科技股份有限公司 A kind of Log Analysis System based on HADOOP
CN108153804A (en) * 2017-11-17 2018-06-12 极道科技(北京)有限公司 A kind of metadata daily record update method of symmetric distributed file system
CN109522290A (en) * 2018-11-14 2019-03-26 中国刑事警察学院 A kind of HBase data block restores and data record extraction method
JP2019204474A (en) * 2018-05-22 2019-11-28 広東技術師範学院 Storage method using user access preference model
CN111966293A (en) * 2020-08-18 2020-11-20 北京明略昭辉科技有限公司 Cold and hot data analysis method and system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040236974A1 (en) * 2003-05-22 2004-11-25 International Business Machines Corporation Advanced computer hibernation functions
US20070276878A1 (en) * 2006-04-28 2007-11-29 Ling Zheng System and method for providing continuous data protection
JP2013025655A (en) * 2011-07-25 2013-02-04 Nidec Sankyo Corp Log file management module and log file management method
CN102289524A (en) * 2011-09-26 2011-12-21 深圳市万兴软件有限公司 Data recovery method and system
CN105868396A (en) * 2016-04-19 2016-08-17 上海交通大学 Multi-version control method of memory file system
CN108153804A (en) * 2017-11-17 2018-06-12 极道科技(北京)有限公司 A kind of metadata daily record update method of symmetric distributed file system
CN108052679A (en) * 2018-01-04 2018-05-18 焦点科技股份有限公司 A kind of Log Analysis System based on HADOOP
JP2019204474A (en) * 2018-05-22 2019-11-28 広東技術師範学院 Storage method using user access preference model
CN109522290A (en) * 2018-11-14 2019-03-26 中国刑事警察学院 A kind of HBase data block restores and data record extraction method
CN111966293A (en) * 2020-08-18 2020-11-20 北京明略昭辉科技有限公司 Cold and hot data analysis method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
彭英等: "一种用于石油勘探的云计算与虚拟存储平台设计", 《测绘与空间地理信息》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114415978A (en) * 2022-03-29 2022-04-29 维塔科技(北京)有限公司 Multi-cloud cluster data reading and writing method and device, storage medium and electronic equipment
CN114415978B (en) * 2022-03-29 2022-06-21 维塔科技(北京)有限公司 Multi-cloud cluster data reading and writing method and device, storage medium and electronic equipment
CN115510292A (en) * 2022-11-18 2022-12-23 四川汉唐云分布式存储技术有限公司 Distributed storage system tree search management method, device, equipment and medium

Also Published As

Publication number Publication date
CN113377733B (en) 2022-12-27

Similar Documents

Publication Publication Date Title
CN104756107B (en) Using location information profile data
JP4997856B2 (en) Database analysis program, database analysis apparatus, and database analysis method
CN113377733B (en) Storage optimization method for Hadoop distributed file system
US20040249808A1 (en) Query expansion using query logs
US11487729B2 (en) Data management device, data management method, and non-transitory computer readable storage medium
CN114003791B (en) Depth map matching-based automatic classification method and system for medical data elements
KR101679050B1 (en) Personalized log analysis system using rule based log data grouping and method thereof
KR102046692B1 (en) Method and System for Entity summarization based on multilingual projected entity space
CN110795613B (en) Commodity searching method, device and system and electronic equipment
CN111326236A (en) Medical image automatic processing system
Cheng et al. Supporting entity search: a large-scale prototype search engine
CN116775972A (en) Remote resource arrangement service method and system based on information technology
CN110472659A (en) Data processing method, device, computer readable storage medium and computer equipment
Theron The use of data mining for predicting injuries in professional football players
JP7292235B2 (en) Analysis support device and analysis support method
CN114281989A (en) Data deduplication method and device based on text similarity, storage medium and server
CN106776704A (en) Statistical information collection method and device
Zhang et al. PARROT: pattern-based correlation exploitation in big partitioned data series
Zhang et al. A learning-based framework for improving querying on web interfaces of curated knowledge bases
CN113806190A (en) Method, device and system for predicting performance of database management system
CN116450768B (en) Industrial data processing method, device and equipment oriented to low-code development platform
CN117744784B (en) Medical scientific research knowledge graph construction and intelligent retrieval method and system
Shyamala et al. A survey on online stock forum using subspace clustering
CN117785841A (en) Processing method and device for multi-source heterogeneous data
CN117951118A (en) Geotechnical engineering investigation big data archiving method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant