CN113377733B

CN113377733B - Storage optimization method for Hadoop distributed file system

Info

Publication number: CN113377733B
Application number: CN202110644122.4A
Authority: CN
Inventors: 王周恺; 贾乔; 马维纲; 王怀军; 曹霆; 李宇昕; 王侃
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2022-12-27
Anticipated expiration: 2041-06-09
Also published as: CN113377733A

Abstract

The invention discloses a storage optimization method for a Hadoop distributed file system, which specifically comprises the following steps: firstly, selecting an INFO level log file, wherein the selected log file comprises a specific execution timestamp and file name information, and acquiring access records and deletion records of the INFO level log file; extracting and sorting all information containing the keywords in the IFNO level log, and then sequencing and numbering according to the time stamp; then determining a feature label, selecting features, constructing a feature vector, and forming a sample set of the training file elimination model; selecting three eigenvalues of the eigenvector as three classification nodes of the decision tree in sequence, establishing the decision tree by adopting an ID3 algorithm, and constructing a file elimination model by the decision tree; and finally, predicting the reusability of the file by using the established file elimination model. The method of the invention optimizes the storage efficiency of the distributed file system, reduces the data storage scale and improves the storage efficiency of the HDFS.

Description

Storage optimization method for Hadoop distributed file system

Technical Field

The invention belongs to the technical field of data storage, and particularly relates to a storage optimization method for a Hadoop distributed file system.

Background

With the increasingly wide application of big data computing engines (such as Apache Hadoop and Apache Spark), a great deal of new data needs to be stored in the Hadoop distributed file system HDFS continuously, which puts a great pressure on the storage of the HDFS. The traditional method continuously enlarges the capacity of the HDFS by increasing hardware investment, thereby storing massive increased data, but the cost is higher, most of the data stored in the HDFS has low utilization value, the probability of being used or accessed by other equipment is low, and a large amount of hardware resources and software cost are wasted.

In the cloud computing era, the storage optimization problem of a large-scale distributed file system is receiving more and more extensive attention. For example, kirsten et al propose a load balancing method for a generalized distributed file system from the perspective of access rate balancing and data value. Shruthi et al propose a storage space clustering algorithm, which can reduce the access time and the number of node accesses during work by defining the "distance" between nodes by defining data similarity and association and placing data on appropriate nodes. Although the method can optimize the distributed file system to a certain extent and improve the rate of the storage space, the bottom layer architecture and the core allocation rule of the HDFS need to be changed, so the realization difficulty is high, and the portability is poor. In China, the countermark Jin Dengren provides a virtual desktop optimization technology based on data de-duplication, and Wu Qiu et al provides an ARC cache elimination mechanism facing cloud storage data fault tolerance. The method uses erasure code fault tolerance, only uses the traditional copy redundancy fault tolerance in the cache, also obtains better effect, and obviously reduces the storage space of the distributed file system. However, in a real distributed file system, the proportion of the repeated data is not large, and the proportion of the repeated data is large, and a large amount of low-value data which is used only once or for a plurality of times is used.

Disclosure of Invention

The invention aims to provide a storage optimization method for a Hadoop distributed file system, which is used for optimizing the storage efficiency of the distributed file system, reducing the data storage scale and improving the storage efficiency of an HDFS.

The technical scheme adopted by the invention is that a storage optimization method for a Hadoop distributed file system is implemented according to the following steps:

step 1, extracting file operation records, specifically:

step 1.1: selecting an INFO level log file, wherein the selected log file comprises specific execution time stamp and file name information;

step 1.2: acquiring an access record of an INFO level log file;

step 1.3: acquiring a deletion record of an INFO level log file;

step 1.4: extracting and sorting all information containing the keywords in the IFNO level log, and then sequencing and numbering according to the time stamp; the selection type represents the operation type, 1 represents a deletion operation, 0 represents an access operation, F represents a file name, and d represents the time when the operation occurs;

and 2, step: determining a feature label, selecting features, constructing a feature vector, and forming a sample set of a training file elimination model;

and step 3: selecting three eigenvalues of the eigenvector as three classification nodes of the decision tree in sequence, establishing the decision tree by adopting an ID3 algorithm, and constructing a file elimination model by the decision tree; calling a programming interface of the MLlib by taking the MLlib as a tool, taking a sample set as input, training a file elimination model, and after training is finished, storing the model back to the HDFS in a json format for later use of eliminated files;

and 4, step 4: and predicting the reusability of the file by using the established file elimination model.

The present invention is also characterized in that,

in step 1.2, for the acquisition of the log file access record, the information generated by the NameNode node and all the keywords containing complefile and the keywords containing timestamp are found out from the log file by using the filter operation in combination with the lambda expression, and the keywords containing timestamp and the keywords containing filename are extracted and stored as the file access record in the HDFS.

In step 1.3, firstly, a filter function is combined with a lambda expression to find out all the addtoeinvaderides-containing keywords and information generated by a NameNode node from a log file; then extracting a timestamp keyword and a block name keyword contained in the data; then, the same filter function and lambda expression are utilized to find out the information which is the same as the block name key words contained in the addToinvestidates information and contains the key words allocataBlock from the log file, and the file name key words and the time stamp key words contained in the information are extracted; finally, the file name key included in the allocatablock information and the timestamp key in the corresponding addtimeidentifiers information are saved as file deletion records and stored in the HDFS.

In the step 2, the method specifically comprises the following steps:

step 2.1: defining label of the sample set as 'whether deletion is possible'; "yes" is a positive label, which indicates that the multiplexing possibility of the file is very low, the file can be deleted, and the tuple marked as "yes" belongs to a positive sample; "no" is a negative label, which indicates that the file may be multiplexed and not deleted, and the tuple marked as "no" belongs to a negative sample;

converting each file access and deletion record into a feature vector tuple containing a label; regarding the tuple characteristics, taking type as label information; the main correspondence rules for the tags are as follows:

for each file operation record r0:

a) If type =0, the record is marked as an access record; indicating that the corresponding file F is multiplexed at the time d when the file operation occurs; the file has multiplexing possibility at d and cannot be deleted; the tuple is marked as 'no' and is a negative sample;

b) If type =1, the record is marked as a deleted record; when the file operation occurs at the time d or later, the corresponding file f has no multiplexing possibility any more and can be deleted; the tuple is marked as "yes", and is a positive sample;

step 2.2: selecting characteristics;

all file access records and file deletion records in the HDFS are taken out, and the creation time length d is calculated for records with the same file name _c (unit: day), duration of non-visit d _a (unit: day), creating average daily access frequency f to date _rq Respectively expressed by formula (1), formula (2) and formula (3) to form the characteristics of a sample set; each item composed of the above labels and feature values is also called a feature vector;

d _c ＝d ₀ -d _c0 (1)；

d _a ＝d ₀ -d _a0 (2)；

in the formula (d) _c0 A creation date for the file; d ₀ A file timestamp; d _a0 Date of last access to the file; n is ₀ Is to d ₀ The number of accesses to the file at that time.

In step 4, the method specifically comprises the following steps: firstly, reading a current file list from an HDFS (Hadoop distributed File System), and then predicting whether the file list can be deleted or not by using a trained file elimination model; for each file, extracting all operation records of the file from the log through the file name, and calculating label, d of the file at the d moment _c ,d _a ,f _rq And obtaining a feature vector, entering the prediction of a trained decision tree, and returning a label 'can delete (yes)' or 'suggest reservation (no)'.

The method has the advantages that the importance and the access heat of the files stored in the HDFS are periodically analyzed by establishing an analysis model, the files or file copies with low repeated access frequency in the HDFS are selected according to the analysis result of the model, the files are recommended to be deleted to a user, and the storage space is cleaned, so that the data storage scale is reduced, and the storage efficiency of the HDFS is improved.

Drawings

FIG. 1 is a flow chart of a method of storage optimization for a Hadoop distributed file system of the present invention;

FIG. 2 is a flow chart of creating a file in a storage optimization method for a Hadoop distributed file system according to the present invention;

FIG. 3 is a flowchart of reading a file in a storage optimization method for a Hadoop distributed file system according to the present invention;

FIG. 4 is a flowchart of file deletion in the storage optimization method for a Hadoop distributed file system according to the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention relates to a storage optimization method for a Hadoop distributed file system, which is implemented according to the following steps as shown in figure 1:

step 1, extracting file operation records, specifically:

step 1.1: selecting an INFO level log file, wherein the selected log file comprises a specific execution timestamp and file name information;

the HDFS stores massive log file contents, records various operations on a distributed file system, is mainly divided into three levels, namely WARN, INFO and DEBUG, and the recording detail degree of the HDFS is increased in sequence. The DEBUG level log is positioned at the bottom layer, the recorded content is most direct and detailed, but the data volume is large; the top layer of the WARN-level log only records key information and information which possibly causes errors, and the information amount is too small to facilitate the log analysis work. The invention hopes to extract the log file as little as possible, reduce the analysis scale and the calculation amount of the log and strive to complete the analysis of the log file with the minimum cost. Therefore, selecting the middle INFO level log file for analysis;

on the basis of the explicit level, the acquisition of the file operation record can be started, and the file operation comprises writing, reading and deleting of the file. Correspondingly, the log file operation record also comprises a write record, a read record and a delete record, wherein the write record and the read record of the log file are also called as access records of the log file and can be obtained together; the deletion record of the log file is different from the access record of the log file and needs to be acquired by adopting a separate method;

step 1.2: obtaining an access record of an INFO level log file;

firstly, reading an INFO level log file into a big data computing platform from a distributed database HDFS by using PySparkAPI provided by Apache Spark to form an original elastic distributed data set; because the number of log files is huge, on the basis of forming the elastic distributed data set, it is necessary to analyze and determine which records are meaningful, and the records can be used for analysis and calculation, and are extracted and stored.

The first is the analysis of the file write records, since the files on the HDFS are all read-only files, writing the files creates files. The specific flow is shown in fig. 2.

Firstly, a DFS client side initiates a file creating request to a NameNode through an HDFS, and after the NameNode confirms that a file does not exist, a corresponding file is newly built in a name space and blocks are distributed for the file. After the completion, the HDFS returns FSDataOutputStream to the DFS client and initializes a data stream pipe (pipeline) between the datanodes, and the DataNode data receiving service is started. The client writes data to the FSDataOutputStream in a data stream (stream) mode, the FSDataOutputStream divides the data into blocks, stores the blocks into a data queue (data queue), sends the blocks to the DataNode in a packet (packet) mode, the DataNode receives the data and transmits the data to other DataNodes through a data stream pipeline, and the DataNode receiving the data returns ack information. After the file writing is completed, the client closes the data stream, sends information of 'completing file operation' to the NameNode, and the NameNode verifies information of the file INode, the lease, the operation authority and the like, so that the file creation is completed.

According to the above process, the operations closely related to storing files are mainly 1 in fig. 2, create file request; 2. creating a new file and allocating block operation; 4. sending a packet operation; 5.Ack operation; 7. the file operation is completed. But in the INFO-level log, log information related to "1. Create file request" is not stored; in addition, "4. Send packet operation" and "5.Ack operation" both represent data write operations to the distributed file, and therefore the log information corresponding to the reserved operation 4 is selected. In summary, the key log information when the HDFS writes a file (creates a file) includes operations 2, 4, and 7, the log information corresponding to these operations can be collected and used for subsequent analysis and calculation, and the specific information formats of the log information corresponding to these three operations are shown in table 1.

TABLE 1 HDFS File write data related Log information

Secondly, the acquisition of the read file information is concerned, the read file is simpler than the write file, the involved operations are less, and correspondingly, the involved log information is less. Specifically as shown in fig. 3.

And the DFS client initiates a file opening request to the NameNode through the HDFS, and the NameNode returns the information of the data block corresponding to the file after confirming the operation authority and the file are stored. After completion, HDFS returns FSDataInputStream to the DFS client for reading data. The client requests to read data from FSDataOutputStream, which selects the closest node from all the data nodes containing the first data block, connects and starts reading. After reading a data block, FSDataOutputStream selects the closest node containing the next data block to read. After reading, the client closes the data flow and sends the information of 'completing file operation' to the NameNode. In the process, if the data node has communication errors, connecting the next node containing the data block, removing the record of the node and not connecting any more.

According to the above process, the operations closely related to reading the file are mainly 1. Open file request in fig. 3; 2. a block allocation list; 4. reading data; 6. and finishing the file operation. However, in the INFO-level log, log information related to "1. Open file request" is not stored; in addition, the log information related to the "2. Block allocation list" does not include file name information, so that it is impossible to determine which files should be eliminated by analyzing such information, and similarly, the log information related to the "4. Read data" operation is also included. Therefore, in the process of reading the file, the log information that can be collected and used for analysis is only the information corresponding to operation 6, which is specifically shown in table 2.

Table 2 HDFS reads data-related log information from a file

It is clear from the combination of table 1 and table 2 that whether the file is written or read, as the access to the file, a log with the keyword "complete file" is left, indicating that the file operation of writing or reading is completed. The format is as follows:

< timestamp > DIR < completeFile: < filename > is closed by < DFS client number >

The log comprises a timestamp and a file name, can simultaneously represent the writing and reading of files, has representativeness and exclusivity in meaning, and well meets the requirement of selecting the log. Therefore, the invention uses the functions of filter, map and the like provided by Spark, combines the lambda expression, acquires all the information containing the complefile keywords and generated by the NameNode node from the elastic distributed data set, and extracts the timestamp keywords and the file name keywords contained in the information as the file access records to be saved (stored in the HDFS).

Step 1.3: acquiring a deletion record of an INFO level log file;

in the same step 1.2, firstly, reading an INFO level log file into a big data computing platform from a distributed database HDFS by using PySparkAPI provided by Apache Spark to form an original elastic distributed data set; and then analyzes and selects the file deletion record having the role therefrom. However, the acquisition of the file deletion record is more complicated than the acquisition of the file access record because the INFO log file mainly records the deletion operation in the basic unit of block, and thus it is difficult to directly know which files have been deleted in the distributed file system through the log file.

Specifically, the flow of deleting a file on the HDFS is shown in fig. 4.

The DFS firstly initiates a file creating request to the NameNode. Then, nameNode looks up the block allocation information of the file in the name space, adds the block of the file to the 'invalid block list', and prepares for recovery. At the same time, all references to the file and blocks are deleted. The NameNode and the DataNode are master-slave information, and the NameNode does not actively send information to the DataNode. In contrast, during the operation of HDFS, the DataNode continuously and periodically sends a "heartbeat" to the NameNode, and the NameNode replies to the "heartbeat". When a file is to be deleted, the DataNode obtains an invalid block list from the NameNode through heartbeat, and deletes the data of the corresponding block on the node of the DataNode.

Therefore, log information related to file deletion is shown in table 3. In table 3, both NameNode and DataNode record the block name to be deleted, belonging to duplicate information. Therefore, in order to reduce the log information to be processed, the invention collects the file deletion information on the NameNode as the basis for subsequent calculation and analysis.

TABLE 3 Log information related to File deletion on HDFS

However, the log does not include a file name, and therefore, it is necessary to acquire detailed information of the deleted file by using a block matching method, which is as follows.

First, the log information format of the record creation file is as follows:

< timestamp > BLOCK: < filename > < BLOCK name > { < BLOCK copy information > }

The log information format of the recording deletion file is as follows:

< timestamp > BLOCK: < Block name >

Therefore, only the allocated block operation (allocateBlock) log with the same < block name > contained in the file deletion operation (addtoavailates) log needs to be found, so that the file to which the block being recycled belongs can be located, the deletion time can be determined, the file can be further determined to be deleted, and the record of the deleted file can be sorted out. The specific process is similar to the acquisition of file access records, firstly, a filter function is combined with a lambda expression to find out all the addToinvestidates keywords and information generated by a NameNode node from a log file; then extracting a timestamp keyword and a block name keyword contained in the data; then, the same filter function and the same lambda expression are utilized to find out the information which is the same as the block name key words contained in the addToInvalidates information and contains the key words allocataBlock from the log file, and the file name key words and the time stamp key words contained in the information are extracted; finally, the file name key included in the allocatablock information and the timestamp key in the corresponding addtobialids information are saved as file deletion records (stored in the HDFS), specifically, when addtatables log information is encountered, the block indicated by the < block name > part in the log information is deleted, and if a block is deleted, it means that the file to which the block belongs is also deleted. Under the logic, if the allocatablocks log information with the same < block name > as the addToInvalidates log information is included in the continuously searched log files, the < file name > recorded in the allocatablocks log information is the deleted file name, and the < timestamp > recorded in the original addToInvalidates log information is the deleted time of the deleted file and is also recorded together.

Step 1.4: extracting and sorting all information containing the keywords in the selected IFNO level logs, and then sequencing and numbering the information according to the time stamps; selection type represents operation type, 1 represents deletion operation, 0 represents access operation (including writing file and reading file), F represents file name, d represents time when operation occurs, and the collected and sequenced log information is used by a sample set for constructing a training file elimination model.

And 2, step: determining a feature label (label), selecting features, constructing a feature vector, and forming a sample set of a training file elimination model; the method specifically comprises the following steps:

step 2.1: the label of the sample set is defined as "whether it can be deleted". "yes" is a positive label, indicating that the file multiplexing probability is low, and the file multiplexing probability can be deleted, and the tuple labeled "yes" belongs to a positive sample. "no" is a negative label, which indicates that the file may be multiplexed and not deleted, and the tuple labeled "no" belongs to a negative sample.

And converting each file access and deletion record into a feature vector tuple containing a label. For tuple characteristics, type above is taken as label information. The main correspondence rules for the tags are as follows:

for each file operation record r0:

a) If type =0, the record is marked as an access record. Indicating that at time d, when the file operation occurs, the corresponding file F is multiplexed. The file has the possibility of multiplexing at d, and cannot be deleted. The tuple is labeled "no", as a negative sample.

b) If type =1, the record is recorded as a delete record. Indicating that the corresponding file f no longer has the multiplexing possibility and can be deleted at the time d when the file operation occurs or later. The tuple is marked as "yes", and is a positive sample;

step 2.2: selecting characteristics;

the method adopts a machine learning algorithm of supervised learning to establish a file elimination model, and the selection of characteristics is very important. The features cannot be too many, otherwise the vector space dimension is too high, which can greatly increase the amount of computation.

To extract the features, a record of creation of the file, i.e., the first access to the file, may be extracted from the file access records.

To judge the multiplexing possibility of the files, the access and deletion of records to each file are considered: for file F ₀ Selecting an operation time stamp d ₀ . Note file F ₀ Is created by a date d _c0 Last visit date is d _a0 . To d ₀ At a moment n occurs for the file ₀ And (7) secondary access. The following three dimensions are sequentially selected as the feature vector space:

(1) Creation time length d _c (Unit: day)

d _c ＝d ₀ -d _c0

(2) Length d from last access time _a (Unit: day), i.e. length of non-visit

d _a ＝d ₀ -d _a0

(3) Creating an average daily Access frequency to date f _rq

f _rq ＝d ₀ -d _c0

In summary, the present invention selects three characteristics d _c 、d _a And f _rq . Table 4 lists an example of a calibrated sample set.

Table 4 sample set format example

And step 3: selecting three eigenvalues of the eigenvector as three classification nodes of the decision tree in sequence, establishing the decision tree by adopting an ID3 algorithm, and constructing a file elimination model by the decision tree; the method comprises the steps of directly calling a programming interface of the MLlib by taking the MLlib as a tool, taking a sample set as input, training a file elimination model, specifically, firstly calculating information entropy (information entropy) of samples in the sample set, wherein the calculation formula of the information entropy is as follows, D represents a data set, m represents the number of attributes, and p represents the number of the attributes _i Tuple label in finger D is C _i Can use | C _i,D Estimate/| D |:

on the basis, when each feature is taken as a division node, the information entropy is also calculated, and the formula is as follows:

after the information entropy is calculated, an information gain (information gain) when the feature value is used as a classification node is calculated, and the calculation formula is as follows:

Gain(A)＝Info(D)-Info _A (D)

and for the generation of each round of classification nodes, taking the characteristic value with the maximum information gain as a division attribute to generate the whole decision tree. After training is finished, storing the file back to the HDFS in a json format for use by obsolete files later;

and 4, step 4: predicting reusability of the files by using the established file elimination model;

the method specifically comprises the following steps: firstly, reading a current file list from the HDFS, and then predicting whether the file list can be deleted by using a trained file elimination model. For each file, extracting all operation records of the file from the log through the file name, and calculating label, d of the file at the d moment _c ,d _a ,f _rq And obtaining a feature vector, entering the prediction of a trained decision tree, and returning a label 'can delete (yes)' or 'suggest reservation (no)'.

In order to verify the effectiveness of the method, a verification experiment is designed, and the functions of the file elimination model provided by the invention are evaluated from the aspects of precision, recall, F1-measure and the like. On the basis, the performance of the file elimination model provided by the invention is evaluated from the aspects of response time and the like.

The experimental environment comprises a computing cluster formed by 7 isomorphic PC machines through a Juniper J6350 enterprise-level high-performance router, and a python 3.5.0 operating environment is deployed in the cluster at first. The hardware configuration of each computing node in the cluster is the same as

Core i5-4590 CPU,8GB internal memory, 380GB mechanical hard disk. The machine runs the Ubuntu 15.10 operating system, and the whole cluster is built based on the HDFS of Spark 1.5.1 and Hadoop 2.6.0. In the cluster, one PC is used as a control Node (Master Node) to manage the other six computing nodes (Slave nodes) responsible for specific service execution.

k-fold cross validation is a common classifier evaluation method that partitions the original data set into k closely sized disjoint subsets (i.e., folds), D ₁ ,D ₂ ,…,D _k . Training and testing need to be repeated k times: in the ith wheel, D _i Is retained as the prediction set and the remaining data is used as the training set. Averaging the k results gives a single estimate.

k-fold differs from other model evaluation methods such as Holdout verification, random subsampling and the like in that in k-fold cross-validation, the number of times each subset is used as a training set is the same as that of a test set, thereby reducing the risk that data partitioning affects model verification.

Precision (precision) and recall (recall) are commonly used classifier evaluation metrics. In the present invention, these two criteria will be used, in combination with F ₁ And evaluating the decision tree in the file elimination model by score so as to evaluate the functionality of the file elimination model.

The tags of the classifier can be classified into positive tags (positive) and negative tags (negative). The positive labels correspond to positive samples and are interested parts and expected parts to be searched out; the negative label corresponds to the negative example, and is an implicit part. In experimental verification for the present invention, the positive label "yes" indicates that the file can be deleted and the negative label "no" indicates that retention is recommended. The following four cases can be classified:

1) True Positive (TP): the prediction label is positive, and the prediction result is correct.

2) True Negative (TN): the prediction label is negative, and the prediction result is correct.

3) False Positive (TP): the prediction tag is positive and the prediction result is wrong.

4) False Negative (FN): the prediction tag is negative and the prediction result is wrong.

The confusion matrix composed of these four cases is shown in table 5.

TABLE 5 confusion matrix

On the basis, the precision, the recall degree and the F ₁ The calculation method of score is described as follows:

the accuracy describes the accuracy of the returned result of the positive tag, and the calculation formula is as follows:

the recall describes how many tuples of the active tags can be accurately retrieved, and the calculation formula is as follows:

accuracy and recall describe the model in two opposite ways. For example, one hundred positive results return only one, with very high accuracy but very low recall. If all tuples are returned directly, the recall is 100% but the precision is greatly affected.

F ₁ score takes a harmonic mean (harmonic mean) of precision and recall, gives good consideration to these two different aspects, and is a good indicator for classifier evaluation. And thus also used in experimental validation in connection with the present invention. The calculation formula is as follows:

the experimental data come from log files generated from the Name Node from 2018, 10 months to 2019, 4 months on the HDFS of the cluster, the number of the logs is 870 ten thousand, and the size of the logs is 3.4GB. The extracted samples collected 6 ten thousand. One tenth of the training set is used as a training set, and the rest is used as a prediction set. After a decision tree in a prediction model is trained by using a training set, processing prediction set data by using the trained decision tree, recording the statistical condition of a classifier label, performing detailed analysis and evaluation on the function of the decision tree by calculating an evaluation index according to the three formulas, and further performing functional analysis on a file elimination model based on the decision tree.

The results of the experiments are shown in table 6 for the decision tree evaluation.

TABLE 6 evaluation results for decision Tree

As can be seen from Table 6, the recall of the decision tree is 97.7%, indicating that the prediction model can find most useless files using the trained decision tree, and the models are very popularAnd (5) effect. Furthermore, the accuracy of the decision tree is slightly lower than the recall, 73.0%. The main reason for this result is that the management operation of the HDFS by the user is not complete, and the file is not deleted immediately after the file is used, and has a certain randomness; secondly, in the training process, the negative sample (file access) is far larger than the positive sample (file deletion), so that the training set is sparse, and the learning of the positive sample is influenced to a certain extent. However, the prediction model in the invention recommends files which may be useless to the user, and the user makes a decision, so that the safety of the prediction system is ensured to a certain extent. Finally, F integrating precision and recall ₁ The score index is 83.6%, which indicates that the overall performance of the prediction model is better, and eliminated files can be effectively recommended to the user.

The optimization method comprises the following steps: and reading the logs of the latest specified time from the HDFS, extracting the logs, screening and sorting the logs into structured file operation records. Specifically, it is necessary to clearly extract a meaningful log for a file access record, and for a file deletion record, a file and block copy matching method is required to calculate the file deletion record by combining a file life cycle with a block allocation record created by a file and a block recovery record at the time of deletion.

After the log extraction is finished, the label of the sample set needs to be determined, and feature extraction is performed on the file operation record. And after the characteristic extraction is finished, the record containing the label is submitted to a decision tree for training, and in the method provided by the invention, an ID3 decision tree is used for establishing a file elimination model. The specific process is to extract the creating time length d by analyzing the historical file operation record and the file management habit on the HDFS _c Length of time d from final access _a Creating average daily access frequency to date f _rq On the basis, the multiplexing possibility of the file is analyzed by using a machine learning algorithm ID3 decision tree, and whether the file is automatically eliminated is judged. And finally, evaluating the decision tree in the aspects of accuracy, precision, recall degree and the like. If the trained decision tree is qualified, importing the current file list from the HDFS, predicting reusability of each file by using the trained decision tree, andthereby giving a list of files recommended to be eliminated and returning the list to the user.

Claims

1. A storage optimization method for a Hadoop distributed file system is characterized by comprising the following steps:

step 1, extracting file operation records, specifically:

step 1.2: obtaining an access record of an INFO level log file;

for the acquisition of the access records of the log files, using the filter operation in combination with a lambda expression to find out all the information which contains completeFile keywords and is generated by the NameNode node from the log files, extracting the timestamp keywords and the filename keywords contained in the information as the access records of the files, storing the access records into the HDFS;

step 1.3: acquiring a deletion record of an INFO level log file;

firstly, a filter function is combined with a lambda expression to find out all addToInvalidates keywords and information generated by a NameNode node from a log file; then extracting a timestamp keyword and a block name keyword contained in the data; then, the same filter function and lambda expression are utilized to find out the information which is the same as the block name key words contained in the addToinvestidates information and contains the key words allocataBlock from the log file, and the file name key words and the time stamp key words contained in the information are extracted; finally, storing the file name key words contained in the allocataBlock information and the timestamp key words in the corresponding addtimeidentifiers information as file deletion records, and storing the file name key words and the timestamp key words in the corresponding addtimeidentifiers information into the HDFS;

step 2: determining a feature label, selecting features, constructing a feature vector, and forming a sample set of a training file elimination model; the method specifically comprises the following steps:

step 2.1: defining label of the sample set as 'whether deletion is possible'; "yes" is a positive label, which indicates that the multiplexing possibility of the file is very low, and the file can be deleted, and the tuple marked as "yes" belongs to a positive sample; "no" is a negative label, which indicates that the file may be multiplexed and not deleted, and the tuple marked as "no" belongs to a negative sample;

converting each file access and deletion record into a feature vector tuple containing a feature label; regarding the tuple characteristics, taking type as label information; the main correspondence rules of the feature labels are as follows:

for each file operation record r0:

a) If type =0, the record is marked as an access record; indicating that the corresponding file F is multiplexed at the time d when the file operation occurs; the file has multiplexing possibility at d and cannot be deleted; the tuple is marked as "no" and is a negative sample;

b) If type =1, the record is marked as a deleted record; when the file operation occurs at the time d or later, the corresponding file f has no multiplexing possibility any more and can be deleted; the tuple is marked as "yes" and is a positive sample;

step 2.2: selecting characteristics;

all file access records and file deletion records in the HDFS are taken out, and the creation time length d is calculated for records with the same file name _c The unit: day; duration of non-access d _a The unit: day; creating an average daily Access frequency to date f _rq (ii) a Forming characteristics of a sample set as shown in formulas (1), (2) and (3), wherein each record formed by the label and the characteristics is also called a characteristic vector;

d _c ＝d ₀ -d _c0 (1)；

d _a ＝d ₀ -d _a0 (2)；

in the formula (d) _c0 A file creation date; d ₀ A file timestamp; d _a0 Date of last access to the file; n is ₀ Is to d ₀ The number of times of accessing the file at all times;

and step 3: selecting three features d of the feature vector _c 、d _a 、f _rq Sequentially serving as three classification nodes of the decision tree, establishing the decision tree by adopting an ID3 algorithm, and constructing a file elimination model by the decision tree; calling a programming interface of the MLlib by taking the MLlib as a tool, taking a sample set as input, training a file elimination model, and after training is finished, storing the model back to the HDFS in a json format for later use of eliminated files;

2. The storage optimization method for the Hadoop distributed file system according to claim 1, wherein in the step 4, specifically: firstly, reading a current file list from an HDFS (Hadoop distributed File System), and then predicting whether the file list can be deleted or not by using a trained file elimination model; for each file, extracting all operation records of the file from the log through the file name, and calculating label, d of the file at the d moment _c ,d _a ,f _rq And obtaining a feature vector, entering the prediction of a trained decision tree, and returning a label 'yes can be deleted' or 'no is reserved by suggestion'.