CN113760822A

CN113760822A - HDFS-based distributed intelligent campus file management system optimization method and device

Info

Publication number: CN113760822A
Application number: CN202110917880.9A
Authority: CN
Inventors: 朱全银; 冯万利; 周泓; 李翔; 刘斌; 申奕; 马思伟; 吴斌; 曹猛; 朱良生
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-12-07

Abstract

The invention discloses a method and a device for constructing a distributed intelligent campus file management system based on an HDFS (Hadoop distributed File System). the method and the device use a small file merging association algorithm, calculate the association probability among different files by analyzing the operation log information of a user according to the file association algorithm, and merge the small files meeting the conditions; and a data packaging mode is adopted, an interface for Web service calling is provided for the intelligent campus file management system, and the traditional mode of storing mass data in colleges and universities is changed. The invention solves the problems that the scale of small files of school education resources is more and more huge, and the file storage and access efficiency is reduced because the load pressure on the NameNode is more and more large when the files are stored in the HDFS; meanwhile, guarantee is provided for research and development of the smart campus; the memory space occupied by the file metadata information after a large number of small files are stored is reduced; the efficiency of the system when the small file is accessed is improved.

Description

HDFS-based distributed intelligent campus file management system optimization method and device

Technical Field

The invention belongs to the field of HDFS (Hadoop distributed file system) storage and optimization, and particularly relates to a distributed intelligent campus file management system optimization method and device based on an HDFS.

Background

In the current era of educational informatization gradual development, the smart campus provides a comprehensive information service platform for people. With the increase of the number of teachers and students in universities, the amount of information is also increasing. In order to further develop under the large background of the modern era, the smart campus is built not only for application of the current internet of things technology, but also for service of a data center of the smart campus, a more effective way is provided for college data sharing, a basis for scientific decision is provided for teaching management, more convenient service is provided for work, study and life of teachers, and the college big data management capability is improved.

In order to process large amounts of data while ensuring high performance and reliability, the use of a distributed file system is the best option. In contrast to conventional file systems, distributed file systems typically consist of several servers that allocate storage services and are connected to clients and servers through a network between the clients. This distributed file system architecture may provide two advantages over traditional file systems: scalability and parallelism. Since servers have limited hardware ports to connect storage devices, a distributed file system can manage thousands of servers for clients by connecting multiple servers to a network (scalability). Because thousands of servers are connected, many storage devices can also be used to access data simultaneously (parallelism). Due to the large number of storage devices used in a distributed file system compared to a conventional file system. Therefore, the file storage principle and mechanism are known, which is very important for improving the file storage efficiency.

Hadoop is applied to big data aspects. Hadoop is an open source framework developed by using Java language, and rapidly processes unstructured, heterogeneous and mass data from different sources by using a parallel computing framework. It is managed by the Apache software foundation developed by Doug Cutting and Mike cafearella in 2005. HDFS is characterized by high fault tolerance and is designed for deployment on inexpensive (low-cost) hardware; and it provides high throughput (high throughput) to access data of applications, suitable for applications with very large data sets.

The existing research bases of Zhuquanhyin et al include: the classification and extraction algorithm of Web science and technology news [ J ] academic newspaper of Huaiyin institute of Industrial science and technology, 2015,24(5): 18-24; lixiang, Zhu-Quanyin, collaborative clustering and scoring matrix shared collaborative filtering recommendations [ J ] computer science and exploration 2014,8(6): 751-; quanyin Zhu, Sun qun Cao.A Novel Classifier-independent Feature Selection Algorithm for Imbalanced datasets.2009, p: 77-82; quanyin Zhu, Yunyang Yan, Jin Ding, Jin Qian, the Case Study for Price extraction of Mobile Phone Sell Online.2011, p: 282-285; quanyin Zhu, Suqun Cao, Pei Zhou, Yunyang Yan, Hong Zhou. Integrated print for based on Dichotomy Back filling and Disturbance Factor Algorithm. International Review on Computers and Software,2011, Vol.6(6): 1089-; li Xiang, Zhu quan Yin, Hurong Lin, Zhonhang, a cold chain logistics stowage intelligent recommendation method based on spectral clustering, Chinese patent publication No. CN105654267A, 2016.06.08; suo Cao, Zhu quan Yin, Zuo Xiao Ming, Gao Shang soldier, etc., a feature selection method for pattern classification Chinese patent publication No.: CN 103425994a, 2013.12.04; liu jin Ling, von Wanli, Zhang Yao red Chinese text clustering method based on rescaling [ J ] computer engineering and applications, 2012,48(21): 146-; the classification and extraction algorithm of Web science and technology news [ J ] academic newspaper of Huaiyin institute of Industrial science and technology, 2015,24(5): 18-24; lixiang, Zhu-Quanyin, collaborative clustering and scoring matrix shared collaborative filtering recommendations [ J ] computer science and exploration 2014,8(6): 751-; zhuquanhyin, sincerity, Lixiang, xukang and the like, a network behavior habit clustering method based on K-means and LDA bidirectional verification, Chinese patent publication No. CN 106202480A, 2016.12.07.

The Distributed File System (Distributed File System) is used for separately storing data resources which can only be originally stored on a fixed local node, and then connecting the data resources with the separated data storage devices through a computer network; or a number of different disk devices or disk paths connecting them together to form a file system with a hierarchical nature and ensuring data integrity is called DFS [19 ]. At present, files need to be stored separately in the storage solution for big data, data maintenance is difficult due to traditional single-point storage, capacity expansion of storage capacity is difficult, and a distributed file system can become the core of big data development because the distributed file system can expand the number of storage devices at any time to achieve the purpose of high expandability.

HDFS is collectively called the Hadoop Distributed File System (Hadoop Distributed File System). Distributed file systems, which are a collection of independent computers that are in principle one, are presented to users as a single system, are one of the strategies for handling large amounts of data in real time. In a distributed file system, common files may be shared between nodes. The HDFS is one of three major components of a Hadoop distributed file system, is used for storing files of all storage nodes on a computer cluster, and is suitable for deploying cheap machines such as servers and personal computers. The HDFS provides an extensible, fault-tolerant, economical and efficient storage mode for big data. Since data replication is supported, high data reliability is achieved. However, due to the replication strategy, additional use of disk storage space is required.

Hadoop is applied to big data aspects. Hadoop is an open source framework developed by using Java language, and rapidly processes unstructured, heterogeneous and mass data from different sources by using a parallel computing framework. It is managed by the Apache software foundation developed by Doug Cutting and Mike cafearella in 2005. Hadoop can realize efficient specialized processing on mass data by using clusters, is a software platform for large-scale data storage, calculation, analysis and mining, has the advantages of low cost, high efficiency and the like, and can reliably store and process PB-level data.

The NameNode manages the access composition of the file system namespace by the main server and the management client to the file, such as opening, closing and renaming files and directories. And the system is responsible for managing the file directory, the corresponding relation between the file and the block and the corresponding relation between the block and the DataNode, maintaining the directory tree and taking over the request of the user. The data structures of the two cores, FsImage and editlg, are saved. The FsImage is used to maintain the metadata of the file system tree and all the files and folders in the file tree. All operations of creating, deleting, renaming and the like for the file are recorded in the operation log file EditLog. Both of these loads are resolved to memory.

DataNodes (data nodes) manage storage connected to the nodes on which they operate, and are responsible for handling read and write requests from file system clients. The DataNodes also perform block creation, deletion. The NameNode node needs to communicate with the NameNode node continuously to inform the NameNode node of the information, so that the NameNode node is convenient to manage and control the whole system.

In the learning resources in various forms in the smart campus construction, the storage space occupied by most of the learning resources is very small compared with the current mainstream storage equipment, usually in a KB level, but the number of the learning resources accounts for more than 80% of the total number of files in the system, the scale of small files of the learning resources in network learning is increasingly huge, and when the learning resources are stored in an HDFS, the load pressure on a NameNode is increasingly large, so that the file storage and access efficiency is reduced. In the aspect of a file management system of a distributed smart campus, most of current researches mainly aim at simple combination of small files, increase of disk space, failure of effective judgment of association relation among the small files, combination of only the small files close to physical addresses, or combination of only the small files in the same category, no consideration of potential association rules among the small files, and combination of the small files according to the association relation among the small files dynamically.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the background art, the invention provides a distributed intelligent campus file management system optimization method and device based on an HDFS (Hadoop distributed file system), aiming at solving the problem that massive small files occupy a large amount of memory resource space under the HDFS of the distributed file system and the construction and optimization of an intelligent campus, and being capable of helping colleges and universities to save the resource space of the file management system after the massive small files are stored.

The technical scheme is as follows: the invention provides a distributed intelligent campus file management system optimization method based on an HDFS (Hadoop distributed File System), which specifically comprises the following steps:

(1) monitoring the operation of a user on a file in a system in real time, recording an operation log of the user when the user is detected to download the file, and extracting the UID, the LogTime, the FID, the Path, the FileName and the File size of the file to obtain a LogList of the log file;

(2) performing data processing on the log file LogList obtained in the step (1), analyzing log information, filtering logs with the file size larger than M to obtain a small file log set SLog, and calculating the download times count of each small file according to the SLog to obtain a small file download times set SFDcount;

(3) calculating to obtain a frequency set STDL of FNamej downloading by a user within T time after the small file FNamei is downloaded; calculating according to the data in the STDL set and a file association formula to obtain the probability of downloading FNamej within T time after the user downloads the small file FNamei, and obtaining a small file association degree set SRL;

(4) determining which small files are merged according to the SRL, analyzing and calculating the result, and putting the file into an initial merged file set BMF when the file association probability P (FName _ j-FName _ i) > 50% and the number of files meeting the condition is greater than 3;

(5) after obtaining an initial merged file set BMF, calculating the size of the merged file, and ensuring that the size of the data block does not exceed the size of an initially set HDFS block; the merged files are not merged any more, and a file merging set BMFL is finally obtained;

(6) and after the log information LogList of the user operation is obtained by adopting a data packaging mode, calculation is carried out according to a small file association algorithm, and the obtained small file merging result BMFL is returned to the Web service interface and is provided for the file management system of the intelligent campus of colleges and universities.

Further, the step (1) includes the steps of:

(1.1) building a distributed cluster based on HDFS in a Linux system;

(1.2) packaging data according to an interface provided in a Hadoop class library;

(1.3) define a file list FileList, FileList ═ FileList₁,FileList₂,…,FileList_FN},FileList_nFor the nth element of the file list, FN is the total number of files, n is equal to [1, FN]；

(1.4) defining the user Id as UID, the file ID as FID default value as 1, the path of the file as FilePath, the file name as FileName and the file size as FileSize;

(1.5) defining a loop variable i1 for traversing all files in the system, i 1E [1, FN ], and the initial value of i1 is 1;

(1.6) when the condition FilePath!is satisfied! When the signal value is NULL, the next step jumps to (1.7), otherwise, the next step jumps to (1.14);

(1.7) defining a counter Fcount, wherein the initial value is 0;

(1.8) let Fcount ═ Fcount + 1;

(1.9) assigning the user Id recorded in the browser Session to the UID;

(1.10) recording file information in the system, including UID (user identifier), FID (file identifier), file path, file name FName and file size FileSize of a user, in a file list FileList;

(1.11)FID＝FID+1；

(1.12) i1 ═ i1+1, the next step jumps to (1.6);

(1.13) obtaining a file list information set FileList in the system by taking FN as Fcount;

(1.14) after logging in the system, according to the operation of external equipment such as a mouse, a keyboard and the like of a user, the Web server can automatically process the request of the user to generate request information, wherein the request information comprises an access address URLPath, a request mode RequestWay, request Parameters, operation time is LogTime, operation content OperationContent and a file path FilePath;

(1.15) defining that the operation content of the user to the server in the file downloading operation is Download;

(1.16) the interceptor judges when intercepting the request of the user, and jumps to (1.17) when the OperationContent is downloaded;

(1.17) define a LogList of files, LogList ═ LogList₁,LogList₂,…,LogList_LFN},LogList_nIs the nth element of the file list, LFN is the total number of files, n belongs to [1, LFN]；

(1.18) extracting a file name FName, an operation time FTime, a file size FileSize, a user UID and a file FID from the request information, and adding the information into a LogList;

(1.19) finally obtaining the log information LogList after the operation of the user.

Further, the step (2) comprises the steps of:

(2.1) define a doclet list SList, SList ═ SList { SList₁,SList₂,…,Slist_SFN},SList_nIs the nth element of the small file list, SFN is the total number of the small files, n belongs to [1, SFN]；

(2.2) define a doclet log list SLog, { SLog }₁,SLog₂,…,SLog_SLFN},SLog_nFor the nth element of the small file list, SLFN is the total number of small files, n is equal to [1 ], SLFN]；

(2.3) defining a loop variable i2 for traversing the FileList, and traversing each file size FileSize of the file list FileList, wherein the initial value of i2 is equal to [1, FN ], and i2 is 1;

(2.4) when the condition i2 is not more than FN, jumping to (2.5) next step, otherwise jumping to (2.10) next step;

(2.5) if FileList_i1FileSize of<When the number is 10M, the next step jumps to (2.6), otherwise, the next step jumps to (2.4);

(2.6) defining the file Id of the small file as SFID and the default value as 1;

(2.7) adding the UID of the user, the SFID of the small file, the path FilePath of the file, the file name FName and the file size FileSize into a small file list SList;

(2.8)SFID＝SFID+1；

(2.9) i2 ═ i2+1, and the next step jumps to (2.4);

(2.10) defining a loop variable i3 for traversing the LogList, and traversing each file size FileSize in the LogList, wherein i3 belongs to [1, LFN ], and the initial value of i3 is 1;

(2.11) when the condition i3 is not more than LFN, jumping to (2.12) next step, otherwise jumping to (2.17) next step;

(2.12) LogList_i3FileSize of<When the current time is 10M, the next step jumps to (2.13), otherwise, the next step jumps to (2.11);

(2.13) defining the small file log Id as SLID and the initial value as 1;

(2.14) File LogList_i3SLID, FName, FTime, FileSize of (S) are all recorded in a list SLog;

(2.15)SLID＝SLID+1；

(2.16) i3 ═ i3+1, the next step jumps to (2.11);

(2.17) defining the number of times of downloading the small files as count, and the set of the number of times of downloading the small files as SFDcount { [ FName { [₁,count₁],[FName₂,count₂],…,[FName_SFDN,count_SFDN]SFDN is the total number of the small file download times set, SFDcount_nFor the nth element in the small file download times set, n belongs to [1, SFDN ]]；

(2.18) defining a loop variable j1 for traversing SList, j1 e [1, SLFN ], the initial assignment of j1 is 1;

(2.19) when the condition j1 is not more than SLFN, jumping to (2.20) next step, otherwise jumping to (2.28) next step;

(2.20) defining a loop variable i4 for traversing SLog, i4 ∈ [1, SLN ], the initial assignment of i4 is 1;

(2.21) when the condition i4 is not more than SLN, jumping to (2.22) next step, otherwise jumping to (2.19) next step;

(2.22) defining a counter SFCount, and making SFCount equal to 0;

(2.23) when FName in Slog and FName in SList are equal, next skipping to (2.24);

(2.24) letting SFCount be SFCount + 1;

(2.25) i4 ═ i4+1, the next step jumps to (2.21);

(2.26) assigning the value of SFCount to count, and recording FName and count in the set SFDcount;

(2.27) j1 ═ j1+1, the next step jumps to (2.19);

and (2.28) obtaining a small file download number set SFDcount.

Further, the step (3) includes the steps of:

(3.1) defining constant time T and FName downloading by a user_iThe set of downloading other files in the later T time is STDL { [ FName { ]₁,FName₂,Dcount₁₂],[FName₂,FName₃,Dcount₂₃]…[FName_i,FName_j,Dcount_ij]},Dcount_ijIndicating the user is down FName_iDownloading FName in later T time_jNumber of times of (Dcount)_ij∈[0,+∞]Default value is 0;

(3.2) defining a small file relevance set SRL, wherein SRL { [ FName { ]₁,FName₂,Relevance₁₂],[FName₂,FName₃,Relevance₂₃]…[FName_i,FName_j,Relevance_ij]}，Relevance_ijRepresentation of FName_iAnd FName_jDegree of association between, Relevance_ij∈(0,+∞]；

(3.3) defining a loop variable i5 for traversing SLog, i5 e [1, SLFN ], initial assignment of i5 is 1;

(3.4) when the condition i5 is not more than SLFN, jumping to (3.5) next step, otherwise jumping to (3.7) next step;

(3.5) if file FName_iFTime of<FName is downloaded during T_jThen FName is added_i,FName_j,Dcount_ij＝Dcount_ij+1 addition to STDL;

(3.6) i5 ═ i5+1, and the next step jumps to (3.4);

(3.7) calculating the result of STDL;

(3.8) defining probability P as FName under download_iDownloading FName in later T time_jThe probability of (d);

(3.9) the formula for P is:

(3.10) defining a loop variable i6 for traversing STDL, i6 e [1, len (STDL) ], i6 having an initial assignment of 1;

(3.11) when the condition i6 ≤ len (STDL) is met, jumping to (3.12) next step, otherwise, jumping to (3.14) next step;

(3.12) according to the formula in (3.9), substituting the data in the list STDL into calculation to obtain the probability P_ijThe FName_i,FName_j,Relevance_ijAdded to the list SRL;

(3.13) i6 ═ i6+1, the next step jumps to (3.11);

and (3.14) obtaining a final file association calculation result STDL.

Further, the step (4) comprises the steps of:

(4.1) defining a loop variable j2 for traversing STDL, the initial assignment of j2 is 2;

(4.2) sequencing the obtained SRLs in a reverse order according to Relevance to obtain a new set SRLS;

(4.3) when the condition j2 is not more than len (SRLS) is met, jumping to (4.4) next step, or jumping to (4.10) next step;

(4.4) defining a counter C, making C ═ 0;

(4.5) define an initial file BMF, BMF { [ BMF ]₁,C₁],[BMF₂,C₂],…,[BMF_BMFN,C_BMFN]},BMF_nIs the nth element of the initial file, BMFN is the total number of the initial file, n is equal to [1 ], BMFN]；

(4.6) if FName_iAnd FName_jRelevance of_ij≥50％；

(4.7) if C is C +1, then SFName_iAssign value to BMF_iAnd C is_iAdding to the set BMF;

(4.8) j2 ═ j2+1, the next step jumps to (4.3);

(4.9) defining a loop variable i7 for traversing the BMF, the initial assignment of i7 being 1;

(4.10) when the condition i7 is not more than BMFN, jumping to (4.11) next step, otherwise jumping to (4.14) next step;

(4.11) if C_i<3, jumping to (4.12) next step, or jumping to (4.10) next step;

(4.12) deleting the data in the current BMF set;

(4.13) i7 ═ i7+1, and the next step jumps to (4.10);

and (4.14) obtaining an initial file BMF set.

Further, the step (5) includes the steps of:

(5.1) definition document merge set BMFL, BMFL ═ BMFL₁,BMFL₂,…,BMFL_BMLN},BMFL_nIs the nth element of the initial file, BMLN is the total number of the initial file, n belongs to [1, BMFLN]；

(5.2) defining an initial assignment of a loop variable j3 for traversing STDL, j3 as 1;

(5.3) when the condition j3 ≤ len (SRLS) is satisfied, if not, jumping to (5.11);

(5.4) when a file existing in the STDL exists in the BML at the same time;

(5.5) counting the files with Relevanceij being more than or equal to 50%;

(5.6) making the total file size TFSize;

(5.7)TFSize＝TFSize+FileSize；

(5.8) when TFSize < 128M;

(5.9) storing the small file information to be combined into the BMFL set;

(5.10) j3 ═ j3+1, the next step jumps to (5.3);

and (5.11) obtaining a final file combination result BMFL.

Further, the step (6) comprises the steps of:

(6.1) opening a Web service interface in a data packaging mode, and providing the Web service interface for a college intelligent campus file management system;

(6.2) creating a thread pool ThreadPool;

(6.3) judging whether all the child threads in the thread pool ThreadPool are ended, if so, entering (6.9), and if not, entering (6.4);

(6.4) the system starts to record the operation of the user to obtain a file operation log LogList;

(6.5) defining child thread ChildThread for processing and calculating LogList;

(6.6) defining a small file association degree calculation interface SRAPI, splitting an operation log LogList of a user into different batches, and performing parallel calculation;

(6.7) combining the calculation results of different batches to obtain a final file combination result BMFL;

(6.8) ending the child thread ChildThread, entering (6.3);

(6.9) closing the thread pool ThreadPool;

and (6.10) returning the calculated file merging result BMFL to the Web interface.

Based on the same inventive concept, the invention also provides a distributed intelligent campus file management system optimization device based on the HDFS, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the optimization method of the distributed intelligent campus file management system based on the HDFS when being loaded to the processor.

Has the advantages that: compared with the prior art, the invention has the beneficial effects that: the method comprises the steps of using a small file combination association algorithm, calculating the association probability among different files according to the file association algorithm by analyzing operation log information of a user, and combining the small files meeting conditions; the invention provides an interface for Web service calling for the intelligent campus file management system by adopting a data packaging mode, changes the traditional mode of storing mass data in colleges and universities, and reduces the memory space occupied by file metadata information after a large number of small files are stored. The efficiency of the system when the small file is accessed is improved.

Drawings

FIG. 1 is a flow chart of a method for optimizing a distributed intelligent campus file management system based on HDFS;

FIG. 2 is a flowchart of a log collection for obtaining a user downloaded file;

FIG. 3 is a flow chart of calculating the number of times a small file is downloaded;

FIG. 4 is a flow chart of calculating a small file association probability;

FIG. 5 is a flow chart of obtaining an initial merged file set;

FIG. 6 is a flow chart of obtaining a merged set of files BMFL;

fig. 7 is a flowchart illustrating interface call for providing Web services in a data encapsulation manner.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

A large number of variables are involved in the present embodiment, and each variable will now be described as shown in table 1.

Table 1 description of variables

The invention provides a distributed intelligent campus file management system optimization method based on small file association rules, which specifically comprises the following steps as shown in figure 1:

step 1: monitoring the operation of a user on a file in a system in real time, recording an operation log of the user when the downloading operation of the user on the file is detected, and extracting the UID, the operation time LogTime, the FID, the path FilePath, the FileName and the file size FileSize of the user to obtain a log set LogList; as shown in fig. 2, the method specifically includes the following steps:

step 1.1: building a distributed cluster based on HDFS in a Linux system;

step 1.2: packaging data according to an interface provided in a Hadoop class library;

step 1.3: define a file list FileList, FileList ═ FileList₁,FileList₂,…,FileList_FN},FileList_nFor the nth element of the file list, FN is the total number of files, n is equal to [1, FN]；

Step 1.4: defining the user Id as UID, the file ID as FID default value as 1, the path of the file as FilePath, the file name as FileName and the file size as FileSize;

step 1.5: defining a loop variable i1 for traversing all files in the system, i1 epsilon [1, FN ], and the initial assignment of i1 is 1;

step 1.6: when the condition FilePath!is satisfied! If not, the next step goes to step 1.7, otherwise, the next step goes to step 1.14;

step 1.7: defining a counter Fcount, wherein the initial value is 0;

step 1.8: let Fcount ═ Fcount + 1;

step 1.9: assigning the user Id recorded in the browser Session to the UID;

step 1.10: recording file information in a system, including UID (user identifier), FID (file identifier), file path, FName and file size FileSize of a user in a file list FileList;

step 1.11: FID + 1;

step 1.12: i1 ═ i1+1, and the next step jumps to step 1.6;

step 1.13: obtaining a file list information set FileList in the system if FN is Fcount;

step 1.14: after logging in the system, according to the operation of external equipment such as a mouse, a keyboard and the like of a user, the Web server can automatically process the request of the user to generate request information, wherein the request information comprises an access address URLPath, a request mode RequestWay, request Parameters, operation time LogTime, operation content OperationContent and a file path FilePath;

step 1.15: defining that the operation content of a user to be transmitted to a server during file downloading operation is Download;

step 1.16: the interceptor judges when intercepting the user's request, and jumps to step (1.17) when OperationContent is Download;

step 1.17: defining a LogList of files, LogList ═ LogList₁,LogList₂,…,LogList_LFN},LogList_nIs the nth element of the file list, LFN is the total number of files, n belongs to [1, LFN]；

Step 1.18: extracting a file name FName, an operation time FTime, a file size FileSize, a user UID and a file FID from the request information, and adding the information into a LogList;

step 1.19: and finally, obtaining log information LogList after the operation of a user.

Step 2: performing data processing on the log file LogList obtained in the step, analyzing the information of each log, filtering the log with the file size larger than M to obtain a small file log set SLog, and calculating the download times count of each small file according to the SLog to obtain a small file download times set SFDcount; as shown in fig. 3, the method specifically includes the following steps:

step 2.1: defining a small file list SList, where SList is { SList₁,SList₂,…,Slist_SFN},SList_nIs the nth element of the small file list, SFN is the total number of the small files, n belongs to [1, SFN]；

Step 2.2: defining a doclet log list SLog, { SLog }₁,SLog₂,…,SLog_SLFN},SLog_nFor the nth element of the small file list, SLFN is smallTotal number of files, n ∈ [1, SLFN]；

Step 2.3: defining a loop variable i2 for traversing FileList, and traversing each file size FileSize of the file list FileList, wherein i2 is E [1, FN ], and the initial value of i2 is 1;

step 2.4: when the condition i2 is not more than FN, the next step goes to step 2.5, otherwise, the next step goes to step 2.10;

step 2.5: if FileList_i1FileSize of<If 10M, the next step goes to step 2.6, otherwise, the next step goes to step 2.4;

step 2.6: defining the file Id of the small file as SFID, and setting the default value as 1;

step 2.7: adding the UID of the user, the SFID of the small file, the path FilePath of the file, the file name FName and the size FileSize of the file into a small file list SList;

step 2.8: SFID ═ SFID + 1;

step 2.9: i2 ═ i2+1, and the next step jumps to step 2.4;

step 2.10: defining a cyclic variable i3 for traversing a LogList, and traversing each file size FileSize in the LogList, wherein i3 belongs to [1, LFN ], and the initial value of i3 is 1;

step 2.11: when the condition i3 is not more than LFN, the next step goes to step 2.12, otherwise, the next step goes to step 2.17;

step 2.12: if LogList_i3FileSize of<If 10M, the next step goes to step 2.13, otherwise, the next step goes to step 2.11;

step 2.13: defining the small file log Id as SLID and an initial value as 1;

step 2.14: LogList File_i3SLID, FName, FTime, FileSize of (S) are all recorded in a list SLog;

step 2.15: SLID is SLID + 1;

step 2.16: i3 ═ i3+1, and the next step jumps to step 2.11;

step 2.17: defining the download times of the small files as count, and the set of download times of the small files as SFDcount { [ FName { [₁,count₁],[FName₂,count₂],…,[FName_SFDN,count_SFDN]SFDN is the total number of the small file download times set, SFDcount_nFor the nth element in the small file download times set, n belongs to [1, SFDN ]]；

Step 2.18: defining a loop variable j1 for traversing SList, j1 ∈ [1, SLFN ], the initial assignment of j1 is 1;

step 2.19: when the condition j1 is not more than SLFN, the next step goes to step 2.20, otherwise, the next step goes to step 2.28;

step 2.20: defining a loop variable i4 for traversing SLog, i4 ∈ [1, SLN ], the initial assignment of i4 is 1;

step 2.21: when the condition i4 is not more than SLN, the next step goes to step 2.22, otherwise, the next step goes to step 2.19;

step 2.22: defining a counter SFCount, and making SFCount equal to 0;

step 2.23: when the FName in the Slog is equal to the FName in the SList, the next step goes to step 2.24;

step 2.24: let SFCount be SFCount + 1;

step 2.25: i4 ═ i4+1, and the next step jumps to step 2.21;

step 2.26: assigning the value of SFCount to count, and recording FName and count in a set SFDcount;

step 2.27: j1 ═ j1+1, the next step jumps to step 2.19;

step 2.28: and obtaining a small file download frequency set SFDcount.

And step 3: and calculating to obtain the set of times of downloading FNamej of the user within T time after the small file FNamei is downloaded as STDL. Calculating according to the data in the STDL set and a file association formula to obtain the probability of downloading FNamej within T time after the user downloads the small file FNamei, and obtaining a small file association degree set SRL; as shown in fig. 4, the method specifically includes the following steps:

step 3.1: defining constant time T and FName download by user_iThe set of downloading other files in the later T time is STDL { [ FName { ]₁,FName₂,Dcount₁₂],[FName₂,FName₃,Dcount₂₃]…[FName_i,FName_j,Dcount_ij]},Dcount_ijIndicating the user is down FName_iDownloading FName in later T time_jNumber of times of (Dcount)_ij∈[0,+∞]Default value is 0;

step 3.2: defining a small file association set SRL, wherein SRL { [ FName ]₁,FName₂,Relevance₁₂],[FName₂,FName₃,Relevance₂₃]…[FName_i,FName_j,Relevance_ij]}，Relevance_ijRepresentation of FName_iAnd FName_jDegree of association between, Relevance_ij∈(0,+∞]；

Step 3.3: defining a loop variable i5 for traversing SLog, i5 e [1, SLFN ], the initial assignment of i5 is 1;

step 3.4: when the condition i5 is not more than SLFN, the next step goes to step 3.5, otherwise, the next step goes to step 3.7;

step 3.5: if file FName_iFTime of<FName is downloaded during T_jThen FName is added_i,FName_j,Dcount_ij＝Dcount_ij+1 addition to STDL;

step 3.6: i5 ═ i5+1, and the next step jumps to step 3.4;

step 3.7: calculating to obtain the result of the STDL;

step 3.8: defining probability P as FName under download_iDownloading FName in later T time_jThe probability of (d);

step 3.9: p is calculated as follows

Step 3.10: defining a loop variable i6 for traversing STDL, i6 ∈ [1, len (STDL) ], i6 having an initial assignment of 1;

step 3.11: when the condition i6 is less than or equal to len (STDL), the next step jumps to the step 3.12, otherwise, the next step jumps to the step 3.14;

step 3.12: the data in the list STDL is substituted into the calculation according to the formula in step 3.9,get the probability P_ijThe FName_i,FName_j,Relevance_ijAdded to the list SRL;

step 3.13: i6 ═ i6+1, the next step jumps to step 3.11;

step 3.14: and obtaining a final file association calculation result STDL.

And 4, step 4: determining which small files are merged according to the SRL, analyzing and calculating the result, and putting the file into an initial merged file set BMF when the file association probability P (FName _ j-FName _ i) > 50% and the number of files meeting the condition is greater than 3; as shown in fig. 5, the method specifically includes the following steps:

step 4.1: defining a loop variable j2 for traversing STDL, the initial assignment of j2 is 1;

step 4.2: sequencing the obtained SRLs in a reverse order according to Relevance to obtain a new set SRLS;

step 4.3: when the condition j2 is less than or equal to len (SRLS), the next step jumps to the step 4.4, otherwise, the next step jumps to the step 4.10;

step 4.4: defining a counter C, and enabling C to be 0;

step 4.5: define initial file BMF, BMF { [ BMF { ]₁,C₁],[BMF₂,C₂],…,[BMF_BMFN,C_BMFN]},BMF_nIs the nth element of the initial file, BMFN is the total number of the initial file, n is equal to [1 ], BMFN]；

Step 4.6: if FName_iAnd FName_jRelevance of_ij≥50％；

Step 4.7: then C ═ C +1, SFName_iAssign value to BMF_iAnd C is_iAdding to the set BMF;

step 4.8: j2 ═ j2+1, the next step jumps to step 4.3;

step 4.9: defining a loop variable i7 for traversing the BMF, the initial assignment of i7 being 1;

step 4.10: when the condition i7 is not more than the BMFN, the next step jumps to the step 4.11, otherwise, the next step jumps to the step 4.14;

step 4.11: if C_i<3, jumping to the step 4.12 in the next step, or jumping to the step 4.10 in the next step;

step 4.12: deleting the data in the current BMF set;

step 4.13: i7 ═ i7+1, and the next step jumps to step 4.10;

step 4.14: and obtaining an initial file BMF set.

And 5: in order to avoid file redundancy, after obtaining the initial merged file set BMF, the size of the merged file is calculated, and the size of the data block is ensured not to exceed the size of the initially set HDFS block. The merged files are not merged any more, and a file merging set BMFL is finally obtained; as shown in fig. 6, the method specifically includes the following steps:

step 5.1: definition file merging set BMFL, BMFL ═ BMFL₁,BMFL₂,…,BMFL_BMLN},BMFL_nIs the nth element of the initial file, BMLN is the total number of the initial file, n belongs to [1, BMFLN]；

Step 5.2: defining a loop variable j3 for traversing STDL, the initial assignment of j3 is 1;

step 5.3: when the condition j3 is not more than len (SRLS) is met, if not, jumping to the step (5.11);

step 5.4: when a file existing in STDL exists in BML at the same time;

step 5.5: counting files with Relevenij being more than or equal to 50%;

step 5.6: let the total file size be TFSize;

step 5.7: TFSize ═ TFSize + FileSize;

step 5.8: when TFSize < 128M;

step 5.9: storing the small file information to be combined into a BMFL set;

step 5.10: j3 ═ j3+1, the next step jumps to step 5.3;

step 5.11: obtaining a final file merging result BMFL;

step 6: after user operation log information LogList is obtained in a data packaging mode, calculation is carried out according to a small file association algorithm, and the obtained small file merging result BMFL is returned to a Web service interface and is provided for an intelligent campus file management system of colleges and universities for use; as shown in fig. 7, the method specifically includes the following steps:

step 6.1: a data packaging mode is adopted, a Web service interface is opened, and the Web service interface is provided for a college intelligent campus file management system;

step 6.2: creating a thread pool ThreadPool;

step 6.3: judging whether all the sub-threads in the thread pool ThreadPool are finished, if so, entering a step 6.9, otherwise, entering a step 6.4;

step 6.4: the system starts to record the operation of a user to obtain a file operation log LogList;

step 6.5: defining child thread ChildThread for processing and calculating LogList;

step 6.6: defining a small file association degree computing interface SRAPI, splitting an operation log LogList of a user into different batches, and performing parallel computing;

step 6.7: combining the calculation results of different batches to obtain a final file combination result BMFL;

step 6.8: ending the child thread ChildThread and entering step 6.3;

step 6.9: closing the thread pool ThreadPool;

step 6.10: and returning the calculated file merging result BMFL to the Web interface.

In order to better illustrate the effectiveness of the method, a large number of small files generated randomly are uploaded to a file system, then access logs among the files are generated randomly, and after the small files are combined, the occupation amount of a system memory is reduced by 89% compared with that before combination, so that the waste of memory resources is effectively reduced. The invention can be combined with a computer system, thereby completing the construction of the distributed file management system in the smart campus.

The distributed intelligent campus file management system optimization method based on the small file association rule can be used for small file storage optimization of a distributed file system in high school and can also be used for small file storage optimization in other application systems.

The above description is only an example of the present invention and is not intended to limit the present invention. All equivalents which come within the spirit of the invention are therefore intended to be embraced therein. Details not described herein are well within the skill of those in the art.

Claims

1. A distributed intelligent campus file management system optimization method based on an HDFS is characterized by comprising the following steps:

2. The HDFS-based distributed intelligent campus file management system optimization method according to claim 1, wherein said step (1) comprises the steps of:

(1.1) building a distributed cluster based on HDFS in a Linux system;

(1.7) defining a counter Fcount, wherein the initial value is 0;

(1.8) let Fcount ═ Fcount + 1;

(1.9) assigning the user Id recorded in the browser Session to the UID;

(1.11)FID＝FID+1；

(1.12) i1 ═ i1+1, the next step jumps to (1.6);

3. The HDFS-based distributed intelligent campus file management system optimization method according to claim 1, wherein said step (2) comprises the steps of:

(2.8)SFID＝SFID+1；

(2.9) i2 ═ i2+1, and the next step jumps to (2.4);

(2.13) defining the small file log Id as SLID and the initial value as 1;

(2.15)SLID＝SLID+1；

(2.16) i3 ═ i3+1, the next step jumps to (2.11);

(2.22) defining a counter SFCount, and making SFCount equal to 0;

(2.24) letting SFCount be SFCount + 1;

(2.25) i4 ═ i4+1, the next step jumps to (2.21);

(2.27) j1 ═ j1+1, the next step jumps to (2.19);

and (2.28) obtaining a small file download number set SFDcount.

4. The HDFS-based distributed intelligent campus file management system optimization method according to claim 1, wherein said step (3) comprises the steps of:

(3.6) i5 ═ i5+1, and the next step jumps to (3.4);

(3.7) calculating the result of STDL;

(3.9) the formula for P is:

(3.13) i6 ═ i6+1, the next step jumps to (3.11);

and (3.14) obtaining a final file association calculation result STDL.

5. The HDFS-based distributed intelligent campus file management system optimization method according to claim 1, wherein said step (4) comprises the steps of:

(4.4) defining a counter C, making C ═ 0;

(4.6) if FName_iAnd FName_jRelevance of_ij≥50％；

(4.8) j2 ═ j2+1, the next step jumps to (4.3);

(4.11) if C_i<3, jumping to (4.12) next step, or jumping to (4.10) next step;

(4.12) deleting the data in the current BMF set;

(4.13) i7 ═ i7+1, and the next step jumps to (4.10);

and (4.14) obtaining an initial file BMF set.

6. The HDFS-based distributed intelligent campus file management system optimization method according to claim 1, wherein said step (5) comprises the steps of:

(5.1) definition documentCombining set BMFL, BMFL ═ BMFL₁,BMFL₂,…,BMFL_BMLN},BMFL_nIs the nth element of the initial file, BMLN is the total number of the initial file, n belongs to [1, BMFLN]；

(5.4) when a file existing in the STDL exists in the BML at the same time;

(5.5) counting the files with Relevanceij being more than or equal to 50%;

(5.6) making the total file size TFSize;

(5.7)TFSize＝TFSize+FileSize；

(5.8) when TFSize < 128M;

(5.9) storing the small file information to be combined into the BMFL set;

(5.10) j3 ═ j3+1, the next step jumps to (5.3);

and (5.11) obtaining a final file combination result BMFL.

7. The HDFS-based distributed intelligent campus file management system optimization method according to claim 1, wherein said step (6) comprises the steps of:

(6.2) creating a thread pool ThreadPool;

(6.5) defining child thread ChildThread for processing and calculating LogList;

(6.8) ending the child thread ChildThread, entering (6.3);

(6.9) closing the thread pool ThreadPool;

8. An HDFS-based distributed intelligent campus file management system optimization apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program when loaded into the processor implements the HDFS-based distributed intelligent campus file management system optimization method according to any one of claims 1 to 7.