CN110018997A

CN110018997A - A kind of mass small documents storage optimization method based on HDFS

Info

Publication number: CN110018997A
Application number: CN201910175055.9A
Authority: CN
Inventors: 王健; 韩永鹏; 崔运鹏; 刘娟; 胡林; 苏航; 梁毅
Original assignee: Beijing University of Technology; Agricultural Information Institute of CAAS
Current assignee: Beijing University of Technology; Agricultural Information Institute of CAAS
Priority date: 2019-03-08
Filing date: 2019-03-08
Publication date: 2019-07-16
Anticipated expiration: 2039-03-08
Also published as: CN110018997B

Abstract

A kind of mass small documents storage optimization method based on HDFS belongs to storage performance optimization field, comprising: initialization, the classification of file access status analysis, small documents is temporary, small documents merging stores and backtracking.Method is directed to the history access log information of file, analyzes file access situation, the degree of correlation between calculation document, forms file association mapping ensemblen.According to file association mapping ensemblen is formed by, classification is carried out to small documents and is kept in, the high small documents of the degree of correlation are temporary together, while considering the size distribution of small documents.Storage finally is merged to temporary small documents, the original part of small documents and copy are deleted, will be merged in the big file storage to HDFS formed.Method is stored the mass small documents being stored in HDFS originally again by way of merging, the correlation of small documents and the size distribution of small documents are fully considered, the memory overhead of title node is significantly reduced, improves HDFS to the access efficiency of small documents.

Description

A kind of mass small documents storage optimization method based on HDFS

Technical field

The invention belongs to the storage performances of distributed file system to optimize field, and in particular to a kind of magnanimity based on HDFS Small documents storage optimization method.

Background technique

Hadoop is the distributed basis Computational frame of Apache exploitation, is real to the open source of Google's cloud computing cluster thought It is existing.HDFS (Hadoop Distributed File System) is one of nucleus module of Hadoop, is provided for entire frame The mass file store function of bottom.HDFS uses master-slave architecture, and NameNode (name node) is used as host node, is responsible for pipe Manage the metadata information of entire distributed file system.DataNode (back end) is responsible for carrying out data as from node It is locally stored.However, the design of HDFS storage organization is conducive to the efficient storage of big file, but when handling mass small documents, The storage of HDFS can decline to a great extent with access efficiency.The memory headroom of name node excessively polynary number caused by mass small documents It is believed that breath occupies, so that the ability of system storage file reaches bottleneck.In addition, mass small documents cause small documents access time to be opened Pin sharply increases, so that the file access efficiency of whole system reduces.

It for the solution of HDFS small documents storage problem is stored after merging small documents at present.Existing conjunction And scheme is merged according to the degree of correlation between small documents, according to the rule that particular task scene is established, by the degree of correlation Big small documents are merged into a file, so that the number of small documents is reduced, but the existing merging based on the file degree of correlation Scheme depends on specific task scene, not general relatedness computation method, and existing method is not to the big of small documents Small distribution accounts for, and the result finally merged tends not to make full use of memory space.The method of the present invention merges in small documents During comprehensively consider the size distribution of correlation and small documents between small documents, during relatedness computation, method Frequency is accessed file under different time threshold value simultaneously and file correlation access frequency counts, and by reasonably adding Power mode quantifies the expression of the file degree of correlation, so that this method does not rely on specific task scene, has good general Property.In addition, the method for the present invention considers the size distribution of file simultaneously, memory space can be more made full use of, is small documents Basis is made in high efficiency access after merging storage.

Summary of the invention

The method of the present invention is not high for mass small documents storage efficiency for above-mentioned HDFS distributed file storage system The problem of, calculation document between the degree of correlation for statistical analysis according to the history access log of heap file quantifies between file The degree of correlation indicate, obtain the relevance mapping ensemblen between file and file, and carry out to small documents according to relevance mapping ensemblen Classification is temporary, and finally carries out small documents and merge in storage to HDFS.This method has fully considered the correlation between small documents The size distribution of property and small documents, improves HDFS to the storage performance of mass small documents, and be conducive to a certain extent small The optimization of file access performance.

Mass small documents storage optimization method of the present invention is divided into five steps: initialization, file access situation point Analysis, small documents classification is temporary, small documents merging stores and backtracking.In this method, it is defined as follows basic parameter: small documents The time interval threshold value T of access simultaneously_c, the time interval threshold value T of small documents correlation access_s, small documents keep in the aggregate capacity upper limit Threshold value V_max, the temporary lowest capacity utilization rate R being integrated into when small documents merge_min, the minimum relevance threshold θ of small documents_min, small Weighing factor W of the file correlation access probability to the degree of correlation_r, small documents simultaneously access probability to the weighing factor W of the degree of correlation_c, small File mergences executes cycle T_cycle.The value condition of parameter is as follows: T_cValue interval be (0s, 1s], T_sValue interval be (1s, 5s], V_maxValue interval is [64MB, 1024MB], R_minValue interval is [0.8,1], θ_minValue interval is [0.3,1].W_rWith W_cMeet W_r+W_c=1, W_rValue interval [0.6,0.9], W_cValue interval [0.1,0.4], T_cycleValue 3600s or more.

The above method is realized according to the following steps on computers:

(1) it initializes

1.1) the small documents collection stored in HDFS is enabled to be combined into F, F={ f_i| 1≤i≤M }, f_iIndicate that any small documents, M indicate Small documents total number.

1.2) small documents access log collection is enabled to be combined into L, L={ l_i| 1≤i≤N }, for any l_i∈ L, l_iIt is represented by three Tuple (f_i,t_i,s_i), f_iRepresent the small documents that log is recorded, t_iRepresent small documents access time, s_iRepresent small documents size.

1.3) building small documents individually access Frequency statistics set C, C={ (f_i,c_i) | 1≤i≤M }, c_iRepresent small documents f_iAccess frequency, set C is initialized as null set.Similarly, building small documents access Frequency statistics set I, I={ (f simultaneously_i, f_j,u_ij,v_ij) | 1≤i < M, i < j≤M }, u_ijRepresent small documents f_iWith f_jIn T_sFrequency, v are accessed in time interval simultaneously_ijIt represents Small documents f_iWith f_jIn T_cFrequency is accessed simultaneously in time interval, set I is initialized as null set.

1.4) small documents relevance mapping set S, S={ s are constructed_ij| 1≤i < M, i < j≤M }, for any s_ij∈ S, s_ij It is represented by triple (s_i,s_j,w_ij), s_iWith s_jIndicate two different small documents in incidence relation, w_ijIndicate two small documents The degree of correlation, set S is initialized as null set.

1.5) the temporary set complete or collected works Q, Q={ q of building small documents_i| 1≤i≤K }, K indicates to keep in set number in set Q. For any q_i∈ Q, q_i={ f_j|1≤j≤n_i, n_iIndicate set q_iMiddle kept in small documents number, Q are initialized as containing K The set of a empty set.

1.6) small documents candidate collection complete or collected works H is constructed,H={ h_i| 1≤i≤K ' }, K ' indicates candidate in set H Gather number.For any h_i∈ H, h_i={ f_j|1≤j≤n_i, n_iIndicate set h_iIn small documents number, H is initialized as sky Collection.

(2) file access status analysis

2.1) each of small documents set F small documents f is traversed_i(1≤i≤M) calculates each small documents f_iIndividually visit The frequency asked, steps are as follows for calculating:

2.1.1) define c_iIndicate current small documents f_iFrequency is accessed, c is initialized_i←0。

2.1.2 each log l in log set L) is traversed_k(1≤k≤N), l_kIt is expressed as triple (f_k,t_k,s_k), If meeting f_i=f_k, then c_i←c_i+1。

2.1.3) by small documents f_iWith independent access Frequency statistics c_i, with binary group (f_i,c_i) be added small documents individually access In frequency set C: C ← C ∪ { (f_i,c_i)}。

2.2) for any two small documents f in small documents set F_iWith f_j, calculate separately f_iWith f_jIn time interval T_cWith T_sFrequency will be accessed while lower, steps are as follows for calculating:

2.2.1 small documents f) is chosen from log recording set L_iCorresponding log recording subset L_i={ (f_i,t_k,s_k)|1 ≤k≤N_i,

L_iIt is file f_iLog recording set about time descending.

2.2.2 small documents f) is chosen from log recording set L_jCorresponding log recording subset L_j={ (f_j,t_q,s_q)|1 ≤q≤N_j,

L_jIt is file f_jLog recording set about time descending.

2.2.3) enable u_ijIndicate small documents f_iWith f_jIn time interval T_sIt is interior while accessing frequency, initialize u_ij←0.Enable v_ij Indicate small documents f_iWith f_jIn time interval T_cIt is interior while accessing frequency, initialize v_ij←0。

2.2.4) for log set L_iMiddle each log recording (f_i,t_k,s_k), traverse log set L_jMiddle each day Will records (f_j,t_q,s_q), if | t_q-t_k|<T_cIt sets up, is then considered as while accessing: v_ij←v_ij+1.If | t_q-t_k|<T_sIt sets up, then regards It is accessed for correlation: u_ij←u_ij+1。

2.2.5) by small documents f_i, f_j, while accessing frequency u_ij, v_ijWith four-tuple (f_i,f_j,u_ij,v_ij) small documents are added It accesses in Frequency statistics set I simultaneously: I ← I ∪ { (f_i,f_j,u_ij,v_ij)}。

2.3) for any two small documents f in small documents set F_iWith f_j, calculate f_iWith f_jThe degree of correlation, construct f_iWith f_j Relevance mapping, steps are as follows:

2.3.1 P (f) is enabled_j|f_i) indicate file f_iAfter access, file f_jBy the conditional probability of correlation access；Enable P (f_if_j) Indicate file f_iWith file f_jThe probability of access simultaneously.Calculate P (f_j|f_i) and P (f_if_j), N represents the log total number of records.

P(f_j|f_i)=u_ij/c_i

P(f_if_j)=v_ij/N

2.3.2) enable w_ijRepresent small documents f_iWith f_jBetween the degree of correlation, w_ij←W_r×P(f_j|f_i)+W_c×P(f_if_j)。

2.3.3) by small documents f_i, f_jAnd degree of correlation w_ijWith triple (f_i,f_j,w_ij) form, be added file association In mapping set S, S ← S ∪ { (f_i,f_j,w_ij)}。

(3) small documents classification is temporary

3.1) a small documents f in small documents set F is read_i, to small documents f_iExecute following steps:

3.2) the temporary set q of each of temporary set complete or collected works Q of traversal small documents_k(1≤k≤N):

3.2.1 small documents f) is calculated_iWith temporary set q_kAverage degree of correlation, enable f_j(j∈[1,n_k]) indicate set q_kIn Each temporary small documents, n_kIt is temporary set small file number.Enable a_ikIndicate small documents f_iWith temporary set q_kBe averaged The degree of correlation.If temporary set q_kFor sky, then average degree of correlation a_ik=0.Otherwise, average degree of correlation is calculated using following formula.

If meeting condition a_ik>θ_min, then by temporary set q_kIt removes, is added in candidate collection complete or collected works H from set Q.

3.3) judge candidate collection complete or collected works H whether be it is empty, if H is sky, without any candidate collection, jump to step at this time It is rapid 3.7) to start to execute.If H is not sky, step 3.4) is continued to execute.

3.4) q is enabled_minIndicate the least candidate collection of use space, q_minIt is initialized as arbitrary collection.Enable size_minTable Show set q_minSize, when initial: size_min←V_max。

3.5) each candidate collection h in candidate collection complete or collected works H is traversed_j(1≤j≤n_j, n_jFor candidate collection number), meter Calculate h_jCurrently used space size_j, h_nFor h_jSmall documents number in the middle, s_kIndicate h_jAs the size of small file, size_jMeter It calculates as follows:

If meeting size_j<size_min, then: q_min←h_j, size_min←size_j。

3.6) judge small documents f_iQ can be added_minIn, if s_i+size_min≤V_max, by small documents f_iCandidate collection is added q_minIn, by all candidate collection h_jIt is removed from set H, places back in temporary set complete or collected works Q, execute step 3.8).Otherwise, Continue to execute step 3.7).

3.7) by small documents f_iThe temporary set of any one sky in temporary set complete or collected works Q is added.If sky is not present in Q Temporary set, then jump directly to step (6).

3.8) next untreated small documents in small documents set are read, and are re-executed from step 3.2), Zhi Daosuo There are small documents to be disposed.

(4) small documents merge storage

4.1) a temporary set q is read from set Q_k, it performs the following operations:

4.2) judge temporary set q_kWhether lowest capacity utilization rate R when merging has been reached_minIf size_k/V_max≥ R_min, then subsequent step is continued to execute, step 4.7) is otherwise directly executed.

4.3) f is defined_newIndicate new null file.

4.4) temporary set q is traversed_kEach of small documents f_i, by each small documents f_iContent it is additional enter it is empty White file f_newIn the middle, original small documents f is deleted in HDFS_iAnd its copy.

4.5) by the f after merging_newFile is saved in HDFS.

4.6) temporary set q is emptied_k。

4.7) next untreated temporary set in Q is read, is executed since step 4.2) again, until all temporary Process of aggregation finishes.

(5) recall: judging whether next merging period reaches, if arrived, jump to step (2) and re-execute, Otherwise step (6) are gone to.

(6) terminate: stopping entire small documents classification merging process.

In order to realize this method, the present invention is additionally arranged file access information collection module in each back end, for receiving Collect the access situation of each back end file.In order to realize that this method, the present invention are additionally arranged file access on name node Information Statistics analysis module, corresponding step 2) pass through for summarizing file history access information collected by each back end The history access information of Study document obtains file association mapping set so that the degree of correlation quantified between file indicates.For Realization this method, the present invention are additionally arranged file mergences decision-making module, corresponding step 3), by between file on name node The degree of correlation, small documents carry out to classification is temporary, so that the high small documents of the degree of correlation are categorized into the same temporary set, thus shape The merging decision of pairs of small documents.In order to realize that this method, the present invention are additionally arranged file mergences execution module on back end, Control back end is executed consolidation strategy, completed to small by corresponding step 4) after name node, which completes small documents, merges decision File after merging is written HDFS, and deletes original small documents and copy by the union operation of file, modifies corresponding first number It is believed that breath.

Detailed description of the invention

The deployment diagram for the HDFS platform that Fig. 1 is depended on by the method for the present invention.

Fig. 2 is the software module and its interactive relation figure that the method for the present invention increases newly.

Fig. 3 is the method for the present invention overview flow chart.

Fig. 4 is that file access status analysis executes course diagram.

Fig. 5 is that small documents classifying temporary deposits execution flow chart.

Fig. 6 is that small documents merge storage execution flow chart.

Fig. 7 file distribution situation map

Fig. 8 NameNode memory consumption situation

The access efficiency of tri- kinds of storage modes of Fig. 9 compares

Specific embodiment

The present invention is illustrated with reference to the accompanying drawings and detailed description.

Small documents storage optimization method proposed by the invention can depend on existing HDFS framework, by modifying and increasing newly Corresponding software module is realized.Fig. 1 is the Platform deployment figure of mass small documents storage optimization method of the present invention, which has multiple Computer server (node) forms, and passes through network connection between node.Platform nodes are divided into two major classes: including a host node It is (name node) and multiple from node (back end).In name node, in addition to original file system metadata management Except module, back end management module, the file access Information Statistics analysis module and file for having increased this method needs newly are closed And decision-making module.In each back end, other than original data block memory management module, file access letter has been increased newly Cease collection module and file mergences execution module.

The software module and its interactive relation figure newly increased in the HDFS platform that Fig. 2 is depended on by the method for the present invention.Shade Module is in order to realize this method software module newly-increased in HDFS platform, including the file access increased newly on name node Information Statistics analysis module, file mergences decision-making module；In the file access information collection module that back end increases newly, file is closed And execution module.Non-shadow module is existing software module in original HDFS.Interaction flow between module is as follows: (1) The message that back end management module on name node sends collection file access information gives each of cluster number According to node.Back end management module is the original module of name node, its effect is all data managed in cluster Node, control function of the name Completion node to back end.(2) file of the file access information collection module to back end Access information is collected, and after the completion of back end collection, is sent the access information of collection back to name node and is carried out file visit Ask summarizing for information.(3) the All Files access information that file access Information Statistics analysis module carrys out collection, is counted It analyzes, the degree of correlation between calculation document forms the relevance mapping ensemblen of file.(4) according to obtained file association mapping Collection, the newly-increased file mergences decision-making module of name node carry out classification according to the degree of correlation to small documents and keep in, form classification and merge Strategy.(5) name node will control back end and execute conjunction according to classification consolidation strategy by back end management module And strategy.(6) file mergences execution module carries out file Merge operation on back end, during file mergences, The original data block memory management module of the back end that needs to rely on assist complete data write-in and delete etc. functions, and It is responsible for sending messages to file system metadata management module, to carry out the corresponding update of metadata.Since small documents were according to both Determine strategy to have carried out merging storage, effectively reduces name node in the process of the present invention due to safeguarding numerous small documents metadata Caused memory overhead.

Fig. 3 is the overall execution flow chart of the method for the present invention, is divided into: initialization, file access status analysis, small documents point Class is temporary, and small documents merge five steps of storage and backtracking.Process is executed in order to which entire method is better described, it is assumed that has been read Having taken existing 4 small documents in HDFS is respectively file1, file2, file3, file4.This is obtained from history access log The log information of 4 files, it is assumed that simplified wherein 20 access log informations are as follows:

The basic parameter value that the method for the present invention uses is as follows: small documents while the judgement time interval T of access_c=1s, Being less than or equal to 1s indicates while accessing, the judgement time interval T of small documents correlation access_s=2s, i.e., be less than or equal to greater than 1s 2s indicates related access, and small documents keep in aggregate capacity upper limit threshold V_max=64MB, it is temporary to be integrated into when small documents merge most Low capacity utilization rate R_min=0.9, minimum relevance threshold θ between small documents_min=0.5, related access probability is to the degree of correlation Weighing factor W_r=0.6, while access probability is to the weighing factor W of the degree of correlation_c=0.4, small documents merge cycle T_cycle= 86400s。

The specific embodiment of entire inventive method is provided below with reference to Fig. 3:

(1) it initializes

1.1) the small documents collection stored in HDFS is enabled to be combined into F, F={ file1, file2, file3, file4 }, small documents are total Number M=4.

1.2) small documents access log collection is enabled to be combined into L, L=(file1,2018/01/01 12:10:01,10240), (file3,2018/01/01 12:10:02,30720) ..., (file4,2018/01/01 12:11:51,40960) }, log The total number of records is N=20.

1.3) building small documents individually access Frequency statistics set C, when initial: C={ }.Building small documents access frequency simultaneously Number statistics set I, when initial: I={ }.

1.4) small documents relevance mapping set S is constructed, when initial: S={ }.

1.5) the temporary set complete or collected works Q of building small documents, when initial: the empty set in Q={ { }, { } ..., { } }, set Q Conjunction number is K=10.

1.6) small documents candidate collection complete or collected works H is constructed, when initial: H={ }.

(2) file access status analysis

2.1) each of small documents set F small documents are traversed, the frequency that each small documents individually access is calculated.It is assumed that The small documents currently got are file1, then steps are as follows for calculating:

2.1.1) enable c₁It indicates the access frequency of current small documents file1, initializes c₁=0.

2.1.2 each log (f in log set L) is traversed_k,t_k,s_k), wherein 1≤k≤N, if meeting f_k= File1, then c₁=c₁+1。

2.1.3 Frequency statistics c) is accessed by small documents file1 and individually₁, small documents are added and individually access frequency set C In, C={ (file1, c at this time₁)}。

After entire step 2.1) ergodic process terminates, the independent access Frequency statistics of All Files are completed, final to collect Close C are as follows: C={ (file1,5), (file2,5), (file3,4), (file4,6) }.

2.2) for any two small documents in small documents set F, two small documents are calculated separately in time interval T_cWith T_s Frequency is accessed while interior.It is assumed that current two accessed small documents are file1 and file2 respectively, then step is calculated such as Under:

2.2.1) the corresponding log recording subset L of selecting file file1 from log recording set L₁, L₁It is file f ile1 About the log recording set of time descending, L₁=(file1,2018/01/01 12:11:50,10240), (file1, 2018/01/01 12:11:30,10240),…,(file1,2018/01/01 12:10:01,10240)}。

2.2.2) the corresponding log recording subset L of selecting file file2 from log recording set L₂, L₂It is file f ile2 About the log recording set of time descending, L₂=(file2,2018/01/01 12:11:31,20480), (file2, 2018/01/01 12:11:18,20480),…,(file2,2018/01/01 12:10:02,20480)}。

2.2.3) enable u₁₂Indicate small documents file1 and file2 in time interval T_sInterior while access frequency, initializes u₁₂ =0.Enable v₁₂Indicate small documents file1 and file2 in time interval T_cInterior while access frequency, initializes v₁₂=0.

2.2.4) for log set L₁Middle each log recording (file1, t_k,s_k), traverse log set L₂In it is each Log recording (file2, t_q,s_q), if | t_q-t_k|<T_cIt sets up, then: v₁₂=v₁₂+1.If | t_q-t_k|<T_sIt sets up, then: u₁₂= u₁₂+1。

2.2.5) by small documents file1, file2, while frequency u is accessed₁₂, v₁₂With four-tuple (file1, file2, u₁₂, v₁₂) it is added when small documents access simultaneously in frequency statistics set I: I={ (file1, file2, u₁₂,v₁₂)}。

After entire step 2.2) executes, set I will record the frequency of any two file access simultaneously, as follows:

I=(file1, file2,4,4), (file1, file3,5,4), (file1, file4,1,1), (file2, file3,3,2),(file2,file4,1,0),(file3,file4,1,1)}

2.3) for any two small documents in small documents set F, the degree of correlation between two small documents, building association are calculated Property mapping, it is assumed that two accessed files are file1 and file2, then steps are as follows:

2.3.1 after) enabling P (file2 | file1) indicate file f ile1 access, the condition of file f ile2 correlation access is general Rate；The probability for enabling P (file1file2) to indicate that file f ile1 and file f ile2 are accessed simultaneously.Calculating P (file2 | file1) with P (file1file2), N represent the log total number of records, N=20.

P (file2 | file1)=u₁₂/c₁=0.8

P (file1file2)=v₁₂/ N=0.2

2.3.2) enable w₁₂Represent the degree of correlation between small documents file1 and file2, w₁₂=W_r×0.8+W_c× 0.2= 0.56。

2.3.3) by small documents file1, file2 and degree of correlation w₁₂With triple (file1, file2, w₁₂) form, It is added in file association mapping set S, S={ (file1, file2, w₁₂)}。

After the completion of entire 2.3) step executes, final file association mapping set S are as follows: S=(file1, file2, 0.56),(file1,file3,0.68),(file1,file4,0.14),(file2,file3,0.4),(file2,file4, 0.12),(file3,file4,0.17)}。

(3) small documents classification is temporary

In order to facilitate demonstration calculating process, it is assumed that small documents file1 and file2 has been placed in temporary set q₁In the middle, i.e. q₁ ={ file1, file2 }, Q={ { file1, file2 }, { } ..., { } }.

3.1) small documents file3 is obtained, following steps are executed to small documents file3:

3.2) the temporary set q of each of temporary set complete or collected works Q of traversal small documents_k:

3.2.1 small documents file3 and temporary set q) are calculated_kAverage degree of correlation, with q₁For, file3 and q₁Be averaged Relatedness computation are as follows: a₃₁=(w₃₁+w₃₂)/2=(0.68+0.4)/2=0.54.Due to a₃₁=0.54 > θ_min=0.5, so will q₁It removes, and is added in candidate collection complete or collected works H from Q.Similarly, continue the average degree of correlation of file3 and other temporary set It calculates.

3.3) judge whether candidate collection complete or collected works H is that sky jumps to step 3.7) at this time and start to execute if H is sky.Due to Existing q in H₁, so H is not sky, continue to execute step 3.4).

3.4) q is enabled_minIndicate the least candidate collection of use space, q_minIt is initialized as arbitrary collection, enables size_minTable Show q_minSize, size when initial_min=V_max。

3.5) each candidate collection in candidate collection complete or collected works H is traversed, the currently used space of candidate collection is calculated, with q₁For Example, q₁Existing file1, file2 in the middle, so use space are as follows: size₁=s₁+s₂=30MB meets condition size₁< size_min, then: q_min=h₁, size_min=size₁。

3.6) due to s₃+size_min=60MB≤V_maxIt sets up, candidate collection q can be added in small documents file3_minIn, it will All candidate collections are removed from H, are placed back in temporary set complete or collected works Q, are executed step 3.7), at this time: H={ }, Q= {(file1,file2,file3),(),…,()}.If s₃+size_min≤V_maxInvalid, then file3 is due to aggregate capacity Set q is not can enter not enough_minIn, step 3.7) is executed at this time.

3.7) small documents file3 is added in temporary set complete or collected works Q in any one empty temporary set, for example be added To the temporary set q of sky₂In, q at this time₂={ file3 }, Q=(file1, file2), (file3), () ..., () }.If in Q Empty temporary set has been not present, then has jumped directly to step (6).

3.8) next small documents in small documents set are obtained, and are re-executed from step 3.2), until small documents collection Until all small documents acquisitions finish in conjunction F.

(4) small documents merge storage

4.1) a temporary set is obtained from set Q, it is assumed that the collection of selection is combined into q₁.It performs the following operations:

4.2) judge temporary set q₁Whether lowest capacity utilization rate R when merging has been reached_min=0.9, q₁Have in the middle Small documents file1, file2, file3, capacity utilization are as follows: size₁/V_max=(10M+20M+30M)/64MB=0.94 >= R_min, continue to execute step 4.3).If size₁/V_max<R_min, directly execute step 4.7).

4.3) f is defined_newIndicate new null file.

4.4) temporary set q is traversed₁Each of small documents, by file1, the file content of file2, file3 are additional Into null file f_newIn the middle, original small documents file1, file2, file3 and its copy are deleted in HDFS.

4.5) by the f after merging_newFile is saved in HDFS.

4.6) temporary set q is emptied₁。

4.7) next temporary set in Q is taken, is executed since step 4.2) again, until each temporary set is obtained It takes until finishing.

(5) recall: being executed in the period at this, after small documents merge storing process completion, judge execution next time Whether the period arrives, and reaches specified Ct value T when next starting period_cycleWhen, then it jumps to step (2) and re-executes, Otherwise step (6) are gone to.

(6) terminate: stopping entire small documents classification merging process.

(hereinafter referred to as MBDC is calculated the mass small documents storage optimization method based on HDFS proposed according to the present invention Method), inventor has done relevant performance test.In performance test, has collected 3132 files and its access log, file are big Small distribution is differed from 100KB to 120MB, and wherein size is that 5MB small documents quantity below has reached file population quantity 96%.File distribution situation is as shown in Figure 7:

Compared to the normal storage mode for not taking any consolidation strategy, use the PS algorithm of consolidation strategy (existing Classics merge algorithm) and MBDC algorithm (method that the present invention uses), have greatly to the memory overhead of NameNode node It reduces, small documents storage performance improves 98% and 97.8% respectively.This is because PS algorithm and MBDC algorithm all use it is small The mode of file mergences storage merges processing to the small documents of magnanimity.Since MBDC algorithm is more laid particular emphasis on to file correlation Property account for, with only only account for file size distribution without consider correlation of files PS merge algorithm compare, can waste Fall some remaining memory spaces, so that the final merging quantity of documents of MBDC algorithm is slightly more than PS algorithm, causes PS algorithm is slightly inferior in terms of NameNode memory overhead.Using algorithms of different, NameNode memory consumption situation is as shown in Figure 8.

Compared to the access mode of HDFS default and using the access mode after the storage of PS algorithm, MBDC algorithm greatly contracts Short small documents access times, improve small documents access efficiency.This is mainly due to this algorithms more fully than PS algorithm By the high file of correlation as in same, so that user, in access process, the high small documents content of correlation can be by one And access, to effectively reduce client and the number of communications of NameNode, improve the whole efficiency of data access.It visits Ask that efficiency comparative is as shown in Figure 9.

Finally, it should be noted that above example is only to illustrate the present invention and not limits technology described in the invention, And the technical solution and its improvement of all spirit and scope for not departing from invention, it should all cover in claim model of the invention In enclosing.

Claims

1. a kind of mass small documents storage optimization method based on HDFS, which is characterized in that be divided into five steps: initialization, text Part accesses status analysis, small documents classification is kept in, small documents merge storage and backtracking；Have following basic parameter: small documents are same When the time interval threshold value T that accesses_c, the time interval threshold value T of small documents correlation access_s, small documents keep in aggregate capacity upper limit threshold Value V_max, the temporary lowest capacity utilization rate R being integrated into when small documents merge_min, the minimum relevance threshold θ of small documents_min, small text Weighing factor W of the part correlation access probability to the degree of correlation_r, small documents simultaneously access probability to the weighing factor W of the degree of correlation_c, small text Part, which merges, executes cycle T_cycle；The value condition of parameter are as follows: T_cValue interval be (0s, 1s], T_sValue interval be (1s, 5s], V_maxValue interval is [64MB, 1024MB], R_minValue interval is [0.8,1], θ_minValue interval is [0.3,1]；W_rWith W_cIt is full Sufficient W_r+W_c=1, W_rValue interval [0.6,0.9], W_cValue interval [0.1,0.4], T_cycleValue 3600s or more；

The method is realized according to the following steps:

(1) it initializes

1.1) the small documents collection stored in HDFS is enabled to be combined into F, F={ f_i| 1≤i≤M }, f_iIndicate that any small documents, M indicate small text Part total number；

1.2) small documents access log collection is enabled to be combined into L, L={ l_i| 1≤i≤N }, for any l_i∈ L, l_iIt is expressed as triple (f_i,t_i,s_i), f_iRepresent the small documents that log is recorded, t_iRepresent small documents access time, s_iRepresent small documents size；

1.3) building small documents individually access Frequency statistics set C, C={ (f_i,c_i) | 1≤i≤M }, c_iRepresent small documents f_iVisit Ask frequency, set C is initialized as null set；Similarly, building small documents access Frequency statistics set I, I={ (f simultaneously_i,f_j, u_ij,v_ij) | 1≤i < M, i < j≤M }, u_ijRepresent small documents f_iWith f_jIn T_sFrequency, v are accessed in time interval simultaneously_ijRepresent small text Part f_iWith f_jIn T_cFrequency is accessed simultaneously in time interval, set I is initialized as null set；

1.4) small documents relevance mapping set S, S={ s are constructed_ij| 1≤i < M, i < j≤M }, for any s_ij∈ S, s_ijIt can table It is shown as triple (s_i,s_j,w_ij), s_iWith s_jIndicate two different small documents in incidence relation, w_ijIndicate the phase of two small documents Guan Du, set S are initialized as null set；

1.5) the temporary set complete or collected works Q, Q={ q of building small documents_i| 1≤i≤K }, K indicates to keep in set number in set Q；For Any q_i∈ Q, q_i={ f_j|1≤j≤n_i, n_iIndicate set q_iMiddle kept in small documents number, Q are initialized as containing K sky The set of collection；

1.6) small documents candidate collection complete or collected works H is constructed,H={ h_i| 1≤i≤K ' }, K ' indicates candidate collection in set H Number；For any h_i∈ H, h_i={ f_j|1≤j≤n_i, n_iIndicate set h_iIn small documents number, H is initialized as empty set；

(2) file access status analysis

2.1) each of small documents set F small documents f is traversed_i, 1≤i≤M calculates each small documents f_iThe frequency individually accessed Number, steps are as follows for calculating:

2.1.1) define c_iIndicate current small documents f_iFrequency is accessed, c is initialized_iIt is 0；

2.1.2 each log l in log set L) is traversed_k, 1≤k≤N, l_kIt is expressed as triple (f_k,t_k,s_k), if meeting f_i=f_k, then c_iIncrease by 1；

2.1.3) by small documents f_iWith independent access Frequency statistics c_i, with binary group (f_i,c_i) be added small documents individually access frequency In set C；

2.2) for any two small documents f in small documents set F_iWith f_j, calculate separately f_iWith f_jIn time interval T_cWith T_sUnder While access frequency, calculate that steps are as follows:

2.2.1 small documents f) is chosen from log recording set L_iCorresponding log recording subset L_i={ (f_i,t_k,s_k)|1≤k≤ N_i, L_iIt is file f_iLog recording set about time descending；

2.2.2 small documents f) is chosen from log recording set L_jCorresponding log recording subset L_j={ (f_j,t_q,s_q)|1≤q≤ N_j,

L_jIt is file f_jLog recording set about time descending；

2.2.3) enable u_ijIndicate small documents f_iWith f_jIn time interval T_sIt is interior while accessing frequency, initialize u_ij←0；Enable v_ijIt indicates Small documents f_iWith f_jIn time interval T_cIt is interior while accessing frequency, initialize v_ij←0；

2.2.4) for log set L_iMiddle each log recording (f_i,t_k,s_k), traverse log set L_jMiddle each log note Record (f_j,t_q,s_q), if | t_q-t_k|<T_cIt sets up, is then considered as while accessing: v_ijAdd 1；If | t_q-t_k|<T_sIt sets up, is then considered as related visit It asks: u_ijAdd 1；

2.2.5) by small documents f_i, f_j, while accessing frequency u_ij, v_ijWith four-tuple (f_i,f_j,u_ij,v_ij) small documents are added simultaneously It accesses in Frequency statistics set I；

2.3) for any two small documents f in small documents set F_iWith f_j, calculate f_iWith f_jThe degree of correlation, construct f_iWith f_jPass The mapping of connection property, steps are as follows:

2.3.1 P (f) is enabled_j|f_i) indicate file f_iAfter access, file f_jBy the conditional probability of correlation access；Enable P (f_if_j) indicate File f_iWith file f_jThe probability of access simultaneously；Calculate P (f_j|f_i) and P (f_if_j), N represents the log total number of records；

P(f_j|f_i)=u_ij/c_i

P(f_if_j)=v_ij/N

2.3.2) enable w_ijRepresent small documents f_iWith f_jBetween the degree of correlation, w_ij←W_r×P(f_j|f_i)+W_c×P(f_if_j)；

2.3.3) by small documents f_i, f_jAnd degree of correlation w_ijWith triple (f_i,f_j,w_ij) form, be added file association mapping In set S；

(3) small documents classification is temporary

3.2) the temporary set q of each of temporary set complete or collected works Q of traversal small documents_k, 1≤k≤N:

3.2.1 small documents f) is calculated_iWith temporary set q_kAverage degree of correlation, enable f_jIndicate set q_kEach of keep in small text Part, wherein j ∈ [1, n_k], n_kIt is temporary set small file number；Enable a_ikIndicate small documents f_iWith temporary set q_kAverage phase Guan Du；If temporary set q_kFor sky, then average degree of correlation a_ik=0；Otherwise, average degree of correlation is calculated using following formula；

If meeting condition a_ik>θ_min, then by temporary set q_kIt removes, is added in candidate collection complete or collected works H from set Q；

3.3) judge candidate collection complete or collected works H whether be it is empty, if H is sky, without any candidate collection, jump to step at this time 3.7) start to execute；If H is not sky, step 3.4) is continued to execute；

3.4) q is enabled_minIndicate the least candidate collection of use space, q_minIt is initialized as arbitrary collection；Enable size_minIndicate collection Close q_minSize, when initial: by V_maxIt is assigned to size_min；

3.5) each candidate collection h in candidate collection complete or collected works H is traversed_j；1≤j≤n_j, n_jFor candidate collection number；Calculate h_j's Currently used space size_j, h_nFor h_jSmall documents number in the middle, s_kIndicate h_jAs the size of small file, size_jIt calculates such as Under:

If meeting size_j<size_min, then by h at this time_jIt is assigned to q_min, size_jIt is assigned to size_min；

3.6) judge small documents f_iQ can be added_minIn, if s_i+size_min≤V_max, by small documents f_iCandidate collection q is added_min In, by all candidate collection h_jIt is removed from set H, places back in temporary set complete or collected works Q, execute step 3.8)；Otherwise, continue Execute step 3.7)；

3.7) by small documents f_iThe temporary set of any one sky in temporary set complete or collected works Q is added；If being not present in Q empty temporary Set, then jump directly to step (6)；

3.8) next untreated small documents in small documents set are read, and are re-executed from step 3.2), until all small File process finishes；

(4) small documents merge storage

4.2) judge temporary set q_kWhether lowest capacity utilization rate R when merging has been reached_minIf size_k/V_max≥R_min, then Subsequent step is continued to execute, step 4.7) is otherwise directly executed；

4.3) f is defined_newIndicate new null file；

4.4) temporary set q is traversed_kEach of small documents f_i, by each small documents f_iContent additional enter blank text Part f_newIn the middle, original small documents f is deleted in HDFS_iAnd its copy；

4.5) by the f after merging_newFile is saved in HDFS；

4.6) temporary set q is emptied_k；

4.7) next untreated temporary set in Q is read, is executed since step 4.2) again, until all temporary set It is disposed；

(5) recall: judging whether next merging period reaches, if arrived, jump to step (2) and re-execute, otherwise Go to step (6)；

(6) terminate: stopping entire small documents classification merging process.

2. a kind of mass small documents storage optimization method based on HDFS according to claim 1, it is characterised in that: every A back end is additionally arranged file access information collection module, for collecting the access situation of each back end file；In name Claim to be additionally arranged file access Information Statistics analysis module on node, corresponding step 2), for summarizing collected by each back end File history access information, so that quantifying the degree of correlation between file indicates, obtained by the history access information of Study document To file association mapping set；Be additionally arranged file mergences decision-making module on name node, corresponding step 3), by file it Between the degree of correlation, small documents carry out to classification is temporary so that the high small documents of the degree of correlation are categorized into the same temporary set, thus Form the merging decision to small documents；File mergences execution module, corresponding step 4), when title section are additionally arranged on back end After point completes small documents merging decision, control back end is executed into consolidation strategy, completes that the union operation of small documents will be closed And HDFS is written in file later, and deletes original small documents and copy.

3. a kind of mass small documents storage optimization method based on HDFS according to claim 1, it is characterised in that: we Method realization depends on existing HDFS framework, is realized by modifying and increasing newly corresponding software module；Mass small documents storage is excellent Change method platform is made of multiple computer server, that is, nodes, passes through network connection between node；It is big that platform nodes are divided into two Class: including a host node, that is, name node and multiple from node, that is, back end；In name node, in addition to original text Except part system metadata management module, back end management module, the file access Information Statistics of this method needs have been increased newly Analysis module and file mergences decision-making module；In each back end, other than original data block memory management module, File access information collection module and file mergences execution module are increased newly；

Mould is analyzed in the software module increased newly in HDFS platform, the file access Information Statistics including increasing newly on name node Block, file mergences decision-making module；In the file access information collection module that back end increases newly, file mergences execution module；Mould Interaction flow between block is as follows: (1) back end management module sends the message for collecting file access information；Back end pipe Reason module is the original module of name node, its effect is all back end managed in cluster, name Completion node To the control function of back end；(2) file access information collection module is collected the file access information of back end, After the completion of back end collection, sends the access information of collection back to name node and carry out summarizing for file access information；(3) literary The All Files access information that part access information statistical analysis module carrys out collection, it is for statistical analysis, between calculation document The degree of correlation, form the relevance mapping ensemblen of file；(4) it is increased newly according to obtained file association mapping ensemblen, name node File mergences decision-making module carries out classification according to the degree of correlation to small documents and keeps in, and forms combined strategy of classifying；(5) name node It will control back end by back end management module according to classification consolidation strategy and execute consolidation strategy；(6) file mergences Execution module carries out file Merge operation, during file mergences, need to rely on back end on back end Original data block memory management module assists completing the write-in of data and deletes function, and is responsible for sending messages to file system System metadata management module, to carry out the corresponding update of metadata.