CN110018997A - A kind of mass small documents storage optimization method based on HDFS - Google Patents

A kind of mass small documents storage optimization method based on HDFS Download PDF

Info

Publication number
CN110018997A
CN110018997A CN201910175055.9A CN201910175055A CN110018997A CN 110018997 A CN110018997 A CN 110018997A CN 201910175055 A CN201910175055 A CN 201910175055A CN 110018997 A CN110018997 A CN 110018997A
Authority
CN
China
Prior art keywords
small documents
file
small
correlation
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910175055.9A
Other languages
Chinese (zh)
Other versions
CN110018997B (en
Inventor
王健
韩永鹏
崔运鹏
刘娟
胡林
苏航
梁毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Agricultural Information Institute of CAAS
Original Assignee
Beijing University of Technology
Agricultural Information Institute of CAAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology, Agricultural Information Institute of CAAS filed Critical Beijing University of Technology
Priority to CN201910175055.9A priority Critical patent/CN110018997B/en
Publication of CN110018997A publication Critical patent/CN110018997A/en
Application granted granted Critical
Publication of CN110018997B publication Critical patent/CN110018997B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of mass small documents storage optimization method based on HDFS belongs to storage performance optimization field, comprising: initialization, the classification of file access status analysis, small documents is temporary, small documents merging stores and backtracking.Method is directed to the history access log information of file, analyzes file access situation, the degree of correlation between calculation document, forms file association mapping ensemblen.According to file association mapping ensemblen is formed by, classification is carried out to small documents and is kept in, the high small documents of the degree of correlation are temporary together, while considering the size distribution of small documents.Storage finally is merged to temporary small documents, the original part of small documents and copy are deleted, will be merged in the big file storage to HDFS formed.Method is stored the mass small documents being stored in HDFS originally again by way of merging, the correlation of small documents and the size distribution of small documents are fully considered, the memory overhead of title node is significantly reduced, improves HDFS to the access efficiency of small documents.

Description

A kind of mass small documents storage optimization method based on HDFS
Technical field
The invention belongs to the storage performances of distributed file system to optimize field, and in particular to a kind of magnanimity based on HDFS Small documents storage optimization method.
Background technique
Hadoop is the distributed basis Computational frame of Apache exploitation, is real to the open source of Google's cloud computing cluster thought It is existing.HDFS (Hadoop Distributed File System) is one of nucleus module of Hadoop, is provided for entire frame The mass file store function of bottom.HDFS uses master-slave architecture, and NameNode (name node) is used as host node, is responsible for pipe Manage the metadata information of entire distributed file system.DataNode (back end) is responsible for carrying out data as from node It is locally stored.However, the design of HDFS storage organization is conducive to the efficient storage of big file, but when handling mass small documents, The storage of HDFS can decline to a great extent with access efficiency.The memory headroom of name node excessively polynary number caused by mass small documents It is believed that breath occupies, so that the ability of system storage file reaches bottleneck.In addition, mass small documents cause small documents access time to be opened Pin sharply increases, so that the file access efficiency of whole system reduces.
It for the solution of HDFS small documents storage problem is stored after merging small documents at present.Existing conjunction And scheme is merged according to the degree of correlation between small documents, according to the rule that particular task scene is established, by the degree of correlation Big small documents are merged into a file, so that the number of small documents is reduced, but the existing merging based on the file degree of correlation Scheme depends on specific task scene, not general relatedness computation method, and existing method is not to the big of small documents Small distribution accounts for, and the result finally merged tends not to make full use of memory space.The method of the present invention merges in small documents During comprehensively consider the size distribution of correlation and small documents between small documents, during relatedness computation, method Frequency is accessed file under different time threshold value simultaneously and file correlation access frequency counts, and by reasonably adding Power mode quantifies the expression of the file degree of correlation, so that this method does not rely on specific task scene, has good general Property.In addition, the method for the present invention considers the size distribution of file simultaneously, memory space can be more made full use of, is small documents Basis is made in high efficiency access after merging storage.
Summary of the invention
The method of the present invention is not high for mass small documents storage efficiency for above-mentioned HDFS distributed file storage system The problem of, calculation document between the degree of correlation for statistical analysis according to the history access log of heap file quantifies between file The degree of correlation indicate, obtain the relevance mapping ensemblen between file and file, and carry out to small documents according to relevance mapping ensemblen Classification is temporary, and finally carries out small documents and merge in storage to HDFS.This method has fully considered the correlation between small documents The size distribution of property and small documents, improves HDFS to the storage performance of mass small documents, and be conducive to a certain extent small The optimization of file access performance.
Mass small documents storage optimization method of the present invention is divided into five steps: initialization, file access situation point Analysis, small documents classification is temporary, small documents merging stores and backtracking.In this method, it is defined as follows basic parameter: small documents The time interval threshold value T of access simultaneouslyc, the time interval threshold value T of small documents correlation accesss, small documents keep in the aggregate capacity upper limit Threshold value Vmax, the temporary lowest capacity utilization rate R being integrated into when small documents mergemin, the minimum relevance threshold θ of small documentsmin, small Weighing factor W of the file correlation access probability to the degree of correlationr, small documents simultaneously access probability to the weighing factor W of the degree of correlationc, small File mergences executes cycle Tcycle.The value condition of parameter is as follows: TcValue interval be (0s, 1s], TsValue interval be (1s, 5s], VmaxValue interval is [64MB, 1024MB], RminValue interval is [0.8,1], θminValue interval is [0.3,1].WrWith WcMeet Wr+Wc=1, WrValue interval [0.6,0.9], WcValue interval [0.1,0.4], TcycleValue 3600s or more.
The above method is realized according to the following steps on computers:
(1) it initializes
1.1) the small documents collection stored in HDFS is enabled to be combined into F, F={ fi| 1≤i≤M }, fiIndicate that any small documents, M indicate Small documents total number.
1.2) small documents access log collection is enabled to be combined into L, L={ li| 1≤i≤N }, for any li∈ L, liIt is represented by three Tuple (fi,ti,si), fiRepresent the small documents that log is recorded, tiRepresent small documents access time, siRepresent small documents size.
1.3) building small documents individually access Frequency statistics set C, C={ (fi,ci) | 1≤i≤M }, ciRepresent small documents fiAccess frequency, set C is initialized as null set.Similarly, building small documents access Frequency statistics set I, I={ (f simultaneouslyi, fj,uij,vij) | 1≤i < M, i < j≤M }, uijRepresent small documents fiWith fjIn TsFrequency, v are accessed in time interval simultaneouslyijIt represents Small documents fiWith fjIn TcFrequency is accessed simultaneously in time interval, set I is initialized as null set.
1.4) small documents relevance mapping set S, S={ s are constructedij| 1≤i < M, i < j≤M }, for any sij∈ S, sij It is represented by triple (si,sj,wij), siWith sjIndicate two different small documents in incidence relation, wijIndicate two small documents The degree of correlation, set S is initialized as null set.
1.5) the temporary set complete or collected works Q, Q={ q of building small documentsi| 1≤i≤K }, K indicates to keep in set number in set Q. For any qi∈ Q, qi={ fj|1≤j≤ni, niIndicate set qiMiddle kept in small documents number, Q are initialized as containing K The set of a empty set.
1.6) small documents candidate collection complete or collected works H is constructed,H={ hi| 1≤i≤K ' }, K ' indicates candidate in set H Gather number.For any hi∈ H, hi={ fj|1≤j≤ni, niIndicate set hiIn small documents number, H is initialized as sky Collection.
(2) file access status analysis
2.1) each of small documents set F small documents f is traversedi(1≤i≤M) calculates each small documents fiIndividually visit The frequency asked, steps are as follows for calculating:
2.1.1) define ciIndicate current small documents fiFrequency is accessed, c is initializedi←0。
2.1.2 each log l in log set L) is traversedk(1≤k≤N), lkIt is expressed as triple (fk,tk,sk), If meeting fi=fk, then ci←ci+1。
2.1.3) by small documents fiWith independent access Frequency statistics ci, with binary group (fi,ci) be added small documents individually access In frequency set C: C ← C ∪ { (fi,ci)}。
2.2) for any two small documents f in small documents set FiWith fj, calculate separately fiWith fjIn time interval TcWith TsFrequency will be accessed while lower, steps are as follows for calculating:
2.2.1 small documents f) is chosen from log recording set LiCorresponding log recording subset Li={ (fi,tk,sk)|1 ≤k≤Ni,
LiIt is file fiLog recording set about time descending.
2.2.2 small documents f) is chosen from log recording set LjCorresponding log recording subset Lj={ (fj,tq,sq)|1 ≤q≤Nj,
LjIt is file fjLog recording set about time descending.
2.2.3) enable uijIndicate small documents fiWith fjIn time interval TsIt is interior while accessing frequency, initialize uij←0.Enable vij Indicate small documents fiWith fjIn time interval TcIt is interior while accessing frequency, initialize vij←0。
2.2.4) for log set LiMiddle each log recording (fi,tk,sk), traverse log set LjMiddle each day Will records (fj,tq,sq), if | tq-tk|<TcIt sets up, is then considered as while accessing: vij←vij+1.If | tq-tk|<TsIt sets up, then regards It is accessed for correlation: uij←uij+1。
2.2.5) by small documents fi, fj, while accessing frequency uij, vijWith four-tuple (fi,fj,uij,vij) small documents are added It accesses in Frequency statistics set I simultaneously: I ← I ∪ { (fi,fj,uij,vij)}。
2.3) for any two small documents f in small documents set FiWith fj, calculate fiWith fjThe degree of correlation, construct fiWith fj Relevance mapping, steps are as follows:
2.3.1 P (f) is enabledj|fi) indicate file fiAfter access, file fjBy the conditional probability of correlation access;Enable P (fifj) Indicate file fiWith file fjThe probability of access simultaneously.Calculate P (fj|fi) and P (fifj), N represents the log total number of records.
P(fj|fi)=uij/ci
P(fifj)=vij/N
2.3.2) enable wijRepresent small documents fiWith fjBetween the degree of correlation, wij←Wr×P(fj|fi)+Wc×P(fifj)。
2.3.3) by small documents fi, fjAnd degree of correlation wijWith triple (fi,fj,wij) form, be added file association In mapping set S, S ← S ∪ { (fi,fj,wij)}。
(3) small documents classification is temporary
3.1) a small documents f in small documents set F is readi, to small documents fiExecute following steps:
3.2) the temporary set q of each of temporary set complete or collected works Q of traversal small documentsk(1≤k≤N):
3.2.1 small documents f) is calculatediWith temporary set qkAverage degree of correlation, enable fj(j∈[1,nk]) indicate set qkIn Each temporary small documents, nkIt is temporary set small file number.Enable aikIndicate small documents fiWith temporary set qkBe averaged The degree of correlation.If temporary set qkFor sky, then average degree of correlation aik=0.Otherwise, average degree of correlation is calculated using following formula.
If meeting condition aikmin, then by temporary set qkIt removes, is added in candidate collection complete or collected works H from set Q.
3.3) judge candidate collection complete or collected works H whether be it is empty, if H is sky, without any candidate collection, jump to step at this time It is rapid 3.7) to start to execute.If H is not sky, step 3.4) is continued to execute.
3.4) q is enabledminIndicate the least candidate collection of use space, qminIt is initialized as arbitrary collection.Enable sizeminTable Show set qminSize, when initial: sizemin←Vmax
3.5) each candidate collection h in candidate collection complete or collected works H is traversedj(1≤j≤nj, njFor candidate collection number), meter Calculate hjCurrently used space sizej, hnFor hjSmall documents number in the middle, skIndicate hjAs the size of small file, sizejMeter It calculates as follows:
If meeting sizej<sizemin, then: qmin←hj, sizemin←sizej
3.6) judge small documents fiQ can be addedminIn, if si+sizemin≤Vmax, by small documents fiCandidate collection is added qminIn, by all candidate collection hjIt is removed from set H, places back in temporary set complete or collected works Q, execute step 3.8).Otherwise, Continue to execute step 3.7).
3.7) by small documents fiThe temporary set of any one sky in temporary set complete or collected works Q is added.If sky is not present in Q Temporary set, then jump directly to step (6).
3.8) next untreated small documents in small documents set are read, and are re-executed from step 3.2), Zhi Daosuo There are small documents to be disposed.
(4) small documents merge storage
4.1) a temporary set q is read from set Qk, it performs the following operations:
4.2) judge temporary set qkWhether lowest capacity utilization rate R when merging has been reachedminIf sizek/Vmax≥ Rmin, then subsequent step is continued to execute, step 4.7) is otherwise directly executed.
4.3) f is definednewIndicate new null file.
4.4) temporary set q is traversedkEach of small documents fi, by each small documents fiContent it is additional enter it is empty White file fnewIn the middle, original small documents f is deleted in HDFSiAnd its copy.
4.5) by the f after mergingnewFile is saved in HDFS.
4.6) temporary set q is emptiedk
4.7) next untreated temporary set in Q is read, is executed since step 4.2) again, until all temporary Process of aggregation finishes.
(5) recall: judging whether next merging period reaches, if arrived, jump to step (2) and re-execute, Otherwise step (6) are gone to.
(6) terminate: stopping entire small documents classification merging process.
In order to realize this method, the present invention is additionally arranged file access information collection module in each back end, for receiving Collect the access situation of each back end file.In order to realize that this method, the present invention are additionally arranged file access on name node Information Statistics analysis module, corresponding step 2) pass through for summarizing file history access information collected by each back end The history access information of Study document obtains file association mapping set so that the degree of correlation quantified between file indicates.For Realization this method, the present invention are additionally arranged file mergences decision-making module, corresponding step 3), by between file on name node The degree of correlation, small documents carry out to classification is temporary, so that the high small documents of the degree of correlation are categorized into the same temporary set, thus shape The merging decision of pairs of small documents.In order to realize that this method, the present invention are additionally arranged file mergences execution module on back end, Control back end is executed consolidation strategy, completed to small by corresponding step 4) after name node, which completes small documents, merges decision File after merging is written HDFS, and deletes original small documents and copy by the union operation of file, modifies corresponding first number It is believed that breath.
Detailed description of the invention
The deployment diagram for the HDFS platform that Fig. 1 is depended on by the method for the present invention.
Fig. 2 is the software module and its interactive relation figure that the method for the present invention increases newly.
Fig. 3 is the method for the present invention overview flow chart.
Fig. 4 is that file access status analysis executes course diagram.
Fig. 5 is that small documents classifying temporary deposits execution flow chart.
Fig. 6 is that small documents merge storage execution flow chart.
Fig. 7 file distribution situation map
Fig. 8 NameNode memory consumption situation
The access efficiency of tri- kinds of storage modes of Fig. 9 compares
Specific embodiment
The present invention is illustrated with reference to the accompanying drawings and detailed description.
Small documents storage optimization method proposed by the invention can depend on existing HDFS framework, by modifying and increasing newly Corresponding software module is realized.Fig. 1 is the Platform deployment figure of mass small documents storage optimization method of the present invention, which has multiple Computer server (node) forms, and passes through network connection between node.Platform nodes are divided into two major classes: including a host node It is (name node) and multiple from node (back end).In name node, in addition to original file system metadata management Except module, back end management module, the file access Information Statistics analysis module and file for having increased this method needs newly are closed And decision-making module.In each back end, other than original data block memory management module, file access letter has been increased newly Cease collection module and file mergences execution module.
The software module and its interactive relation figure newly increased in the HDFS platform that Fig. 2 is depended on by the method for the present invention.Shade Module is in order to realize this method software module newly-increased in HDFS platform, including the file access increased newly on name node Information Statistics analysis module, file mergences decision-making module;In the file access information collection module that back end increases newly, file is closed And execution module.Non-shadow module is existing software module in original HDFS.Interaction flow between module is as follows: (1) The message that back end management module on name node sends collection file access information gives each of cluster number According to node.Back end management module is the original module of name node, its effect is all data managed in cluster Node, control function of the name Completion node to back end.(2) file of the file access information collection module to back end Access information is collected, and after the completion of back end collection, is sent the access information of collection back to name node and is carried out file visit Ask summarizing for information.(3) the All Files access information that file access Information Statistics analysis module carrys out collection, is counted It analyzes, the degree of correlation between calculation document forms the relevance mapping ensemblen of file.(4) according to obtained file association mapping Collection, the newly-increased file mergences decision-making module of name node carry out classification according to the degree of correlation to small documents and keep in, form classification and merge Strategy.(5) name node will control back end and execute conjunction according to classification consolidation strategy by back end management module And strategy.(6) file mergences execution module carries out file Merge operation on back end, during file mergences, The original data block memory management module of the back end that needs to rely on assist complete data write-in and delete etc. functions, and It is responsible for sending messages to file system metadata management module, to carry out the corresponding update of metadata.Since small documents were according to both Determine strategy to have carried out merging storage, effectively reduces name node in the process of the present invention due to safeguarding numerous small documents metadata Caused memory overhead.
Fig. 3 is the overall execution flow chart of the method for the present invention, is divided into: initialization, file access status analysis, small documents point Class is temporary, and small documents merge five steps of storage and backtracking.Process is executed in order to which entire method is better described, it is assumed that has been read Having taken existing 4 small documents in HDFS is respectively file1, file2, file3, file4.This is obtained from history access log The log information of 4 files, it is assumed that simplified wherein 20 access log informations are as follows:
The basic parameter value that the method for the present invention uses is as follows: small documents while the judgement time interval T of accessc=1s, Being less than or equal to 1s indicates while accessing, the judgement time interval T of small documents correlation accesss=2s, i.e., be less than or equal to greater than 1s 2s indicates related access, and small documents keep in aggregate capacity upper limit threshold Vmax=64MB, it is temporary to be integrated into when small documents merge most Low capacity utilization rate Rmin=0.9, minimum relevance threshold θ between small documentsmin=0.5, related access probability is to the degree of correlation Weighing factor Wr=0.6, while access probability is to the weighing factor W of the degree of correlationc=0.4, small documents merge cycle Tcycle= 86400s。
The specific embodiment of entire inventive method is provided below with reference to Fig. 3:
(1) it initializes
1.1) the small documents collection stored in HDFS is enabled to be combined into F, F={ file1, file2, file3, file4 }, small documents are total Number M=4.
1.2) small documents access log collection is enabled to be combined into L, L=(file1,2018/01/01 12:10:01,10240), (file3,2018/01/01 12:10:02,30720) ..., (file4,2018/01/01 12:11:51,40960) }, log The total number of records is N=20.
1.3) building small documents individually access Frequency statistics set C, when initial: C={ }.Building small documents access frequency simultaneously Number statistics set I, when initial: I={ }.
1.4) small documents relevance mapping set S is constructed, when initial: S={ }.
1.5) the temporary set complete or collected works Q of building small documents, when initial: the empty set in Q={ { }, { } ..., { } }, set Q Conjunction number is K=10.
1.6) small documents candidate collection complete or collected works H is constructed, when initial: H={ }.
(2) file access status analysis
2.1) each of small documents set F small documents are traversed, the frequency that each small documents individually access is calculated.It is assumed that The small documents currently got are file1, then steps are as follows for calculating:
2.1.1) enable c1It indicates the access frequency of current small documents file1, initializes c1=0.
2.1.2 each log (f in log set L) is traversedk,tk,sk), wherein 1≤k≤N, if meeting fk= File1, then c1=c1+1。
2.1.3 Frequency statistics c) is accessed by small documents file1 and individually1, small documents are added and individually access frequency set C In, C={ (file1, c at this time1)}。
After entire step 2.1) ergodic process terminates, the independent access Frequency statistics of All Files are completed, final to collect Close C are as follows: C={ (file1,5), (file2,5), (file3,4), (file4,6) }.
2.2) for any two small documents in small documents set F, two small documents are calculated separately in time interval TcWith Ts Frequency is accessed while interior.It is assumed that current two accessed small documents are file1 and file2 respectively, then step is calculated such as Under:
2.2.1) the corresponding log recording subset L of selecting file file1 from log recording set L1, L1It is file f ile1 About the log recording set of time descending, L1=(file1,2018/01/01 12:11:50,10240), (file1, 2018/01/01 12:11:30,10240),…,(file1,2018/01/01 12:10:01,10240)}。
2.2.2) the corresponding log recording subset L of selecting file file2 from log recording set L2, L2It is file f ile2 About the log recording set of time descending, L2=(file2,2018/01/01 12:11:31,20480), (file2, 2018/01/01 12:11:18,20480),…,(file2,2018/01/01 12:10:02,20480)}。
2.2.3) enable u12Indicate small documents file1 and file2 in time interval TsInterior while access frequency, initializes u12 =0.Enable v12Indicate small documents file1 and file2 in time interval TcInterior while access frequency, initializes v12=0.
2.2.4) for log set L1Middle each log recording (file1, tk,sk), traverse log set L2In it is each Log recording (file2, tq,sq), if | tq-tk|<TcIt sets up, then: v12=v12+1.If | tq-tk|<TsIt sets up, then: u12= u12+1。
2.2.5) by small documents file1, file2, while frequency u is accessed12, v12With four-tuple (file1, file2, u12, v12) it is added when small documents access simultaneously in frequency statistics set I: I={ (file1, file2, u12,v12)}。
After entire step 2.2) executes, set I will record the frequency of any two file access simultaneously, as follows:
I=(file1, file2,4,4), (file1, file3,5,4), (file1, file4,1,1), (file2, file3,3,2),(file2,file4,1,0),(file3,file4,1,1)}
2.3) for any two small documents in small documents set F, the degree of correlation between two small documents, building association are calculated Property mapping, it is assumed that two accessed files are file1 and file2, then steps are as follows:
2.3.1 after) enabling P (file2 | file1) indicate file f ile1 access, the condition of file f ile2 correlation access is general Rate;The probability for enabling P (file1file2) to indicate that file f ile1 and file f ile2 are accessed simultaneously.Calculating P (file2 | file1) with P (file1file2), N represent the log total number of records, N=20.
P (file2 | file1)=u12/c1=0.8
P (file1file2)=v12/ N=0.2
2.3.2) enable w12Represent the degree of correlation between small documents file1 and file2, w12=Wr×0.8+Wc× 0.2= 0.56。
2.3.3) by small documents file1, file2 and degree of correlation w12With triple (file1, file2, w12) form, It is added in file association mapping set S, S={ (file1, file2, w12)}。
After the completion of entire 2.3) step executes, final file association mapping set S are as follows: S=(file1, file2, 0.56),(file1,file3,0.68),(file1,file4,0.14),(file2,file3,0.4),(file2,file4, 0.12),(file3,file4,0.17)}。
(3) small documents classification is temporary
In order to facilitate demonstration calculating process, it is assumed that small documents file1 and file2 has been placed in temporary set q1In the middle, i.e. q1 ={ file1, file2 }, Q={ { file1, file2 }, { } ..., { } }.
3.1) small documents file3 is obtained, following steps are executed to small documents file3:
3.2) the temporary set q of each of temporary set complete or collected works Q of traversal small documentsk:
3.2.1 small documents file3 and temporary set q) are calculatedkAverage degree of correlation, with q1For, file3 and q1Be averaged Relatedness computation are as follows: a31=(w31+w32)/2=(0.68+0.4)/2=0.54.Due to a31=0.54 > θmin=0.5, so will q1It removes, and is added in candidate collection complete or collected works H from Q.Similarly, continue the average degree of correlation of file3 and other temporary set It calculates.
3.3) judge whether candidate collection complete or collected works H is that sky jumps to step 3.7) at this time and start to execute if H is sky.Due to Existing q in H1, so H is not sky, continue to execute step 3.4).
3.4) q is enabledminIndicate the least candidate collection of use space, qminIt is initialized as arbitrary collection, enables sizeminTable Show qminSize, size when initialmin=Vmax
3.5) each candidate collection in candidate collection complete or collected works H is traversed, the currently used space of candidate collection is calculated, with q1For Example, q1Existing file1, file2 in the middle, so use space are as follows: size1=s1+s2=30MB meets condition size1< sizemin, then: qmin=h1, sizemin=size1
3.6) due to s3+sizemin=60MB≤VmaxIt sets up, candidate collection q can be added in small documents file3minIn, it will All candidate collections are removed from H, are placed back in temporary set complete or collected works Q, are executed step 3.7), at this time: H={ }, Q= {(file1,file2,file3),(),…,()}.If s3+sizemin≤VmaxInvalid, then file3 is due to aggregate capacity Set q is not can enter not enoughminIn, step 3.7) is executed at this time.
3.7) small documents file3 is added in temporary set complete or collected works Q in any one empty temporary set, for example be added To the temporary set q of sky2In, q at this time2={ file3 }, Q=(file1, file2), (file3), () ..., () }.If in Q Empty temporary set has been not present, then has jumped directly to step (6).
3.8) next small documents in small documents set are obtained, and are re-executed from step 3.2), until small documents collection Until all small documents acquisitions finish in conjunction F.
(4) small documents merge storage
4.1) a temporary set is obtained from set Q, it is assumed that the collection of selection is combined into q1.It performs the following operations:
4.2) judge temporary set q1Whether lowest capacity utilization rate R when merging has been reachedmin=0.9, q1Have in the middle Small documents file1, file2, file3, capacity utilization are as follows: size1/Vmax=(10M+20M+30M)/64MB=0.94 >= Rmin, continue to execute step 4.3).If size1/Vmax<Rmin, directly execute step 4.7).
4.3) f is definednewIndicate new null file.
4.4) temporary set q is traversed1Each of small documents, by file1, the file content of file2, file3 are additional Into null file fnewIn the middle, original small documents file1, file2, file3 and its copy are deleted in HDFS.
4.5) by the f after mergingnewFile is saved in HDFS.
4.6) temporary set q is emptied1
4.7) next temporary set in Q is taken, is executed since step 4.2) again, until each temporary set is obtained It takes until finishing.
(5) recall: being executed in the period at this, after small documents merge storing process completion, judge execution next time Whether the period arrives, and reaches specified Ct value T when next starting periodcycleWhen, then it jumps to step (2) and re-executes, Otherwise step (6) are gone to.
(6) terminate: stopping entire small documents classification merging process.
(hereinafter referred to as MBDC is calculated the mass small documents storage optimization method based on HDFS proposed according to the present invention Method), inventor has done relevant performance test.In performance test, has collected 3132 files and its access log, file are big Small distribution is differed from 100KB to 120MB, and wherein size is that 5MB small documents quantity below has reached file population quantity 96%.File distribution situation is as shown in Figure 7:
Compared to the normal storage mode for not taking any consolidation strategy, use the PS algorithm of consolidation strategy (existing Classics merge algorithm) and MBDC algorithm (method that the present invention uses), have greatly to the memory overhead of NameNode node It reduces, small documents storage performance improves 98% and 97.8% respectively.This is because PS algorithm and MBDC algorithm all use it is small The mode of file mergences storage merges processing to the small documents of magnanimity.Since MBDC algorithm is more laid particular emphasis on to file correlation Property account for, with only only account for file size distribution without consider correlation of files PS merge algorithm compare, can waste Fall some remaining memory spaces, so that the final merging quantity of documents of MBDC algorithm is slightly more than PS algorithm, causes PS algorithm is slightly inferior in terms of NameNode memory overhead.Using algorithms of different, NameNode memory consumption situation is as shown in Figure 8.
Compared to the access mode of HDFS default and using the access mode after the storage of PS algorithm, MBDC algorithm greatly contracts Short small documents access times, improve small documents access efficiency.This is mainly due to this algorithms more fully than PS algorithm By the high file of correlation as in same, so that user, in access process, the high small documents content of correlation can be by one And access, to effectively reduce client and the number of communications of NameNode, improve the whole efficiency of data access.It visits Ask that efficiency comparative is as shown in Figure 9.
Finally, it should be noted that above example is only to illustrate the present invention and not limits technology described in the invention, And the technical solution and its improvement of all spirit and scope for not departing from invention, it should all cover in claim model of the invention In enclosing.

Claims (3)

1. a kind of mass small documents storage optimization method based on HDFS, which is characterized in that be divided into five steps: initialization, text Part accesses status analysis, small documents classification is kept in, small documents merge storage and backtracking;Have following basic parameter: small documents are same When the time interval threshold value T that accessesc, the time interval threshold value T of small documents correlation accesss, small documents keep in aggregate capacity upper limit threshold Value Vmax, the temporary lowest capacity utilization rate R being integrated into when small documents mergemin, the minimum relevance threshold θ of small documentsmin, small text Weighing factor W of the part correlation access probability to the degree of correlationr, small documents simultaneously access probability to the weighing factor W of the degree of correlationc, small text Part, which merges, executes cycle Tcycle;The value condition of parameter are as follows: TcValue interval be (0s, 1s], TsValue interval be (1s, 5s], VmaxValue interval is [64MB, 1024MB], RminValue interval is [0.8,1], θminValue interval is [0.3,1];WrWith WcIt is full Sufficient Wr+Wc=1, WrValue interval [0.6,0.9], WcValue interval [0.1,0.4], TcycleValue 3600s or more;
The method is realized according to the following steps:
(1) it initializes
1.1) the small documents collection stored in HDFS is enabled to be combined into F, F={ fi| 1≤i≤M }, fiIndicate that any small documents, M indicate small text Part total number;
1.2) small documents access log collection is enabled to be combined into L, L={ li| 1≤i≤N }, for any li∈ L, liIt is expressed as triple (fi,ti,si), fiRepresent the small documents that log is recorded, tiRepresent small documents access time, siRepresent small documents size;
1.3) building small documents individually access Frequency statistics set C, C={ (fi,ci) | 1≤i≤M }, ciRepresent small documents fiVisit Ask frequency, set C is initialized as null set;Similarly, building small documents access Frequency statistics set I, I={ (f simultaneouslyi,fj, uij,vij) | 1≤i < M, i < j≤M }, uijRepresent small documents fiWith fjIn TsFrequency, v are accessed in time interval simultaneouslyijRepresent small text Part fiWith fjIn TcFrequency is accessed simultaneously in time interval, set I is initialized as null set;
1.4) small documents relevance mapping set S, S={ s are constructedij| 1≤i < M, i < j≤M }, for any sij∈ S, sijIt can table It is shown as triple (si,sj,wij), siWith sjIndicate two different small documents in incidence relation, wijIndicate the phase of two small documents Guan Du, set S are initialized as null set;
1.5) the temporary set complete or collected works Q, Q={ q of building small documentsi| 1≤i≤K }, K indicates to keep in set number in set Q;For Any qi∈ Q, qi={ fj|1≤j≤ni, niIndicate set qiMiddle kept in small documents number, Q are initialized as containing K sky The set of collection;
1.6) small documents candidate collection complete or collected works H is constructed,H={ hi| 1≤i≤K ' }, K ' indicates candidate collection in set H Number;For any hi∈ H, hi={ fj|1≤j≤ni, niIndicate set hiIn small documents number, H is initialized as empty set;
(2) file access status analysis
2.1) each of small documents set F small documents f is traversedi, 1≤i≤M calculates each small documents fiThe frequency individually accessed Number, steps are as follows for calculating:
2.1.1) define ciIndicate current small documents fiFrequency is accessed, c is initializediIt is 0;
2.1.2 each log l in log set L) is traversedk, 1≤k≤N, lkIt is expressed as triple (fk,tk,sk), if meeting fi=fk, then ciIncrease by 1;
2.1.3) by small documents fiWith independent access Frequency statistics ci, with binary group (fi,ci) be added small documents individually access frequency In set C;
2.2) for any two small documents f in small documents set FiWith fj, calculate separately fiWith fjIn time interval TcWith TsUnder While access frequency, calculate that steps are as follows:
2.2.1 small documents f) is chosen from log recording set LiCorresponding log recording subset Li={ (fi,tk,sk)|1≤k≤ Ni, LiIt is file fiLog recording set about time descending;
2.2.2 small documents f) is chosen from log recording set LjCorresponding log recording subset Lj={ (fj,tq,sq)|1≤q≤ Nj,
LjIt is file fjLog recording set about time descending;
2.2.3) enable uijIndicate small documents fiWith fjIn time interval TsIt is interior while accessing frequency, initialize uij←0;Enable vijIt indicates Small documents fiWith fjIn time interval TcIt is interior while accessing frequency, initialize vij←0;
2.2.4) for log set LiMiddle each log recording (fi,tk,sk), traverse log set LjMiddle each log note Record (fj,tq,sq), if | tq-tk|<TcIt sets up, is then considered as while accessing: vijAdd 1;If | tq-tk|<TsIt sets up, is then considered as related visit It asks: uijAdd 1;
2.2.5) by small documents fi, fj, while accessing frequency uij, vijWith four-tuple (fi,fj,uij,vij) small documents are added simultaneously It accesses in Frequency statistics set I;
2.3) for any two small documents f in small documents set FiWith fj, calculate fiWith fjThe degree of correlation, construct fiWith fjPass The mapping of connection property, steps are as follows:
2.3.1 P (f) is enabledj|fi) indicate file fiAfter access, file fjBy the conditional probability of correlation access;Enable P (fifj) indicate File fiWith file fjThe probability of access simultaneously;Calculate P (fj|fi) and P (fifj), N represents the log total number of records;
P(fj|fi)=uij/ci
P(fifj)=vij/N
2.3.2) enable wijRepresent small documents fiWith fjBetween the degree of correlation, wij←Wr×P(fj|fi)+Wc×P(fifj);
2.3.3) by small documents fi, fjAnd degree of correlation wijWith triple (fi,fj,wij) form, be added file association mapping In set S;
(3) small documents classification is temporary
3.1) a small documents f in small documents set F is readi, to small documents fiExecute following steps:
3.2) the temporary set q of each of temporary set complete or collected works Q of traversal small documentsk, 1≤k≤N:
3.2.1 small documents f) is calculatediWith temporary set qkAverage degree of correlation, enable fjIndicate set qkEach of keep in small text Part, wherein j ∈ [1, nk], nkIt is temporary set small file number;Enable aikIndicate small documents fiWith temporary set qkAverage phase Guan Du;If temporary set qkFor sky, then average degree of correlation aik=0;Otherwise, average degree of correlation is calculated using following formula;
If meeting condition aikmin, then by temporary set qkIt removes, is added in candidate collection complete or collected works H from set Q;
3.3) judge candidate collection complete or collected works H whether be it is empty, if H is sky, without any candidate collection, jump to step at this time 3.7) start to execute;If H is not sky, step 3.4) is continued to execute;
3.4) q is enabledminIndicate the least candidate collection of use space, qminIt is initialized as arbitrary collection;Enable sizeminIndicate collection Close qminSize, when initial: by VmaxIt is assigned to sizemin
3.5) each candidate collection h in candidate collection complete or collected works H is traversedj;1≤j≤nj, njFor candidate collection number;Calculate hj's Currently used space sizej, hnFor hjSmall documents number in the middle, skIndicate hjAs the size of small file, sizejIt calculates such as Under:
If meeting sizej<sizemin, then by h at this timejIt is assigned to qmin, sizejIt is assigned to sizemin
3.6) judge small documents fiQ can be addedminIn, if si+sizemin≤Vmax, by small documents fiCandidate collection q is addedmin In, by all candidate collection hjIt is removed from set H, places back in temporary set complete or collected works Q, execute step 3.8);Otherwise, continue Execute step 3.7);
3.7) by small documents fiThe temporary set of any one sky in temporary set complete or collected works Q is added;If being not present in Q empty temporary Set, then jump directly to step (6);
3.8) next untreated small documents in small documents set are read, and are re-executed from step 3.2), until all small File process finishes;
(4) small documents merge storage
4.1) a temporary set q is read from set Qk, it performs the following operations:
4.2) judge temporary set qkWhether lowest capacity utilization rate R when merging has been reachedminIf sizek/Vmax≥Rmin, then Subsequent step is continued to execute, step 4.7) is otherwise directly executed;
4.3) f is definednewIndicate new null file;
4.4) temporary set q is traversedkEach of small documents fi, by each small documents fiContent additional enter blank text Part fnewIn the middle, original small documents f is deleted in HDFSiAnd its copy;
4.5) by the f after mergingnewFile is saved in HDFS;
4.6) temporary set q is emptiedk
4.7) next untreated temporary set in Q is read, is executed since step 4.2) again, until all temporary set It is disposed;
(5) recall: judging whether next merging period reaches, if arrived, jump to step (2) and re-execute, otherwise Go to step (6);
(6) terminate: stopping entire small documents classification merging process.
2. a kind of mass small documents storage optimization method based on HDFS according to claim 1, it is characterised in that: every A back end is additionally arranged file access information collection module, for collecting the access situation of each back end file;In name Claim to be additionally arranged file access Information Statistics analysis module on node, corresponding step 2), for summarizing collected by each back end File history access information, so that quantifying the degree of correlation between file indicates, obtained by the history access information of Study document To file association mapping set;Be additionally arranged file mergences decision-making module on name node, corresponding step 3), by file it Between the degree of correlation, small documents carry out to classification is temporary so that the high small documents of the degree of correlation are categorized into the same temporary set, thus Form the merging decision to small documents;File mergences execution module, corresponding step 4), when title section are additionally arranged on back end After point completes small documents merging decision, control back end is executed into consolidation strategy, completes that the union operation of small documents will be closed And HDFS is written in file later, and deletes original small documents and copy.
3. a kind of mass small documents storage optimization method based on HDFS according to claim 1, it is characterised in that: we Method realization depends on existing HDFS framework, is realized by modifying and increasing newly corresponding software module;Mass small documents storage is excellent Change method platform is made of multiple computer server, that is, nodes, passes through network connection between node;It is big that platform nodes are divided into two Class: including a host node, that is, name node and multiple from node, that is, back end;In name node, in addition to original text Except part system metadata management module, back end management module, the file access Information Statistics of this method needs have been increased newly Analysis module and file mergences decision-making module;In each back end, other than original data block memory management module, File access information collection module and file mergences execution module are increased newly;
Mould is analyzed in the software module increased newly in HDFS platform, the file access Information Statistics including increasing newly on name node Block, file mergences decision-making module;In the file access information collection module that back end increases newly, file mergences execution module;Mould Interaction flow between block is as follows: (1) back end management module sends the message for collecting file access information;Back end pipe Reason module is the original module of name node, its effect is all back end managed in cluster, name Completion node To the control function of back end;(2) file access information collection module is collected the file access information of back end, After the completion of back end collection, sends the access information of collection back to name node and carry out summarizing for file access information;(3) literary The All Files access information that part access information statistical analysis module carrys out collection, it is for statistical analysis, between calculation document The degree of correlation, form the relevance mapping ensemblen of file;(4) it is increased newly according to obtained file association mapping ensemblen, name node File mergences decision-making module carries out classification according to the degree of correlation to small documents and keeps in, and forms combined strategy of classifying;(5) name node It will control back end by back end management module according to classification consolidation strategy and execute consolidation strategy;(6) file mergences Execution module carries out file Merge operation, during file mergences, need to rely on back end on back end Original data block memory management module assists completing the write-in of data and deletes function, and is responsible for sending messages to file system System metadata management module, to carry out the corresponding update of metadata.
CN201910175055.9A 2019-03-08 2019-03-08 Mass small file storage optimization method based on HDFS Expired - Fee Related CN110018997B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910175055.9A CN110018997B (en) 2019-03-08 2019-03-08 Mass small file storage optimization method based on HDFS

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910175055.9A CN110018997B (en) 2019-03-08 2019-03-08 Mass small file storage optimization method based on HDFS

Publications (2)

Publication Number Publication Date
CN110018997A true CN110018997A (en) 2019-07-16
CN110018997B CN110018997B (en) 2021-07-23

Family

ID=67189371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910175055.9A Expired - Fee Related CN110018997B (en) 2019-03-08 2019-03-08 Mass small file storage optimization method based on HDFS

Country Status (1)

Country Link
CN (1) CN110018997B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639054A (en) * 2020-05-29 2020-09-08 中国人民解放军国防科技大学 Data coupling method, system and medium for ocean mode and data assimilation
CN117170590A (en) * 2023-11-03 2023-12-05 沈阳卓志创芯科技有限公司 Computer data storage method and system based on cloud computing
CN117519608A (en) * 2023-12-27 2024-02-06 泰安北航科技园信息科技有限公司 Big data server with Hadoop as core
CN118132520A (en) * 2024-05-08 2024-06-04 济南浪潮数据技术有限公司 Storage system file processing method, electronic device, storage medium and program product

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101854388A (en) * 2010-05-17 2010-10-06 浪潮(北京)电子信息产业有限公司 Method and system concurrently accessing a large amount of small documents in cluster storage
CN103500089A (en) * 2013-09-18 2014-01-08 北京航空航天大学 Small file storage system suitable for Mapreduce calculation model
US20150125133A1 (en) * 2013-11-06 2015-05-07 Konkuk University Industrial Cooperation Corp. Method for transcoding multimedia, and hadoop-based multimedia transcoding system for performing the method
CN106446079A (en) * 2016-09-08 2017-02-22 中国科学院计算技术研究所 Distributed file system-oriented file prefetching/caching method and apparatus
CN108710639A (en) * 2018-04-17 2018-10-26 桂林电子科技大学 A kind of mass small documents access optimization method based on Ceph

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101854388A (en) * 2010-05-17 2010-10-06 浪潮(北京)电子信息产业有限公司 Method and system concurrently accessing a large amount of small documents in cluster storage
CN103500089A (en) * 2013-09-18 2014-01-08 北京航空航天大学 Small file storage system suitable for Mapreduce calculation model
US20150125133A1 (en) * 2013-11-06 2015-05-07 Konkuk University Industrial Cooperation Corp. Method for transcoding multimedia, and hadoop-based multimedia transcoding system for performing the method
CN106446079A (en) * 2016-09-08 2017-02-22 中国科学院计算技术研究所 Distributed file system-oriented file prefetching/caching method and apparatus
CN108710639A (en) * 2018-04-17 2018-10-26 桂林电子科技大学 A kind of mass small documents access optimization method based on Ceph

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
XUN CAI ET AL: "An optimization strategy of massive small files storage based on HDFS", 《2018 JOINT INTERNATIONAL ADVANCED ENGINEERING AND TECHNOLOGY RESEARCH CONFERENCE》 *
周国安 等: "云环境下海量小文件存储技术研究综述", 《信息网络安全》 *
董其文: "基于HDFS的小文件存储方法的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
邹振宇 等: "基于HDFS的云存储系统小文件优化方案", 《计算机工程》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639054A (en) * 2020-05-29 2020-09-08 中国人民解放军国防科技大学 Data coupling method, system and medium for ocean mode and data assimilation
CN111639054B (en) * 2020-05-29 2023-11-07 中国人民解放军国防科技大学 Data coupling method, system and medium for ocean mode and data assimilation
CN117170590A (en) * 2023-11-03 2023-12-05 沈阳卓志创芯科技有限公司 Computer data storage method and system based on cloud computing
CN117170590B (en) * 2023-11-03 2024-01-26 沈阳卓志创芯科技有限公司 Computer data storage method and system based on cloud computing
CN117519608A (en) * 2023-12-27 2024-02-06 泰安北航科技园信息科技有限公司 Big data server with Hadoop as core
CN117519608B (en) * 2023-12-27 2024-03-22 泰安北航科技园信息科技有限公司 Big data server with Hadoop as core
CN118132520A (en) * 2024-05-08 2024-06-04 济南浪潮数据技术有限公司 Storage system file processing method, electronic device, storage medium and program product

Also Published As

Publication number Publication date
CN110018997B (en) 2021-07-23

Similar Documents

Publication Publication Date Title
US20210374610A1 (en) Efficient duplicate detection for machine learning data sets
CN109739849B (en) Data-driven network sensitive information mining and early warning platform
US11169710B2 (en) Method and apparatus for SSD storage access
CN110018997A (en) A kind of mass small documents storage optimization method based on HDFS
CN109740037B (en) Multi-source heterogeneous flow state big data distributed online real-time processing method and system
CA2953826C (en) Machine learning service
Gupta et al. Scalable machine‐learning algorithms for big data analytics: a comprehensive review
CN103838617A (en) Method for constructing data mining platform in big data environment
CN109740038A (en) Network data distributed parallel computing environment and method
CN116089414B (en) Time sequence database writing performance optimization method and device based on mass data scene
CN112799597A (en) Hierarchical storage fault-tolerant method for stream data processing
CN115941696A (en) Heterogeneous Big Data Distributed Cluster Storage Optimization Method
Elmeiligy et al. An efficient parallel indexing structure for multi-dimensional big data using spark
CN114153545B (en) Multi-page-based management system and method
CN114168084B (en) File merging method, file merging device, electronic equipment and storage medium
CN108256694A (en) Based on Fuzzy time sequence forecasting system, the method and device for repeating genetic algorithm
CN115203133A (en) Data processing method and device, reduction server and mapping server
Bai et al. An efficient skyline query algorithm in the distributed environment
CN118503807B (en) Multi-dimensional cross-border commodity matching method and system
CN114610721B (en) Multi-level distributed storage system and storage method
WO2023066248A1 (en) Data processing method and apparatus, device, and system
Wu et al. Storage and Query Indexing Methods on Big Data
Liu et al. AVPS: Automatic Vertical Partitioning for Dynamic Workload
CN116126209A (en) Data storage method, system, device, storage medium and program product
CN116975053A (en) Data processing method, device, equipment, medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210723