CN110018997A - A kind of mass small documents storage optimization method based on HDFS - Google Patents
A kind of mass small documents storage optimization method based on HDFS Download PDFInfo
- Publication number
- CN110018997A CN110018997A CN201910175055.9A CN201910175055A CN110018997A CN 110018997 A CN110018997 A CN 110018997A CN 201910175055 A CN201910175055 A CN 201910175055A CN 110018997 A CN110018997 A CN 110018997A
- Authority
- CN
- China
- Prior art keywords
- small documents
- file
- small
- correlation
- access
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of mass small documents storage optimization method based on HDFS belongs to storage performance optimization field, comprising: initialization, the classification of file access status analysis, small documents is temporary, small documents merging stores and backtracking.Method is directed to the history access log information of file, analyzes file access situation, the degree of correlation between calculation document, forms file association mapping ensemblen.According to file association mapping ensemblen is formed by, classification is carried out to small documents and is kept in, the high small documents of the degree of correlation are temporary together, while considering the size distribution of small documents.Storage finally is merged to temporary small documents, the original part of small documents and copy are deleted, will be merged in the big file storage to HDFS formed.Method is stored the mass small documents being stored in HDFS originally again by way of merging, the correlation of small documents and the size distribution of small documents are fully considered, the memory overhead of title node is significantly reduced, improves HDFS to the access efficiency of small documents.
Description
Technical field
The invention belongs to the storage performances of distributed file system to optimize field, and in particular to a kind of magnanimity based on HDFS
Small documents storage optimization method.
Background technique
Hadoop is the distributed basis Computational frame of Apache exploitation, is real to the open source of Google's cloud computing cluster thought
It is existing.HDFS (Hadoop Distributed File System) is one of nucleus module of Hadoop, is provided for entire frame
The mass file store function of bottom.HDFS uses master-slave architecture, and NameNode (name node) is used as host node, is responsible for pipe
Manage the metadata information of entire distributed file system.DataNode (back end) is responsible for carrying out data as from node
It is locally stored.However, the design of HDFS storage organization is conducive to the efficient storage of big file, but when handling mass small documents,
The storage of HDFS can decline to a great extent with access efficiency.The memory headroom of name node excessively polynary number caused by mass small documents
It is believed that breath occupies, so that the ability of system storage file reaches bottleneck.In addition, mass small documents cause small documents access time to be opened
Pin sharply increases, so that the file access efficiency of whole system reduces.
It for the solution of HDFS small documents storage problem is stored after merging small documents at present.Existing conjunction
And scheme is merged according to the degree of correlation between small documents, according to the rule that particular task scene is established, by the degree of correlation
Big small documents are merged into a file, so that the number of small documents is reduced, but the existing merging based on the file degree of correlation
Scheme depends on specific task scene, not general relatedness computation method, and existing method is not to the big of small documents
Small distribution accounts for, and the result finally merged tends not to make full use of memory space.The method of the present invention merges in small documents
During comprehensively consider the size distribution of correlation and small documents between small documents, during relatedness computation, method
Frequency is accessed file under different time threshold value simultaneously and file correlation access frequency counts, and by reasonably adding
Power mode quantifies the expression of the file degree of correlation, so that this method does not rely on specific task scene, has good general
Property.In addition, the method for the present invention considers the size distribution of file simultaneously, memory space can be more made full use of, is small documents
Basis is made in high efficiency access after merging storage.
Summary of the invention
The method of the present invention is not high for mass small documents storage efficiency for above-mentioned HDFS distributed file storage system
The problem of, calculation document between the degree of correlation for statistical analysis according to the history access log of heap file quantifies between file
The degree of correlation indicate, obtain the relevance mapping ensemblen between file and file, and carry out to small documents according to relevance mapping ensemblen
Classification is temporary, and finally carries out small documents and merge in storage to HDFS.This method has fully considered the correlation between small documents
The size distribution of property and small documents, improves HDFS to the storage performance of mass small documents, and be conducive to a certain extent small
The optimization of file access performance.
Mass small documents storage optimization method of the present invention is divided into five steps: initialization, file access situation point
Analysis, small documents classification is temporary, small documents merging stores and backtracking.In this method, it is defined as follows basic parameter: small documents
The time interval threshold value T of access simultaneouslyc, the time interval threshold value T of small documents correlation accesss, small documents keep in the aggregate capacity upper limit
Threshold value Vmax, the temporary lowest capacity utilization rate R being integrated into when small documents mergemin, the minimum relevance threshold θ of small documentsmin, small
Weighing factor W of the file correlation access probability to the degree of correlationr, small documents simultaneously access probability to the weighing factor W of the degree of correlationc, small
File mergences executes cycle Tcycle.The value condition of parameter is as follows: TcValue interval be (0s, 1s], TsValue interval be (1s,
5s], VmaxValue interval is [64MB, 1024MB], RminValue interval is [0.8,1], θminValue interval is [0.3,1].WrWith
WcMeet Wr+Wc=1, WrValue interval [0.6,0.9], WcValue interval [0.1,0.4], TcycleValue 3600s or more.
The above method is realized according to the following steps on computers:
(1) it initializes
1.1) the small documents collection stored in HDFS is enabled to be combined into F, F={ fi| 1≤i≤M }, fiIndicate that any small documents, M indicate
Small documents total number.
1.2) small documents access log collection is enabled to be combined into L, L={ li| 1≤i≤N }, for any li∈ L, liIt is represented by three
Tuple (fi,ti,si), fiRepresent the small documents that log is recorded, tiRepresent small documents access time, siRepresent small documents size.
1.3) building small documents individually access Frequency statistics set C, C={ (fi,ci) | 1≤i≤M }, ciRepresent small documents
fiAccess frequency, set C is initialized as null set.Similarly, building small documents access Frequency statistics set I, I={ (f simultaneouslyi,
fj,uij,vij) | 1≤i < M, i < j≤M }, uijRepresent small documents fiWith fjIn TsFrequency, v are accessed in time interval simultaneouslyijIt represents
Small documents fiWith fjIn TcFrequency is accessed simultaneously in time interval, set I is initialized as null set.
1.4) small documents relevance mapping set S, S={ s are constructedij| 1≤i < M, i < j≤M }, for any sij∈ S, sij
It is represented by triple (si,sj,wij), siWith sjIndicate two different small documents in incidence relation, wijIndicate two small documents
The degree of correlation, set S is initialized as null set.
1.5) the temporary set complete or collected works Q, Q={ q of building small documentsi| 1≤i≤K }, K indicates to keep in set number in set Q.
For any qi∈ Q, qi={ fj|1≤j≤ni, niIndicate set qiMiddle kept in small documents number, Q are initialized as containing K
The set of a empty set.
1.6) small documents candidate collection complete or collected works H is constructed,H={ hi| 1≤i≤K ' }, K ' indicates candidate in set H
Gather number.For any hi∈ H, hi={ fj|1≤j≤ni, niIndicate set hiIn small documents number, H is initialized as sky
Collection.
(2) file access status analysis
2.1) each of small documents set F small documents f is traversedi(1≤i≤M) calculates each small documents fiIndividually visit
The frequency asked, steps are as follows for calculating:
2.1.1) define ciIndicate current small documents fiFrequency is accessed, c is initializedi←0。
2.1.2 each log l in log set L) is traversedk(1≤k≤N), lkIt is expressed as triple (fk,tk,sk),
If meeting fi=fk, then ci←ci+1。
2.1.3) by small documents fiWith independent access Frequency statistics ci, with binary group (fi,ci) be added small documents individually access
In frequency set C: C ← C ∪ { (fi,ci)}。
2.2) for any two small documents f in small documents set FiWith fj, calculate separately fiWith fjIn time interval TcWith
TsFrequency will be accessed while lower, steps are as follows for calculating:
2.2.1 small documents f) is chosen from log recording set LiCorresponding log recording subset Li={ (fi,tk,sk)|1
≤k≤Ni,
LiIt is file fiLog recording set about time descending.
2.2.2 small documents f) is chosen from log recording set LjCorresponding log recording subset Lj={ (fj,tq,sq)|1
≤q≤Nj,
LjIt is file fjLog recording set about time descending.
2.2.3) enable uijIndicate small documents fiWith fjIn time interval TsIt is interior while accessing frequency, initialize uij←0.Enable vij
Indicate small documents fiWith fjIn time interval TcIt is interior while accessing frequency, initialize vij←0。
2.2.4) for log set LiMiddle each log recording (fi,tk,sk), traverse log set LjMiddle each day
Will records (fj,tq,sq), if | tq-tk|<TcIt sets up, is then considered as while accessing: vij←vij+1.If | tq-tk|<TsIt sets up, then regards
It is accessed for correlation: uij←uij+1。
2.2.5) by small documents fi, fj, while accessing frequency uij, vijWith four-tuple (fi,fj,uij,vij) small documents are added
It accesses in Frequency statistics set I simultaneously: I ← I ∪ { (fi,fj,uij,vij)}。
2.3) for any two small documents f in small documents set FiWith fj, calculate fiWith fjThe degree of correlation, construct fiWith fj
Relevance mapping, steps are as follows:
2.3.1 P (f) is enabledj|fi) indicate file fiAfter access, file fjBy the conditional probability of correlation access;Enable P (fifj)
Indicate file fiWith file fjThe probability of access simultaneously.Calculate P (fj|fi) and P (fifj), N represents the log total number of records.
P(fj|fi)=uij/ci
P(fifj)=vij/N
2.3.2) enable wijRepresent small documents fiWith fjBetween the degree of correlation, wij←Wr×P(fj|fi)+Wc×P(fifj)。
2.3.3) by small documents fi, fjAnd degree of correlation wijWith triple (fi,fj,wij) form, be added file association
In mapping set S, S ← S ∪ { (fi,fj,wij)}。
(3) small documents classification is temporary
3.1) a small documents f in small documents set F is readi, to small documents fiExecute following steps:
3.2) the temporary set q of each of temporary set complete or collected works Q of traversal small documentsk(1≤k≤N):
3.2.1 small documents f) is calculatediWith temporary set qkAverage degree of correlation, enable fj(j∈[1,nk]) indicate set qkIn
Each temporary small documents, nkIt is temporary set small file number.Enable aikIndicate small documents fiWith temporary set qkBe averaged
The degree of correlation.If temporary set qkFor sky, then average degree of correlation aik=0.Otherwise, average degree of correlation is calculated using following formula.
If meeting condition aik>θmin, then by temporary set qkIt removes, is added in candidate collection complete or collected works H from set Q.
3.3) judge candidate collection complete or collected works H whether be it is empty, if H is sky, without any candidate collection, jump to step at this time
It is rapid 3.7) to start to execute.If H is not sky, step 3.4) is continued to execute.
3.4) q is enabledminIndicate the least candidate collection of use space, qminIt is initialized as arbitrary collection.Enable sizeminTable
Show set qminSize, when initial: sizemin←Vmax。
3.5) each candidate collection h in candidate collection complete or collected works H is traversedj(1≤j≤nj, njFor candidate collection number), meter
Calculate hjCurrently used space sizej, hnFor hjSmall documents number in the middle, skIndicate hjAs the size of small file, sizejMeter
It calculates as follows:
If meeting sizej<sizemin, then: qmin←hj, sizemin←sizej。
3.6) judge small documents fiQ can be addedminIn, if si+sizemin≤Vmax, by small documents fiCandidate collection is added
qminIn, by all candidate collection hjIt is removed from set H, places back in temporary set complete or collected works Q, execute step 3.8).Otherwise,
Continue to execute step 3.7).
3.7) by small documents fiThe temporary set of any one sky in temporary set complete or collected works Q is added.If sky is not present in Q
Temporary set, then jump directly to step (6).
3.8) next untreated small documents in small documents set are read, and are re-executed from step 3.2), Zhi Daosuo
There are small documents to be disposed.
(4) small documents merge storage
4.1) a temporary set q is read from set Qk, it performs the following operations:
4.2) judge temporary set qkWhether lowest capacity utilization rate R when merging has been reachedminIf sizek/Vmax≥
Rmin, then subsequent step is continued to execute, step 4.7) is otherwise directly executed.
4.3) f is definednewIndicate new null file.
4.4) temporary set q is traversedkEach of small documents fi, by each small documents fiContent it is additional enter it is empty
White file fnewIn the middle, original small documents f is deleted in HDFSiAnd its copy.
4.5) by the f after mergingnewFile is saved in HDFS.
4.6) temporary set q is emptiedk。
4.7) next untreated temporary set in Q is read, is executed since step 4.2) again, until all temporary
Process of aggregation finishes.
(5) recall: judging whether next merging period reaches, if arrived, jump to step (2) and re-execute,
Otherwise step (6) are gone to.
(6) terminate: stopping entire small documents classification merging process.
In order to realize this method, the present invention is additionally arranged file access information collection module in each back end, for receiving
Collect the access situation of each back end file.In order to realize that this method, the present invention are additionally arranged file access on name node
Information Statistics analysis module, corresponding step 2) pass through for summarizing file history access information collected by each back end
The history access information of Study document obtains file association mapping set so that the degree of correlation quantified between file indicates.For
Realization this method, the present invention are additionally arranged file mergences decision-making module, corresponding step 3), by between file on name node
The degree of correlation, small documents carry out to classification is temporary, so that the high small documents of the degree of correlation are categorized into the same temporary set, thus shape
The merging decision of pairs of small documents.In order to realize that this method, the present invention are additionally arranged file mergences execution module on back end,
Control back end is executed consolidation strategy, completed to small by corresponding step 4) after name node, which completes small documents, merges decision
File after merging is written HDFS, and deletes original small documents and copy by the union operation of file, modifies corresponding first number
It is believed that breath.
Detailed description of the invention
The deployment diagram for the HDFS platform that Fig. 1 is depended on by the method for the present invention.
Fig. 2 is the software module and its interactive relation figure that the method for the present invention increases newly.
Fig. 3 is the method for the present invention overview flow chart.
Fig. 4 is that file access status analysis executes course diagram.
Fig. 5 is that small documents classifying temporary deposits execution flow chart.
Fig. 6 is that small documents merge storage execution flow chart.
Fig. 7 file distribution situation map
Fig. 8 NameNode memory consumption situation
The access efficiency of tri- kinds of storage modes of Fig. 9 compares
Specific embodiment
The present invention is illustrated with reference to the accompanying drawings and detailed description.
Small documents storage optimization method proposed by the invention can depend on existing HDFS framework, by modifying and increasing newly
Corresponding software module is realized.Fig. 1 is the Platform deployment figure of mass small documents storage optimization method of the present invention, which has multiple
Computer server (node) forms, and passes through network connection between node.Platform nodes are divided into two major classes: including a host node
It is (name node) and multiple from node (back end).In name node, in addition to original file system metadata management
Except module, back end management module, the file access Information Statistics analysis module and file for having increased this method needs newly are closed
And decision-making module.In each back end, other than original data block memory management module, file access letter has been increased newly
Cease collection module and file mergences execution module.
The software module and its interactive relation figure newly increased in the HDFS platform that Fig. 2 is depended on by the method for the present invention.Shade
Module is in order to realize this method software module newly-increased in HDFS platform, including the file access increased newly on name node
Information Statistics analysis module, file mergences decision-making module;In the file access information collection module that back end increases newly, file is closed
And execution module.Non-shadow module is existing software module in original HDFS.Interaction flow between module is as follows: (1)
The message that back end management module on name node sends collection file access information gives each of cluster number
According to node.Back end management module is the original module of name node, its effect is all data managed in cluster
Node, control function of the name Completion node to back end.(2) file of the file access information collection module to back end
Access information is collected, and after the completion of back end collection, is sent the access information of collection back to name node and is carried out file visit
Ask summarizing for information.(3) the All Files access information that file access Information Statistics analysis module carrys out collection, is counted
It analyzes, the degree of correlation between calculation document forms the relevance mapping ensemblen of file.(4) according to obtained file association mapping
Collection, the newly-increased file mergences decision-making module of name node carry out classification according to the degree of correlation to small documents and keep in, form classification and merge
Strategy.(5) name node will control back end and execute conjunction according to classification consolidation strategy by back end management module
And strategy.(6) file mergences execution module carries out file Merge operation on back end, during file mergences,
The original data block memory management module of the back end that needs to rely on assist complete data write-in and delete etc. functions, and
It is responsible for sending messages to file system metadata management module, to carry out the corresponding update of metadata.Since small documents were according to both
Determine strategy to have carried out merging storage, effectively reduces name node in the process of the present invention due to safeguarding numerous small documents metadata
Caused memory overhead.
Fig. 3 is the overall execution flow chart of the method for the present invention, is divided into: initialization, file access status analysis, small documents point
Class is temporary, and small documents merge five steps of storage and backtracking.Process is executed in order to which entire method is better described, it is assumed that has been read
Having taken existing 4 small documents in HDFS is respectively file1, file2, file3, file4.This is obtained from history access log
The log information of 4 files, it is assumed that simplified wherein 20 access log informations are as follows:
The basic parameter value that the method for the present invention uses is as follows: small documents while the judgement time interval T of accessc=1s,
Being less than or equal to 1s indicates while accessing, the judgement time interval T of small documents correlation accesss=2s, i.e., be less than or equal to greater than 1s
2s indicates related access, and small documents keep in aggregate capacity upper limit threshold Vmax=64MB, it is temporary to be integrated into when small documents merge most
Low capacity utilization rate Rmin=0.9, minimum relevance threshold θ between small documentsmin=0.5, related access probability is to the degree of correlation
Weighing factor Wr=0.6, while access probability is to the weighing factor W of the degree of correlationc=0.4, small documents merge cycle Tcycle=
86400s。
The specific embodiment of entire inventive method is provided below with reference to Fig. 3:
(1) it initializes
1.1) the small documents collection stored in HDFS is enabled to be combined into F, F={ file1, file2, file3, file4 }, small documents are total
Number M=4.
1.2) small documents access log collection is enabled to be combined into L, L=(file1,2018/01/01 12:10:01,10240),
(file3,2018/01/01 12:10:02,30720) ..., (file4,2018/01/01 12:11:51,40960) }, log
The total number of records is N=20.
1.3) building small documents individually access Frequency statistics set C, when initial: C={ }.Building small documents access frequency simultaneously
Number statistics set I, when initial: I={ }.
1.4) small documents relevance mapping set S is constructed, when initial: S={ }.
1.5) the temporary set complete or collected works Q of building small documents, when initial: the empty set in Q={ { }, { } ..., { } }, set Q
Conjunction number is K=10.
1.6) small documents candidate collection complete or collected works H is constructed, when initial: H={ }.
(2) file access status analysis
2.1) each of small documents set F small documents are traversed, the frequency that each small documents individually access is calculated.It is assumed that
The small documents currently got are file1, then steps are as follows for calculating:
2.1.1) enable c1It indicates the access frequency of current small documents file1, initializes c1=0.
2.1.2 each log (f in log set L) is traversedk,tk,sk), wherein 1≤k≤N, if meeting fk=
File1, then c1=c1+1。
2.1.3 Frequency statistics c) is accessed by small documents file1 and individually1, small documents are added and individually access frequency set C
In, C={ (file1, c at this time1)}。
After entire step 2.1) ergodic process terminates, the independent access Frequency statistics of All Files are completed, final to collect
Close C are as follows: C={ (file1,5), (file2,5), (file3,4), (file4,6) }.
2.2) for any two small documents in small documents set F, two small documents are calculated separately in time interval TcWith Ts
Frequency is accessed while interior.It is assumed that current two accessed small documents are file1 and file2 respectively, then step is calculated such as
Under:
2.2.1) the corresponding log recording subset L of selecting file file1 from log recording set L1, L1It is file f ile1
About the log recording set of time descending, L1=(file1,2018/01/01 12:11:50,10240), (file1,
2018/01/01 12:11:30,10240),…,(file1,2018/01/01 12:10:01,10240)}。
2.2.2) the corresponding log recording subset L of selecting file file2 from log recording set L2, L2It is file f ile2
About the log recording set of time descending, L2=(file2,2018/01/01 12:11:31,20480), (file2,
2018/01/01 12:11:18,20480),…,(file2,2018/01/01 12:10:02,20480)}。
2.2.3) enable u12Indicate small documents file1 and file2 in time interval TsInterior while access frequency, initializes u12
=0.Enable v12Indicate small documents file1 and file2 in time interval TcInterior while access frequency, initializes v12=0.
2.2.4) for log set L1Middle each log recording (file1, tk,sk), traverse log set L2In it is each
Log recording (file2, tq,sq), if | tq-tk|<TcIt sets up, then: v12=v12+1.If | tq-tk|<TsIt sets up, then: u12=
u12+1。
2.2.5) by small documents file1, file2, while frequency u is accessed12, v12With four-tuple (file1, file2, u12,
v12) it is added when small documents access simultaneously in frequency statistics set I: I={ (file1, file2, u12,v12)}。
After entire step 2.2) executes, set I will record the frequency of any two file access simultaneously, as follows:
I=(file1, file2,4,4), (file1, file3,5,4), (file1, file4,1,1), (file2,
file3,3,2),(file2,file4,1,0),(file3,file4,1,1)}
2.3) for any two small documents in small documents set F, the degree of correlation between two small documents, building association are calculated
Property mapping, it is assumed that two accessed files are file1 and file2, then steps are as follows:
2.3.1 after) enabling P (file2 | file1) indicate file f ile1 access, the condition of file f ile2 correlation access is general
Rate;The probability for enabling P (file1file2) to indicate that file f ile1 and file f ile2 are accessed simultaneously.Calculating P (file2 | file1) with
P (file1file2), N represent the log total number of records, N=20.
P (file2 | file1)=u12/c1=0.8
P (file1file2)=v12/ N=0.2
2.3.2) enable w12Represent the degree of correlation between small documents file1 and file2, w12=Wr×0.8+Wc× 0.2=
0.56。
2.3.3) by small documents file1, file2 and degree of correlation w12With triple (file1, file2, w12) form,
It is added in file association mapping set S, S={ (file1, file2, w12)}。
After the completion of entire 2.3) step executes, final file association mapping set S are as follows: S=(file1, file2,
0.56),(file1,file3,0.68),(file1,file4,0.14),(file2,file3,0.4),(file2,file4,
0.12),(file3,file4,0.17)}。
(3) small documents classification is temporary
In order to facilitate demonstration calculating process, it is assumed that small documents file1 and file2 has been placed in temporary set q1In the middle, i.e. q1
={ file1, file2 }, Q={ { file1, file2 }, { } ..., { } }.
3.1) small documents file3 is obtained, following steps are executed to small documents file3:
3.2) the temporary set q of each of temporary set complete or collected works Q of traversal small documentsk:
3.2.1 small documents file3 and temporary set q) are calculatedkAverage degree of correlation, with q1For, file3 and q1Be averaged
Relatedness computation are as follows: a31=(w31+w32)/2=(0.68+0.4)/2=0.54.Due to a31=0.54 > θmin=0.5, so will
q1It removes, and is added in candidate collection complete or collected works H from Q.Similarly, continue the average degree of correlation of file3 and other temporary set
It calculates.
3.3) judge whether candidate collection complete or collected works H is that sky jumps to step 3.7) at this time and start to execute if H is sky.Due to
Existing q in H1, so H is not sky, continue to execute step 3.4).
3.4) q is enabledminIndicate the least candidate collection of use space, qminIt is initialized as arbitrary collection, enables sizeminTable
Show qminSize, size when initialmin=Vmax。
3.5) each candidate collection in candidate collection complete or collected works H is traversed, the currently used space of candidate collection is calculated, with q1For
Example, q1Existing file1, file2 in the middle, so use space are as follows: size1=s1+s2=30MB meets condition size1<
sizemin, then: qmin=h1, sizemin=size1。
3.6) due to s3+sizemin=60MB≤VmaxIt sets up, candidate collection q can be added in small documents file3minIn, it will
All candidate collections are removed from H, are placed back in temporary set complete or collected works Q, are executed step 3.7), at this time: H={ }, Q=
{(file1,file2,file3),(),…,()}.If s3+sizemin≤VmaxInvalid, then file3 is due to aggregate capacity
Set q is not can enter not enoughminIn, step 3.7) is executed at this time.
3.7) small documents file3 is added in temporary set complete or collected works Q in any one empty temporary set, for example be added
To the temporary set q of sky2In, q at this time2={ file3 }, Q=(file1, file2), (file3), () ..., () }.If in Q
Empty temporary set has been not present, then has jumped directly to step (6).
3.8) next small documents in small documents set are obtained, and are re-executed from step 3.2), until small documents collection
Until all small documents acquisitions finish in conjunction F.
(4) small documents merge storage
4.1) a temporary set is obtained from set Q, it is assumed that the collection of selection is combined into q1.It performs the following operations:
4.2) judge temporary set q1Whether lowest capacity utilization rate R when merging has been reachedmin=0.9, q1Have in the middle
Small documents file1, file2, file3, capacity utilization are as follows: size1/Vmax=(10M+20M+30M)/64MB=0.94 >=
Rmin, continue to execute step 4.3).If size1/Vmax<Rmin, directly execute step 4.7).
4.3) f is definednewIndicate new null file.
4.4) temporary set q is traversed1Each of small documents, by file1, the file content of file2, file3 are additional
Into null file fnewIn the middle, original small documents file1, file2, file3 and its copy are deleted in HDFS.
4.5) by the f after mergingnewFile is saved in HDFS.
4.6) temporary set q is emptied1。
4.7) next temporary set in Q is taken, is executed since step 4.2) again, until each temporary set is obtained
It takes until finishing.
(5) recall: being executed in the period at this, after small documents merge storing process completion, judge execution next time
Whether the period arrives, and reaches specified Ct value T when next starting periodcycleWhen, then it jumps to step (2) and re-executes,
Otherwise step (6) are gone to.
(6) terminate: stopping entire small documents classification merging process.
(hereinafter referred to as MBDC is calculated the mass small documents storage optimization method based on HDFS proposed according to the present invention
Method), inventor has done relevant performance test.In performance test, has collected 3132 files and its access log, file are big
Small distribution is differed from 100KB to 120MB, and wherein size is that 5MB small documents quantity below has reached file population quantity
96%.File distribution situation is as shown in Figure 7:
Compared to the normal storage mode for not taking any consolidation strategy, use the PS algorithm of consolidation strategy (existing
Classics merge algorithm) and MBDC algorithm (method that the present invention uses), have greatly to the memory overhead of NameNode node
It reduces, small documents storage performance improves 98% and 97.8% respectively.This is because PS algorithm and MBDC algorithm all use it is small
The mode of file mergences storage merges processing to the small documents of magnanimity.Since MBDC algorithm is more laid particular emphasis on to file correlation
Property account for, with only only account for file size distribution without consider correlation of files PS merge algorithm compare, can waste
Fall some remaining memory spaces, so that the final merging quantity of documents of MBDC algorithm is slightly more than PS algorithm, causes
PS algorithm is slightly inferior in terms of NameNode memory overhead.Using algorithms of different, NameNode memory consumption situation is as shown in Figure 8.
Compared to the access mode of HDFS default and using the access mode after the storage of PS algorithm, MBDC algorithm greatly contracts
Short small documents access times, improve small documents access efficiency.This is mainly due to this algorithms more fully than PS algorithm
By the high file of correlation as in same, so that user, in access process, the high small documents content of correlation can be by one
And access, to effectively reduce client and the number of communications of NameNode, improve the whole efficiency of data access.It visits
Ask that efficiency comparative is as shown in Figure 9.
Finally, it should be noted that above example is only to illustrate the present invention and not limits technology described in the invention,
And the technical solution and its improvement of all spirit and scope for not departing from invention, it should all cover in claim model of the invention
In enclosing.
Claims (3)
1. a kind of mass small documents storage optimization method based on HDFS, which is characterized in that be divided into five steps: initialization, text
Part accesses status analysis, small documents classification is kept in, small documents merge storage and backtracking;Have following basic parameter: small documents are same
When the time interval threshold value T that accessesc, the time interval threshold value T of small documents correlation accesss, small documents keep in aggregate capacity upper limit threshold
Value Vmax, the temporary lowest capacity utilization rate R being integrated into when small documents mergemin, the minimum relevance threshold θ of small documentsmin, small text
Weighing factor W of the part correlation access probability to the degree of correlationr, small documents simultaneously access probability to the weighing factor W of the degree of correlationc, small text
Part, which merges, executes cycle Tcycle;The value condition of parameter are as follows: TcValue interval be (0s, 1s], TsValue interval be (1s, 5s],
VmaxValue interval is [64MB, 1024MB], RminValue interval is [0.8,1], θminValue interval is [0.3,1];WrWith WcIt is full
Sufficient Wr+Wc=1, WrValue interval [0.6,0.9], WcValue interval [0.1,0.4], TcycleValue 3600s or more;
The method is realized according to the following steps:
(1) it initializes
1.1) the small documents collection stored in HDFS is enabled to be combined into F, F={ fi| 1≤i≤M }, fiIndicate that any small documents, M indicate small text
Part total number;
1.2) small documents access log collection is enabled to be combined into L, L={ li| 1≤i≤N }, for any li∈ L, liIt is expressed as triple
(fi,ti,si), fiRepresent the small documents that log is recorded, tiRepresent small documents access time, siRepresent small documents size;
1.3) building small documents individually access Frequency statistics set C, C={ (fi,ci) | 1≤i≤M }, ciRepresent small documents fiVisit
Ask frequency, set C is initialized as null set;Similarly, building small documents access Frequency statistics set I, I={ (f simultaneouslyi,fj,
uij,vij) | 1≤i < M, i < j≤M }, uijRepresent small documents fiWith fjIn TsFrequency, v are accessed in time interval simultaneouslyijRepresent small text
Part fiWith fjIn TcFrequency is accessed simultaneously in time interval, set I is initialized as null set;
1.4) small documents relevance mapping set S, S={ s are constructedij| 1≤i < M, i < j≤M }, for any sij∈ S, sijIt can table
It is shown as triple (si,sj,wij), siWith sjIndicate two different small documents in incidence relation, wijIndicate the phase of two small documents
Guan Du, set S are initialized as null set;
1.5) the temporary set complete or collected works Q, Q={ q of building small documentsi| 1≤i≤K }, K indicates to keep in set number in set Q;For
Any qi∈ Q, qi={ fj|1≤j≤ni, niIndicate set qiMiddle kept in small documents number, Q are initialized as containing K sky
The set of collection;
1.6) small documents candidate collection complete or collected works H is constructed,H={ hi| 1≤i≤K ' }, K ' indicates candidate collection in set H
Number;For any hi∈ H, hi={ fj|1≤j≤ni, niIndicate set hiIn small documents number, H is initialized as empty set;
(2) file access status analysis
2.1) each of small documents set F small documents f is traversedi, 1≤i≤M calculates each small documents fiThe frequency individually accessed
Number, steps are as follows for calculating:
2.1.1) define ciIndicate current small documents fiFrequency is accessed, c is initializediIt is 0;
2.1.2 each log l in log set L) is traversedk, 1≤k≤N, lkIt is expressed as triple (fk,tk,sk), if meeting
fi=fk, then ciIncrease by 1;
2.1.3) by small documents fiWith independent access Frequency statistics ci, with binary group (fi,ci) be added small documents individually access frequency
In set C;
2.2) for any two small documents f in small documents set FiWith fj, calculate separately fiWith fjIn time interval TcWith TsUnder
While access frequency, calculate that steps are as follows:
2.2.1 small documents f) is chosen from log recording set LiCorresponding log recording subset Li={ (fi,tk,sk)|1≤k≤
Ni, LiIt is file fiLog recording set about time descending;
2.2.2 small documents f) is chosen from log recording set LjCorresponding log recording subset Lj={ (fj,tq,sq)|1≤q≤
Nj,
LjIt is file fjLog recording set about time descending;
2.2.3) enable uijIndicate small documents fiWith fjIn time interval TsIt is interior while accessing frequency, initialize uij←0;Enable vijIt indicates
Small documents fiWith fjIn time interval TcIt is interior while accessing frequency, initialize vij←0;
2.2.4) for log set LiMiddle each log recording (fi,tk,sk), traverse log set LjMiddle each log note
Record (fj,tq,sq), if | tq-tk|<TcIt sets up, is then considered as while accessing: vijAdd 1;If | tq-tk|<TsIt sets up, is then considered as related visit
It asks: uijAdd 1;
2.2.5) by small documents fi, fj, while accessing frequency uij, vijWith four-tuple (fi,fj,uij,vij) small documents are added simultaneously
It accesses in Frequency statistics set I;
2.3) for any two small documents f in small documents set FiWith fj, calculate fiWith fjThe degree of correlation, construct fiWith fjPass
The mapping of connection property, steps are as follows:
2.3.1 P (f) is enabledj|fi) indicate file fiAfter access, file fjBy the conditional probability of correlation access;Enable P (fifj) indicate
File fiWith file fjThe probability of access simultaneously;Calculate P (fj|fi) and P (fifj), N represents the log total number of records;
P(fj|fi)=uij/ci
P(fifj)=vij/N
2.3.2) enable wijRepresent small documents fiWith fjBetween the degree of correlation, wij←Wr×P(fj|fi)+Wc×P(fifj);
2.3.3) by small documents fi, fjAnd degree of correlation wijWith triple (fi,fj,wij) form, be added file association mapping
In set S;
(3) small documents classification is temporary
3.1) a small documents f in small documents set F is readi, to small documents fiExecute following steps:
3.2) the temporary set q of each of temporary set complete or collected works Q of traversal small documentsk, 1≤k≤N:
3.2.1 small documents f) is calculatediWith temporary set qkAverage degree of correlation, enable fjIndicate set qkEach of keep in small text
Part, wherein j ∈ [1, nk], nkIt is temporary set small file number;Enable aikIndicate small documents fiWith temporary set qkAverage phase
Guan Du;If temporary set qkFor sky, then average degree of correlation aik=0;Otherwise, average degree of correlation is calculated using following formula;
If meeting condition aik>θmin, then by temporary set qkIt removes, is added in candidate collection complete or collected works H from set Q;
3.3) judge candidate collection complete or collected works H whether be it is empty, if H is sky, without any candidate collection, jump to step at this time
3.7) start to execute;If H is not sky, step 3.4) is continued to execute;
3.4) q is enabledminIndicate the least candidate collection of use space, qminIt is initialized as arbitrary collection;Enable sizeminIndicate collection
Close qminSize, when initial: by VmaxIt is assigned to sizemin;
3.5) each candidate collection h in candidate collection complete or collected works H is traversedj;1≤j≤nj, njFor candidate collection number;Calculate hj's
Currently used space sizej, hnFor hjSmall documents number in the middle, skIndicate hjAs the size of small file, sizejIt calculates such as
Under:
If meeting sizej<sizemin, then by h at this timejIt is assigned to qmin, sizejIt is assigned to sizemin;
3.6) judge small documents fiQ can be addedminIn, if si+sizemin≤Vmax, by small documents fiCandidate collection q is addedmin
In, by all candidate collection hjIt is removed from set H, places back in temporary set complete or collected works Q, execute step 3.8);Otherwise, continue
Execute step 3.7);
3.7) by small documents fiThe temporary set of any one sky in temporary set complete or collected works Q is added;If being not present in Q empty temporary
Set, then jump directly to step (6);
3.8) next untreated small documents in small documents set are read, and are re-executed from step 3.2), until all small
File process finishes;
(4) small documents merge storage
4.1) a temporary set q is read from set Qk, it performs the following operations:
4.2) judge temporary set qkWhether lowest capacity utilization rate R when merging has been reachedminIf sizek/Vmax≥Rmin, then
Subsequent step is continued to execute, step 4.7) is otherwise directly executed;
4.3) f is definednewIndicate new null file;
4.4) temporary set q is traversedkEach of small documents fi, by each small documents fiContent additional enter blank text
Part fnewIn the middle, original small documents f is deleted in HDFSiAnd its copy;
4.5) by the f after mergingnewFile is saved in HDFS;
4.6) temporary set q is emptiedk;
4.7) next untreated temporary set in Q is read, is executed since step 4.2) again, until all temporary set
It is disposed;
(5) recall: judging whether next merging period reaches, if arrived, jump to step (2) and re-execute, otherwise
Go to step (6);
(6) terminate: stopping entire small documents classification merging process.
2. a kind of mass small documents storage optimization method based on HDFS according to claim 1, it is characterised in that: every
A back end is additionally arranged file access information collection module, for collecting the access situation of each back end file;In name
Claim to be additionally arranged file access Information Statistics analysis module on node, corresponding step 2), for summarizing collected by each back end
File history access information, so that quantifying the degree of correlation between file indicates, obtained by the history access information of Study document
To file association mapping set;Be additionally arranged file mergences decision-making module on name node, corresponding step 3), by file it
Between the degree of correlation, small documents carry out to classification is temporary so that the high small documents of the degree of correlation are categorized into the same temporary set, thus
Form the merging decision to small documents;File mergences execution module, corresponding step 4), when title section are additionally arranged on back end
After point completes small documents merging decision, control back end is executed into consolidation strategy, completes that the union operation of small documents will be closed
And HDFS is written in file later, and deletes original small documents and copy.
3. a kind of mass small documents storage optimization method based on HDFS according to claim 1, it is characterised in that: we
Method realization depends on existing HDFS framework, is realized by modifying and increasing newly corresponding software module;Mass small documents storage is excellent
Change method platform is made of multiple computer server, that is, nodes, passes through network connection between node;It is big that platform nodes are divided into two
Class: including a host node, that is, name node and multiple from node, that is, back end;In name node, in addition to original text
Except part system metadata management module, back end management module, the file access Information Statistics of this method needs have been increased newly
Analysis module and file mergences decision-making module;In each back end, other than original data block memory management module,
File access information collection module and file mergences execution module are increased newly;
Mould is analyzed in the software module increased newly in HDFS platform, the file access Information Statistics including increasing newly on name node
Block, file mergences decision-making module;In the file access information collection module that back end increases newly, file mergences execution module;Mould
Interaction flow between block is as follows: (1) back end management module sends the message for collecting file access information;Back end pipe
Reason module is the original module of name node, its effect is all back end managed in cluster, name Completion node
To the control function of back end;(2) file access information collection module is collected the file access information of back end,
After the completion of back end collection, sends the access information of collection back to name node and carry out summarizing for file access information;(3) literary
The All Files access information that part access information statistical analysis module carrys out collection, it is for statistical analysis, between calculation document
The degree of correlation, form the relevance mapping ensemblen of file;(4) it is increased newly according to obtained file association mapping ensemblen, name node
File mergences decision-making module carries out classification according to the degree of correlation to small documents and keeps in, and forms combined strategy of classifying;(5) name node
It will control back end by back end management module according to classification consolidation strategy and execute consolidation strategy;(6) file mergences
Execution module carries out file Merge operation, during file mergences, need to rely on back end on back end
Original data block memory management module assists completing the write-in of data and deletes function, and is responsible for sending messages to file system
System metadata management module, to carry out the corresponding update of metadata.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910175055.9A CN110018997B (en) | 2019-03-08 | 2019-03-08 | Mass small file storage optimization method based on HDFS |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910175055.9A CN110018997B (en) | 2019-03-08 | 2019-03-08 | Mass small file storage optimization method based on HDFS |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110018997A true CN110018997A (en) | 2019-07-16 |
CN110018997B CN110018997B (en) | 2021-07-23 |
Family
ID=67189371
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910175055.9A Expired - Fee Related CN110018997B (en) | 2019-03-08 | 2019-03-08 | Mass small file storage optimization method based on HDFS |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110018997B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111639054A (en) * | 2020-05-29 | 2020-09-08 | 中国人民解放军国防科技大学 | Data coupling method, system and medium for ocean mode and data assimilation |
CN117170590A (en) * | 2023-11-03 | 2023-12-05 | 沈阳卓志创芯科技有限公司 | Computer data storage method and system based on cloud computing |
CN117519608A (en) * | 2023-12-27 | 2024-02-06 | 泰安北航科技园信息科技有限公司 | Big data server with Hadoop as core |
CN118132520A (en) * | 2024-05-08 | 2024-06-04 | 济南浪潮数据技术有限公司 | Storage system file processing method, electronic device, storage medium and program product |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101854388A (en) * | 2010-05-17 | 2010-10-06 | 浪潮(北京)电子信息产业有限公司 | Method and system concurrently accessing a large amount of small documents in cluster storage |
CN103500089A (en) * | 2013-09-18 | 2014-01-08 | 北京航空航天大学 | Small file storage system suitable for Mapreduce calculation model |
US20150125133A1 (en) * | 2013-11-06 | 2015-05-07 | Konkuk University Industrial Cooperation Corp. | Method for transcoding multimedia, and hadoop-based multimedia transcoding system for performing the method |
CN106446079A (en) * | 2016-09-08 | 2017-02-22 | 中国科学院计算技术研究所 | Distributed file system-oriented file prefetching/caching method and apparatus |
CN108710639A (en) * | 2018-04-17 | 2018-10-26 | 桂林电子科技大学 | A kind of mass small documents access optimization method based on Ceph |
-
2019
- 2019-03-08 CN CN201910175055.9A patent/CN110018997B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101854388A (en) * | 2010-05-17 | 2010-10-06 | 浪潮(北京)电子信息产业有限公司 | Method and system concurrently accessing a large amount of small documents in cluster storage |
CN103500089A (en) * | 2013-09-18 | 2014-01-08 | 北京航空航天大学 | Small file storage system suitable for Mapreduce calculation model |
US20150125133A1 (en) * | 2013-11-06 | 2015-05-07 | Konkuk University Industrial Cooperation Corp. | Method for transcoding multimedia, and hadoop-based multimedia transcoding system for performing the method |
CN106446079A (en) * | 2016-09-08 | 2017-02-22 | 中国科学院计算技术研究所 | Distributed file system-oriented file prefetching/caching method and apparatus |
CN108710639A (en) * | 2018-04-17 | 2018-10-26 | 桂林电子科技大学 | A kind of mass small documents access optimization method based on Ceph |
Non-Patent Citations (4)
Title |
---|
XUN CAI ET AL: "An optimization strategy of massive small files storage based on HDFS", 《2018 JOINT INTERNATIONAL ADVANCED ENGINEERING AND TECHNOLOGY RESEARCH CONFERENCE》 * |
周国安 等: "云环境下海量小文件存储技术研究综述", 《信息网络安全》 * |
董其文: "基于HDFS的小文件存储方法的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
邹振宇 等: "基于HDFS的云存储系统小文件优化方案", 《计算机工程》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111639054A (en) * | 2020-05-29 | 2020-09-08 | 中国人民解放军国防科技大学 | Data coupling method, system and medium for ocean mode and data assimilation |
CN111639054B (en) * | 2020-05-29 | 2023-11-07 | 中国人民解放军国防科技大学 | Data coupling method, system and medium for ocean mode and data assimilation |
CN117170590A (en) * | 2023-11-03 | 2023-12-05 | 沈阳卓志创芯科技有限公司 | Computer data storage method and system based on cloud computing |
CN117170590B (en) * | 2023-11-03 | 2024-01-26 | 沈阳卓志创芯科技有限公司 | Computer data storage method and system based on cloud computing |
CN117519608A (en) * | 2023-12-27 | 2024-02-06 | 泰安北航科技园信息科技有限公司 | Big data server with Hadoop as core |
CN117519608B (en) * | 2023-12-27 | 2024-03-22 | 泰安北航科技园信息科技有限公司 | Big data server with Hadoop as core |
CN118132520A (en) * | 2024-05-08 | 2024-06-04 | 济南浪潮数据技术有限公司 | Storage system file processing method, electronic device, storage medium and program product |
Also Published As
Publication number | Publication date |
---|---|
CN110018997B (en) | 2021-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210374610A1 (en) | Efficient duplicate detection for machine learning data sets | |
CN109739849B (en) | Data-driven network sensitive information mining and early warning platform | |
US11169710B2 (en) | Method and apparatus for SSD storage access | |
CN110018997A (en) | A kind of mass small documents storage optimization method based on HDFS | |
CN109740037B (en) | Multi-source heterogeneous flow state big data distributed online real-time processing method and system | |
CA2953826C (en) | Machine learning service | |
Gupta et al. | Scalable machine‐learning algorithms for big data analytics: a comprehensive review | |
CN103838617A (en) | Method for constructing data mining platform in big data environment | |
CN109740038A (en) | Network data distributed parallel computing environment and method | |
CN116089414B (en) | Time sequence database writing performance optimization method and device based on mass data scene | |
CN112799597A (en) | Hierarchical storage fault-tolerant method for stream data processing | |
CN115941696A (en) | Heterogeneous Big Data Distributed Cluster Storage Optimization Method | |
Elmeiligy et al. | An efficient parallel indexing structure for multi-dimensional big data using spark | |
CN114153545B (en) | Multi-page-based management system and method | |
CN114168084B (en) | File merging method, file merging device, electronic equipment and storage medium | |
CN108256694A (en) | Based on Fuzzy time sequence forecasting system, the method and device for repeating genetic algorithm | |
CN115203133A (en) | Data processing method and device, reduction server and mapping server | |
Bai et al. | An efficient skyline query algorithm in the distributed environment | |
CN118503807B (en) | Multi-dimensional cross-border commodity matching method and system | |
CN114610721B (en) | Multi-level distributed storage system and storage method | |
WO2023066248A1 (en) | Data processing method and apparatus, device, and system | |
Wu et al. | Storage and Query Indexing Methods on Big Data | |
Liu et al. | AVPS: Automatic Vertical Partitioning for Dynamic Workload | |
CN116126209A (en) | Data storage method, system, device, storage medium and program product | |
CN116975053A (en) | Data processing method, device, equipment, medium and program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210723 |