CN110515920A

CN110515920A - A kind of mass small documents access method and system based on Hadoop

Info

Publication number: CN110515920A
Application number: CN201910816503.9A
Authority: CN
Inventors: 孙伟源
Original assignee: Beijing Inspur Data Technology Co Ltd
Current assignee: Beijing Inspur Data Technology Co Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2019-11-29

Abstract

The invention discloses a kind of mass small documents access method and system based on Hadoop, method includes: step 1, judges whether to need to save small documents；If so, step 2, classifies to small documents according to predetermined characteristic, and the small documents of small documents index is put into small documents queue；Step 3, judge whether the length of small documents queue reaches threshold value；If so, multiple small documents in small documents queue are merged into big file by step 4, global index is established, and corresponding relationship is deposited into file index backward NameNode and initiates storage request；Step 5, NameNode according to the block of default size to big file division at data block after, by the storage at least one DataNode of big file, and the state of DataNode and DataNode where data block are write in name space.By first being sorted out according to predetermined characteristic, big file being synthesized in small documents queue before small documents store, small documents rope and vertical global index are established, memory consumption, system load are reduced, improves access efficiency.

Description

A kind of mass small documents access method and system based on Hadoop

Technical field

The present invention relates to big data processing technology fields, access more particularly to a kind of mass small documents based on Hadoop Method and system.

Background technique

Currently, Internet application is ubiquitous, resulting mass data brings huge pressure to storage and processing Power.Big data technology is a series of unconventional tool of uses to a large amount of structuring, unstructured and partly-structured data Handled and obtained the technology of analysis and prediction result.

It can not only be the storage of mass data using big data processing technique by Hadoop frame application in mass data Carrier is provided, while also providing new approach for efficiently processing data.Hadoop provides a distributed document storage System HDFS.HDFS can be used to save the mass data of substantially sequential access, and provide it is a kind of quickly access it is specific The mechanism of data.

However, the HDFS designed to handle big file is that can generate in small documents such as processing picture, file types Problem.General small documents refer to that size is less than the file of 10M, if there are a large amount of this small documents in system, it will pole The memory headroom of the earth trumpet NameNode, to influence the performance of entire HDFS cluster.

There is no very good solution methods aiming at the problem that HDFS accesses small documents at present, and HDFS itself is provided Sequencefile solution reduces the memory consumption of NameNode by merging small documents Li Ai to greatest extent. Sequencefile is the text storage file being made of the byte of Binary Serialization key/value.In In Sequencefile, each key/value is counted as a record.In general, can by the file of small documents and File content constructs a key-value pair, and the key-value pair set being made of in this way multiple small documents can be bundled to In Sequencefile.Sequencefile supports compression, can by several recording compresseds to together, the method reduce The memory consumption of NameNode, but file mergences needs to consume the long period, since key assignments therein does not arrange, searches one A small documents need to be traversed for entire Sequencefile, reduce access efficiency.

Summary of the invention

The object of the present invention is to provide a kind of mass small documents access method and system based on Hadoop are reduced NameNode memory consumption improves access efficiency, reduces system load.

In order to solve the above technical problems, the embodiment of the invention provides a kind of mass small documents access side based on Hadoop Method, comprising:

Step 1, judge whether to need to save small documents；

If so, step 2, after classifying according to predetermined characteristic to the small documents, by small documents and described small The small documents index of file is put into small documents queue；

Step 3, judge whether the length of the small documents queue reaches threshold value；

If so, multiple small documents in the small documents queue are merged into big file by step 4, global rope is established Draw, and corresponding relationship is deposited into file index backward NameNode and initiates storage request；

Step 5, the NameNode according to the block of default size to the big file division at data block after, by institute It states in the storage at least one DataNode of big file, and by the DataNode where the data block and described The state of DataNode is write in name space.

Wherein, the step 2 includes:

The small documents are sorted out according to the file type or creation time of the small documents.

Wherein, the step 4 includes:

Multiple small documents in the small documents queue are merged by big file using Mapfile, wherein MapFile includes the part index and the part data, and for storing data, the part index is used for file for the part data Data directory, for recording the deviation post of the key value and record of record hereof.

Wherein, after the step 5, further includes:

Step 6, judge whether to receive small documents read requests；

Step 7, pre-read in the big file in the small documents read requests where corresponding small documents with the small documents The relevant small documents.

Wherein, after the step 7, further includes:

Step 8, judge whether the frequency accessed in the given time of the small documents reaches threshold value；

If so, step 9, by small documents storage into caching.

Wherein, after the step 9, further includes:

Step 10, judge the small documents in the caching and it is last accessed between time interval whether reach Pre- fixed length T；

If so, the small documents are deleted from the caching.

In addition to this, the embodiment of the invention also provides a kind of, and the mass small documents based on Hadoop access system, comprising:

Small documents store request module, for after having detected that small documents are stored, output pretreatment to be ordered It enables；

Small documents preprocessing module is connect with small documents storage request module, receives the pretreatment order, according to The small documents of the small documents and small documents index is put into small by predetermined characteristic after classifying to the small documents In document queue, after the length of the small documents queue reaches threshold value, by multiple small texts in the small documents queue Part merges into big file, establishes global index, and corresponding relationship is deposited into after the file index and is deposited to NameNode initiation Storage request, control the NameNode according to preset size block to the big file division at data block after, will it is described greatly File storage is at least one DataNode, and by the DataNode and the DataNode where the data block State is write in name space.

It wherein, further include the small documents read requests module being connect with the small documents preprocessing module, the pre- modulus of index Block, the small documents read requests module are used for after detecting small documents read requests, prefetch module hair to the index Pre-read is requested out, and the index prefetches module and pre-reads big file in the small documents read requests where corresponding small documents In the small documents relevant to the small documents.

It wherein, further include prefetching the cache module that module is connect with index, the cache module is for storing the predetermined time Interior accessed frequency reaches the small documents of threshold value.

It wherein, further include the cache cleaner module being connect with the cache module, the cache cleaner module detects institute State the small documents in caching and it is last accessed between time interval reach pre- fixed length T after, by the small documents from It is deleted in the caching.

Mass small documents access method and system based on Hadoop provided by the embodiment of the present invention, with prior art phase Than having the advantage that

Mass small documents access method and system provided in an embodiment of the present invention based on Hadoop, by being deposited in small documents Before storage, first sorted out according to predetermined characteristic, big file is synthesized in small documents queue, establishes small documents rope and the vertical overall situation Index, so that double indexes are formed by small documents rope and vertical global index, so that the reading process of small documents in reading process In, it is first indexed again to small documents from global index, inquiry velocity has more block, realizes the quick positioning of small documents, simultaneously because needing The index file wanted is less, reduces memory consumption, system load, improves access efficiency, while storage is stored according to predetermined characteristic, Storage efficiency is higher, can also improve reading efficiency accordingly.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 is a kind of specific embodiment party of the mass small documents access method provided in an embodiment of the present invention based on Hadoop The step flow diagram of formula；

Fig. 2 is another specific implementation of the mass small documents access method provided in an embodiment of the present invention based on Hadoop The step flow diagram of mode；

Fig. 3 is a kind of specific embodiment party that the mass small documents provided in an embodiment of the present invention based on Hadoop access system The attachment structure schematic diagram of formula；

Fig. 4 is another specific implementation that the mass small documents provided in an embodiment of the present invention based on Hadoop access system The attachment structure schematic diagram of mode.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

FIG. 1 to FIG. 4 is please referred to, Fig. 1 is the mass small documents access method provided in an embodiment of the present invention based on Hadoop A kind of specific embodiment step flow diagram；Fig. 2 is that the magnanimity provided in an embodiment of the present invention based on Hadoop is small The step flow diagram of another specific embodiment of file access method；Fig. 3 is provided in an embodiment of the present invention is based on A kind of attachment structure schematic diagram of specific embodiment of the mass small documents access system of Hadoop；Fig. 4 is that the present invention is implemented The attachment structure schematic diagram of another specific embodiment for the mass small documents access system based on Hadoop that example provides.

In a specific embodiment, the mass small documents access method based on Hadoop, comprising:

Step 1, judge whether to need to save small documents；It needs to judge whether there is small documents storage herein and ask Ask, to open subsequent step, save memory, small documents storage request here can be timing and detect, be also possible to Machine testing, if system carries out random assignment, such as 1-3s, or according to the flat rate of appearance of small documents, if frequency is very before Height illustrates currently carrying out large-scale small documents storage, thus needs to improve detection frequency, reduces between detection time Every on the contrary, detection time interval can be increased.

If so, step 2, after classifying according to predetermined characteristic to the small documents, by small documents and described small The small documents index of file is put into small documents queue；Here the purpose classified is storage and subsequent reading for convenience It takes, the file of general same category feature can be stored and be read by collective so that it is convenient to subsequent reading, otherwise, even if carrying out small text The position enquiring of part just needs many time, both increases memory consumption, also increases the time read and needed, substantially reduces Access efficiency.

Step 3, judge whether the length of the small documents queue reaches threshold value；Judge that the length of small documents queue reaches threshold The purpose of value is, facilitates it in the big file of subsequent synthesis, all has unified length, the length between different big files Spend of substantially equal, it is of substantially equal to store the space occupied, is similar to and uses packaging cargo in case, greatly improves the utilization for space Efficiency, the present invention for small documents queue length threshold without limitation, can be according to the size of big file storage into Row auto-changing, as big file storage space in, can allow for storage quantity be 100G, allow 100 big files, each The length of big file is not more than 1G, and after the memory space of big file becomes 200G, allow quantity or 100, that The length of each big file becomes not greater than 1G, or regardless of in that memory space, each the size of big file is No more than 1G, only the size according to corresponding memory space, quantity are accordingly converted, and this is not limited by the present invention.

If so, multiple small documents in the small documents queue are merged into big file by step 4, global rope is established Draw, and corresponding relationship is deposited into file index backward NameNode and initiates storage request；Here global index is established with before Person indexes to form double indexes in the small documents formed in small documents queue, so that in subsequent reading, it can be using double indexes Structure be read out, small documents can be more quickly positioned, so that all becoming more in the reading of small documents and storing process Accelerate speed.

By first being sorted out according to predetermined characteristic, big file being synthesized in small documents queue before small documents store, Small documents rope and vertical global index are established, so that forming double ropes by small documents rope and vertical global index in reading process Draw, so that first being indexed again to small documents from global index, inquiry velocity has more block, realizes small text in the reading process of small documents The quick positioning of part reduces memory consumption, system load simultaneously because the index file needed is less, improves access efficiency, together When storage stored according to predetermined characteristic, storage efficiency is higher, can also improve reading efficiency accordingly.

It needing to carry out certain pretreatment before small documents storage in the present invention, it, which is first sorted out again, becomes big file, The present invention is to its classifying mode and sorts out requirement without limitation, and in one embodiment of the invention, the step 2 includes:

It should be pointed out that a kind of classifying mode is generally used in the present invention, such as only with file type or only with creation Time is sorted out, and may be such that a small documents not only belong to the former in such a way that mixing is sorted out, but also belong to the latter, bad It is divided, certain present invention can also in other manners, and the present invention is without limitation.

It needs to merge into small documents into big file after classification in the present invention, similar to the standard of small documents storage Change, in being merged into big file and then reading process, is first read in the way of big file, then read in big file Take small documents, the present invention for small documents merging mode without limitation, in one embodiment, the step 4 includes:

When MapFile is accessed, index file can be loaded into memory, can be navigated to rapidly by indexing mapping relations Document location where specified record greatly improves recall precision, and then improves access efficiency.

It is used in the present invention to the pretreated mode of small documents, by changing storage mode in storing process, so that its Conveniently it is read, and in reading process, if it is possible to there is better reading mechanism, also can be improved access efficiency, in this hair In bright one embodiment, the mass small documents access method based on Hadoop is after the step 5, further includes:

Step 6, judge whether to receive small documents read requests；

After receiving small documents read requests, small documents and file relevant to small documents are pre-read, energy Enough save the step interacted with NameNode and time, it is this prefetch mechanism under, NameNode node visit amount will be significantly It reduces, hence it is evident that improve the operational efficiency of NameNode.

In one embodiment, when HDFS attempts to read a small documents in MapFile, with this document same The metadata information of other related small documents in MapFile can be prefetched from NameNode node.Due to at one There is correlation between small documents in MapFile, user often accesses relative file when reading a file, When the metadata of related small documents is stored in HDFS client-cache, client can be saved to be interacted with NameNode Step and time, so that NameNode node visit amount will greatly reduce, hence it is evident that the operational efficiency of NameNode is improved, The memory for reducing NameNode consumption, reduces system load.

In order to further increase reading efficiency, in one embodiment of the present of invention, after the step 7, further includes:

If so, step 9, by small documents storage into caching.

Some files are often repeatedly inquired, and the access frequency of each file is not identical.To improve reading speed, After user reads file, access record is write down, for counting access times.Caching clothes are placed on for the file of high access frequency Device be engaged in as caching, when user reads again same file, need to only be read from cache server, read these files in this way The time of consumption can greatly reduce, and improve access efficiency.

However the access behavior of user often changes, if the file of storing excess in the buffer, and be infrequently by The file used, then caching, which becomes, can become too fat to move, simultaneously because the space of caching is limited, the quantity for the small documents that can store Limited, the efficiency for caching this high-quality storage resource will not be given full play to, in order to solve this technical problem, in the present invention One embodiment in, after the step 9, further includes:

If so, step 11, the small documents are deleted from the caching.

The time interval used by judging small documents, if its time interval exceeds threshold value T, illustrating may quilt The probability used can decline, and value will decline, and have exceeded the lower limit of buffer memory file value, what is cached in this way makes It will be lower with efficiency, and by being deleted, allow for that the high file of more accessed probabilities can be stored in caching in this way, this is right Ask that file reading speed is very helpful in raising.In the present invention by using double-indexing mechanism and caching mechanism, from visitor The accessed note probability of file is improved in terms of family end and server-side two, enhances the robustness of system.

Small documents store request module 10, for after having detected that small documents are stored, output to be pre-processed Order；

Small documents preprocessing module 20 is connect with small documents storage request module 10, receives the pretreatment order, The small documents of the small documents and small documents index is put after classifying according to predetermined characteristic to the small documents Enter in small documents queue, it, will be multiple described in the small documents queue after the length of the small documents queue reaches threshold value Small documents merge into big file, establish global index, and corresponding relationship is deposited into after the file index and is sent out to NameNode Rise storage request, control the NameNode according to preset size block to the big file division at data block after, by institute It states in the storage at least one DataNode of big file, and by the DataNode where the data block and described The state of DataNode is write in name space.

Since the mass small documents access system based on Hadoop is based on the above-mentioned mass small documents based on Hadoop The system of access method, beneficial effect having the same, this is not limited by the present invention.

It is in one embodiment of the invention, described based on Hadoop's in order to further increase the reading efficiency of file It further includes the small documents read requests module 30 connecting with the small documents preprocessing module 20, rope that mass small documents, which access system, Draw and prefetch module 40, the small documents read requests module 30 is used for the Xiang Suoshu rope after detecting small documents read requests Draw prefetch module 40 issue pre-read request, it is described index prefetch module 40 pre-read it is corresponding small in the small documents read requests The small documents relevant to the small documents in big file where file.

By using the mode pre-read, setting index prefetches module between HDFS client and NameNode.Work as HDFS When attempting to read a small documents in big file (such as the Mapfile being merged into using MapFile technology), with this document same The metadata information of other related small documents in one big file can be prefetched from NameNode node.Due to at one There is correlation between small documents in MapFile, user often accesses relative file when reading a file, When the metadata of related small documents is stored in HDFS client-cache, client can be saved to be interacted with NameNode Step and time.It is this prefetch mechanism under, NameNode node visit amount will greatly reduce, hence it is evident that improve NameNode Operational efficiency.

In order to further increase file reading efficiency, in one embodiment of the invention, the sea based on Hadoop Amount small documents access system further includes prefetching the cache module 50 that module 40 is connect with index, and the cache module 50 is for storing The small documents that frequency reaches threshold value are accessed in predetermined time.

Thus, in the present invention by increasing cache module, the file of reading and the higher culture of frequency of use are put It sets in the buffer, reads characteristic using natural high efficiency is cached, improve the reading efficiency of file.

However the access behavior of user often changes, high access frequency is on certain time section, if quilt in caching The file blocking or injection being largely not frequently used, since the space of itself is very limited, the file that can store becomes Less, it so that its efficiency utilization rate reduces, in order to solve this technical problem, in one embodiment of the invention, is set forth in The mass small documents access system of Hadoop further includes the cache cleaner module connecting with the cache module 50, and the caching is clear Reason module detect small documents in the caching and it is last it is accessed between time interval reach pre- fixed length T after, The small documents are deleted from the caching.

By setting up a timer in cache server, for recording the time of last access file till now Interval, when time interval be greater than scheduled duration T after, system can be automatically deleted the file higher than T, this makes it possible to realize caching In file regular update so that its frequency of use and service efficiency, maintain a high-order level always On, improve service efficiency.

In conclusion the mass small documents access method and system provided in an embodiment of the present invention based on Hadoop, passes through Before small documents storage, is first sorted out according to predetermined characteristic, big file is synthesized in small documents queue, establishes small documents rope And vertical global index, so that double indexes are formed by small documents rope and vertical global index, so that small documents in reading process Reading process in, first indexed again to small documents from global index, inquiry velocity has more block, realizes the quick positioning of small documents, Simultaneously because the index file needed is less, memory consumption, system load are reduced, improves access efficiency, while storing according to pre- Determine characteristic storage, storage efficiency is higher, can also improve reading efficiency accordingly.

The transaudient alarm method of phone provided by the present invention and device are described in detail above.It is used herein A specific example illustrates the principle and implementation of the invention, and the above embodiments are only used to help understand originally The method and its core concept of invention.It should be pointed out that for those skilled in the art, not departing from this hair , can be with several improvements and modifications are made to the present invention under the premise of bright principle, these improvement and modification also fall into power of the present invention In the protection scope that benefit requires.

Claims

1. a kind of mass small documents access method based on Hadoop characterized by comprising

Step 1, judge whether to need to save small documents；

If so, step 2, after classifying according to predetermined characteristic to the small documents, by the small documents and the small documents Small documents index be put into small documents queue；

If so, multiple small documents in the small documents queue are merged into big file, establish global index by step 4, and Corresponding relationship is deposited into file index backward NameNode and initiates storage request；

Step 5, the NameNode according to the block of default size to the big file division at data block after, will it is described greatly File storage is at least one DataNode, and by the DataNode and the DataNode where the data block State is write in name space.

2. as claim 1 is based on the mass small documents access method of Hadoop, which is characterized in that the step 2 includes:

3. as claim 1 is based on the mass small documents access method of Hadoop, which is characterized in that the step 4 includes:

Multiple small documents in the small documents queue are merged by big file using Mapfile, wherein MapFile packet The part index and the part data are included, for storing data, the part index is used for the data rope of file for the part data Draw, for recording the deviation post of the key value and record of record hereof.

4. as claim 3 is based on the mass small documents access method of Hadoop, which is characterized in that after the step 5, Further include:

Step 6, judge whether to receive small documents read requests；

Step 7, it pre-reads related to the small documents in the big file in the small documents read requests where corresponding small documents The small documents.

5. as claim 4 is based on the mass small documents access method of Hadoop, which is characterized in that after the step 7, Further include:

If so, step 9, by small documents storage into caching.

6. as claim 5 is based on the mass small documents access method of Hadoop, which is characterized in that after the step 9, Further include:

Step 10, judge the small documents in the caching and it is last accessed between time interval whether reach predetermined Long T；

If so, step 11, the small documents are deleted from the caching.

7. a kind of mass small documents based on Hadoop access system characterized by comprising

Small documents store request module, for after having detected that small documents are stored, output pretreatment to be ordered；

Small documents preprocessing module is connect with small documents storage request module, the pretreatment order is received, according to predetermined The small documents of the small documents and the small documents are indexed after classifying to the small documents and are put into small documents by feature In queue, after the length of the small documents queue reaches threshold value, multiple small documents in the small documents queue are closed And be big file, global index is established, and corresponding relationship is deposited into initiate to store to NameNode after the file index and is asked Ask, control the NameNode according to the block of default size to the big file division at data block after, by the big file It stores at least one DataNode, and by the state of the DataNode and the DataNode where the data block It writes in name space.

8. mass small documents based on Hadoop access system as claimed in claim 7, which is characterized in that further include with it is described small The small documents read requests module of file preprocessing module connection, index prefetch module, and the small documents read requests module is used In after detecting small documents read requests, module is prefetched to the index and issues pre-read request, the pre- modulus of index Block pre-reads relevant to the small documents described small in the big file in the small documents read requests where corresponding small documents File.

9. the mass small documents based on Hadoop access system as claimed in claim 8, which is characterized in that further include pre- with index The cache module of modulus block connection, the cache module reach the described small of threshold value for storing accessed frequency in the predetermined time File.

10. mass small documents based on Hadoop access system as claimed in claim 9, which is characterized in that further include with it is described The cache cleaner module of cache module connection, the cache cleaner module detect the small documents and upper one in the caching It is secondary it is accessed between time interval reach pre- fixed length T after, the small documents are deleted from the caching.