CN110515920A - A kind of mass small documents access method and system based on Hadoop - Google Patents

A kind of mass small documents access method and system based on Hadoop Download PDF

Info

Publication number
CN110515920A
CN110515920A CN201910816503.9A CN201910816503A CN110515920A CN 110515920 A CN110515920 A CN 110515920A CN 201910816503 A CN201910816503 A CN 201910816503A CN 110515920 A CN110515920 A CN 110515920A
Authority
CN
China
Prior art keywords
small documents
file
index
small
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910816503.9A
Other languages
Chinese (zh)
Inventor
孙伟源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Inspur Data Technology Co Ltd
Original Assignee
Beijing Inspur Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Inspur Data Technology Co Ltd filed Critical Beijing Inspur Data Technology Co Ltd
Priority to CN201910816503.9A priority Critical patent/CN110515920A/en
Publication of CN110515920A publication Critical patent/CN110515920A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of mass small documents access method and system based on Hadoop, method includes: step 1, judges whether to need to save small documents;If so, step 2, classifies to small documents according to predetermined characteristic, and the small documents of small documents index is put into small documents queue;Step 3, judge whether the length of small documents queue reaches threshold value;If so, multiple small documents in small documents queue are merged into big file by step 4, global index is established, and corresponding relationship is deposited into file index backward NameNode and initiates storage request;Step 5, NameNode according to the block of default size to big file division at data block after, by the storage at least one DataNode of big file, and the state of DataNode and DataNode where data block are write in name space.By first being sorted out according to predetermined characteristic, big file being synthesized in small documents queue before small documents store, small documents rope and vertical global index are established, memory consumption, system load are reduced, improves access efficiency.

Description

A kind of mass small documents access method and system based on Hadoop
Technical field
The present invention relates to big data processing technology fields, access more particularly to a kind of mass small documents based on Hadoop Method and system.
Background technique
Currently, Internet application is ubiquitous, resulting mass data brings huge pressure to storage and processing Power.Big data technology is a series of unconventional tool of uses to a large amount of structuring, unstructured and partly-structured data Handled and obtained the technology of analysis and prediction result.
It can not only be the storage of mass data using big data processing technique by Hadoop frame application in mass data Carrier is provided, while also providing new approach for efficiently processing data.Hadoop provides a distributed document storage System HDFS.HDFS can be used to save the mass data of substantially sequential access, and provide it is a kind of quickly access it is specific The mechanism of data.
However, the HDFS designed to handle big file is that can generate in small documents such as processing picture, file types Problem.General small documents refer to that size is less than the file of 10M, if there are a large amount of this small documents in system, it will pole The memory headroom of the earth trumpet NameNode, to influence the performance of entire HDFS cluster.
There is no very good solution methods aiming at the problem that HDFS accesses small documents at present, and HDFS itself is provided Sequencefile solution reduces the memory consumption of NameNode by merging small documents Li Ai to greatest extent. Sequencefile is the text storage file being made of the byte of Binary Serialization key/value.In In Sequencefile, each key/value is counted as a record.In general, can by the file of small documents and File content constructs a key-value pair, and the key-value pair set being made of in this way multiple small documents can be bundled to In Sequencefile.Sequencefile supports compression, can by several recording compresseds to together, the method reduce The memory consumption of NameNode, but file mergences needs to consume the long period, since key assignments therein does not arrange, searches one A small documents need to be traversed for entire Sequencefile, reduce access efficiency.
Summary of the invention
The object of the present invention is to provide a kind of mass small documents access method and system based on Hadoop are reduced NameNode memory consumption improves access efficiency, reduces system load.
In order to solve the above technical problems, the embodiment of the invention provides a kind of mass small documents access side based on Hadoop Method, comprising:
Step 1, judge whether to need to save small documents;
If so, step 2, after classifying according to predetermined characteristic to the small documents, by small documents and described small The small documents index of file is put into small documents queue;
Step 3, judge whether the length of the small documents queue reaches threshold value;
If so, multiple small documents in the small documents queue are merged into big file by step 4, global rope is established Draw, and corresponding relationship is deposited into file index backward NameNode and initiates storage request;
Step 5, the NameNode according to the block of default size to the big file division at data block after, by institute It states in the storage at least one DataNode of big file, and by the DataNode where the data block and described The state of DataNode is write in name space.
Wherein, the step 2 includes:
The small documents are sorted out according to the file type or creation time of the small documents.
Wherein, the step 4 includes:
Multiple small documents in the small documents queue are merged by big file using Mapfile, wherein MapFile includes the part index and the part data, and for storing data, the part index is used for file for the part data Data directory, for recording the deviation post of the key value and record of record hereof.
Wherein, after the step 5, further includes:
Step 6, judge whether to receive small documents read requests;
Step 7, pre-read in the big file in the small documents read requests where corresponding small documents with the small documents The relevant small documents.
Wherein, after the step 7, further includes:
Step 8, judge whether the frequency accessed in the given time of the small documents reaches threshold value;
If so, step 9, by small documents storage into caching.
Wherein, after the step 9, further includes:
Step 10, judge the small documents in the caching and it is last accessed between time interval whether reach Pre- fixed length T;
If so, the small documents are deleted from the caching.
In addition to this, the embodiment of the invention also provides a kind of, and the mass small documents based on Hadoop access system, comprising:
Small documents store request module, for after having detected that small documents are stored, output pretreatment to be ordered It enables;
Small documents preprocessing module is connect with small documents storage request module, receives the pretreatment order, according to The small documents of the small documents and small documents index is put into small by predetermined characteristic after classifying to the small documents In document queue, after the length of the small documents queue reaches threshold value, by multiple small texts in the small documents queue Part merges into big file, establishes global index, and corresponding relationship is deposited into after the file index and is deposited to NameNode initiation Storage request, control the NameNode according to preset size block to the big file division at data block after, will it is described greatly File storage is at least one DataNode, and by the DataNode and the DataNode where the data block State is write in name space.
It wherein, further include the small documents read requests module being connect with the small documents preprocessing module, the pre- modulus of index Block, the small documents read requests module are used for after detecting small documents read requests, prefetch module hair to the index Pre-read is requested out, and the index prefetches module and pre-reads big file in the small documents read requests where corresponding small documents In the small documents relevant to the small documents.
It wherein, further include prefetching the cache module that module is connect with index, the cache module is for storing the predetermined time Interior accessed frequency reaches the small documents of threshold value.
It wherein, further include the cache cleaner module being connect with the cache module, the cache cleaner module detects institute State the small documents in caching and it is last accessed between time interval reach pre- fixed length T after, by the small documents from It is deleted in the caching.
Mass small documents access method and system based on Hadoop provided by the embodiment of the present invention, with prior art phase Than having the advantage that
Mass small documents access method and system provided in an embodiment of the present invention based on Hadoop, by being deposited in small documents Before storage, first sorted out according to predetermined characteristic, big file is synthesized in small documents queue, establishes small documents rope and the vertical overall situation Index, so that double indexes are formed by small documents rope and vertical global index, so that the reading process of small documents in reading process In, it is first indexed again to small documents from global index, inquiry velocity has more block, realizes the quick positioning of small documents, simultaneously because needing The index file wanted is less, reduces memory consumption, system load, improves access efficiency, while storage is stored according to predetermined characteristic, Storage efficiency is higher, can also improve reading efficiency accordingly.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.
Fig. 1 is a kind of specific embodiment party of the mass small documents access method provided in an embodiment of the present invention based on Hadoop The step flow diagram of formula;
Fig. 2 is another specific implementation of the mass small documents access method provided in an embodiment of the present invention based on Hadoop The step flow diagram of mode;
Fig. 3 is a kind of specific embodiment party that the mass small documents provided in an embodiment of the present invention based on Hadoop access system The attachment structure schematic diagram of formula;
Fig. 4 is another specific implementation that the mass small documents provided in an embodiment of the present invention based on Hadoop access system The attachment structure schematic diagram of mode.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
FIG. 1 to FIG. 4 is please referred to, Fig. 1 is the mass small documents access method provided in an embodiment of the present invention based on Hadoop A kind of specific embodiment step flow diagram;Fig. 2 is that the magnanimity provided in an embodiment of the present invention based on Hadoop is small The step flow diagram of another specific embodiment of file access method;Fig. 3 is provided in an embodiment of the present invention is based on A kind of attachment structure schematic diagram of specific embodiment of the mass small documents access system of Hadoop;Fig. 4 is that the present invention is implemented The attachment structure schematic diagram of another specific embodiment for the mass small documents access system based on Hadoop that example provides.
In a specific embodiment, the mass small documents access method based on Hadoop, comprising:
Step 1, judge whether to need to save small documents;It needs to judge whether there is small documents storage herein and ask Ask, to open subsequent step, save memory, small documents storage request here can be timing and detect, be also possible to Machine testing, if system carries out random assignment, such as 1-3s, or according to the flat rate of appearance of small documents, if frequency is very before Height illustrates currently carrying out large-scale small documents storage, thus needs to improve detection frequency, reduces between detection time Every on the contrary, detection time interval can be increased.
If so, step 2, after classifying according to predetermined characteristic to the small documents, by small documents and described small The small documents index of file is put into small documents queue;Here the purpose classified is storage and subsequent reading for convenience It takes, the file of general same category feature can be stored and be read by collective so that it is convenient to subsequent reading, otherwise, even if carrying out small text The position enquiring of part just needs many time, both increases memory consumption, also increases the time read and needed, substantially reduces Access efficiency.
Step 3, judge whether the length of the small documents queue reaches threshold value;Judge that the length of small documents queue reaches threshold The purpose of value is, facilitates it in the big file of subsequent synthesis, all has unified length, the length between different big files Spend of substantially equal, it is of substantially equal to store the space occupied, is similar to and uses packaging cargo in case, greatly improves the utilization for space Efficiency, the present invention for small documents queue length threshold without limitation, can be according to the size of big file storage into Row auto-changing, as big file storage space in, can allow for storage quantity be 100G, allow 100 big files, each The length of big file is not more than 1G, and after the memory space of big file becomes 200G, allow quantity or 100, that The length of each big file becomes not greater than 1G, or regardless of in that memory space, each the size of big file is No more than 1G, only the size according to corresponding memory space, quantity are accordingly converted, and this is not limited by the present invention.
If so, multiple small documents in the small documents queue are merged into big file by step 4, global rope is established Draw, and corresponding relationship is deposited into file index backward NameNode and initiates storage request;Here global index is established with before Person indexes to form double indexes in the small documents formed in small documents queue, so that in subsequent reading, it can be using double indexes Structure be read out, small documents can be more quickly positioned, so that all becoming more in the reading of small documents and storing process Accelerate speed.
Step 5, the NameNode according to the block of default size to the big file division at data block after, by institute It states in the storage at least one DataNode of big file, and by the DataNode where the data block and described The state of DataNode is write in name space.
By first being sorted out according to predetermined characteristic, big file being synthesized in small documents queue before small documents store, Small documents rope and vertical global index are established, so that forming double ropes by small documents rope and vertical global index in reading process Draw, so that first being indexed again to small documents from global index, inquiry velocity has more block, realizes small text in the reading process of small documents The quick positioning of part reduces memory consumption, system load simultaneously because the index file needed is less, improves access efficiency, together When storage stored according to predetermined characteristic, storage efficiency is higher, can also improve reading efficiency accordingly.
It needing to carry out certain pretreatment before small documents storage in the present invention, it, which is first sorted out again, becomes big file, The present invention is to its classifying mode and sorts out requirement without limitation, and in one embodiment of the invention, the step 2 includes:
The small documents are sorted out according to the file type or creation time of the small documents.
It should be pointed out that a kind of classifying mode is generally used in the present invention, such as only with file type or only with creation Time is sorted out, and may be such that a small documents not only belong to the former in such a way that mixing is sorted out, but also belong to the latter, bad It is divided, certain present invention can also in other manners, and the present invention is without limitation.
It needs to merge into small documents into big file after classification in the present invention, similar to the standard of small documents storage Change, in being merged into big file and then reading process, is first read in the way of big file, then read in big file Take small documents, the present invention for small documents merging mode without limitation, in one embodiment, the step 4 includes:
Multiple small documents in the small documents queue are merged by big file using Mapfile, wherein MapFile includes the part index and the part data, and for storing data, the part index is used for file for the part data Data directory, for recording the deviation post of the key value and record of record hereof.
When MapFile is accessed, index file can be loaded into memory, can be navigated to rapidly by indexing mapping relations Document location where specified record greatly improves recall precision, and then improves access efficiency.
It is used in the present invention to the pretreated mode of small documents, by changing storage mode in storing process, so that its Conveniently it is read, and in reading process, if it is possible to there is better reading mechanism, also can be improved access efficiency, in this hair In bright one embodiment, the mass small documents access method based on Hadoop is after the step 5, further includes:
Step 6, judge whether to receive small documents read requests;
Step 7, pre-read in the big file in the small documents read requests where corresponding small documents with the small documents The relevant small documents.
After receiving small documents read requests, small documents and file relevant to small documents are pre-read, energy Enough save the step interacted with NameNode and time, it is this prefetch mechanism under, NameNode node visit amount will be significantly It reduces, hence it is evident that improve the operational efficiency of NameNode.
In one embodiment, when HDFS attempts to read a small documents in MapFile, with this document same The metadata information of other related small documents in MapFile can be prefetched from NameNode node.Due to at one There is correlation between small documents in MapFile, user often accesses relative file when reading a file, When the metadata of related small documents is stored in HDFS client-cache, client can be saved to be interacted with NameNode Step and time, so that NameNode node visit amount will greatly reduce, hence it is evident that the operational efficiency of NameNode is improved, The memory for reducing NameNode consumption, reduces system load.
In order to further increase reading efficiency, in one embodiment of the present of invention, after the step 7, further includes:
Step 8, judge whether the frequency accessed in the given time of the small documents reaches threshold value;
If so, step 9, by small documents storage into caching.
Some files are often repeatedly inquired, and the access frequency of each file is not identical.To improve reading speed, After user reads file, access record is write down, for counting access times.Caching clothes are placed on for the file of high access frequency Device be engaged in as caching, when user reads again same file, need to only be read from cache server, read these files in this way The time of consumption can greatly reduce, and improve access efficiency.
However the access behavior of user often changes, if the file of storing excess in the buffer, and be infrequently by The file used, then caching, which becomes, can become too fat to move, simultaneously because the space of caching is limited, the quantity for the small documents that can store Limited, the efficiency for caching this high-quality storage resource will not be given full play to, in order to solve this technical problem, in the present invention One embodiment in, after the step 9, further includes:
Step 10, judge the small documents in the caching and it is last accessed between time interval whether reach Pre- fixed length T;
If so, step 11, the small documents are deleted from the caching.
The time interval used by judging small documents, if its time interval exceeds threshold value T, illustrating may quilt The probability used can decline, and value will decline, and have exceeded the lower limit of buffer memory file value, what is cached in this way makes It will be lower with efficiency, and by being deleted, allow for that the high file of more accessed probabilities can be stored in caching in this way, this is right Ask that file reading speed is very helpful in raising.In the present invention by using double-indexing mechanism and caching mechanism, from visitor The accessed note probability of file is improved in terms of family end and server-side two, enhances the robustness of system.
In addition to this, the embodiment of the invention also provides a kind of, and the mass small documents based on Hadoop access system, comprising:
Small documents store request module 10, for after having detected that small documents are stored, output to be pre-processed Order;
Small documents preprocessing module 20 is connect with small documents storage request module 10, receives the pretreatment order, The small documents of the small documents and small documents index is put after classifying according to predetermined characteristic to the small documents Enter in small documents queue, it, will be multiple described in the small documents queue after the length of the small documents queue reaches threshold value Small documents merge into big file, establish global index, and corresponding relationship is deposited into after the file index and is sent out to NameNode Rise storage request, control the NameNode according to preset size block to the big file division at data block after, by institute It states in the storage at least one DataNode of big file, and by the DataNode where the data block and described The state of DataNode is write in name space.
Since the mass small documents access system based on Hadoop is based on the above-mentioned mass small documents based on Hadoop The system of access method, beneficial effect having the same, this is not limited by the present invention.
It is in one embodiment of the invention, described based on Hadoop's in order to further increase the reading efficiency of file It further includes the small documents read requests module 30 connecting with the small documents preprocessing module 20, rope that mass small documents, which access system, Draw and prefetch module 40, the small documents read requests module 30 is used for the Xiang Suoshu rope after detecting small documents read requests Draw prefetch module 40 issue pre-read request, it is described index prefetch module 40 pre-read it is corresponding small in the small documents read requests The small documents relevant to the small documents in big file where file.
By using the mode pre-read, setting index prefetches module between HDFS client and NameNode.Work as HDFS When attempting to read a small documents in big file (such as the Mapfile being merged into using MapFile technology), with this document same The metadata information of other related small documents in one big file can be prefetched from NameNode node.Due to at one There is correlation between small documents in MapFile, user often accesses relative file when reading a file, When the metadata of related small documents is stored in HDFS client-cache, client can be saved to be interacted with NameNode Step and time.It is this prefetch mechanism under, NameNode node visit amount will greatly reduce, hence it is evident that improve NameNode Operational efficiency.
In order to further increase file reading efficiency, in one embodiment of the invention, the sea based on Hadoop Amount small documents access system further includes prefetching the cache module 50 that module 40 is connect with index, and the cache module 50 is for storing The small documents that frequency reaches threshold value are accessed in predetermined time.
Some files are often repeatedly inquired, and the access frequency of each file is not identical.To improve reading speed, After user reads file, access record is write down, for counting access times.Caching clothes are placed on for the file of high access frequency Device be engaged in as caching, when user reads again same file, need to only be read from cache server, read these files in this way The time of consumption can greatly reduce, and improve access efficiency.
Thus, in the present invention by increasing cache module, the file of reading and the higher culture of frequency of use are put It sets in the buffer, reads characteristic using natural high efficiency is cached, improve the reading efficiency of file.
However the access behavior of user often changes, high access frequency is on certain time section, if quilt in caching The file blocking or injection being largely not frequently used, since the space of itself is very limited, the file that can store becomes Less, it so that its efficiency utilization rate reduces, in order to solve this technical problem, in one embodiment of the invention, is set forth in The mass small documents access system of Hadoop further includes the cache cleaner module connecting with the cache module 50, and the caching is clear Reason module detect small documents in the caching and it is last it is accessed between time interval reach pre- fixed length T after, The small documents are deleted from the caching.
By setting up a timer in cache server, for recording the time of last access file till now Interval, when time interval be greater than scheduled duration T after, system can be automatically deleted the file higher than T, this makes it possible to realize caching In file regular update so that its frequency of use and service efficiency, maintain a high-order level always On, improve service efficiency.
In conclusion the mass small documents access method and system provided in an embodiment of the present invention based on Hadoop, passes through Before small documents storage, is first sorted out according to predetermined characteristic, big file is synthesized in small documents queue, establishes small documents rope And vertical global index, so that double indexes are formed by small documents rope and vertical global index, so that small documents in reading process Reading process in, first indexed again to small documents from global index, inquiry velocity has more block, realizes the quick positioning of small documents, Simultaneously because the index file needed is less, memory consumption, system load are reduced, improves access efficiency, while storing according to pre- Determine characteristic storage, storage efficiency is higher, can also improve reading efficiency accordingly.
The transaudient alarm method of phone provided by the present invention and device are described in detail above.It is used herein A specific example illustrates the principle and implementation of the invention, and the above embodiments are only used to help understand originally The method and its core concept of invention.It should be pointed out that for those skilled in the art, not departing from this hair , can be with several improvements and modifications are made to the present invention under the premise of bright principle, these improvement and modification also fall into power of the present invention In the protection scope that benefit requires.

Claims (10)

1. a kind of mass small documents access method based on Hadoop characterized by comprising
Step 1, judge whether to need to save small documents;
If so, step 2, after classifying according to predetermined characteristic to the small documents, by the small documents and the small documents Small documents index be put into small documents queue;
Step 3, judge whether the length of the small documents queue reaches threshold value;
If so, multiple small documents in the small documents queue are merged into big file, establish global index by step 4, and Corresponding relationship is deposited into file index backward NameNode and initiates storage request;
Step 5, the NameNode according to the block of default size to the big file division at data block after, will it is described greatly File storage is at least one DataNode, and by the DataNode and the DataNode where the data block State is write in name space.
2. as claim 1 is based on the mass small documents access method of Hadoop, which is characterized in that the step 2 includes:
The small documents are sorted out according to the file type or creation time of the small documents.
3. as claim 1 is based on the mass small documents access method of Hadoop, which is characterized in that the step 4 includes:
Multiple small documents in the small documents queue are merged by big file using Mapfile, wherein MapFile packet The part index and the part data are included, for storing data, the part index is used for the data rope of file for the part data Draw, for recording the deviation post of the key value and record of record hereof.
4. as claim 3 is based on the mass small documents access method of Hadoop, which is characterized in that after the step 5, Further include:
Step 6, judge whether to receive small documents read requests;
Step 7, it pre-reads related to the small documents in the big file in the small documents read requests where corresponding small documents The small documents.
5. as claim 4 is based on the mass small documents access method of Hadoop, which is characterized in that after the step 7, Further include:
Step 8, judge whether the frequency accessed in the given time of the small documents reaches threshold value;
If so, step 9, by small documents storage into caching.
6. as claim 5 is based on the mass small documents access method of Hadoop, which is characterized in that after the step 9, Further include:
Step 10, judge the small documents in the caching and it is last accessed between time interval whether reach predetermined Long T;
If so, step 11, the small documents are deleted from the caching.
7. a kind of mass small documents based on Hadoop access system characterized by comprising
Small documents store request module, for after having detected that small documents are stored, output pretreatment to be ordered;
Small documents preprocessing module is connect with small documents storage request module, the pretreatment order is received, according to predetermined The small documents of the small documents and the small documents are indexed after classifying to the small documents and are put into small documents by feature In queue, after the length of the small documents queue reaches threshold value, multiple small documents in the small documents queue are closed And be big file, global index is established, and corresponding relationship is deposited into initiate to store to NameNode after the file index and is asked Ask, control the NameNode according to the block of default size to the big file division at data block after, by the big file It stores at least one DataNode, and by the state of the DataNode and the DataNode where the data block It writes in name space.
8. mass small documents based on Hadoop access system as claimed in claim 7, which is characterized in that further include with it is described small The small documents read requests module of file preprocessing module connection, index prefetch module, and the small documents read requests module is used In after detecting small documents read requests, module is prefetched to the index and issues pre-read request, the pre- modulus of index Block pre-reads relevant to the small documents described small in the big file in the small documents read requests where corresponding small documents File.
9. the mass small documents based on Hadoop access system as claimed in claim 8, which is characterized in that further include pre- with index The cache module of modulus block connection, the cache module reach the described small of threshold value for storing accessed frequency in the predetermined time File.
10. mass small documents based on Hadoop access system as claimed in claim 9, which is characterized in that further include with it is described The cache cleaner module of cache module connection, the cache cleaner module detect the small documents and upper one in the caching It is secondary it is accessed between time interval reach pre- fixed length T after, the small documents are deleted from the caching.
CN201910816503.9A 2019-08-30 2019-08-30 A kind of mass small documents access method and system based on Hadoop Pending CN110515920A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910816503.9A CN110515920A (en) 2019-08-30 2019-08-30 A kind of mass small documents access method and system based on Hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910816503.9A CN110515920A (en) 2019-08-30 2019-08-30 A kind of mass small documents access method and system based on Hadoop

Publications (1)

Publication Number Publication Date
CN110515920A true CN110515920A (en) 2019-11-29

Family

ID=68629643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910816503.9A Pending CN110515920A (en) 2019-08-30 2019-08-30 A kind of mass small documents access method and system based on Hadoop

Country Status (1)

Country Link
CN (1) CN110515920A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968272A (en) * 2019-12-16 2020-04-07 华中科技大学 Time sequence prediction-based method and system for optimizing storage performance of mass small files
CN111475469A (en) * 2020-03-19 2020-07-31 中山大学 Virtual file system-based small file storage optimization system in KUBERNETES user mode application
CN113127548A (en) * 2019-12-31 2021-07-16 奇安信科技集团股份有限公司 File merging method, device, equipment and storage medium
CN113407620A (en) * 2020-03-17 2021-09-17 北京信息科技大学 Data block placement method and system based on heterogeneous Hadoop cluster environment
CN113590566A (en) * 2021-06-23 2021-11-02 河海大学 Stack structure-based sequence File storage optimization method, device, equipment and storage medium
CN114116612A (en) * 2021-11-15 2022-03-01 长沙理工大学 B + tree index-based access method for archived files
CN115269524A (en) * 2022-09-26 2022-11-01 创云融达信息技术(天津)股份有限公司 Integrated system and method for end-to-end small file collection transmission and storage
CN115858249A (en) * 2022-12-30 2023-03-28 北京迪艾尔软件技术有限公司 Backup method for massive unstructured data files

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332029A (en) * 2011-10-15 2012-01-25 西安交通大学 Hadoop-based mass classifiable small file association storage method
CN102332027A (en) * 2011-10-15 2012-01-25 西安交通大学 Mass non-independent small file associated storage method based on Hadoop
CN102902716A (en) * 2012-08-27 2013-01-30 苏州两江科技有限公司 Storage system based on Hadoop distributed computing platform
CN103559229A (en) * 2013-10-22 2014-02-05 西安电子科技大学 Small file management service (SFMS) system based on MapFile and use method thereof
CN103856567A (en) * 2014-03-26 2014-06-11 西安电子科技大学 Small file storage method based on Hadoop distributed file system
CN105183839A (en) * 2015-09-02 2015-12-23 华中科技大学 Hadoop-based storage optimizing method for small file hierachical indexing
CN105956183A (en) * 2016-05-30 2016-09-21 广东电网有限责任公司电力调度控制中心 Method and system for multi-stage optimization storage of a lot of small files in distributed database
CN106909651A (en) * 2017-02-23 2017-06-30 郑州云海信息技术有限公司 A kind of method for being write based on HDFS small documents and being read
CN107045531A (en) * 2017-01-20 2017-08-15 郑州云海信息技术有限公司 A kind of system and method for optimization HDFS small documents access
CN109800208A (en) * 2019-01-18 2019-05-24 湖南友道信息技术有限公司 Network traceability system and its data processing method, computer storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332029A (en) * 2011-10-15 2012-01-25 西安交通大学 Hadoop-based mass classifiable small file association storage method
CN102332027A (en) * 2011-10-15 2012-01-25 西安交通大学 Mass non-independent small file associated storage method based on Hadoop
CN102902716A (en) * 2012-08-27 2013-01-30 苏州两江科技有限公司 Storage system based on Hadoop distributed computing platform
CN103559229A (en) * 2013-10-22 2014-02-05 西安电子科技大学 Small file management service (SFMS) system based on MapFile and use method thereof
CN103856567A (en) * 2014-03-26 2014-06-11 西安电子科技大学 Small file storage method based on Hadoop distributed file system
CN105183839A (en) * 2015-09-02 2015-12-23 华中科技大学 Hadoop-based storage optimizing method for small file hierachical indexing
CN105956183A (en) * 2016-05-30 2016-09-21 广东电网有限责任公司电力调度控制中心 Method and system for multi-stage optimization storage of a lot of small files in distributed database
CN107045531A (en) * 2017-01-20 2017-08-15 郑州云海信息技术有限公司 A kind of system and method for optimization HDFS small documents access
CN106909651A (en) * 2017-02-23 2017-06-30 郑州云海信息技术有限公司 A kind of method for being write based on HDFS small documents and being read
CN109800208A (en) * 2019-01-18 2019-05-24 湖南友道信息技术有限公司 Network traceability system and its data processing method, computer storage medium

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968272B (en) * 2019-12-16 2021-01-01 华中科技大学 Time sequence prediction-based method and system for optimizing storage performance of mass small files
CN110968272A (en) * 2019-12-16 2020-04-07 华中科技大学 Time sequence prediction-based method and system for optimizing storage performance of mass small files
CN113127548A (en) * 2019-12-31 2021-07-16 奇安信科技集团股份有限公司 File merging method, device, equipment and storage medium
CN113127548B (en) * 2019-12-31 2023-10-31 奇安信科技集团股份有限公司 File merging method, device, equipment and storage medium
CN113407620B (en) * 2020-03-17 2023-04-21 北京信息科技大学 Data block placement method and system based on heterogeneous Hadoop cluster environment
CN113407620A (en) * 2020-03-17 2021-09-17 北京信息科技大学 Data block placement method and system based on heterogeneous Hadoop cluster environment
CN111475469A (en) * 2020-03-19 2020-07-31 中山大学 Virtual file system-based small file storage optimization system in KUBERNETES user mode application
CN111475469B (en) * 2020-03-19 2021-12-14 中山大学 Virtual file system-based small file storage optimization system in KUBERNETES user mode application
CN113590566B (en) * 2021-06-23 2023-10-27 河海大学 Method, device, equipment and storage medium for optimizing sequence file storage based on heap structure
CN113590566A (en) * 2021-06-23 2021-11-02 河海大学 Stack structure-based sequence File storage optimization method, device, equipment and storage medium
CN114116612A (en) * 2021-11-15 2022-03-01 长沙理工大学 B + tree index-based access method for archived files
CN114116612B (en) * 2021-11-15 2024-06-07 长沙理工大学 Access method for index archive file based on B+ tree
CN115269524B (en) * 2022-09-26 2023-03-24 创云融达信息技术(天津)股份有限公司 Integrated system and method for end-to-end small file collection transmission and storage
CN115269524A (en) * 2022-09-26 2022-11-01 创云融达信息技术(天津)股份有限公司 Integrated system and method for end-to-end small file collection transmission and storage
CN115858249A (en) * 2022-12-30 2023-03-28 北京迪艾尔软件技术有限公司 Backup method for massive unstructured data files
CN115858249B (en) * 2022-12-30 2024-07-09 北京迪艾尔软件技术有限公司 Backup method for massive unstructured data files

Similar Documents

Publication Publication Date Title
CN110515920A (en) A kind of mass small documents access method and system based on Hadoop
US8352517B2 (en) Infrastructure for spilling pages to a persistent store
US8145859B2 (en) Method and system for spilling from a queue to a persistent store
CN105956183B (en) The multilevel optimization's storage method and system of mass small documents in a kind of distributed data base
US10296462B2 (en) Method to accelerate queries using dynamically generated alternate data formats in flash cache
CN102436513B (en) Distributed search method and system
CN100452041C (en) Method and system for reading information at network resource site, and searching engine
CN104252536B (en) A kind of internet log data query method and device based on hbase
US9712646B2 (en) Automated client/server operation partitioning
US9852180B2 (en) Systems and methods of accessing distributed data
CN108710639A (en) A kind of mass small documents access optimization method based on Ceph
US20100274795A1 (en) Method and system for implementing a composite database
CN109815234A (en) A kind of multiple cuckoo filter under streaming computing model
CN104778229A (en) Telecommunication service small file storage system and method based on Hadoop
CN109766318A (en) File reading and device
CN109842621A (en) A kind of method and terminal reducing token storage quantity
CN108319652A (en) A kind of the column document storage system and method for the elevator data based on HDFS
CN107783732A (en) A kind of data read-write method, system, equipment and computer-readable storage medium
CN113297267A (en) Data caching and task processing method, device, equipment and storage medium
CN112486996B (en) Object-oriented memory data storage system
CN104462602A (en) File system with data processing function and use method thereof
CN109241444B (en) Content recommendation method, device, equipment and storage medium based on state machine
US11055266B2 (en) Efficient key data store entry traversal and result generation
CN106681939B (en) Reading method and device for disk page
CN112860641A (en) Small file storage method and device based on HADOOP

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191129

RJ01 Rejection of invention patent application after publication