CN110515920A - A kind of mass small documents access method and system based on Hadoop - Google Patents
A kind of mass small documents access method and system based on Hadoop Download PDFInfo
- Publication number
- CN110515920A CN110515920A CN201910816503.9A CN201910816503A CN110515920A CN 110515920 A CN110515920 A CN 110515920A CN 201910816503 A CN201910816503 A CN 201910816503A CN 110515920 A CN110515920 A CN 110515920A
- Authority
- CN
- China
- Prior art keywords
- small documents
- file
- index
- small
- hadoop
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/172—Caching, prefetching or hoarding of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of mass small documents access method and system based on Hadoop, method includes: step 1, judges whether to need to save small documents;If so, step 2, classifies to small documents according to predetermined characteristic, and the small documents of small documents index is put into small documents queue;Step 3, judge whether the length of small documents queue reaches threshold value;If so, multiple small documents in small documents queue are merged into big file by step 4, global index is established, and corresponding relationship is deposited into file index backward NameNode and initiates storage request;Step 5, NameNode according to the block of default size to big file division at data block after, by the storage at least one DataNode of big file, and the state of DataNode and DataNode where data block are write in name space.By first being sorted out according to predetermined characteristic, big file being synthesized in small documents queue before small documents store, small documents rope and vertical global index are established, memory consumption, system load are reduced, improves access efficiency.
Description
Technical field
The present invention relates to big data processing technology fields, access more particularly to a kind of mass small documents based on Hadoop
Method and system.
Background technique
Currently, Internet application is ubiquitous, resulting mass data brings huge pressure to storage and processing
Power.Big data technology is a series of unconventional tool of uses to a large amount of structuring, unstructured and partly-structured data
Handled and obtained the technology of analysis and prediction result.
It can not only be the storage of mass data using big data processing technique by Hadoop frame application in mass data
Carrier is provided, while also providing new approach for efficiently processing data.Hadoop provides a distributed document storage
System HDFS.HDFS can be used to save the mass data of substantially sequential access, and provide it is a kind of quickly access it is specific
The mechanism of data.
However, the HDFS designed to handle big file is that can generate in small documents such as processing picture, file types
Problem.General small documents refer to that size is less than the file of 10M, if there are a large amount of this small documents in system, it will pole
The memory headroom of the earth trumpet NameNode, to influence the performance of entire HDFS cluster.
There is no very good solution methods aiming at the problem that HDFS accesses small documents at present, and HDFS itself is provided
Sequencefile solution reduces the memory consumption of NameNode by merging small documents Li Ai to greatest extent.
Sequencefile is the text storage file being made of the byte of Binary Serialization key/value.In
In Sequencefile, each key/value is counted as a record.In general, can by the file of small documents and
File content constructs a key-value pair, and the key-value pair set being made of in this way multiple small documents can be bundled to
In Sequencefile.Sequencefile supports compression, can by several recording compresseds to together, the method reduce
The memory consumption of NameNode, but file mergences needs to consume the long period, since key assignments therein does not arrange, searches one
A small documents need to be traversed for entire Sequencefile, reduce access efficiency.
Summary of the invention
The object of the present invention is to provide a kind of mass small documents access method and system based on Hadoop are reduced
NameNode memory consumption improves access efficiency, reduces system load.
In order to solve the above technical problems, the embodiment of the invention provides a kind of mass small documents access side based on Hadoop
Method, comprising:
Step 1, judge whether to need to save small documents;
If so, step 2, after classifying according to predetermined characteristic to the small documents, by small documents and described small
The small documents index of file is put into small documents queue;
Step 3, judge whether the length of the small documents queue reaches threshold value;
If so, multiple small documents in the small documents queue are merged into big file by step 4, global rope is established
Draw, and corresponding relationship is deposited into file index backward NameNode and initiates storage request;
Step 5, the NameNode according to the block of default size to the big file division at data block after, by institute
It states in the storage at least one DataNode of big file, and by the DataNode where the data block and described
The state of DataNode is write in name space.
Wherein, the step 2 includes:
The small documents are sorted out according to the file type or creation time of the small documents.
Wherein, the step 4 includes:
Multiple small documents in the small documents queue are merged by big file using Mapfile, wherein
MapFile includes the part index and the part data, and for storing data, the part index is used for file for the part data
Data directory, for recording the deviation post of the key value and record of record hereof.
Wherein, after the step 5, further includes:
Step 6, judge whether to receive small documents read requests;
Step 7, pre-read in the big file in the small documents read requests where corresponding small documents with the small documents
The relevant small documents.
Wherein, after the step 7, further includes:
Step 8, judge whether the frequency accessed in the given time of the small documents reaches threshold value;
If so, step 9, by small documents storage into caching.
Wherein, after the step 9, further includes:
Step 10, judge the small documents in the caching and it is last accessed between time interval whether reach
Pre- fixed length T;
If so, the small documents are deleted from the caching.
In addition to this, the embodiment of the invention also provides a kind of, and the mass small documents based on Hadoop access system, comprising:
Small documents store request module, for after having detected that small documents are stored, output pretreatment to be ordered
It enables;
Small documents preprocessing module is connect with small documents storage request module, receives the pretreatment order, according to
The small documents of the small documents and small documents index is put into small by predetermined characteristic after classifying to the small documents
In document queue, after the length of the small documents queue reaches threshold value, by multiple small texts in the small documents queue
Part merges into big file, establishes global index, and corresponding relationship is deposited into after the file index and is deposited to NameNode initiation
Storage request, control the NameNode according to preset size block to the big file division at data block after, will it is described greatly
File storage is at least one DataNode, and by the DataNode and the DataNode where the data block
State is write in name space.
It wherein, further include the small documents read requests module being connect with the small documents preprocessing module, the pre- modulus of index
Block, the small documents read requests module are used for after detecting small documents read requests, prefetch module hair to the index
Pre-read is requested out, and the index prefetches module and pre-reads big file in the small documents read requests where corresponding small documents
In the small documents relevant to the small documents.
It wherein, further include prefetching the cache module that module is connect with index, the cache module is for storing the predetermined time
Interior accessed frequency reaches the small documents of threshold value.
It wherein, further include the cache cleaner module being connect with the cache module, the cache cleaner module detects institute
State the small documents in caching and it is last accessed between time interval reach pre- fixed length T after, by the small documents from
It is deleted in the caching.
Mass small documents access method and system based on Hadoop provided by the embodiment of the present invention, with prior art phase
Than having the advantage that
Mass small documents access method and system provided in an embodiment of the present invention based on Hadoop, by being deposited in small documents
Before storage, first sorted out according to predetermined characteristic, big file is synthesized in small documents queue, establishes small documents rope and the vertical overall situation
Index, so that double indexes are formed by small documents rope and vertical global index, so that the reading process of small documents in reading process
In, it is first indexed again to small documents from global index, inquiry velocity has more block, realizes the quick positioning of small documents, simultaneously because needing
The index file wanted is less, reduces memory consumption, system load, improves access efficiency, while storage is stored according to predetermined characteristic,
Storage efficiency is higher, can also improve reading efficiency accordingly.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention
Some embodiments for those of ordinary skill in the art without creative efforts, can also basis
These attached drawings obtain other attached drawings.
Fig. 1 is a kind of specific embodiment party of the mass small documents access method provided in an embodiment of the present invention based on Hadoop
The step flow diagram of formula;
Fig. 2 is another specific implementation of the mass small documents access method provided in an embodiment of the present invention based on Hadoop
The step flow diagram of mode;
Fig. 3 is a kind of specific embodiment party that the mass small documents provided in an embodiment of the present invention based on Hadoop access system
The attachment structure schematic diagram of formula;
Fig. 4 is another specific implementation that the mass small documents provided in an embodiment of the present invention based on Hadoop access system
The attachment structure schematic diagram of mode.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
FIG. 1 to FIG. 4 is please referred to, Fig. 1 is the mass small documents access method provided in an embodiment of the present invention based on Hadoop
A kind of specific embodiment step flow diagram;Fig. 2 is that the magnanimity provided in an embodiment of the present invention based on Hadoop is small
The step flow diagram of another specific embodiment of file access method;Fig. 3 is provided in an embodiment of the present invention is based on
A kind of attachment structure schematic diagram of specific embodiment of the mass small documents access system of Hadoop;Fig. 4 is that the present invention is implemented
The attachment structure schematic diagram of another specific embodiment for the mass small documents access system based on Hadoop that example provides.
In a specific embodiment, the mass small documents access method based on Hadoop, comprising:
Step 1, judge whether to need to save small documents;It needs to judge whether there is small documents storage herein and ask
Ask, to open subsequent step, save memory, small documents storage request here can be timing and detect, be also possible to
Machine testing, if system carries out random assignment, such as 1-3s, or according to the flat rate of appearance of small documents, if frequency is very before
Height illustrates currently carrying out large-scale small documents storage, thus needs to improve detection frequency, reduces between detection time
Every on the contrary, detection time interval can be increased.
If so, step 2, after classifying according to predetermined characteristic to the small documents, by small documents and described small
The small documents index of file is put into small documents queue;Here the purpose classified is storage and subsequent reading for convenience
It takes, the file of general same category feature can be stored and be read by collective so that it is convenient to subsequent reading, otherwise, even if carrying out small text
The position enquiring of part just needs many time, both increases memory consumption, also increases the time read and needed, substantially reduces
Access efficiency.
Step 3, judge whether the length of the small documents queue reaches threshold value;Judge that the length of small documents queue reaches threshold
The purpose of value is, facilitates it in the big file of subsequent synthesis, all has unified length, the length between different big files
Spend of substantially equal, it is of substantially equal to store the space occupied, is similar to and uses packaging cargo in case, greatly improves the utilization for space
Efficiency, the present invention for small documents queue length threshold without limitation, can be according to the size of big file storage into
Row auto-changing, as big file storage space in, can allow for storage quantity be 100G, allow 100 big files, each
The length of big file is not more than 1G, and after the memory space of big file becomes 200G, allow quantity or 100, that
The length of each big file becomes not greater than 1G, or regardless of in that memory space, each the size of big file is
No more than 1G, only the size according to corresponding memory space, quantity are accordingly converted, and this is not limited by the present invention.
If so, multiple small documents in the small documents queue are merged into big file by step 4, global rope is established
Draw, and corresponding relationship is deposited into file index backward NameNode and initiates storage request;Here global index is established with before
Person indexes to form double indexes in the small documents formed in small documents queue, so that in subsequent reading, it can be using double indexes
Structure be read out, small documents can be more quickly positioned, so that all becoming more in the reading of small documents and storing process
Accelerate speed.
Step 5, the NameNode according to the block of default size to the big file division at data block after, by institute
It states in the storage at least one DataNode of big file, and by the DataNode where the data block and described
The state of DataNode is write in name space.
By first being sorted out according to predetermined characteristic, big file being synthesized in small documents queue before small documents store,
Small documents rope and vertical global index are established, so that forming double ropes by small documents rope and vertical global index in reading process
Draw, so that first being indexed again to small documents from global index, inquiry velocity has more block, realizes small text in the reading process of small documents
The quick positioning of part reduces memory consumption, system load simultaneously because the index file needed is less, improves access efficiency, together
When storage stored according to predetermined characteristic, storage efficiency is higher, can also improve reading efficiency accordingly.
It needing to carry out certain pretreatment before small documents storage in the present invention, it, which is first sorted out again, becomes big file,
The present invention is to its classifying mode and sorts out requirement without limitation, and in one embodiment of the invention, the step 2 includes:
The small documents are sorted out according to the file type or creation time of the small documents.
It should be pointed out that a kind of classifying mode is generally used in the present invention, such as only with file type or only with creation
Time is sorted out, and may be such that a small documents not only belong to the former in such a way that mixing is sorted out, but also belong to the latter, bad
It is divided, certain present invention can also in other manners, and the present invention is without limitation.
It needs to merge into small documents into big file after classification in the present invention, similar to the standard of small documents storage
Change, in being merged into big file and then reading process, is first read in the way of big file, then read in big file
Take small documents, the present invention for small documents merging mode without limitation, in one embodiment, the step 4 includes:
Multiple small documents in the small documents queue are merged by big file using Mapfile, wherein
MapFile includes the part index and the part data, and for storing data, the part index is used for file for the part data
Data directory, for recording the deviation post of the key value and record of record hereof.
When MapFile is accessed, index file can be loaded into memory, can be navigated to rapidly by indexing mapping relations
Document location where specified record greatly improves recall precision, and then improves access efficiency.
It is used in the present invention to the pretreated mode of small documents, by changing storage mode in storing process, so that its
Conveniently it is read, and in reading process, if it is possible to there is better reading mechanism, also can be improved access efficiency, in this hair
In bright one embodiment, the mass small documents access method based on Hadoop is after the step 5, further includes:
Step 6, judge whether to receive small documents read requests;
Step 7, pre-read in the big file in the small documents read requests where corresponding small documents with the small documents
The relevant small documents.
After receiving small documents read requests, small documents and file relevant to small documents are pre-read, energy
Enough save the step interacted with NameNode and time, it is this prefetch mechanism under, NameNode node visit amount will be significantly
It reduces, hence it is evident that improve the operational efficiency of NameNode.
In one embodiment, when HDFS attempts to read a small documents in MapFile, with this document same
The metadata information of other related small documents in MapFile can be prefetched from NameNode node.Due to at one
There is correlation between small documents in MapFile, user often accesses relative file when reading a file,
When the metadata of related small documents is stored in HDFS client-cache, client can be saved to be interacted with NameNode
Step and time, so that NameNode node visit amount will greatly reduce, hence it is evident that the operational efficiency of NameNode is improved,
The memory for reducing NameNode consumption, reduces system load.
In order to further increase reading efficiency, in one embodiment of the present of invention, after the step 7, further includes:
Step 8, judge whether the frequency accessed in the given time of the small documents reaches threshold value;
If so, step 9, by small documents storage into caching.
Some files are often repeatedly inquired, and the access frequency of each file is not identical.To improve reading speed,
After user reads file, access record is write down, for counting access times.Caching clothes are placed on for the file of high access frequency
Device be engaged in as caching, when user reads again same file, need to only be read from cache server, read these files in this way
The time of consumption can greatly reduce, and improve access efficiency.
However the access behavior of user often changes, if the file of storing excess in the buffer, and be infrequently by
The file used, then caching, which becomes, can become too fat to move, simultaneously because the space of caching is limited, the quantity for the small documents that can store
Limited, the efficiency for caching this high-quality storage resource will not be given full play to, in order to solve this technical problem, in the present invention
One embodiment in, after the step 9, further includes:
Step 10, judge the small documents in the caching and it is last accessed between time interval whether reach
Pre- fixed length T;
If so, step 11, the small documents are deleted from the caching.
The time interval used by judging small documents, if its time interval exceeds threshold value T, illustrating may quilt
The probability used can decline, and value will decline, and have exceeded the lower limit of buffer memory file value, what is cached in this way makes
It will be lower with efficiency, and by being deleted, allow for that the high file of more accessed probabilities can be stored in caching in this way, this is right
Ask that file reading speed is very helpful in raising.In the present invention by using double-indexing mechanism and caching mechanism, from visitor
The accessed note probability of file is improved in terms of family end and server-side two, enhances the robustness of system.
In addition to this, the embodiment of the invention also provides a kind of, and the mass small documents based on Hadoop access system, comprising:
Small documents store request module 10, for after having detected that small documents are stored, output to be pre-processed
Order;
Small documents preprocessing module 20 is connect with small documents storage request module 10, receives the pretreatment order,
The small documents of the small documents and small documents index is put after classifying according to predetermined characteristic to the small documents
Enter in small documents queue, it, will be multiple described in the small documents queue after the length of the small documents queue reaches threshold value
Small documents merge into big file, establish global index, and corresponding relationship is deposited into after the file index and is sent out to NameNode
Rise storage request, control the NameNode according to preset size block to the big file division at data block after, by institute
It states in the storage at least one DataNode of big file, and by the DataNode where the data block and described
The state of DataNode is write in name space.
Since the mass small documents access system based on Hadoop is based on the above-mentioned mass small documents based on Hadoop
The system of access method, beneficial effect having the same, this is not limited by the present invention.
It is in one embodiment of the invention, described based on Hadoop's in order to further increase the reading efficiency of file
It further includes the small documents read requests module 30 connecting with the small documents preprocessing module 20, rope that mass small documents, which access system,
Draw and prefetch module 40, the small documents read requests module 30 is used for the Xiang Suoshu rope after detecting small documents read requests
Draw prefetch module 40 issue pre-read request, it is described index prefetch module 40 pre-read it is corresponding small in the small documents read requests
The small documents relevant to the small documents in big file where file.
By using the mode pre-read, setting index prefetches module between HDFS client and NameNode.Work as HDFS
When attempting to read a small documents in big file (such as the Mapfile being merged into using MapFile technology), with this document same
The metadata information of other related small documents in one big file can be prefetched from NameNode node.Due to at one
There is correlation between small documents in MapFile, user often accesses relative file when reading a file,
When the metadata of related small documents is stored in HDFS client-cache, client can be saved to be interacted with NameNode
Step and time.It is this prefetch mechanism under, NameNode node visit amount will greatly reduce, hence it is evident that improve NameNode
Operational efficiency.
In order to further increase file reading efficiency, in one embodiment of the invention, the sea based on Hadoop
Amount small documents access system further includes prefetching the cache module 50 that module 40 is connect with index, and the cache module 50 is for storing
The small documents that frequency reaches threshold value are accessed in predetermined time.
Some files are often repeatedly inquired, and the access frequency of each file is not identical.To improve reading speed,
After user reads file, access record is write down, for counting access times.Caching clothes are placed on for the file of high access frequency
Device be engaged in as caching, when user reads again same file, need to only be read from cache server, read these files in this way
The time of consumption can greatly reduce, and improve access efficiency.
Thus, in the present invention by increasing cache module, the file of reading and the higher culture of frequency of use are put
It sets in the buffer, reads characteristic using natural high efficiency is cached, improve the reading efficiency of file.
However the access behavior of user often changes, high access frequency is on certain time section, if quilt in caching
The file blocking or injection being largely not frequently used, since the space of itself is very limited, the file that can store becomes
Less, it so that its efficiency utilization rate reduces, in order to solve this technical problem, in one embodiment of the invention, is set forth in
The mass small documents access system of Hadoop further includes the cache cleaner module connecting with the cache module 50, and the caching is clear
Reason module detect small documents in the caching and it is last it is accessed between time interval reach pre- fixed length T after,
The small documents are deleted from the caching.
By setting up a timer in cache server, for recording the time of last access file till now
Interval, when time interval be greater than scheduled duration T after, system can be automatically deleted the file higher than T, this makes it possible to realize caching
In file regular update so that its frequency of use and service efficiency, maintain a high-order level always
On, improve service efficiency.
In conclusion the mass small documents access method and system provided in an embodiment of the present invention based on Hadoop, passes through
Before small documents storage, is first sorted out according to predetermined characteristic, big file is synthesized in small documents queue, establishes small documents rope
And vertical global index, so that double indexes are formed by small documents rope and vertical global index, so that small documents in reading process
Reading process in, first indexed again to small documents from global index, inquiry velocity has more block, realizes the quick positioning of small documents,
Simultaneously because the index file needed is less, memory consumption, system load are reduced, improves access efficiency, while storing according to pre-
Determine characteristic storage, storage efficiency is higher, can also improve reading efficiency accordingly.
The transaudient alarm method of phone provided by the present invention and device are described in detail above.It is used herein
A specific example illustrates the principle and implementation of the invention, and the above embodiments are only used to help understand originally
The method and its core concept of invention.It should be pointed out that for those skilled in the art, not departing from this hair
, can be with several improvements and modifications are made to the present invention under the premise of bright principle, these improvement and modification also fall into power of the present invention
In the protection scope that benefit requires.
Claims (10)
1. a kind of mass small documents access method based on Hadoop characterized by comprising
Step 1, judge whether to need to save small documents;
If so, step 2, after classifying according to predetermined characteristic to the small documents, by the small documents and the small documents
Small documents index be put into small documents queue;
Step 3, judge whether the length of the small documents queue reaches threshold value;
If so, multiple small documents in the small documents queue are merged into big file, establish global index by step 4, and
Corresponding relationship is deposited into file index backward NameNode and initiates storage request;
Step 5, the NameNode according to the block of default size to the big file division at data block after, will it is described greatly
File storage is at least one DataNode, and by the DataNode and the DataNode where the data block
State is write in name space.
2. as claim 1 is based on the mass small documents access method of Hadoop, which is characterized in that the step 2 includes:
The small documents are sorted out according to the file type or creation time of the small documents.
3. as claim 1 is based on the mass small documents access method of Hadoop, which is characterized in that the step 4 includes:
Multiple small documents in the small documents queue are merged by big file using Mapfile, wherein MapFile packet
The part index and the part data are included, for storing data, the part index is used for the data rope of file for the part data
Draw, for recording the deviation post of the key value and record of record hereof.
4. as claim 3 is based on the mass small documents access method of Hadoop, which is characterized in that after the step 5,
Further include:
Step 6, judge whether to receive small documents read requests;
Step 7, it pre-reads related to the small documents in the big file in the small documents read requests where corresponding small documents
The small documents.
5. as claim 4 is based on the mass small documents access method of Hadoop, which is characterized in that after the step 7,
Further include:
Step 8, judge whether the frequency accessed in the given time of the small documents reaches threshold value;
If so, step 9, by small documents storage into caching.
6. as claim 5 is based on the mass small documents access method of Hadoop, which is characterized in that after the step 9,
Further include:
Step 10, judge the small documents in the caching and it is last accessed between time interval whether reach predetermined
Long T;
If so, step 11, the small documents are deleted from the caching.
7. a kind of mass small documents based on Hadoop access system characterized by comprising
Small documents store request module, for after having detected that small documents are stored, output pretreatment to be ordered;
Small documents preprocessing module is connect with small documents storage request module, the pretreatment order is received, according to predetermined
The small documents of the small documents and the small documents are indexed after classifying to the small documents and are put into small documents by feature
In queue, after the length of the small documents queue reaches threshold value, multiple small documents in the small documents queue are closed
And be big file, global index is established, and corresponding relationship is deposited into initiate to store to NameNode after the file index and is asked
Ask, control the NameNode according to the block of default size to the big file division at data block after, by the big file
It stores at least one DataNode, and by the state of the DataNode and the DataNode where the data block
It writes in name space.
8. mass small documents based on Hadoop access system as claimed in claim 7, which is characterized in that further include with it is described small
The small documents read requests module of file preprocessing module connection, index prefetch module, and the small documents read requests module is used
In after detecting small documents read requests, module is prefetched to the index and issues pre-read request, the pre- modulus of index
Block pre-reads relevant to the small documents described small in the big file in the small documents read requests where corresponding small documents
File.
9. the mass small documents based on Hadoop access system as claimed in claim 8, which is characterized in that further include pre- with index
The cache module of modulus block connection, the cache module reach the described small of threshold value for storing accessed frequency in the predetermined time
File.
10. mass small documents based on Hadoop access system as claimed in claim 9, which is characterized in that further include with it is described
The cache cleaner module of cache module connection, the cache cleaner module detect the small documents and upper one in the caching
It is secondary it is accessed between time interval reach pre- fixed length T after, the small documents are deleted from the caching.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910816503.9A CN110515920A (en) | 2019-08-30 | 2019-08-30 | A kind of mass small documents access method and system based on Hadoop |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910816503.9A CN110515920A (en) | 2019-08-30 | 2019-08-30 | A kind of mass small documents access method and system based on Hadoop |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110515920A true CN110515920A (en) | 2019-11-29 |
Family
ID=68629643
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910816503.9A Pending CN110515920A (en) | 2019-08-30 | 2019-08-30 | A kind of mass small documents access method and system based on Hadoop |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110515920A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110968272A (en) * | 2019-12-16 | 2020-04-07 | 华中科技大学 | Time sequence prediction-based method and system for optimizing storage performance of mass small files |
CN111475469A (en) * | 2020-03-19 | 2020-07-31 | 中山大学 | Virtual file system-based small file storage optimization system in KUBERNETES user mode application |
CN113127548A (en) * | 2019-12-31 | 2021-07-16 | 奇安信科技集团股份有限公司 | File merging method, device, equipment and storage medium |
CN113407620A (en) * | 2020-03-17 | 2021-09-17 | 北京信息科技大学 | Data block placement method and system based on heterogeneous Hadoop cluster environment |
CN113590566A (en) * | 2021-06-23 | 2021-11-02 | 河海大学 | Stack structure-based sequence File storage optimization method, device, equipment and storage medium |
CN114116612A (en) * | 2021-11-15 | 2022-03-01 | 长沙理工大学 | B + tree index-based access method for archived files |
CN115269524A (en) * | 2022-09-26 | 2022-11-01 | 创云融达信息技术(天津)股份有限公司 | Integrated system and method for end-to-end small file collection transmission and storage |
CN115858249A (en) * | 2022-12-30 | 2023-03-28 | 北京迪艾尔软件技术有限公司 | Backup method for massive unstructured data files |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102332029A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | Hadoop-based mass classifiable small file association storage method |
CN102332027A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | Mass non-independent small file associated storage method based on Hadoop |
CN102902716A (en) * | 2012-08-27 | 2013-01-30 | 苏州两江科技有限公司 | Storage system based on Hadoop distributed computing platform |
CN103559229A (en) * | 2013-10-22 | 2014-02-05 | 西安电子科技大学 | Small file management service (SFMS) system based on MapFile and use method thereof |
CN103856567A (en) * | 2014-03-26 | 2014-06-11 | 西安电子科技大学 | Small file storage method based on Hadoop distributed file system |
CN105183839A (en) * | 2015-09-02 | 2015-12-23 | 华中科技大学 | Hadoop-based storage optimizing method for small file hierachical indexing |
CN105956183A (en) * | 2016-05-30 | 2016-09-21 | 广东电网有限责任公司电力调度控制中心 | Method and system for multi-stage optimization storage of a lot of small files in distributed database |
CN106909651A (en) * | 2017-02-23 | 2017-06-30 | 郑州云海信息技术有限公司 | A kind of method for being write based on HDFS small documents and being read |
CN107045531A (en) * | 2017-01-20 | 2017-08-15 | 郑州云海信息技术有限公司 | A kind of system and method for optimization HDFS small documents access |
CN109800208A (en) * | 2019-01-18 | 2019-05-24 | 湖南友道信息技术有限公司 | Network traceability system and its data processing method, computer storage medium |
-
2019
- 2019-08-30 CN CN201910816503.9A patent/CN110515920A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102332029A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | Hadoop-based mass classifiable small file association storage method |
CN102332027A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | Mass non-independent small file associated storage method based on Hadoop |
CN102902716A (en) * | 2012-08-27 | 2013-01-30 | 苏州两江科技有限公司 | Storage system based on Hadoop distributed computing platform |
CN103559229A (en) * | 2013-10-22 | 2014-02-05 | 西安电子科技大学 | Small file management service (SFMS) system based on MapFile and use method thereof |
CN103856567A (en) * | 2014-03-26 | 2014-06-11 | 西安电子科技大学 | Small file storage method based on Hadoop distributed file system |
CN105183839A (en) * | 2015-09-02 | 2015-12-23 | 华中科技大学 | Hadoop-based storage optimizing method for small file hierachical indexing |
CN105956183A (en) * | 2016-05-30 | 2016-09-21 | 广东电网有限责任公司电力调度控制中心 | Method and system for multi-stage optimization storage of a lot of small files in distributed database |
CN107045531A (en) * | 2017-01-20 | 2017-08-15 | 郑州云海信息技术有限公司 | A kind of system and method for optimization HDFS small documents access |
CN106909651A (en) * | 2017-02-23 | 2017-06-30 | 郑州云海信息技术有限公司 | A kind of method for being write based on HDFS small documents and being read |
CN109800208A (en) * | 2019-01-18 | 2019-05-24 | 湖南友道信息技术有限公司 | Network traceability system and its data processing method, computer storage medium |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110968272B (en) * | 2019-12-16 | 2021-01-01 | 华中科技大学 | Time sequence prediction-based method and system for optimizing storage performance of mass small files |
CN110968272A (en) * | 2019-12-16 | 2020-04-07 | 华中科技大学 | Time sequence prediction-based method and system for optimizing storage performance of mass small files |
CN113127548A (en) * | 2019-12-31 | 2021-07-16 | 奇安信科技集团股份有限公司 | File merging method, device, equipment and storage medium |
CN113127548B (en) * | 2019-12-31 | 2023-10-31 | 奇安信科技集团股份有限公司 | File merging method, device, equipment and storage medium |
CN113407620B (en) * | 2020-03-17 | 2023-04-21 | 北京信息科技大学 | Data block placement method and system based on heterogeneous Hadoop cluster environment |
CN113407620A (en) * | 2020-03-17 | 2021-09-17 | 北京信息科技大学 | Data block placement method and system based on heterogeneous Hadoop cluster environment |
CN111475469A (en) * | 2020-03-19 | 2020-07-31 | 中山大学 | Virtual file system-based small file storage optimization system in KUBERNETES user mode application |
CN111475469B (en) * | 2020-03-19 | 2021-12-14 | 中山大学 | Virtual file system-based small file storage optimization system in KUBERNETES user mode application |
CN113590566B (en) * | 2021-06-23 | 2023-10-27 | 河海大学 | Method, device, equipment and storage medium for optimizing sequence file storage based on heap structure |
CN113590566A (en) * | 2021-06-23 | 2021-11-02 | 河海大学 | Stack structure-based sequence File storage optimization method, device, equipment and storage medium |
CN114116612A (en) * | 2021-11-15 | 2022-03-01 | 长沙理工大学 | B + tree index-based access method for archived files |
CN114116612B (en) * | 2021-11-15 | 2024-06-07 | 长沙理工大学 | Access method for index archive file based on B+ tree |
CN115269524B (en) * | 2022-09-26 | 2023-03-24 | 创云融达信息技术(天津)股份有限公司 | Integrated system and method for end-to-end small file collection transmission and storage |
CN115269524A (en) * | 2022-09-26 | 2022-11-01 | 创云融达信息技术(天津)股份有限公司 | Integrated system and method for end-to-end small file collection transmission and storage |
CN115858249A (en) * | 2022-12-30 | 2023-03-28 | 北京迪艾尔软件技术有限公司 | Backup method for massive unstructured data files |
CN115858249B (en) * | 2022-12-30 | 2024-07-09 | 北京迪艾尔软件技术有限公司 | Backup method for massive unstructured data files |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110515920A (en) | A kind of mass small documents access method and system based on Hadoop | |
US8352517B2 (en) | Infrastructure for spilling pages to a persistent store | |
US8145859B2 (en) | Method and system for spilling from a queue to a persistent store | |
CN105956183B (en) | The multilevel optimization's storage method and system of mass small documents in a kind of distributed data base | |
US10296462B2 (en) | Method to accelerate queries using dynamically generated alternate data formats in flash cache | |
CN102436513B (en) | Distributed search method and system | |
CN100452041C (en) | Method and system for reading information at network resource site, and searching engine | |
CN104252536B (en) | A kind of internet log data query method and device based on hbase | |
US9712646B2 (en) | Automated client/server operation partitioning | |
US9852180B2 (en) | Systems and methods of accessing distributed data | |
CN108710639A (en) | A kind of mass small documents access optimization method based on Ceph | |
US20100274795A1 (en) | Method and system for implementing a composite database | |
CN109815234A (en) | A kind of multiple cuckoo filter under streaming computing model | |
CN104778229A (en) | Telecommunication service small file storage system and method based on Hadoop | |
CN109766318A (en) | File reading and device | |
CN109842621A (en) | A kind of method and terminal reducing token storage quantity | |
CN108319652A (en) | A kind of the column document storage system and method for the elevator data based on HDFS | |
CN107783732A (en) | A kind of data read-write method, system, equipment and computer-readable storage medium | |
CN113297267A (en) | Data caching and task processing method, device, equipment and storage medium | |
CN112486996B (en) | Object-oriented memory data storage system | |
CN104462602A (en) | File system with data processing function and use method thereof | |
CN109241444B (en) | Content recommendation method, device, equipment and storage medium based on state machine | |
US11055266B2 (en) | Efficient key data store entry traversal and result generation | |
CN106681939B (en) | Reading method and device for disk page | |
CN112860641A (en) | Small file storage method and device based on HADOOP |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191129 |
|
RJ01 | Rejection of invention patent application after publication |