CN105843841A - Small file storing method and system - Google Patents
Small file storing method and system Download PDFInfo
- Publication number
- CN105843841A CN105843841A CN201610127995.7A CN201610127995A CN105843841A CN 105843841 A CN105843841 A CN 105843841A CN 201610127995 A CN201610127995 A CN 201610127995A CN 105843841 A CN105843841 A CN 105843841A
- Authority
- CN
- China
- Prior art keywords
- small documents
- file
- key word
- documents
- big file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
- G06F16/134—Distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a small file storing method and system. According to the invention, association relations among all small files are determined based on semantic description information of each to-be-stored small file. On this basis, merging and storing of each to-be-stored small file are achieved based on the determined association relations among all small files. Thus, the invention provides a small file merging and storing strategy based on a semantic association angle and by applying the small file storing method and system, closely associated small files can be merged, thereby effectively increasing reading efficiency of the small files.
Description
Technical field
The invention belongs to computer distribution type technical field of memory, particularly relate to a kind of small documents storage method
And system.
Background technology
Along with the fast development of the Internet, cloud storage starts to be widely used depositing in magnanimity internet data
Chu Zhong, cloud storage is collectively forming an offer by different types of storage devices a large amount of in integration networks
Extraneous storage and the system of Operational Visit, at present, using the teaching of the invention it is possible to provide the distributed system of cloud storage has a lot,
Such as HDFS (Hadoop Distributed File System, distributed file system), the GFS of Google
(Google File System, Google's file system) etc..
Under current internet environment, small documents occupies larger specific gravity, for solving small documents cloud storage
Research be essentially all strategy based on Piece file mergence, by large amount of small documents is merged, subtract
The number of files of few platform, alleviates distributed file system for depositing the memory pressure of file metadata;
After small documents merges into big file simultaneously, it is possible to significantly improve its disk read-write speed, save file storage
The time consumed.But, the most do not consider the communication with one another between file when currently carrying out the merging of small documents,
Piece file mergence is only directly to merge each small documents uploaded, and does not provide relevant merging
Strategy is used for improving the reading efficiency of small documents.
Summary of the invention
In view of this, it is an object of the invention to provide a kind of small documents storage method and system, it is intended to logical
Cross and small documents the closest for relatedness is merged and stores, improve the reading efficiency of small documents.
To this end, the present invention is disclosed directly below technical scheme:
A kind of small documents storage method, including:
Obtain the semantic description information of multiple small documents to be stored;
Based on the semantic description information of small documents each described, determine that the association between each described small documents is closed
System;
Based on incidence relation between small documents each described, small documents each described is carried out at Piece file mergence
Reason, obtains at least one big file;
Storage at least one big file described.
Said method, it is preferred that the semantic description information of multiple small documents that described acquisition is to be stored includes:
Obtain the key word of multiple small documents to be stored.
Said method, it is preferred that described based on the semantic description information of small documents each described, determines each
Incidence relation between individual described small documents includes:
Calculate the semantic similarity between the key word of each described small documents;
Based on described semantic similarity, the clustering processing that the key word of small documents each described is preset;
Wherein, clustering processing result has higher semantic similarity between the key word of same bunch;
According to clustering processing result, determine the incidence relation between each described small documents.
Said method, it is preferred that also include following preprocessing process:
According to small documents criterion of identification set in advance, the file to be stored uploaded is carried out small documents identification.
Said method, it is preferred that before storage at least one big file described, also include:
Utilize the key word of each described small documents, set up inverted index for each described small documents;
The result processed according to described Piece file mergence, determines each described small documents and corresponding big file
Between mapping relations and each described small documents positional information in corresponding big file.
Said method, it is preferred that also include:
Based on the mapping relations between described inverted index, described small documents and corresponding big file and each institute
State small documents positional information in corresponding big file and carry out required small documents reading.
A kind of small documents storage system, including:
Data obtaining module is described, for obtaining the semantic description information of multiple small documents to be stored;
Incidence relation determines module, for based on the semantic description information of small documents each described, determines each
Incidence relation between individual described small documents;
Small documents merges module, for based on incidence relation between small documents each described, to described in each
Small documents carries out Piece file mergence process, obtains at least one big file;
Memory module, is used for storing at least one big file described.
Said system, it is preferred that described description data obtaining module includes:
Key word acquiring unit, for obtaining the key word of multiple small documents to be stored.
Said system, it is preferred that described incidence relation determines that module includes:
Computing unit, the semantic similarity between the key word calculating each described small documents;
Clustering processing unit, for based on described semantic similarity, to the key word of small documents each described
Carry out the clustering processing preset;Wherein, clustering processing result has higher language with between bunch key word
Justice similarity;
Incidence relation determines unit, for according to clustering processing result, determining between each described small documents
Incidence relation.
Said system, it is preferred that also include:
Small documents identification module, for according to small documents criterion of identification set in advance, deposits waiting of uploading
Storage file carries out small documents identification.
Said system, it is preferred that also include that index creation module, described index creation module include:
First index creation unit is for utilizing the key word of each described small documents, described little for each
Inverted index set up by file;
Second index creation unit, for the result processed according to described Piece file mergence, determines each
Mapping relations between described small documents and corresponding big file and each described small documents are in corresponding big file
Positional information.
Said system, it is preferred that also include:
Small documents read module, for based on described inverted index, described small documents and corresponding big file it
Between mapping relations and each described small documents positional information in corresponding big file carry out required little literary composition
Part reads.
From above scheme, small documents disclosed in the present application storage method and system, wait to deposit based on each
The semantic description information of storage small documents, determines the incidence relation between each small documents, on this basis,
Incidence relation between each described small documents determined by based on, it is achieved the small documents that each is to be stored is entered
Row merges storage.Visible, present applicant proposes a kind of small documents based on semantic association angle and merge storage
Strategy, application is the application can realize merging small documents the closest for relatedness, and then can be effective
Improve the reading efficiency of small documents.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to reality
Execute the required accompanying drawing used in example or description of the prior art to be briefly described, it should be apparent that below,
Accompanying drawing in description is only embodiments of the invention, for those of ordinary skill in the art, not
On the premise of paying creative work, it is also possible to obtain other accompanying drawing according to the accompanying drawing provided.
Fig. 1 is the small documents storage method flow diagram that the embodiment of the present invention one provides;
Fig. 2 is the integrated stand composition of the distributed file system that the embodiment of the present invention one provides;
Fig. 3 is the small documents storage method flow diagram that the embodiment of the present invention two provides;
Fig. 4 is the small documents storage method flow diagram that the embodiment of the present invention three provides;
Fig. 5-Fig. 8 is the structural representation of the small documents storage system that the embodiment of the present invention four provides.
Detailed description of the invention
For the sake of quoting and understanding, the technical term that is used below, write a Chinese character in simplified form or summary of abridging is explained such as
Under:
Lucene: be the full-text search engine tool kit of an open source code, but it be not one complete
Full-text search engine, but the framework of a full-text search engine, it is provided that complete query engine and
Index engine, part text analyzing engine.The purpose of Lucene is to provide a letter for software developer
Single easy-to-use tool kit, to realize the function of full-text search easily, or with this in goal systems
Based on set up complete full-text search engine.Lucene is a set of opening for full-text search and search
Source library, is supported by Apache Software Foundation and provides.Lucene provide one simply the most powerful
Application interface, it is possible to doing full-text index and search, in Java development environment, Lucene is an one-tenth
Ripe free Open-Source Tools.
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out
Clearly and completely describe, it is clear that described embodiment is only a part of embodiment of the present invention, and
It is not all, of embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art are not doing
Go out the every other embodiment obtained under creative work premise, broadly fall into the scope of protection of the invention.
Embodiment one
A kind of open small documents storage method of the embodiment of the present invention one, with reference to Fig. 1, described small documents storage side
Method may comprise steps of:
S101: obtain the semantic description information of multiple small documents to be stored.
In cloud platform, the small documents major part of storage is all communication with one another, is mutually related in logic,
For this relatedness, the application is given a kind of based on semantic Piece file mergence strategy, relatedness is the closeest
The small documents cut merges, the size of the big file formed after simultaneously controlling to merge, it is ensured that after merging
The big file arrived without departing from the default storage object size of Distributed Architecture, with avoid small documents across block (block
I.e. refer to big file) storage.
With reference to the typical application scenarios of the distributed document storage shown in Fig. 2, the present embodiment is specifically with in Fig. 2
Distributed file system as a example by the application method is illustrated.When user has file storage demand,
Can on Web server transmitting file, and file carried out key word describe as the metadata of file, wherein
Key word needs based on succinctly, and can preferably summarize the theme of file.Other of key word and file
Relevant information, as file name, size, type, uplink time etc. are protected collectively as the metadata of file
It is stored in the associated documents record of data base.
The file that user need to be uploaded by Web server carries out size identification, afterwards can be to different size of file
Different Strategies is taked to process.The small documents bottleneck major embodiment to Ceph (distributed file system)
At write, reading and the Data Migration of file, based on this, the application is by calculating different size of file
Upload to Ceph or from Mean Speed locally downloading for Ceph, determine big small documents defines/identify mark
Accurate.On this basis, the file that Web server utilizes this standard to upload user carries out size file identification,
If recognition result is big file, then directly this document is uploaded to Ceph cluster and carries out cloud storage, otherwise,
If recognition result is small documents, then cache small documents, and the small documents merging utilizing the application to provide is deposited
Storage strategy realizes the merging storage of small documents.
For each small documents to be stored of caching, first Web server need to obtain the semanteme of each small documents
Description information, described semantic description information can be specifically that the key word of small documents describes information, this information
Can from user transmitting file metadata in extract.
S102: based on the semantic description information of small documents each described, determine between each described small documents
Incidence relation.
After obtaining the key word of each small documents to be stored, this step is concrete according to small documents key word
Between semantic similarity, determine the incidence relation between small documents, wherein, the semanteme between key word
Similarity is the highest, then the relatedness between small documents corresponding to sign key word is the tightst.
The application utilizes the corpus trained, and uses word2vec instrument to the semanteme between key word
Similarity calculates, and employing knows that net supplements, thus utilizes common results between the two to word
Between degree of association be defined, the weight setting up word2vec is α, knows that the weight of net is β, then key word
Semantic similarity between W1 and W2 is defined as follows:
Sim (W1, W2)=α Simw2v(W1,W2)+βSimHN(W1,W2) (1)
In formula (1), Sim (W1, W2) represents the degree of association i.e. semantic similarity between key word W1 and W2,
Simw2v(W1, W2) represents and utilizes the word W1 that Word2Vec instrument calculates, the degree of association between W2,
SimHNWhat (W1, W2) represented is to utilize to know that net carries out the word W1 calculated, the degree of association between W2.
Between defined terms on the basis of semantic similarity, the application is especially by each small documents to be stored
Key word carry out the clustering processing preset, it is achieved the key word of each small documents is carried out sub-clustering so that
Clustering processing result has higher semantic similarity between the key word of same bunch;Accordingly, right
On the basis of small documents key word carries out sub-clustering, can according to the corresponding relation between small documents and key word,
The small documents to be stored being in relief area is carried out logical partitioning, the small documents corresponding with bunch key word is drawn
Dividing in same logical block, the small documents belonging to a logical block together has higher semantic phase to each other
Guan Du.
Next being described the clustering processing process of each small documents key word, this process specifically includes:
1) preliminary clusters (nearest neighbor classifier) of key word
By n key word of multiple small documents (wherein, a corresponding key word of small documents, same
Key word possible corresponding multiple small documents, i.e. corresponding relation between small documents and key word are N:1) map
To one group of disjoint set D={D1,D2…Dn, each set is one bunch, key word in set of computations
Semantic similarity between any two, and the similarity numerical value calculated is sorted in descending order;On this basis,
Take out Similarity value sim (W successivelyi,Wj), until sim (Wi,Wj) stop less than during predetermined threshold value, afterwards, point
Do not find out taken out sim (Wi,Wj(W in)i,WjSet D belonging to)iWith Dj, merge DiWith DjBe one new
Bunch, it is a new set, and deletes original set DiWith Dj.Due to the pass cording that file is relevant
There are symmetry and transitivity, so according to new set distribution, a undirected unconnected graph can be set up,
Then from each node, this figure carried out depth-first traversal, thus by relevant node (key word)
All it is put in a set, i.e. completes the iteration of a keyword clustering.After an iteration not
Intersecting set structure is as follows:
D1={ W1,W2…Wi};
D2={ W3,W6…Wj};
……
Dm={ Wn,Wt…Wk}。
After above process, the number of set has occurred and that change, many key words have been assigned to phase
In the set (bunch) answered, it is achieved thereby that the preliminary clusters of key word.
2) iteration cluster between bunch
After the preliminary clusters of key word, corresponding bunch of Preliminary division, now will obtain a lot
Bunch, on this basis, this step consider continue multiple bunches obtained after key word preliminary clusters are carried out repeatedly
Generation cluster, thus close for association bunch is merged.
Between bunch, iteration cluster is concrete uses following steps to realize:
Step one, extracts the character representation of each bunch;
After the preliminary clusters of key word, have bunch in be provided with multiple key word, for key
Word number more bunch, the application consider to bunch feature (bunch each key word characterize bunch one
Feature) carry out dimensionality reduction, by bunch feature be described with the representational key word of several comparisons.
Wherein, the application especially by bunch in key word once cluster realization bunch feature extraction, so
After for each bunch, by bunch in after cluster each submanifold of obtaining drop by its key word number having
Sequence sorts, on this basis by submanifold maximum for key word number, as this bunch character representation (simultaneously
Need the maximum key word number of limited features submanifold).For example, it is assumed that bunch D1={ W1,W2…WiProcess bunch
It is divided into following submanifold: D after the once cluster of interior key word1={ (W1,W7,W11) ... (Wi,Wj), then may be used
Select (W1,W7,W11) as bunch D1Character representation, be i.e. equivalent to have updated a bunch center, for next
Bunch cluster of step.
Step 2, the relevance degree between calculating two-by-two bunch, and preserve;
After each bunch obtained after to key word preliminary clusters carries out bunch feature extraction, next proceed to profit
By the similarity between bunch each bunch of feature calculation extracted, between bunch, Similarity Measure mode is specific as follows:
Assume a bunch Di={ Wi1,Wi2…Win, Dj={ Wj1,Wj2…Wjm, then bunch DiWith DjBetween similarity be:
In formula (2), Sim (Di,Dj) that represent is a bunch DiWith DjBetween similarity, Sim (Wip,Wjq) be bunch between
Similarity value between characteristic key words.I.e. between comprehensive utilization bunch two-by-two characteristic key words calculate relevant
Angle value effectively represent bunch between degree of association, wherein, bunch the character representation that i.e. refers to bunch of characteristic key words in
The key word comprised.
Step 3, between general bunch, relevance degree is according to descending sort;
Step 4, sequentially takes out sim (D from descending sequencei,Dj), until sim (Di,Dj) less than predetermined threshold value
Time terminate;Afterwards by taken out sim (Di,Dj) corresponding bunch DiWith DjMerge into a bunch of Dk, thus realize
All key words in two bunches are incorporated among a set;
Step 5, it is judged that bunch number whether there occurs change or iterations whether reach certain number of times,
If meeting condition, iteration completes, if ineligible, continues executing with step one.
Sequentially pass through key word preliminary clusters and bunch between iteration cluster after, can realize each to be stored
The key word of small documents carries out sub-clustering, wherein, has higher semantic similarity with between bunch key word.
Owing to having corresponding corresponding relation between key word with small documents, therefore, can sub-clustering based on key word
As a result, the small documents to be stored being in relief area is carried out logical partitioning, by with corresponding little of bunch key word
File is divided in same logical block, and the file degree of being relative to each other belonging to a logical block together is high, has
Higher semantic association degree.
S103: based on incidence relation between small documents each described, small documents each described is carried out file
Merging treatment, obtains at least one big file.
The small documents to be stored cached is being carried out logical partitioning, after obtaining one or more logical block,
This step continues according to the logical partitioning result of small documents, carries out the small documents belonging to a logical block together
Piece file mergence processes.
Wherein, in order to avoid time-consumingly too much when file reads, then the file merging gained should be too not big,
It is to say, for the logical block that small documents quantity/data volume is bigger comprised, should be by this logic
All small documents in unit are all merged in a big file, but need to be by each in this logical block
Small documents is merged into multiple big file, i.e. logical block and is merged the relation that file can be one-to-many.
The cut-off rule assuming big small documents is BS, owing to the file of filebuf is all to know through small documents
The small documents of gained after not, thus in relief area, the size of All Files all will be less than BS.In order to reach literary composition
Part prefetching efficiency high this purpose, when carrying out small documents and merging, first by same logical block
All small documents are ranked up according to file size, then according to small documents is carried out by the order of file successively
Merge, when the file once merging gained is more than BS, this time merges and terminate, be then followed by down
Once merge.This kind of strategy both can ensure that the transmission read-write efficiency of gained file after merging and big file phase
With, ensure that there is maximum hit rate when small documents prefetches the most simultaneously.
Specifically, small documents data are converted to binary data by system when realizing Piece file mergence, and
When merging every time, all afterbodys at a upper small documents carry out data supplementing, and good this of record is merged little simultaneously
The length scale that the original position of file and small documents are occupied, the most just can lead to when this small documents of needs
Cross and extract the data of designated length to reduce small documents itself.
S104: storage at least one big file described.
After small documents is merged obtaining big file, the big file merging gained can be carried out required
Distributed storage, specifically, with reference to Fig. 2, Web server is merging after small documents obtains big file,
Can be by corresponding interface service by the appointment Bucket (data bucket) of big file uploading to Ceph storage cluster
On, thus finally realize the cloud storage of small documents.
From above scheme, small documents disclosed in the present application storage method, based on each little literary composition to be stored
The semantic description information of part, determines the incidence relation between each small documents, on this basis, based on institute
Incidence relation between each the described small documents determined, it is achieved the small documents that each is to be stored is merged
Storage.Visible, present applicant proposes a kind of small documents based on semantic association angle and merge storage strategy,
Application is the application can realize merging small documents the closest for relatedness, and then can be effectively improved little
The reading efficiency of file.
Embodiment two
The present embodiment continues to supplement the scheme of embodiment one, with reference to Fig. 3, in the present embodiment, in institute
Can also comprise the following steps before stating step S104:
S105: utilize the key word of each described small documents, sets up inverted index for each described small documents;
The result processed according to described Piece file mergence, determines between each described small documents and corresponding big file
Mapping relations and each described small documents positional information in corresponding big file.
Document retrieval is all indispensable for any platform, while storage file, needs simultaneously
Search function for file provides support, say, that, it is desirable to be able to the term according to user's input is fixed
Position, to required file, based on this, is supported to provide document retrieval, and the present embodiment utilizes each little
The key word of file is that small documents sets up inverted index, thus formed key word → small documents ID (Identity,
Identity number) mapping, the key word of small documents is described and sets up by the specifically used Lucene of the present embodiment
Inverted index, with reference to Fig. 2, after inverted index has been set up, just can generate index file storehouse and carry out it
Storage, thus follow-up quick-searching and the location that can carry out small documents according to the term of input.
Followed by the establishment process of an example simple declaration inverted index, this example provides three figures
The brief introduction of sheet (being equivalent to 3 small documents) describes:
A, Tian An-men, Pekinese.
B, the ruins in Yuanmingyuan Park.
C, Tian An-men and scenic spot, Dou Shi Pekinese, Yuanmingyuan Park.
Can get after it is carried out participle pretreatment:
A, [Beijing] [Tian An-men].
B, [Yuanmingyuan Park] [ruins].
C, [Tian An-men] [Yuanmingyuan Park] [Beijing] [scenic spot].
Continue the participle of above-mentioned each picture is carried out inverted index process, the available as shown in table 1 row of falling
Table:
Table 1
Except setting up inverted index for each small documents, the present embodiment is built always according to the combination situation of small documents
Mapping relations between vertical small documents and big file, simultaneously record small documents position in corresponding big file
Information, and in corresponding database server, store described mapping relations and described positional information.Described
Inverted index, small documents can be total to the mapping relations of big file, small documents positional information in big file
Be all the index information of small documents, for follow-up small documents retrieval, position, read offer support.
Embodiment three
On the basis of embodiment two scheme, with reference to Fig. 4, in the present embodiment, described small documents storage method
Can also comprise the following steps:
S106: based on the mapping relations between described inverted index, described small documents and corresponding big file and
Each described small documents positional information in corresponding big file carries out required small documents and reads.
On the basis of above example, the present embodiment provides the read schemes of small documents, the reading of small documents
Taking and can be divided into document retrieval and file download two step, wherein, document retrieval specifically can be to receive user defeated
After the term entered, use Lucene that the inverted index created is retrieved, thus obtain meeting retrieval
The results list of word, this list includes one or more small documents ID, afterwards, obtains after continuing with retrieval
To small documents ID carry out data base querying, obtain the biggest file ID of merging file at small documents place, so
After, send file read request according to the big file ID that inquiry obtains to Ceph cluster.
After Ceph cluster receives this request, navigate to corresponding Bucket and obtain required big file, due to
The big file obtained after merging is stored by object on Ceph, thus when reading small documents every time
It is required for the big file at small documents place carrying out overall download and caching, afterwards can be by downloading and delaying
The big file deposited is disassembled, and obtains required little literary composition according to small documents positional information in big file
Part.
Wherein it is desired to explanation, owing to the application is when carrying out cloud storage to small documents, based on little literary composition
Part is semantic merges storage by more close for association small documents, thus small documents is downloaded,
During reading, the reading efficiency of the higher small documents of degree of association can be effectively improved.
Embodiment four
The open a kind of small documents storage system of the present embodiment, described small documents storage system and above each enforcement
Disclosed in example, small documents storage method is corresponding.
Corresponding to embodiment one, with reference to Fig. 5, described system can include describing data obtaining module 100,
Incidence relation determines that module 200, small documents merge module 300 and memory module 400.
Data obtaining module 100 is described, for obtaining the semantic description information of multiple small documents to be stored.
Wherein, described description data obtaining module 100 includes key word acquiring unit, is used for obtaining to be stored
The key word of multiple small documents.
Incidence relation determines module 200, for based on the semantic description information of small documents each described, determines
Incidence relation between each described small documents.
Described incidence relation determines that module 200 includes that computing unit, clustering processing unit and incidence relation determine
Unit.
Computing unit, the semantic similarity between the key word calculating each described small documents;
Clustering processing unit, for based on described semantic similarity, to the key word of small documents each described
Carry out the clustering processing preset;Wherein, clustering processing result has higher language with between bunch key word
Justice similarity;
Incidence relation determines unit, for according to clustering processing result, determining between each described small documents
Incidence relation.
Small documents merges module 300, for based on incidence relation between small documents each described, to each institute
State small documents and carry out Piece file mergence process, obtain at least one big file.
Memory module 400, is used for storing at least one big file described.
The functional realiey of the most each module or unit needs to set up on the pretreatment basis of small documents identification
On, therefore, with reference to Fig. 6, described system also includes small documents identification module 500, and this module is used for foundation
Small documents criterion of identification set in advance, carries out small documents identification to the file to be stored uploaded.
Corresponding to embodiment two, with reference to Fig. 7, described system can also include index creation module 600, should
Module includes the first index creation unit and the second index creation unit.
First index creation unit is for utilizing the key word of each described small documents, described little for each
Inverted index set up by file;
Second index creation unit, for the result processed according to described Piece file mergence, determines each
Mapping relations between described small documents and corresponding big file and each described small documents are in corresponding big file
Positional information.
Corresponding to embodiment three, with reference to Fig. 8, described system can also include small documents read module 700,
For based on the mapping relations between described inverted index, described small documents and corresponding big file and each institute
State small documents positional information in corresponding big file and carry out required small documents reading.
For small documents storage system disclosed in the embodiment of the present invention four, due to its with embodiment one to
Disclosed in embodiment three, small documents storage method is corresponding, so describe is fairly simple, relevant similar it
Place refers to the explanation of embodiment one to embodiment three small file storage method part, the most no longer
Describe in detail.
It should be noted that each embodiment in this specification all uses the mode gone forward one by one to describe, each
What embodiment stressed is all the difference with other embodiments, identical similar between each embodiment
Part see mutually.
For convenience of description, it is divided into various module or unit to divide with function when describing system above or device
Do not describe.Certainly, implement the application time can the function of each unit same or multiple softwares and/
Or hardware realizes.
As seen through the above description of the embodiments, those skilled in the art is it can be understood that arrive
The application can add the mode of required general hardware platform by software and realize.Based on such understanding,
The part that prior art is contributed by the technical scheme of the application the most in other words can be with software product
Form embody, this computer software product can be stored in storage medium, as ROM/RAM,
Magnetic disc, CD etc., including some instructions with so that computer equipment (can be personal computer,
Server, or the network equipment etc.) perform each embodiment of the application or some part institute of embodiment
The method stated.
Finally, in addition it is also necessary to explanation, in this article, such as first, second, third and fourth etc. it
The relational terms of class is used merely to separate an entity or operation with another entity or operating space,
And not necessarily require or imply and there is the relation of any this reality or suitable between these entities or operation
Sequence.And, term " includes ", " comprising " or its any other variant are intended to nonexcludability
Comprise, so that include that the process of a series of key element, method, article or equipment not only include that
A little key elements, but also include other key elements being not expressly set out, or also include for this process,
The key element that method, article or equipment are intrinsic.In the case of there is no more restriction, by statement " bag
Include one ... " key element that limits, it is not excluded that include the process of described key element, method, article or
Person's equipment there is also other identical element.
The above is only the preferred embodiment of the present invention, it is noted that general for the art
For logical technical staff, under the premise without departing from the principles of the invention, it is also possible to make some improvement and profit
Decorations, these improvements and modifications also should be regarded as protection scope of the present invention.
Claims (12)
1. a small documents storage method, it is characterised in that including:
Obtain the semantic description information of multiple small documents to be stored;
Based on the semantic description information of small documents each described, determine that the association between each described small documents is closed
System;
Based on incidence relation between small documents each described, small documents each described is carried out at Piece file mergence
Reason, obtains at least one big file;
Storage at least one big file described.
Method the most according to claim 1, it is characterised in that to be stored multiple little of described acquisition
The semantic description information of file includes:
Obtain the key word of multiple small documents to be stored.
Method the most according to claim 2, it is characterised in that described based on small documents each described
Semantic description information, determine that the incidence relation between each described small documents includes:
Calculate the semantic similarity between the key word of each described small documents;
Based on described semantic similarity, the clustering processing that the key word of small documents each described is preset;
Wherein, clustering processing result has higher semantic similarity between the key word of same bunch;
According to clustering processing result, determine the incidence relation between each described small documents.
4. according to the method described in claim 1-3 any one, it is characterised in that also include following pre-
Processing procedure:
According to small documents criterion of identification set in advance, the file to be stored uploaded is carried out small documents identification.
5. according to the method described in claim 2-3 any one, it is characterised in that storage described at least
Before one big file, also include:
Utilize the key word of each described small documents, set up inverted index for each described small documents;
The result processed according to described Piece file mergence, determines each described small documents and corresponding big file
Between mapping relations and each described small documents positional information in corresponding big file.
Method the most according to claim 5, it is characterised in that also include:
Based on the mapping relations between described inverted index, described small documents and corresponding big file and each institute
State small documents positional information in corresponding big file and carry out required small documents reading.
7. a small documents storage system, it is characterised in that including:
Data obtaining module is described, for obtaining the semantic description information of multiple small documents to be stored;
Incidence relation determines module, for based on the semantic description information of small documents each described, determines each
Incidence relation between individual described small documents;
Small documents merges module, for based on incidence relation between small documents each described, to described in each
Small documents carries out Piece file mergence process, obtains at least one big file;
Memory module, is used for storing at least one big file described.
System the most according to claim 7, it is characterised in that described description data obtaining module bag
Include:
Key word acquiring unit, for obtaining the key word of multiple small documents to be stored.
System the most according to claim 8, it is characterised in that described incidence relation determines module bag
Include:
Computing unit, the semantic similarity between the key word calculating each described small documents;
Clustering processing unit, for based on described semantic similarity, to the key word of small documents each described
Carry out the clustering processing preset;Wherein, clustering processing result has higher language with between bunch key word
Justice similarity;
Incidence relation determines unit, for according to clustering processing result, determining between each described small documents
Incidence relation.
10. according to the system described in claim 7-9 any one, it is characterised in that also include:
Small documents identification module, for according to small documents criterion of identification set in advance, deposits waiting of uploading
Storage file carries out small documents identification.
11. systems described in-9 any one according to Claim 8, it is characterised in that also include index wound
Modeling block, described index creation module includes:
First index creation unit is for utilizing the key word of each described small documents, described little for each
Inverted index set up by file;
Second index creation unit, for the result processed according to described Piece file mergence, determines each
Mapping relations between described small documents and corresponding big file and each described small documents are in corresponding big file
Positional information.
12. systems according to claim 11, it is characterised in that also include:
Small documents read module, for based on described inverted index, described small documents and corresponding big file it
Between mapping relations and each described small documents positional information in corresponding big file carry out required little literary composition
Part reads.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610127995.7A CN105843841A (en) | 2016-03-07 | 2016-03-07 | Small file storing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610127995.7A CN105843841A (en) | 2016-03-07 | 2016-03-07 | Small file storing method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105843841A true CN105843841A (en) | 2016-08-10 |
Family
ID=56587046
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610127995.7A Pending CN105843841A (en) | 2016-03-07 | 2016-03-07 | Small file storing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105843841A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106446079A (en) * | 2016-09-08 | 2017-02-22 | 中国科学院计算技术研究所 | Distributed file system-oriented file prefetching/caching method and apparatus |
CN106528451A (en) * | 2016-11-14 | 2017-03-22 | 哈尔滨工业大学(威海) | Cloud storage framework for second level cache prefetching for small files and construction method thereof |
CN106776370A (en) * | 2016-12-05 | 2017-05-31 | 哈尔滨工业大学(威海) | Cloud storage method and device based on the assessment of object relevance |
CN106776967A (en) * | 2016-12-05 | 2017-05-31 | 哈尔滨工业大学(威海) | Mass small documents real-time storage method and device based on sequential aggregating algorithm |
CN106897587A (en) * | 2017-02-27 | 2017-06-27 | 百度在线网络技术(北京)有限公司 | The method and apparatus of reinforcement application, loading reinforcement application |
CN107341267A (en) * | 2017-07-24 | 2017-11-10 | 郑州云海信息技术有限公司 | A kind of distributed file system access method and platform |
CN109766318A (en) * | 2018-12-17 | 2019-05-17 | 新华三大数据技术有限公司 | File reading and device |
CN109947721A (en) * | 2017-12-01 | 2019-06-28 | 北京安天网络安全技术有限公司 | A kind of small documents treating method and apparatus |
CN110069455A (en) * | 2017-09-21 | 2019-07-30 | 北京华为数字技术有限公司 | A kind of file mergences method and device |
CN110069466A (en) * | 2019-04-15 | 2019-07-30 | 武汉大学 | A kind of the small documents storage method and device of Based on Distributed file system |
CN110297810A (en) * | 2019-07-05 | 2019-10-01 | 联想(北京)有限公司 | A kind of stream data processing method, device and electronic equipment |
CN111475469A (en) * | 2020-03-19 | 2020-07-31 | 中山大学 | Virtual file system-based small file storage optimization system in KUBERNETES user mode application |
CN111930684A (en) * | 2020-07-28 | 2020-11-13 | 苏州亿歌网络科技有限公司 | Small file processing method, device and equipment based on HDFS (Hadoop distributed File System) and storage medium |
CN112241396A (en) * | 2020-10-27 | 2021-01-19 | 浪潮云信息技术股份公司 | Spark-based method and Spark-based system for merging small files of Delta |
CN112422448A (en) * | 2020-08-21 | 2021-02-26 | 苏州浪潮智能科技有限公司 | FPGA accelerator card network data transmission method and related components |
CN118132520A (en) * | 2024-05-08 | 2024-06-04 | 济南浪潮数据技术有限公司 | Storage system file processing method, electronic device, storage medium and program product |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7225207B1 (en) * | 2001-10-10 | 2007-05-29 | Google Inc. | Server for geospatially organized flat file data |
CN103577123A (en) * | 2013-11-12 | 2014-02-12 | 河海大学 | Small file optimization storage method based on HDFS |
CN103678491A (en) * | 2013-11-14 | 2014-03-26 | 东南大学 | Method based on Hadoop small file optimization and reverse index establishment |
CN104765876A (en) * | 2015-04-24 | 2015-07-08 | 中国人民解放军信息工程大学 | Massive GNSS small file cloud storage method |
CN105183839A (en) * | 2015-09-02 | 2015-12-23 | 华中科技大学 | Hadoop-based storage optimizing method for small file hierachical indexing |
-
2016
- 2016-03-07 CN CN201610127995.7A patent/CN105843841A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7225207B1 (en) * | 2001-10-10 | 2007-05-29 | Google Inc. | Server for geospatially organized flat file data |
CN103577123A (en) * | 2013-11-12 | 2014-02-12 | 河海大学 | Small file optimization storage method based on HDFS |
CN103678491A (en) * | 2013-11-14 | 2014-03-26 | 东南大学 | Method based on Hadoop small file optimization and reverse index establishment |
CN104765876A (en) * | 2015-04-24 | 2015-07-08 | 中国人民解放军信息工程大学 | Massive GNSS small file cloud storage method |
CN105183839A (en) * | 2015-09-02 | 2015-12-23 | 华中科技大学 | Hadoop-based storage optimizing method for small file hierachical indexing |
Non-Patent Citations (5)
Title |
---|
李海生 等: "Hadoop环境下三维模型的存储及形状分布特征提取", 《计算机研究与发展》 * |
王晓明: "基于HDFS的移动超声探测小文件高效存储研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王涛: "云存储中面向访问任务的小文件合并与预取策略", 《武汉大学学报(信息科学版)》 * |
章成志 等: "《文本自动标引与自动分类研究》", 31 December 2009, 东南大学出版社 * |
马刚: "《基于语义的Web数据挖掘》", 31 January 2014, 东北财经大学出版社 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106446079B (en) * | 2016-09-08 | 2019-06-18 | 中国科学院计算技术研究所 | A kind of file of Based on Distributed file system prefetches/caching method and device |
CN106446079A (en) * | 2016-09-08 | 2017-02-22 | 中国科学院计算技术研究所 | Distributed file system-oriented file prefetching/caching method and apparatus |
CN106528451A (en) * | 2016-11-14 | 2017-03-22 | 哈尔滨工业大学(威海) | Cloud storage framework for second level cache prefetching for small files and construction method thereof |
CN106528451B (en) * | 2016-11-14 | 2019-09-03 | 哈尔滨工业大学(威海) | The cloud storage frame and construction method prefetched for the L2 cache of small documents |
CN106776370A (en) * | 2016-12-05 | 2017-05-31 | 哈尔滨工业大学(威海) | Cloud storage method and device based on the assessment of object relevance |
CN106776967A (en) * | 2016-12-05 | 2017-05-31 | 哈尔滨工业大学(威海) | Mass small documents real-time storage method and device based on sequential aggregating algorithm |
CN106776967B (en) * | 2016-12-05 | 2020-03-27 | 哈尔滨工业大学(威海) | Method and device for storing massive small files in real time based on time sequence aggregation algorithm |
CN106897587A (en) * | 2017-02-27 | 2017-06-27 | 百度在线网络技术(北京)有限公司 | The method and apparatus of reinforcement application, loading reinforcement application |
CN107341267A (en) * | 2017-07-24 | 2017-11-10 | 郑州云海信息技术有限公司 | A kind of distributed file system access method and platform |
CN110069455A (en) * | 2017-09-21 | 2019-07-30 | 北京华为数字技术有限公司 | A kind of file mergences method and device |
CN110069455B (en) * | 2017-09-21 | 2021-12-14 | 北京华为数字技术有限公司 | File merging method and device |
CN109947721A (en) * | 2017-12-01 | 2019-06-28 | 北京安天网络安全技术有限公司 | A kind of small documents treating method and apparatus |
CN109947721B (en) * | 2017-12-01 | 2021-08-17 | 北京安天网络安全技术有限公司 | Small file processing method and device |
CN109766318A (en) * | 2018-12-17 | 2019-05-17 | 新华三大数据技术有限公司 | File reading and device |
CN109766318B (en) * | 2018-12-17 | 2021-03-02 | 新华三大数据技术有限公司 | File reading method and device |
CN110069466A (en) * | 2019-04-15 | 2019-07-30 | 武汉大学 | A kind of the small documents storage method and device of Based on Distributed file system |
CN110069466B (en) * | 2019-04-15 | 2021-02-19 | 武汉大学 | Small file storage method and device for distributed file system |
CN110297810A (en) * | 2019-07-05 | 2019-10-01 | 联想(北京)有限公司 | A kind of stream data processing method, device and electronic equipment |
CN110297810B (en) * | 2019-07-05 | 2022-01-18 | 联想(北京)有限公司 | Stream data processing method and device and electronic equipment |
CN111475469A (en) * | 2020-03-19 | 2020-07-31 | 中山大学 | Virtual file system-based small file storage optimization system in KUBERNETES user mode application |
CN111475469B (en) * | 2020-03-19 | 2021-12-14 | 中山大学 | Virtual file system-based small file storage optimization system in KUBERNETES user mode application |
CN111930684A (en) * | 2020-07-28 | 2020-11-13 | 苏州亿歌网络科技有限公司 | Small file processing method, device and equipment based on HDFS (Hadoop distributed File System) and storage medium |
CN112422448A (en) * | 2020-08-21 | 2021-02-26 | 苏州浪潮智能科技有限公司 | FPGA accelerator card network data transmission method and related components |
CN112241396A (en) * | 2020-10-27 | 2021-01-19 | 浪潮云信息技术股份公司 | Spark-based method and Spark-based system for merging small files of Delta |
CN112241396B (en) * | 2020-10-27 | 2023-05-23 | 浪潮云信息技术股份公司 | Spark-based method and system for merging small files of Delta |
CN118132520A (en) * | 2024-05-08 | 2024-06-04 | 济南浪潮数据技术有限公司 | Storage system file processing method, electronic device, storage medium and program product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105843841A (en) | Small file storing method and system | |
CN106649455A (en) | Big data development standardized systematic classification and command set system | |
CN104346438B (en) | Based on big data data management service system | |
CN110019616A (en) | A kind of POI trend of the times state acquiring method and its equipment, storage medium, server | |
CN103812939A (en) | Big data storage system | |
CN105159971B (en) | A kind of cloud platform data retrieval method | |
CN107103032A (en) | The global mass data paging query method sorted is avoided under a kind of distributed environment | |
CN104063376A (en) | Multi-dimensional grouping operation method and system | |
CN106294595A (en) | A kind of document storage, search method and device | |
Pallickara et al. | Efficient metadata generation to enable interactive data discovery over large-scale scientific data collections | |
CN104391908B (en) | Multiple key indexing means based on local sensitivity Hash on a kind of figure | |
Gupta et al. | Faster as well as early measurements from big data predictive analytics model | |
Cary et al. | Leveraging cloud computing in geodatabase management | |
Kim et al. | Efficient distributed selective search | |
CN110795613B (en) | Commodity searching method, device and system and electronic equipment | |
CN113254630B (en) | Domain knowledge map recommendation method for global comprehensive observation results | |
CN109783441A (en) | Mass data inquiry method based on Bloom Filter | |
JP2020512651A (en) | Search method, device, and non-transitory computer-readable storage medium | |
KR20180129001A (en) | Method and System for Entity summarization based on multilingual projected entity space | |
CN104573082B (en) | Space small documents distributed data storage method and system based on access log information | |
Khodaei et al. | Temporal-textual retrieval: Time and keyword search in web documents | |
Ravichandran | Big Data processing with Hadoop: a review | |
Huang et al. | Design a batched information retrieval system based on a concept-lattice-like structure | |
CN104794237A (en) | Web page information processing method and device | |
Shah et al. | Big data analytics framework for spatial data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160810 |
|
RJ01 | Rejection of invention patent application after publication |