CN105843841A - Small file storing method and system - Google Patents

Small file storing method and system Download PDF

Info

Publication number
CN105843841A
CN105843841A CN201610127995.7A CN201610127995A CN105843841A CN 105843841 A CN105843841 A CN 105843841A CN 201610127995 A CN201610127995 A CN 201610127995A CN 105843841 A CN105843841 A CN 105843841A
Authority
CN
China
Prior art keywords
small documents
file
key word
documents
big file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610127995.7A
Other languages
Chinese (zh)
Inventor
王金龙
段良涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao University of Technology
Original Assignee
Qingdao University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao University of Technology filed Critical Qingdao University of Technology
Priority to CN201610127995.7A priority Critical patent/CN105843841A/en
Publication of CN105843841A publication Critical patent/CN105843841A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/134Distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a small file storing method and system. According to the invention, association relations among all small files are determined based on semantic description information of each to-be-stored small file. On this basis, merging and storing of each to-be-stored small file are achieved based on the determined association relations among all small files. Thus, the invention provides a small file merging and storing strategy based on a semantic association angle and by applying the small file storing method and system, closely associated small files can be merged, thereby effectively increasing reading efficiency of the small files.

Description

A kind of small documents storage method and system
Technical field
The invention belongs to computer distribution type technical field of memory, particularly relate to a kind of small documents storage method And system.
Background technology
Along with the fast development of the Internet, cloud storage starts to be widely used depositing in magnanimity internet data Chu Zhong, cloud storage is collectively forming an offer by different types of storage devices a large amount of in integration networks Extraneous storage and the system of Operational Visit, at present, using the teaching of the invention it is possible to provide the distributed system of cloud storage has a lot, Such as HDFS (Hadoop Distributed File System, distributed file system), the GFS of Google (Google File System, Google's file system) etc..
Under current internet environment, small documents occupies larger specific gravity, for solving small documents cloud storage Research be essentially all strategy based on Piece file mergence, by large amount of small documents is merged, subtract The number of files of few platform, alleviates distributed file system for depositing the memory pressure of file metadata; After small documents merges into big file simultaneously, it is possible to significantly improve its disk read-write speed, save file storage The time consumed.But, the most do not consider the communication with one another between file when currently carrying out the merging of small documents, Piece file mergence is only directly to merge each small documents uploaded, and does not provide relevant merging Strategy is used for improving the reading efficiency of small documents.
Summary of the invention
In view of this, it is an object of the invention to provide a kind of small documents storage method and system, it is intended to logical Cross and small documents the closest for relatedness is merged and stores, improve the reading efficiency of small documents.
To this end, the present invention is disclosed directly below technical scheme:
A kind of small documents storage method, including:
Obtain the semantic description information of multiple small documents to be stored;
Based on the semantic description information of small documents each described, determine that the association between each described small documents is closed System;
Based on incidence relation between small documents each described, small documents each described is carried out at Piece file mergence Reason, obtains at least one big file;
Storage at least one big file described.
Said method, it is preferred that the semantic description information of multiple small documents that described acquisition is to be stored includes:
Obtain the key word of multiple small documents to be stored.
Said method, it is preferred that described based on the semantic description information of small documents each described, determines each Incidence relation between individual described small documents includes:
Calculate the semantic similarity between the key word of each described small documents;
Based on described semantic similarity, the clustering processing that the key word of small documents each described is preset; Wherein, clustering processing result has higher semantic similarity between the key word of same bunch;
According to clustering processing result, determine the incidence relation between each described small documents.
Said method, it is preferred that also include following preprocessing process:
According to small documents criterion of identification set in advance, the file to be stored uploaded is carried out small documents identification.
Said method, it is preferred that before storage at least one big file described, also include:
Utilize the key word of each described small documents, set up inverted index for each described small documents;
The result processed according to described Piece file mergence, determines each described small documents and corresponding big file Between mapping relations and each described small documents positional information in corresponding big file.
Said method, it is preferred that also include:
Based on the mapping relations between described inverted index, described small documents and corresponding big file and each institute State small documents positional information in corresponding big file and carry out required small documents reading.
A kind of small documents storage system, including:
Data obtaining module is described, for obtaining the semantic description information of multiple small documents to be stored;
Incidence relation determines module, for based on the semantic description information of small documents each described, determines each Incidence relation between individual described small documents;
Small documents merges module, for based on incidence relation between small documents each described, to described in each Small documents carries out Piece file mergence process, obtains at least one big file;
Memory module, is used for storing at least one big file described.
Said system, it is preferred that described description data obtaining module includes:
Key word acquiring unit, for obtaining the key word of multiple small documents to be stored.
Said system, it is preferred that described incidence relation determines that module includes:
Computing unit, the semantic similarity between the key word calculating each described small documents;
Clustering processing unit, for based on described semantic similarity, to the key word of small documents each described Carry out the clustering processing preset;Wherein, clustering processing result has higher language with between bunch key word Justice similarity;
Incidence relation determines unit, for according to clustering processing result, determining between each described small documents Incidence relation.
Said system, it is preferred that also include:
Small documents identification module, for according to small documents criterion of identification set in advance, deposits waiting of uploading Storage file carries out small documents identification.
Said system, it is preferred that also include that index creation module, described index creation module include:
First index creation unit is for utilizing the key word of each described small documents, described little for each Inverted index set up by file;
Second index creation unit, for the result processed according to described Piece file mergence, determines each Mapping relations between described small documents and corresponding big file and each described small documents are in corresponding big file Positional information.
Said system, it is preferred that also include:
Small documents read module, for based on described inverted index, described small documents and corresponding big file it Between mapping relations and each described small documents positional information in corresponding big file carry out required little literary composition Part reads.
From above scheme, small documents disclosed in the present application storage method and system, wait to deposit based on each The semantic description information of storage small documents, determines the incidence relation between each small documents, on this basis, Incidence relation between each described small documents determined by based on, it is achieved the small documents that each is to be stored is entered Row merges storage.Visible, present applicant proposes a kind of small documents based on semantic association angle and merge storage Strategy, application is the application can realize merging small documents the closest for relatedness, and then can be effective Improve the reading efficiency of small documents.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to reality Execute the required accompanying drawing used in example or description of the prior art to be briefly described, it should be apparent that below, Accompanying drawing in description is only embodiments of the invention, for those of ordinary skill in the art, not On the premise of paying creative work, it is also possible to obtain other accompanying drawing according to the accompanying drawing provided.
Fig. 1 is the small documents storage method flow diagram that the embodiment of the present invention one provides;
Fig. 2 is the integrated stand composition of the distributed file system that the embodiment of the present invention one provides;
Fig. 3 is the small documents storage method flow diagram that the embodiment of the present invention two provides;
Fig. 4 is the small documents storage method flow diagram that the embodiment of the present invention three provides;
Fig. 5-Fig. 8 is the structural representation of the small documents storage system that the embodiment of the present invention four provides.
Detailed description of the invention
For the sake of quoting and understanding, the technical term that is used below, write a Chinese character in simplified form or summary of abridging is explained such as Under:
Lucene: be the full-text search engine tool kit of an open source code, but it be not one complete Full-text search engine, but the framework of a full-text search engine, it is provided that complete query engine and Index engine, part text analyzing engine.The purpose of Lucene is to provide a letter for software developer Single easy-to-use tool kit, to realize the function of full-text search easily, or with this in goal systems Based on set up complete full-text search engine.Lucene is a set of opening for full-text search and search Source library, is supported by Apache Software Foundation and provides.Lucene provide one simply the most powerful Application interface, it is possible to doing full-text index and search, in Java development environment, Lucene is an one-tenth Ripe free Open-Source Tools.
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out Clearly and completely describe, it is clear that described embodiment is only a part of embodiment of the present invention, and It is not all, of embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art are not doing Go out the every other embodiment obtained under creative work premise, broadly fall into the scope of protection of the invention.
Embodiment one
A kind of open small documents storage method of the embodiment of the present invention one, with reference to Fig. 1, described small documents storage side Method may comprise steps of:
S101: obtain the semantic description information of multiple small documents to be stored.
In cloud platform, the small documents major part of storage is all communication with one another, is mutually related in logic, For this relatedness, the application is given a kind of based on semantic Piece file mergence strategy, relatedness is the closeest The small documents cut merges, the size of the big file formed after simultaneously controlling to merge, it is ensured that after merging The big file arrived without departing from the default storage object size of Distributed Architecture, with avoid small documents across block (block I.e. refer to big file) storage.
With reference to the typical application scenarios of the distributed document storage shown in Fig. 2, the present embodiment is specifically with in Fig. 2 Distributed file system as a example by the application method is illustrated.When user has file storage demand, Can on Web server transmitting file, and file carried out key word describe as the metadata of file, wherein Key word needs based on succinctly, and can preferably summarize the theme of file.Other of key word and file Relevant information, as file name, size, type, uplink time etc. are protected collectively as the metadata of file It is stored in the associated documents record of data base.
The file that user need to be uploaded by Web server carries out size identification, afterwards can be to different size of file Different Strategies is taked to process.The small documents bottleneck major embodiment to Ceph (distributed file system) At write, reading and the Data Migration of file, based on this, the application is by calculating different size of file Upload to Ceph or from Mean Speed locally downloading for Ceph, determine big small documents defines/identify mark Accurate.On this basis, the file that Web server utilizes this standard to upload user carries out size file identification, If recognition result is big file, then directly this document is uploaded to Ceph cluster and carries out cloud storage, otherwise, If recognition result is small documents, then cache small documents, and the small documents merging utilizing the application to provide is deposited Storage strategy realizes the merging storage of small documents.
For each small documents to be stored of caching, first Web server need to obtain the semanteme of each small documents Description information, described semantic description information can be specifically that the key word of small documents describes information, this information Can from user transmitting file metadata in extract.
S102: based on the semantic description information of small documents each described, determine between each described small documents Incidence relation.
After obtaining the key word of each small documents to be stored, this step is concrete according to small documents key word Between semantic similarity, determine the incidence relation between small documents, wherein, the semanteme between key word Similarity is the highest, then the relatedness between small documents corresponding to sign key word is the tightst.
The application utilizes the corpus trained, and uses word2vec instrument to the semanteme between key word Similarity calculates, and employing knows that net supplements, thus utilizes common results between the two to word Between degree of association be defined, the weight setting up word2vec is α, knows that the weight of net is β, then key word Semantic similarity between W1 and W2 is defined as follows:
Sim (W1, W2)=α Simw2v(W1,W2)+βSimHN(W1,W2) (1)
In formula (1), Sim (W1, W2) represents the degree of association i.e. semantic similarity between key word W1 and W2, Simw2v(W1, W2) represents and utilizes the word W1 that Word2Vec instrument calculates, the degree of association between W2, SimHNWhat (W1, W2) represented is to utilize to know that net carries out the word W1 calculated, the degree of association between W2.
Between defined terms on the basis of semantic similarity, the application is especially by each small documents to be stored Key word carry out the clustering processing preset, it is achieved the key word of each small documents is carried out sub-clustering so that Clustering processing result has higher semantic similarity between the key word of same bunch;Accordingly, right On the basis of small documents key word carries out sub-clustering, can according to the corresponding relation between small documents and key word, The small documents to be stored being in relief area is carried out logical partitioning, the small documents corresponding with bunch key word is drawn Dividing in same logical block, the small documents belonging to a logical block together has higher semantic phase to each other Guan Du.
Next being described the clustering processing process of each small documents key word, this process specifically includes:
1) preliminary clusters (nearest neighbor classifier) of key word
By n key word of multiple small documents (wherein, a corresponding key word of small documents, same Key word possible corresponding multiple small documents, i.e. corresponding relation between small documents and key word are N:1) map To one group of disjoint set D={D1,D2…Dn, each set is one bunch, key word in set of computations Semantic similarity between any two, and the similarity numerical value calculated is sorted in descending order;On this basis, Take out Similarity value sim (W successivelyi,Wj), until sim (Wi,Wj) stop less than during predetermined threshold value, afterwards, point Do not find out taken out sim (Wi,Wj(W in)i,WjSet D belonging to)iWith Dj, merge DiWith DjBe one new Bunch, it is a new set, and deletes original set DiWith Dj.Due to the pass cording that file is relevant There are symmetry and transitivity, so according to new set distribution, a undirected unconnected graph can be set up, Then from each node, this figure carried out depth-first traversal, thus by relevant node (key word) All it is put in a set, i.e. completes the iteration of a keyword clustering.After an iteration not Intersecting set structure is as follows:
D1={ W1,W2…Wi};
D2={ W3,W6…Wj};
……
Dm={ Wn,Wt…Wk}。
After above process, the number of set has occurred and that change, many key words have been assigned to phase In the set (bunch) answered, it is achieved thereby that the preliminary clusters of key word.
2) iteration cluster between bunch
After the preliminary clusters of key word, corresponding bunch of Preliminary division, now will obtain a lot Bunch, on this basis, this step consider continue multiple bunches obtained after key word preliminary clusters are carried out repeatedly Generation cluster, thus close for association bunch is merged.
Between bunch, iteration cluster is concrete uses following steps to realize:
Step one, extracts the character representation of each bunch;
After the preliminary clusters of key word, have bunch in be provided with multiple key word, for key Word number more bunch, the application consider to bunch feature (bunch each key word characterize bunch one Feature) carry out dimensionality reduction, by bunch feature be described with the representational key word of several comparisons.
Wherein, the application especially by bunch in key word once cluster realization bunch feature extraction, so After for each bunch, by bunch in after cluster each submanifold of obtaining drop by its key word number having Sequence sorts, on this basis by submanifold maximum for key word number, as this bunch character representation (simultaneously Need the maximum key word number of limited features submanifold).For example, it is assumed that bunch D1={ W1,W2…WiProcess bunch It is divided into following submanifold: D after the once cluster of interior key word1={ (W1,W7,W11) ... (Wi,Wj), then may be used Select (W1,W7,W11) as bunch D1Character representation, be i.e. equivalent to have updated a bunch center, for next Bunch cluster of step.
Step 2, the relevance degree between calculating two-by-two bunch, and preserve;
After each bunch obtained after to key word preliminary clusters carries out bunch feature extraction, next proceed to profit By the similarity between bunch each bunch of feature calculation extracted, between bunch, Similarity Measure mode is specific as follows:
Assume a bunch Di={ Wi1,Wi2…Win, Dj={ Wj1,Wj2…Wjm, then bunch DiWith DjBetween similarity be:
S i m ( D i , D j ) = 1 m * n Σ p Σ q S i m ( W i p , W j q ) - - - ( 2 )
In formula (2), Sim (Di,Dj) that represent is a bunch DiWith DjBetween similarity, Sim (Wip,Wjq) be bunch between Similarity value between characteristic key words.I.e. between comprehensive utilization bunch two-by-two characteristic key words calculate relevant Angle value effectively represent bunch between degree of association, wherein, bunch the character representation that i.e. refers to bunch of characteristic key words in The key word comprised.
Step 3, between general bunch, relevance degree is according to descending sort;
Step 4, sequentially takes out sim (D from descending sequencei,Dj), until sim (Di,Dj) less than predetermined threshold value Time terminate;Afterwards by taken out sim (Di,Dj) corresponding bunch DiWith DjMerge into a bunch of Dk, thus realize All key words in two bunches are incorporated among a set;
Step 5, it is judged that bunch number whether there occurs change or iterations whether reach certain number of times, If meeting condition, iteration completes, if ineligible, continues executing with step one.
Sequentially pass through key word preliminary clusters and bunch between iteration cluster after, can realize each to be stored The key word of small documents carries out sub-clustering, wherein, has higher semantic similarity with between bunch key word. Owing to having corresponding corresponding relation between key word with small documents, therefore, can sub-clustering based on key word As a result, the small documents to be stored being in relief area is carried out logical partitioning, by with corresponding little of bunch key word File is divided in same logical block, and the file degree of being relative to each other belonging to a logical block together is high, has Higher semantic association degree.
S103: based on incidence relation between small documents each described, small documents each described is carried out file Merging treatment, obtains at least one big file.
The small documents to be stored cached is being carried out logical partitioning, after obtaining one or more logical block, This step continues according to the logical partitioning result of small documents, carries out the small documents belonging to a logical block together Piece file mergence processes.
Wherein, in order to avoid time-consumingly too much when file reads, then the file merging gained should be too not big, It is to say, for the logical block that small documents quantity/data volume is bigger comprised, should be by this logic All small documents in unit are all merged in a big file, but need to be by each in this logical block Small documents is merged into multiple big file, i.e. logical block and is merged the relation that file can be one-to-many.
The cut-off rule assuming big small documents is BS, owing to the file of filebuf is all to know through small documents The small documents of gained after not, thus in relief area, the size of All Files all will be less than BS.In order to reach literary composition Part prefetching efficiency high this purpose, when carrying out small documents and merging, first by same logical block All small documents are ranked up according to file size, then according to small documents is carried out by the order of file successively Merge, when the file once merging gained is more than BS, this time merges and terminate, be then followed by down Once merge.This kind of strategy both can ensure that the transmission read-write efficiency of gained file after merging and big file phase With, ensure that there is maximum hit rate when small documents prefetches the most simultaneously.
Specifically, small documents data are converted to binary data by system when realizing Piece file mergence, and When merging every time, all afterbodys at a upper small documents carry out data supplementing, and good this of record is merged little simultaneously The length scale that the original position of file and small documents are occupied, the most just can lead to when this small documents of needs Cross and extract the data of designated length to reduce small documents itself.
S104: storage at least one big file described.
After small documents is merged obtaining big file, the big file merging gained can be carried out required Distributed storage, specifically, with reference to Fig. 2, Web server is merging after small documents obtains big file, Can be by corresponding interface service by the appointment Bucket (data bucket) of big file uploading to Ceph storage cluster On, thus finally realize the cloud storage of small documents.
From above scheme, small documents disclosed in the present application storage method, based on each little literary composition to be stored The semantic description information of part, determines the incidence relation between each small documents, on this basis, based on institute Incidence relation between each the described small documents determined, it is achieved the small documents that each is to be stored is merged Storage.Visible, present applicant proposes a kind of small documents based on semantic association angle and merge storage strategy, Application is the application can realize merging small documents the closest for relatedness, and then can be effectively improved little The reading efficiency of file.
Embodiment two
The present embodiment continues to supplement the scheme of embodiment one, with reference to Fig. 3, in the present embodiment, in institute Can also comprise the following steps before stating step S104:
S105: utilize the key word of each described small documents, sets up inverted index for each described small documents; The result processed according to described Piece file mergence, determines between each described small documents and corresponding big file Mapping relations and each described small documents positional information in corresponding big file.
Document retrieval is all indispensable for any platform, while storage file, needs simultaneously Search function for file provides support, say, that, it is desirable to be able to the term according to user's input is fixed Position, to required file, based on this, is supported to provide document retrieval, and the present embodiment utilizes each little The key word of file is that small documents sets up inverted index, thus formed key word → small documents ID (Identity, Identity number) mapping, the key word of small documents is described and sets up by the specifically used Lucene of the present embodiment Inverted index, with reference to Fig. 2, after inverted index has been set up, just can generate index file storehouse and carry out it Storage, thus follow-up quick-searching and the location that can carry out small documents according to the term of input.
Followed by the establishment process of an example simple declaration inverted index, this example provides three figures The brief introduction of sheet (being equivalent to 3 small documents) describes:
A, Tian An-men, Pekinese.
B, the ruins in Yuanmingyuan Park.
C, Tian An-men and scenic spot, Dou Shi Pekinese, Yuanmingyuan Park.
Can get after it is carried out participle pretreatment:
A, [Beijing] [Tian An-men].
B, [Yuanmingyuan Park] [ruins].
C, [Tian An-men] [Yuanmingyuan Park] [Beijing] [scenic spot].
Continue the participle of above-mentioned each picture is carried out inverted index process, the available as shown in table 1 row of falling Table:
Table 1
Except setting up inverted index for each small documents, the present embodiment is built always according to the combination situation of small documents Mapping relations between vertical small documents and big file, simultaneously record small documents position in corresponding big file Information, and in corresponding database server, store described mapping relations and described positional information.Described Inverted index, small documents can be total to the mapping relations of big file, small documents positional information in big file Be all the index information of small documents, for follow-up small documents retrieval, position, read offer support.
Embodiment three
On the basis of embodiment two scheme, with reference to Fig. 4, in the present embodiment, described small documents storage method Can also comprise the following steps:
S106: based on the mapping relations between described inverted index, described small documents and corresponding big file and Each described small documents positional information in corresponding big file carries out required small documents and reads.
On the basis of above example, the present embodiment provides the read schemes of small documents, the reading of small documents Taking and can be divided into document retrieval and file download two step, wherein, document retrieval specifically can be to receive user defeated After the term entered, use Lucene that the inverted index created is retrieved, thus obtain meeting retrieval The results list of word, this list includes one or more small documents ID, afterwards, obtains after continuing with retrieval To small documents ID carry out data base querying, obtain the biggest file ID of merging file at small documents place, so After, send file read request according to the big file ID that inquiry obtains to Ceph cluster.
After Ceph cluster receives this request, navigate to corresponding Bucket and obtain required big file, due to The big file obtained after merging is stored by object on Ceph, thus when reading small documents every time It is required for the big file at small documents place carrying out overall download and caching, afterwards can be by downloading and delaying The big file deposited is disassembled, and obtains required little literary composition according to small documents positional information in big file Part.
Wherein it is desired to explanation, owing to the application is when carrying out cloud storage to small documents, based on little literary composition Part is semantic merges storage by more close for association small documents, thus small documents is downloaded, During reading, the reading efficiency of the higher small documents of degree of association can be effectively improved.
Embodiment four
The open a kind of small documents storage system of the present embodiment, described small documents storage system and above each enforcement Disclosed in example, small documents storage method is corresponding.
Corresponding to embodiment one, with reference to Fig. 5, described system can include describing data obtaining module 100, Incidence relation determines that module 200, small documents merge module 300 and memory module 400.
Data obtaining module 100 is described, for obtaining the semantic description information of multiple small documents to be stored.
Wherein, described description data obtaining module 100 includes key word acquiring unit, is used for obtaining to be stored The key word of multiple small documents.
Incidence relation determines module 200, for based on the semantic description information of small documents each described, determines Incidence relation between each described small documents.
Described incidence relation determines that module 200 includes that computing unit, clustering processing unit and incidence relation determine Unit.
Computing unit, the semantic similarity between the key word calculating each described small documents;
Clustering processing unit, for based on described semantic similarity, to the key word of small documents each described Carry out the clustering processing preset;Wherein, clustering processing result has higher language with between bunch key word Justice similarity;
Incidence relation determines unit, for according to clustering processing result, determining between each described small documents Incidence relation.
Small documents merges module 300, for based on incidence relation between small documents each described, to each institute State small documents and carry out Piece file mergence process, obtain at least one big file.
Memory module 400, is used for storing at least one big file described.
The functional realiey of the most each module or unit needs to set up on the pretreatment basis of small documents identification On, therefore, with reference to Fig. 6, described system also includes small documents identification module 500, and this module is used for foundation Small documents criterion of identification set in advance, carries out small documents identification to the file to be stored uploaded.
Corresponding to embodiment two, with reference to Fig. 7, described system can also include index creation module 600, should Module includes the first index creation unit and the second index creation unit.
First index creation unit is for utilizing the key word of each described small documents, described little for each Inverted index set up by file;
Second index creation unit, for the result processed according to described Piece file mergence, determines each Mapping relations between described small documents and corresponding big file and each described small documents are in corresponding big file Positional information.
Corresponding to embodiment three, with reference to Fig. 8, described system can also include small documents read module 700, For based on the mapping relations between described inverted index, described small documents and corresponding big file and each institute State small documents positional information in corresponding big file and carry out required small documents reading.
For small documents storage system disclosed in the embodiment of the present invention four, due to its with embodiment one to Disclosed in embodiment three, small documents storage method is corresponding, so describe is fairly simple, relevant similar it Place refers to the explanation of embodiment one to embodiment three small file storage method part, the most no longer Describe in detail.
It should be noted that each embodiment in this specification all uses the mode gone forward one by one to describe, each What embodiment stressed is all the difference with other embodiments, identical similar between each embodiment Part see mutually.
For convenience of description, it is divided into various module or unit to divide with function when describing system above or device Do not describe.Certainly, implement the application time can the function of each unit same or multiple softwares and/ Or hardware realizes.
As seen through the above description of the embodiments, those skilled in the art is it can be understood that arrive The application can add the mode of required general hardware platform by software and realize.Based on such understanding, The part that prior art is contributed by the technical scheme of the application the most in other words can be with software product Form embody, this computer software product can be stored in storage medium, as ROM/RAM, Magnetic disc, CD etc., including some instructions with so that computer equipment (can be personal computer, Server, or the network equipment etc.) perform each embodiment of the application or some part institute of embodiment The method stated.
Finally, in addition it is also necessary to explanation, in this article, such as first, second, third and fourth etc. it The relational terms of class is used merely to separate an entity or operation with another entity or operating space, And not necessarily require or imply and there is the relation of any this reality or suitable between these entities or operation Sequence.And, term " includes ", " comprising " or its any other variant are intended to nonexcludability Comprise, so that include that the process of a series of key element, method, article or equipment not only include that A little key elements, but also include other key elements being not expressly set out, or also include for this process, The key element that method, article or equipment are intrinsic.In the case of there is no more restriction, by statement " bag Include one ... " key element that limits, it is not excluded that include the process of described key element, method, article or Person's equipment there is also other identical element.
The above is only the preferred embodiment of the present invention, it is noted that general for the art For logical technical staff, under the premise without departing from the principles of the invention, it is also possible to make some improvement and profit Decorations, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims (12)

1. a small documents storage method, it is characterised in that including:
Obtain the semantic description information of multiple small documents to be stored;
Based on the semantic description information of small documents each described, determine that the association between each described small documents is closed System;
Based on incidence relation between small documents each described, small documents each described is carried out at Piece file mergence Reason, obtains at least one big file;
Storage at least one big file described.
Method the most according to claim 1, it is characterised in that to be stored multiple little of described acquisition The semantic description information of file includes:
Obtain the key word of multiple small documents to be stored.
Method the most according to claim 2, it is characterised in that described based on small documents each described Semantic description information, determine that the incidence relation between each described small documents includes:
Calculate the semantic similarity between the key word of each described small documents;
Based on described semantic similarity, the clustering processing that the key word of small documents each described is preset; Wherein, clustering processing result has higher semantic similarity between the key word of same bunch;
According to clustering processing result, determine the incidence relation between each described small documents.
4. according to the method described in claim 1-3 any one, it is characterised in that also include following pre- Processing procedure:
According to small documents criterion of identification set in advance, the file to be stored uploaded is carried out small documents identification.
5. according to the method described in claim 2-3 any one, it is characterised in that storage described at least Before one big file, also include:
Utilize the key word of each described small documents, set up inverted index for each described small documents;
The result processed according to described Piece file mergence, determines each described small documents and corresponding big file Between mapping relations and each described small documents positional information in corresponding big file.
Method the most according to claim 5, it is characterised in that also include:
Based on the mapping relations between described inverted index, described small documents and corresponding big file and each institute State small documents positional information in corresponding big file and carry out required small documents reading.
7. a small documents storage system, it is characterised in that including:
Data obtaining module is described, for obtaining the semantic description information of multiple small documents to be stored;
Incidence relation determines module, for based on the semantic description information of small documents each described, determines each Incidence relation between individual described small documents;
Small documents merges module, for based on incidence relation between small documents each described, to described in each Small documents carries out Piece file mergence process, obtains at least one big file;
Memory module, is used for storing at least one big file described.
System the most according to claim 7, it is characterised in that described description data obtaining module bag Include:
Key word acquiring unit, for obtaining the key word of multiple small documents to be stored.
System the most according to claim 8, it is characterised in that described incidence relation determines module bag Include:
Computing unit, the semantic similarity between the key word calculating each described small documents;
Clustering processing unit, for based on described semantic similarity, to the key word of small documents each described Carry out the clustering processing preset;Wherein, clustering processing result has higher language with between bunch key word Justice similarity;
Incidence relation determines unit, for according to clustering processing result, determining between each described small documents Incidence relation.
10. according to the system described in claim 7-9 any one, it is characterised in that also include:
Small documents identification module, for according to small documents criterion of identification set in advance, deposits waiting of uploading Storage file carries out small documents identification.
11. systems described in-9 any one according to Claim 8, it is characterised in that also include index wound Modeling block, described index creation module includes:
First index creation unit is for utilizing the key word of each described small documents, described little for each Inverted index set up by file;
Second index creation unit, for the result processed according to described Piece file mergence, determines each Mapping relations between described small documents and corresponding big file and each described small documents are in corresponding big file Positional information.
12. systems according to claim 11, it is characterised in that also include:
Small documents read module, for based on described inverted index, described small documents and corresponding big file it Between mapping relations and each described small documents positional information in corresponding big file carry out required little literary composition Part reads.
CN201610127995.7A 2016-03-07 2016-03-07 Small file storing method and system Pending CN105843841A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610127995.7A CN105843841A (en) 2016-03-07 2016-03-07 Small file storing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610127995.7A CN105843841A (en) 2016-03-07 2016-03-07 Small file storing method and system

Publications (1)

Publication Number Publication Date
CN105843841A true CN105843841A (en) 2016-08-10

Family

ID=56587046

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610127995.7A Pending CN105843841A (en) 2016-03-07 2016-03-07 Small file storing method and system

Country Status (1)

Country Link
CN (1) CN105843841A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446079A (en) * 2016-09-08 2017-02-22 中国科学院计算技术研究所 Distributed file system-oriented file prefetching/caching method and apparatus
CN106528451A (en) * 2016-11-14 2017-03-22 哈尔滨工业大学(威海) Cloud storage framework for second level cache prefetching for small files and construction method thereof
CN106776370A (en) * 2016-12-05 2017-05-31 哈尔滨工业大学(威海) Cloud storage method and device based on the assessment of object relevance
CN106776967A (en) * 2016-12-05 2017-05-31 哈尔滨工业大学(威海) Mass small documents real-time storage method and device based on sequential aggregating algorithm
CN106897587A (en) * 2017-02-27 2017-06-27 百度在线网络技术(北京)有限公司 The method and apparatus of reinforcement application, loading reinforcement application
CN107341267A (en) * 2017-07-24 2017-11-10 郑州云海信息技术有限公司 A kind of distributed file system access method and platform
CN109766318A (en) * 2018-12-17 2019-05-17 新华三大数据技术有限公司 File reading and device
CN109947721A (en) * 2017-12-01 2019-06-28 北京安天网络安全技术有限公司 A kind of small documents treating method and apparatus
CN110069455A (en) * 2017-09-21 2019-07-30 北京华为数字技术有限公司 A kind of file mergences method and device
CN110069466A (en) * 2019-04-15 2019-07-30 武汉大学 A kind of the small documents storage method and device of Based on Distributed file system
CN110297810A (en) * 2019-07-05 2019-10-01 联想(北京)有限公司 A kind of stream data processing method, device and electronic equipment
CN111475469A (en) * 2020-03-19 2020-07-31 中山大学 Virtual file system-based small file storage optimization system in KUBERNETES user mode application
CN111930684A (en) * 2020-07-28 2020-11-13 苏州亿歌网络科技有限公司 Small file processing method, device and equipment based on HDFS (Hadoop distributed File System) and storage medium
CN112241396A (en) * 2020-10-27 2021-01-19 浪潮云信息技术股份公司 Spark-based method and Spark-based system for merging small files of Delta
CN112422448A (en) * 2020-08-21 2021-02-26 苏州浪潮智能科技有限公司 FPGA accelerator card network data transmission method and related components
CN118132520A (en) * 2024-05-08 2024-06-04 济南浪潮数据技术有限公司 Storage system file processing method, electronic device, storage medium and program product

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7225207B1 (en) * 2001-10-10 2007-05-29 Google Inc. Server for geospatially organized flat file data
CN103577123A (en) * 2013-11-12 2014-02-12 河海大学 Small file optimization storage method based on HDFS
CN103678491A (en) * 2013-11-14 2014-03-26 东南大学 Method based on Hadoop small file optimization and reverse index establishment
CN104765876A (en) * 2015-04-24 2015-07-08 中国人民解放军信息工程大学 Massive GNSS small file cloud storage method
CN105183839A (en) * 2015-09-02 2015-12-23 华中科技大学 Hadoop-based storage optimizing method for small file hierachical indexing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7225207B1 (en) * 2001-10-10 2007-05-29 Google Inc. Server for geospatially organized flat file data
CN103577123A (en) * 2013-11-12 2014-02-12 河海大学 Small file optimization storage method based on HDFS
CN103678491A (en) * 2013-11-14 2014-03-26 东南大学 Method based on Hadoop small file optimization and reverse index establishment
CN104765876A (en) * 2015-04-24 2015-07-08 中国人民解放军信息工程大学 Massive GNSS small file cloud storage method
CN105183839A (en) * 2015-09-02 2015-12-23 华中科技大学 Hadoop-based storage optimizing method for small file hierachical indexing

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
李海生 等: "Hadoop环境下三维模型的存储及形状分布特征提取", 《计算机研究与发展》 *
王晓明: "基于HDFS的移动超声探测小文件高效存储研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王涛: "云存储中面向访问任务的小文件合并与预取策略", 《武汉大学学报(信息科学版)》 *
章成志 等: "《文本自动标引与自动分类研究》", 31 December 2009, 东南大学出版社 *
马刚: "《基于语义的Web数据挖掘》", 31 January 2014, 东北财经大学出版社 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446079B (en) * 2016-09-08 2019-06-18 中国科学院计算技术研究所 A kind of file of Based on Distributed file system prefetches/caching method and device
CN106446079A (en) * 2016-09-08 2017-02-22 中国科学院计算技术研究所 Distributed file system-oriented file prefetching/caching method and apparatus
CN106528451A (en) * 2016-11-14 2017-03-22 哈尔滨工业大学(威海) Cloud storage framework for second level cache prefetching for small files and construction method thereof
CN106528451B (en) * 2016-11-14 2019-09-03 哈尔滨工业大学(威海) The cloud storage frame and construction method prefetched for the L2 cache of small documents
CN106776370A (en) * 2016-12-05 2017-05-31 哈尔滨工业大学(威海) Cloud storage method and device based on the assessment of object relevance
CN106776967A (en) * 2016-12-05 2017-05-31 哈尔滨工业大学(威海) Mass small documents real-time storage method and device based on sequential aggregating algorithm
CN106776967B (en) * 2016-12-05 2020-03-27 哈尔滨工业大学(威海) Method and device for storing massive small files in real time based on time sequence aggregation algorithm
CN106897587A (en) * 2017-02-27 2017-06-27 百度在线网络技术(北京)有限公司 The method and apparatus of reinforcement application, loading reinforcement application
CN107341267A (en) * 2017-07-24 2017-11-10 郑州云海信息技术有限公司 A kind of distributed file system access method and platform
CN110069455A (en) * 2017-09-21 2019-07-30 北京华为数字技术有限公司 A kind of file mergences method and device
CN110069455B (en) * 2017-09-21 2021-12-14 北京华为数字技术有限公司 File merging method and device
CN109947721A (en) * 2017-12-01 2019-06-28 北京安天网络安全技术有限公司 A kind of small documents treating method and apparatus
CN109947721B (en) * 2017-12-01 2021-08-17 北京安天网络安全技术有限公司 Small file processing method and device
CN109766318A (en) * 2018-12-17 2019-05-17 新华三大数据技术有限公司 File reading and device
CN109766318B (en) * 2018-12-17 2021-03-02 新华三大数据技术有限公司 File reading method and device
CN110069466A (en) * 2019-04-15 2019-07-30 武汉大学 A kind of the small documents storage method and device of Based on Distributed file system
CN110069466B (en) * 2019-04-15 2021-02-19 武汉大学 Small file storage method and device for distributed file system
CN110297810A (en) * 2019-07-05 2019-10-01 联想(北京)有限公司 A kind of stream data processing method, device and electronic equipment
CN110297810B (en) * 2019-07-05 2022-01-18 联想(北京)有限公司 Stream data processing method and device and electronic equipment
CN111475469A (en) * 2020-03-19 2020-07-31 中山大学 Virtual file system-based small file storage optimization system in KUBERNETES user mode application
CN111475469B (en) * 2020-03-19 2021-12-14 中山大学 Virtual file system-based small file storage optimization system in KUBERNETES user mode application
CN111930684A (en) * 2020-07-28 2020-11-13 苏州亿歌网络科技有限公司 Small file processing method, device and equipment based on HDFS (Hadoop distributed File System) and storage medium
CN112422448A (en) * 2020-08-21 2021-02-26 苏州浪潮智能科技有限公司 FPGA accelerator card network data transmission method and related components
CN112241396A (en) * 2020-10-27 2021-01-19 浪潮云信息技术股份公司 Spark-based method and Spark-based system for merging small files of Delta
CN112241396B (en) * 2020-10-27 2023-05-23 浪潮云信息技术股份公司 Spark-based method and system for merging small files of Delta
CN118132520A (en) * 2024-05-08 2024-06-04 济南浪潮数据技术有限公司 Storage system file processing method, electronic device, storage medium and program product

Similar Documents

Publication Publication Date Title
CN105843841A (en) Small file storing method and system
CN106649455A (en) Big data development standardized systematic classification and command set system
CN104346438B (en) Based on big data data management service system
CN110019616A (en) A kind of POI trend of the times state acquiring method and its equipment, storage medium, server
CN103812939A (en) Big data storage system
CN105159971B (en) A kind of cloud platform data retrieval method
CN107103032A (en) The global mass data paging query method sorted is avoided under a kind of distributed environment
CN104063376A (en) Multi-dimensional grouping operation method and system
CN106294595A (en) A kind of document storage, search method and device
Pallickara et al. Efficient metadata generation to enable interactive data discovery over large-scale scientific data collections
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
Gupta et al. Faster as well as early measurements from big data predictive analytics model
Cary et al. Leveraging cloud computing in geodatabase management
Kim et al. Efficient distributed selective search
CN110795613B (en) Commodity searching method, device and system and electronic equipment
CN113254630B (en) Domain knowledge map recommendation method for global comprehensive observation results
CN109783441A (en) Mass data inquiry method based on Bloom Filter
JP2020512651A (en) Search method, device, and non-transitory computer-readable storage medium
KR20180129001A (en) Method and System for Entity summarization based on multilingual projected entity space
CN104573082B (en) Space small documents distributed data storage method and system based on access log information
Khodaei et al. Temporal-textual retrieval: Time and keyword search in web documents
Ravichandran Big Data processing with Hadoop: a review
Huang et al. Design a batched information retrieval system based on a concept-lattice-like structure
CN104794237A (en) Web page information processing method and device
Shah et al. Big data analytics framework for spatial data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160810

RJ01 Rejection of invention patent application after publication