CN105843841A

CN105843841A - Small file storing method and system

Info

Publication number: CN105843841A
Application number: CN201610127995.7A
Authority: CN
Inventors: 王金龙; 段良涛
Original assignee: Qingdao University of Technology
Current assignee: Qingdao University of Technology
Priority date: 2016-03-07
Filing date: 2016-03-07
Publication date: 2016-08-10

Abstract

The invention discloses a small file storing method and system. According to the invention, association relations among all small files are determined based on semantic description information of each to-be-stored small file. On this basis, merging and storing of each to-be-stored small file are achieved based on the determined association relations among all small files. Thus, the invention provides a small file merging and storing strategy based on a semantic association angle and by applying the small file storing method and system, closely associated small files can be merged, thereby effectively increasing reading efficiency of the small files.

Description

A kind of small documents storage method and system

Technical field

The invention belongs to computer distribution type technical field of memory, particularly relate to a kind of small documents storage method And system.

Background technology

Along with the fast development of the Internet, cloud storage starts to be widely used depositing in magnanimity internet data Chu Zhong, cloud storage is collectively forming an offer by different types of storage devices a large amount of in integration networks Extraneous storage and the system of Operational Visit, at present, using the teaching of the invention it is possible to provide the distributed system of cloud storage has a lot, Such as HDFS (Hadoop Distributed File System, distributed file system), the GFS of Google (Google File System, Google's file system) etc..

Under current internet environment, small documents occupies larger specific gravity, for solving small documents cloud storage Research be essentially all strategy based on Piece file mergence, by large amount of small documents is merged, subtract The number of files of few platform, alleviates distributed file system for depositing the memory pressure of file metadata； After small documents merges into big file simultaneously, it is possible to significantly improve its disk read-write speed, save file storage The time consumed.But, the most do not consider the communication with one another between file when currently carrying out the merging of small documents, Piece file mergence is only directly to merge each small documents uploaded, and does not provide relevant merging Strategy is used for improving the reading efficiency of small documents.

Summary of the invention

In view of this, it is an object of the invention to provide a kind of small documents storage method and system, it is intended to logical Cross and small documents the closest for relatedness is merged and stores, improve the reading efficiency of small documents.

To this end, the present invention is disclosed directly below technical scheme:

A kind of small documents storage method, including:

Obtain the semantic description information of multiple small documents to be stored；

Based on the semantic description information of small documents each described, determine that the association between each described small documents is closed System；

Based on incidence relation between small documents each described, small documents each described is carried out at Piece file mergence Reason, obtains at least one big file；

Storage at least one big file described.

Said method, it is preferred that the semantic description information of multiple small documents that described acquisition is to be stored includes:

Obtain the key word of multiple small documents to be stored.

Said method, it is preferred that described based on the semantic description information of small documents each described, determines each Incidence relation between individual described small documents includes:

Calculate the semantic similarity between the key word of each described small documents；

Based on described semantic similarity, the clustering processing that the key word of small documents each described is preset； Wherein, clustering processing result has higher semantic similarity between the key word of same bunch；

According to clustering processing result, determine the incidence relation between each described small documents.

Said method, it is preferred that also include following preprocessing process:

According to small documents criterion of identification set in advance, the file to be stored uploaded is carried out small documents identification.

Said method, it is preferred that before storage at least one big file described, also include:

Utilize the key word of each described small documents, set up inverted index for each described small documents；

The result processed according to described Piece file mergence, determines each described small documents and corresponding big file Between mapping relations and each described small documents positional information in corresponding big file.

Said method, it is preferred that also include:

Based on the mapping relations between described inverted index, described small documents and corresponding big file and each institute State small documents positional information in corresponding big file and carry out required small documents reading.

A kind of small documents storage system, including:

Data obtaining module is described, for obtaining the semantic description information of multiple small documents to be stored；

Incidence relation determines module, for based on the semantic description information of small documents each described, determines each Incidence relation between individual described small documents；

Small documents merges module, for based on incidence relation between small documents each described, to described in each Small documents carries out Piece file mergence process, obtains at least one big file；

Memory module, is used for storing at least one big file described.

Said system, it is preferred that described description data obtaining module includes:

Key word acquiring unit, for obtaining the key word of multiple small documents to be stored.

Said system, it is preferred that described incidence relation determines that module includes:

Computing unit, the semantic similarity between the key word calculating each described small documents；

Clustering processing unit, for based on described semantic similarity, to the key word of small documents each described Carry out the clustering processing preset；Wherein, clustering processing result has higher language with between bunch key word Justice similarity；

Incidence relation determines unit, for according to clustering processing result, determining between each described small documents Incidence relation.

Said system, it is preferred that also include:

Small documents identification module, for according to small documents criterion of identification set in advance, deposits waiting of uploading Storage file carries out small documents identification.

Said system, it is preferred that also include that index creation module, described index creation module include:

First index creation unit is for utilizing the key word of each described small documents, described little for each Inverted index set up by file；

Second index creation unit, for the result processed according to described Piece file mergence, determines each Mapping relations between described small documents and corresponding big file and each described small documents are in corresponding big file Positional information.

Said system, it is preferred that also include:

Small documents read module, for based on described inverted index, described small documents and corresponding big file it Between mapping relations and each described small documents positional information in corresponding big file carry out required little literary composition Part reads.

From above scheme, small documents disclosed in the present application storage method and system, wait to deposit based on each The semantic description information of storage small documents, determines the incidence relation between each small documents, on this basis, Incidence relation between each described small documents determined by based on, it is achieved the small documents that each is to be stored is entered Row merges storage.Visible, present applicant proposes a kind of small documents based on semantic association angle and merge storage Strategy, application is the application can realize merging small documents the closest for relatedness, and then can be effective Improve the reading efficiency of small documents.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to reality Execute the required accompanying drawing used in example or description of the prior art to be briefly described, it should be apparent that below, Accompanying drawing in description is only embodiments of the invention, for those of ordinary skill in the art, not On the premise of paying creative work, it is also possible to obtain other accompanying drawing according to the accompanying drawing provided.

Fig. 1 is the small documents storage method flow diagram that the embodiment of the present invention one provides；

Fig. 2 is the integrated stand composition of the distributed file system that the embodiment of the present invention one provides；

Fig. 3 is the small documents storage method flow diagram that the embodiment of the present invention two provides；

Fig. 4 is the small documents storage method flow diagram that the embodiment of the present invention three provides；

Fig. 5-Fig. 8 is the structural representation of the small documents storage system that the embodiment of the present invention four provides.

Detailed description of the invention

For the sake of quoting and understanding, the technical term that is used below, write a Chinese character in simplified form or summary of abridging is explained such as Under:

Lucene: be the full-text search engine tool kit of an open source code, but it be not one complete Full-text search engine, but the framework of a full-text search engine, it is provided that complete query engine and Index engine, part text analyzing engine.The purpose of Lucene is to provide a letter for software developer Single easy-to-use tool kit, to realize the function of full-text search easily, or with this in goal systems Based on set up complete full-text search engine.Lucene is a set of opening for full-text search and search Source library, is supported by Apache Software Foundation and provides.Lucene provide one simply the most powerful Application interface, it is possible to doing full-text index and search, in Java development environment, Lucene is an one-tenth Ripe free Open-Source Tools.

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out Clearly and completely describe, it is clear that described embodiment is only a part of embodiment of the present invention, and It is not all, of embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art are not doing Go out the every other embodiment obtained under creative work premise, broadly fall into the scope of protection of the invention.

Embodiment one

A kind of open small documents storage method of the embodiment of the present invention one, with reference to Fig. 1, described small documents storage side Method may comprise steps of:

S101: obtain the semantic description information of multiple small documents to be stored.

In cloud platform, the small documents major part of storage is all communication with one another, is mutually related in logic, For this relatedness, the application is given a kind of based on semantic Piece file mergence strategy, relatedness is the closeest The small documents cut merges, the size of the big file formed after simultaneously controlling to merge, it is ensured that after merging The big file arrived without departing from the default storage object size of Distributed Architecture, with avoid small documents across block (block I.e. refer to big file) storage.

With reference to the typical application scenarios of the distributed document storage shown in Fig. 2, the present embodiment is specifically with in Fig. 2 Distributed file system as a example by the application method is illustrated.When user has file storage demand, Can on Web server transmitting file, and file carried out key word describe as the metadata of file, wherein Key word needs based on succinctly, and can preferably summarize the theme of file.Other of key word and file Relevant information, as file name, size, type, uplink time etc. are protected collectively as the metadata of file It is stored in the associated documents record of data base.

The file that user need to be uploaded by Web server carries out size identification, afterwards can be to different size of file Different Strategies is taked to process.The small documents bottleneck major embodiment to Ceph (distributed file system) At write, reading and the Data Migration of file, based on this, the application is by calculating different size of file Upload to Ceph or from Mean Speed locally downloading for Ceph, determine big small documents defines/identify mark Accurate.On this basis, the file that Web server utilizes this standard to upload user carries out size file identification, If recognition result is big file, then directly this document is uploaded to Ceph cluster and carries out cloud storage, otherwise, If recognition result is small documents, then cache small documents, and the small documents merging utilizing the application to provide is deposited Storage strategy realizes the merging storage of small documents.

For each small documents to be stored of caching, first Web server need to obtain the semanteme of each small documents Description information, described semantic description information can be specifically that the key word of small documents describes information, this information Can from user transmitting file metadata in extract.

S102: based on the semantic description information of small documents each described, determine between each described small documents Incidence relation.

After obtaining the key word of each small documents to be stored, this step is concrete according to small documents key word Between semantic similarity, determine the incidence relation between small documents, wherein, the semanteme between key word Similarity is the highest, then the relatedness between small documents corresponding to sign key word is the tightst.

The application utilizes the corpus trained, and uses word2vec instrument to the semanteme between key word Similarity calculates, and employing knows that net supplements, thus utilizes common results between the two to word Between degree of association be defined, the weight setting up word2vec is α, knows that the weight of net is β, then key word Semantic similarity between W1 and W2 is defined as follows:

Sim (W1, W2)=α Sim_w2v(W1,W2)+βSim_HN(W1,W2) (1)

In formula (1), Sim (W1, W2) represents the degree of association i.e. semantic similarity between key word W1 and W2, Sim_w2v(W1, W2) represents and utilizes the word W1 that Word2Vec instrument calculates, the degree of association between W2, Sim_HNWhat (W1, W2) represented is to utilize to know that net carries out the word W1 calculated, the degree of association between W2.

Between defined terms on the basis of semantic similarity, the application is especially by each small documents to be stored Key word carry out the clustering processing preset, it is achieved the key word of each small documents is carried out sub-clustering so that Clustering processing result has higher semantic similarity between the key word of same bunch；Accordingly, right On the basis of small documents key word carries out sub-clustering, can according to the corresponding relation between small documents and key word, The small documents to be stored being in relief area is carried out logical partitioning, the small documents corresponding with bunch key word is drawn Dividing in same logical block, the small documents belonging to a logical block together has higher semantic phase to each other Guan Du.

Next being described the clustering processing process of each small documents key word, this process specifically includes:

1) preliminary clusters (nearest neighbor classifier) of key word

By n key word of multiple small documents (wherein, a corresponding key word of small documents, same Key word possible corresponding multiple small documents, i.e. corresponding relation between small documents and key word are N:1) map To one group of disjoint set D={D₁,D₂…D_n, each set is one bunch, key word in set of computations Semantic similarity between any two, and the similarity numerical value calculated is sorted in descending order；On this basis, Take out Similarity value sim (W successively_i,W_j), until sim (W_i,W_j) stop less than during predetermined threshold value, afterwards, point Do not find out taken out sim (W_i,W_j(W in)_i,W_jSet D belonging to)_iWith D_j, merge D_iWith D_jBe one new Bunch, it is a new set, and deletes original set D_iWith D_j.Due to the pass cording that file is relevant There are symmetry and transitivity, so according to new set distribution, a undirected unconnected graph can be set up, Then from each node, this figure carried out depth-first traversal, thus by relevant node (key word) All it is put in a set, i.e. completes the iteration of a keyword clustering.After an iteration not Intersecting set structure is as follows:

D₁={ W₁,W₂…W_i}；

D₂={ W₃,W₆…W_j}；

……

D_m={ W_n,W_t…W_k}。

After above process, the number of set has occurred and that change, many key words have been assigned to phase In the set (bunch) answered, it is achieved thereby that the preliminary clusters of key word.

2) iteration cluster between bunch

After the preliminary clusters of key word, corresponding bunch of Preliminary division, now will obtain a lot Bunch, on this basis, this step consider continue multiple bunches obtained after key word preliminary clusters are carried out repeatedly Generation cluster, thus close for association bunch is merged.

Between bunch, iteration cluster is concrete uses following steps to realize:

Step one, extracts the character representation of each bunch；

After the preliminary clusters of key word, have bunch in be provided with multiple key word, for key Word number more bunch, the application consider to bunch feature (bunch each key word characterize bunch one Feature) carry out dimensionality reduction, by bunch feature be described with the representational key word of several comparisons.

Wherein, the application especially by bunch in key word once cluster realization bunch feature extraction, so After for each bunch, by bunch in after cluster each submanifold of obtaining drop by its key word number having Sequence sorts, on this basis by submanifold maximum for key word number, as this bunch character representation (simultaneously Need the maximum key word number of limited features submanifold).For example, it is assumed that bunch D₁={ W₁,W₂…W_iProcess bunch It is divided into following submanifold: D after the once cluster of interior key word₁={ (W₁,W₇,W₁₁) ... (W_i,W_j), then may be used Select (W₁,W₇,W₁₁) as bunch D₁Character representation, be i.e. equivalent to have updated a bunch center, for next Bunch cluster of step.

Step 2, the relevance degree between calculating two-by-two bunch, and preserve；

After each bunch obtained after to key word preliminary clusters carries out bunch feature extraction, next proceed to profit By the similarity between bunch each bunch of feature calculation extracted, between bunch, Similarity Measure mode is specific as follows:

Assume a bunch D_i={ W_i1,W_i2…W_in, D_j={ W_j1,W_j2…W_jm, then bunch D_iWith D_jBetween similarity be:

S i m (D_{i}, D_{j}) = \frac{1}{m * n} \underset{p}{Σ} \underset{q}{Σ} S i m (W_{i p}, W_{j q}) - - - (2)

In formula (2), Sim (D_i,D_j) that represent is a bunch D_iWith D_jBetween similarity, Sim (W_ip,W_jq) be bunch between Similarity value between characteristic key words.I.e. between comprehensive utilization bunch two-by-two characteristic key words calculate relevant Angle value effectively represent bunch between degree of association, wherein, bunch the character representation that i.e. refers to bunch of characteristic key words in The key word comprised.

Step 3, between general bunch, relevance degree is according to descending sort；

Step 4, sequentially takes out sim (D from descending sequence_i,D_j), until sim (D_i,D_j) less than predetermined threshold value Time terminate；Afterwards by taken out sim (D_i,D_j) corresponding bunch D_iWith D_jMerge into a bunch of D_k, thus realize All key words in two bunches are incorporated among a set；

Step 5, it is judged that bunch number whether there occurs change or iterations whether reach certain number of times, If meeting condition, iteration completes, if ineligible, continues executing with step one.

Sequentially pass through key word preliminary clusters and bunch between iteration cluster after, can realize each to be stored The key word of small documents carries out sub-clustering, wherein, has higher semantic similarity with between bunch key word. Owing to having corresponding corresponding relation between key word with small documents, therefore, can sub-clustering based on key word As a result, the small documents to be stored being in relief area is carried out logical partitioning, by with corresponding little of bunch key word File is divided in same logical block, and the file degree of being relative to each other belonging to a logical block together is high, has Higher semantic association degree.

S103: based on incidence relation between small documents each described, small documents each described is carried out file Merging treatment, obtains at least one big file.

The small documents to be stored cached is being carried out logical partitioning, after obtaining one or more logical block, This step continues according to the logical partitioning result of small documents, carries out the small documents belonging to a logical block together Piece file mergence processes.

Wherein, in order to avoid time-consumingly too much when file reads, then the file merging gained should be too not big, It is to say, for the logical block that small documents quantity/data volume is bigger comprised, should be by this logic All small documents in unit are all merged in a big file, but need to be by each in this logical block Small documents is merged into multiple big file, i.e. logical block and is merged the relation that file can be one-to-many.

The cut-off rule assuming big small documents is BS, owing to the file of filebuf is all to know through small documents The small documents of gained after not, thus in relief area, the size of All Files all will be less than BS.In order to reach literary composition Part prefetching efficiency high this purpose, when carrying out small documents and merging, first by same logical block All small documents are ranked up according to file size, then according to small documents is carried out by the order of file successively Merge, when the file once merging gained is more than BS, this time merges and terminate, be then followed by down Once merge.This kind of strategy both can ensure that the transmission read-write efficiency of gained file after merging and big file phase With, ensure that there is maximum hit rate when small documents prefetches the most simultaneously.

Specifically, small documents data are converted to binary data by system when realizing Piece file mergence, and When merging every time, all afterbodys at a upper small documents carry out data supplementing, and good this of record is merged little simultaneously The length scale that the original position of file and small documents are occupied, the most just can lead to when this small documents of needs Cross and extract the data of designated length to reduce small documents itself.

S104: storage at least one big file described.

After small documents is merged obtaining big file, the big file merging gained can be carried out required Distributed storage, specifically, with reference to Fig. 2, Web server is merging after small documents obtains big file, Can be by corresponding interface service by the appointment Bucket (data bucket) of big file uploading to Ceph storage cluster On, thus finally realize the cloud storage of small documents.

From above scheme, small documents disclosed in the present application storage method, based on each little literary composition to be stored The semantic description information of part, determines the incidence relation between each small documents, on this basis, based on institute Incidence relation between each the described small documents determined, it is achieved the small documents that each is to be stored is merged Storage.Visible, present applicant proposes a kind of small documents based on semantic association angle and merge storage strategy, Application is the application can realize merging small documents the closest for relatedness, and then can be effectively improved little The reading efficiency of file.

Embodiment two

The present embodiment continues to supplement the scheme of embodiment one, with reference to Fig. 3, in the present embodiment, in institute Can also comprise the following steps before stating step S104:

S105: utilize the key word of each described small documents, sets up inverted index for each described small documents； The result processed according to described Piece file mergence, determines between each described small documents and corresponding big file Mapping relations and each described small documents positional information in corresponding big file.

Document retrieval is all indispensable for any platform, while storage file, needs simultaneously Search function for file provides support, say, that, it is desirable to be able to the term according to user's input is fixed Position, to required file, based on this, is supported to provide document retrieval, and the present embodiment utilizes each little The key word of file is that small documents sets up inverted index, thus formed key word → small documents ID (Identity, Identity number) mapping, the key word of small documents is described and sets up by the specifically used Lucene of the present embodiment Inverted index, with reference to Fig. 2, after inverted index has been set up, just can generate index file storehouse and carry out it Storage, thus follow-up quick-searching and the location that can carry out small documents according to the term of input.

Followed by the establishment process of an example simple declaration inverted index, this example provides three figures The brief introduction of sheet (being equivalent to 3 small documents) describes:

A, Tian An-men, Pekinese.

B, the ruins in Yuanmingyuan Park.

C, Tian An-men and scenic spot, Dou Shi Pekinese, Yuanmingyuan Park.

Can get after it is carried out participle pretreatment:

A, [Beijing] [Tian An-men].

B, [Yuanmingyuan Park] [ruins].

C, [Tian An-men] [Yuanmingyuan Park] [Beijing] [scenic spot].

Continue the participle of above-mentioned each picture is carried out inverted index process, the available as shown in table 1 row of falling Table:

Table 1

Except setting up inverted index for each small documents, the present embodiment is built always according to the combination situation of small documents Mapping relations between vertical small documents and big file, simultaneously record small documents position in corresponding big file Information, and in corresponding database server, store described mapping relations and described positional information.Described Inverted index, small documents can be total to the mapping relations of big file, small documents positional information in big file Be all the index information of small documents, for follow-up small documents retrieval, position, read offer support.

Embodiment three

On the basis of embodiment two scheme, with reference to Fig. 4, in the present embodiment, described small documents storage method Can also comprise the following steps:

S106: based on the mapping relations between described inverted index, described small documents and corresponding big file and Each described small documents positional information in corresponding big file carries out required small documents and reads.

On the basis of above example, the present embodiment provides the read schemes of small documents, the reading of small documents Taking and can be divided into document retrieval and file download two step, wherein, document retrieval specifically can be to receive user defeated After the term entered, use Lucene that the inverted index created is retrieved, thus obtain meeting retrieval The results list of word, this list includes one or more small documents ID, afterwards, obtains after continuing with retrieval To small documents ID carry out data base querying, obtain the biggest file ID of merging file at small documents place, so After, send file read request according to the big file ID that inquiry obtains to Ceph cluster.

After Ceph cluster receives this request, navigate to corresponding Bucket and obtain required big file, due to The big file obtained after merging is stored by object on Ceph, thus when reading small documents every time It is required for the big file at small documents place carrying out overall download and caching, afterwards can be by downloading and delaying The big file deposited is disassembled, and obtains required little literary composition according to small documents positional information in big file Part.

Wherein it is desired to explanation, owing to the application is when carrying out cloud storage to small documents, based on little literary composition Part is semantic merges storage by more close for association small documents, thus small documents is downloaded, During reading, the reading efficiency of the higher small documents of degree of association can be effectively improved.

Embodiment four

The open a kind of small documents storage system of the present embodiment, described small documents storage system and above each enforcement Disclosed in example, small documents storage method is corresponding.

Corresponding to embodiment one, with reference to Fig. 5, described system can include describing data obtaining module 100, Incidence relation determines that module 200, small documents merge module 300 and memory module 400.

Data obtaining module 100 is described, for obtaining the semantic description information of multiple small documents to be stored.

Wherein, described description data obtaining module 100 includes key word acquiring unit, is used for obtaining to be stored The key word of multiple small documents.

Incidence relation determines module 200, for based on the semantic description information of small documents each described, determines Incidence relation between each described small documents.

Described incidence relation determines that module 200 includes that computing unit, clustering processing unit and incidence relation determine Unit.

Small documents merges module 300, for based on incidence relation between small documents each described, to each institute State small documents and carry out Piece file mergence process, obtain at least one big file.

Memory module 400, is used for storing at least one big file described.

The functional realiey of the most each module or unit needs to set up on the pretreatment basis of small documents identification On, therefore, with reference to Fig. 6, described system also includes small documents identification module 500, and this module is used for foundation Small documents criterion of identification set in advance, carries out small documents identification to the file to be stored uploaded.

Corresponding to embodiment two, with reference to Fig. 7, described system can also include index creation module 600, should Module includes the first index creation unit and the second index creation unit.

Corresponding to embodiment three, with reference to Fig. 8, described system can also include small documents read module 700, For based on the mapping relations between described inverted index, described small documents and corresponding big file and each institute State small documents positional information in corresponding big file and carry out required small documents reading.

For small documents storage system disclosed in the embodiment of the present invention four, due to its with embodiment one to Disclosed in embodiment three, small documents storage method is corresponding, so describe is fairly simple, relevant similar it Place refers to the explanation of embodiment one to embodiment three small file storage method part, the most no longer Describe in detail.

It should be noted that each embodiment in this specification all uses the mode gone forward one by one to describe, each What embodiment stressed is all the difference with other embodiments, identical similar between each embodiment Part see mutually.

For convenience of description, it is divided into various module or unit to divide with function when describing system above or device Do not describe.Certainly, implement the application time can the function of each unit same or multiple softwares and/ Or hardware realizes.

As seen through the above description of the embodiments, those skilled in the art is it can be understood that arrive The application can add the mode of required general hardware platform by software and realize.Based on such understanding, The part that prior art is contributed by the technical scheme of the application the most in other words can be with software product Form embody, this computer software product can be stored in storage medium, as ROM/RAM, Magnetic disc, CD etc., including some instructions with so that computer equipment (can be personal computer, Server, or the network equipment etc.) perform each embodiment of the application or some part institute of embodiment The method stated.

Finally, in addition it is also necessary to explanation, in this article, such as first, second, third and fourth etc. it The relational terms of class is used merely to separate an entity or operation with another entity or operating space, And not necessarily require or imply and there is the relation of any this reality or suitable between these entities or operation Sequence.And, term " includes ", " comprising " or its any other variant are intended to nonexcludability Comprise, so that include that the process of a series of key element, method, article or equipment not only include that A little key elements, but also include other key elements being not expressly set out, or also include for this process, The key element that method, article or equipment are intrinsic.In the case of there is no more restriction, by statement " bag Include one ... " key element that limits, it is not excluded that include the process of described key element, method, article or Person's equipment there is also other identical element.

The above is only the preferred embodiment of the present invention, it is noted that general for the art For logical technical staff, under the premise without departing from the principles of the invention, it is also possible to make some improvement and profit Decorations, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims

1. a small documents storage method, it is characterised in that including:

Storage at least one big file described.

Method the most according to claim 1, it is characterised in that to be stored multiple little of described acquisition The semantic description information of file includes:

Obtain the key word of multiple small documents to be stored.

Method the most according to claim 2, it is characterised in that described based on small documents each described Semantic description information, determine that the incidence relation between each described small documents includes:

4. according to the method described in claim 1-3 any one, it is characterised in that also include following pre- Processing procedure:

5. according to the method described in claim 2-3 any one, it is characterised in that storage described at least Before one big file, also include:

Method the most according to claim 5, it is characterised in that also include:

7. a small documents storage system, it is characterised in that including:

Memory module, is used for storing at least one big file described.

System the most according to claim 7, it is characterised in that described description data obtaining module bag Include:

System the most according to claim 8, it is characterised in that described incidence relation determines module bag Include:

10. according to the system described in claim 7-9 any one, it is characterised in that also include:

11. systems described in-9 any one according to Claim 8, it is characterised in that also include index wound Modeling block, described index creation module includes:

12. systems according to claim 11, it is characterised in that also include: