A kind of the cloud storage optimization method and its system of large amount of small documents
Technical field
The present invention relates to the technical field of cloud computing platform more particularly to a kind of cloud storage optimization methods of large amount of small documents
And its cloud storage optimization system.
Background technique
In recent years under HDFS file system, each file will be created corresponding inode etc metadata, but
Under mass file scene, traditional HDFS can not carry so many metadata IO amount and so huge metasearch
Calculation amount, unique way is exactly to reduce amount of metadata, then being bound to that the quantity of document entity will be reduced, so these are literary
Part system without exception be all the method for having used such a accommodation, i.e., create file again hereof, be unable to satisfy efficiently
Store large amount of small documents.
Metadata is dispersed in each file, and if first number except above four occurs in the metadata returned
According to, then for users these be exactly it is useless, every time request picture when will read in memory, in face of mass picture
Scene performance has tremendous influence.
Summary of the invention
The purpose of this section is to summarize some aspects of the embodiment of the present invention and briefly introduce some preferable implementations
Example.It may do a little simplified or be omitted to avoid our department is made in this section and the description of the application and the title of the invention
Point, the purpose of abstract of description and denomination of invention it is fuzzy, and this simplification or omit and cannot be used for limiting the scope of the invention.
In view of above-mentioned existing problem, the present invention is proposed.
Therefore, the technical problem that the present invention solves is: the cloud storage optimization method for providing a kind of large amount of small documents is full
Sufficient existing file system can not efficient storage large amount of small documents.
In order to solve the above technical problems, the invention provides the following technical scheme: a kind of cloud storage of large amount of small documents optimizes
Method includes the following steps, by Haystack engine distribution to cloud platform;Using the Haystack engine to having been written into cloud
Each small documents of disk are packaged and create corresponding needle model;The collection for constructing the needle model is combined into
Data file module, the small documents are appended in the data file module by the sequencing of write-in;Using described
Haystack engine generates in index file write-in cloud disk;File search is realized according to the index file.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: described to incite somebody to action
Haystack engine distribution includes the following steps to cloud platform, and the cloud platform provides operation data bank interface, described in installation
Haystack engine;Search engine is set, target configuration is added in setup module;Creation index class simultaneously adds block mapping pass
System;Template is added, in template in creation search column.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: be based on institute
It states Haystack engine, user and program and passes through the read-write of web services protocol realization object and the access of storage resource, including wound
It builds needle model and generates index file, wherein the needle model includes key, size, data number of each small documents
It is believed that breath.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: the rope
Before quotation part saves key, offset, size information of each needle model, and the index file only saves key
Nybble, the needle model in the data file module are stored according to the lexicographic order of key.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: described
The corresponding offset of assignment is generated in needle model creation process, if encountered when constructing or updating and map in memory identical
The needle model of offset is then updated with the high coverage values of offset value are low.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: further include
The step of being scanned for according to the index file reads small documents;According to search request file in the alphabetical serial number of memory
Preceding 4 bytes of key;Offset is obtained, size numerical value obtains the key value of the needle model from stack;Judge mould
Whether type key value is equal with the key value of file, if data is then returned to use according to the size in the needle model
Family;If otherwise whether judgment models key value is equal with preceding 4 bytes of file;If then completing to search and starting to calculate next
The needle modal position, read it is next it is described state needle model key value, if otherwise not returning to any data.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: including small
The definition of file is set to the different threshold values of framework block according to the different needs of the user, is less than framework block threshold value for all
The file of size is defined as small documents, and the threshold value is 75%.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: described
Realize that file search includes according to the index file, index can be loaded into cloud platform service by when Haystack engine start
In the memory of device, offset and size in the data file is positioned by searching for indexing in memory, including search small text
Initial position and size of the part in the data file module.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: described
The doclet object of Haystack engine is image file, includes the steps that image is read and write, and uploads layer and receives the figure that user uploads
Picture measures original image size and is saved into storage layer;Images serve layer receives HTTP image request, and provides a user
It is stored in the image of the storage layer;The request of user is dispatched to nearest cloud platform node first, if cache hit, directly
Picture material is returned into user, otherwise requests the storage system of rear end, image content is returned to user by caching.
Another technical problem that the present invention solves is: the cloud storage optimization system for providing a kind of large amount of small documents meets now
There is file system can not efficient storage large amount of small documents.
In order to solve the above technical problems, the invention provides the following technical scheme: a kind of cloud storage of large amount of small documents optimizes
System, including Haystack engine, needle model, data file module and index file;The Haystack engine is to open
Source search framework, can be deployed to cloud platform and directly use, and object-based storage equipment can be real by web services agreement
The read-write of existing object and the access of storage resource;The needle model is the model created based on the Haystack engine,
For saving the data information of small documents;The data file module is that the collection that the needle model is stored in sequence merges
It is stored in the cloud;The index file is the concordance list generated based on the Haystack engine.
Beneficial effects of the present invention: it can reduce the resource consumption when storage scene of large amount of small documents, nothing can be reduced
Input/output is imitated, scattered small documents are spliced into a small amount of metadata of big file maintenance, metadata can be buffered in interior
Reduce a large amount of invalid input/output in depositing.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment
Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this
For the those of ordinary skill of field, without any creative labor, it can also be obtained according to these attached drawings other
Attached drawing.Wherein:
Fig. 1 is that the overall flow structure of the cloud storage optimization method of large amount of small documents described in first embodiment of the invention is shown
It is intended to;
Fig. 2 is the overall flow structural schematic diagram of indexed search described in first embodiment of the invention;
Fig. 3 is the structural schematic diagram of data file module described in first embodiment of the invention;
Fig. 4 is structural schematic diagram of the data file module described in first embodiment of the invention for the file information;
Fig. 5 is the structural schematic diagram of index file described in first embodiment of the invention;
Fig. 6 is the matched flowage structure schematic diagram of information searching described in first embodiment of the invention;
Fig. 7 is the structural schematic diagram of framework and request processing flow described in first embodiment of the invention;
Fig. 8 is that the whole theory structure of the cloud storage optimization system of large amount of small documents described in second embodiment of the invention shows
It is intended to.
Fig. 9 is stored and is applied to this programme in traditional scheme respectively for file described in second embodiment of the invention and deposited
Store up test data result schematic diagram.
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, right with reference to the accompanying drawings of the specification
A specific embodiment of the invention is described in detail, it is clear that and described embodiment is a part of the embodiments of the present invention, and
It is not all of embodiment.Based on the embodiments of the present invention, ordinary people in the field is without making creative work
Every other embodiment obtained, all should belong to the range of protection of the invention.
In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention, but the present invention can be with
Implemented using other than the one described here other way, those skilled in the art can be without prejudice to intension of the present invention
In the case of do similar popularization, therefore the present invention is not limited by the specific embodiments disclosed below.
Secondly, " one embodiment " or " embodiment " referred to herein, which refers to, may be included at least one realization side of the invention
A particular feature, structure, or characteristic in formula." in one embodiment " that different places occur in the present specification not refers both to
The same embodiment, nor the individual or selective embodiment mutually exclusive with other embodiments.
Combination schematic diagram of the present invention is described in detail, when describing the embodiments of the present invention, for purposes of illustration only, indicating device
The sectional view of structure can disobey general proportion and make partial enlargement, and the schematic diagram is example, should not limit this herein
Invent the range of protection.In addition, the three-dimensional space of length, width and depth should be included in actual fabrication.
Simultaneously in the description of the present invention, it should be noted that the orientation of the instructions such as " upper and lower, inner and outer " in term
Or positional relationship is to be based on the orientation or positional relationship shown in the drawings, and is merely for convenience of description of the present invention and simplification of the description, and
It is not that the device of indication or suggestion meaning or element must have a particular orientation, be constructed and operated in a specific orientation, therefore
It is not considered as limiting the invention.In addition, term " first, second or third " is used for description purposes only, and cannot understand
For indication or suggestion relative importance.
In the present invention unless otherwise clearly defined and limited, term " installation is connected, connection " shall be understood in a broad sense, example
Such as: may be a fixed connection, be detachably connected or integral type connection;It equally can be mechanical connection, be electrically connected or be directly connected to,
Can also indirectly connected through an intermediary, the connection being also possible to inside two elements.For the ordinary skill people of this field
For member, the concrete meaning of above-mentioned term in the present invention can be understood with concrete condition.
As used in this application, it is related real that term " component ", " module ", " system " etc. are intended to refer to computer
Body, the computer related entity can be hardware, firmware, the combination of hardware and software, software or running software.Example
Such as, component, which may be, but not limited to, is: the processing that runs on a processor, processor, object, executable file, in execution
Thread, program and/or computer.As an example, the application run on the computing device and the calculating equipment can be components.
One or more components can reside in process in execution and/or thread, and component can be located in a computer
And/or it is distributed between two or more computers.In addition, these components can be from it with various data knots
It is executed in the various computer-readable mediums of structure.These components can be by such as according to one or more data groupings
(for example, the data from a component, another component in the component and local system, distributed system interact and/
Or interacted in a manner of signal by the network and other systems of such as internet etc) signal, with local and/or remote
The mode of journey process is communicated.
Embodiment 1
Signal referring to Fig.1 is illustrated as proposing a kind of the whole of the cloud storage optimization method of large amount of small documents in the present embodiment
Body flow diagram.This method is based on being deployed in cloud platform under Hadoop distributed file system and realize, Hadoop is distributed
File system (HDFS) is designed to be suitble to operate in the distributed file system on common hardware, it and existing distribution are literary
Part system has many common ground.But the difference of it and other distributed file systems simultaneously is also apparent.HDFS is one
The system of a Error Tolerance is suitble to be deployed on cheap machine, and HDFS can provide the data access of high-throughput, very suitable
Close the application on large-scale dataset.HDFS relaxes a part of POSIX constraint, and Lai Shixian streaming reads file system data
Purpose.HDFS is developed most beginning as the architecture of ApacheNutch search engine project, and which employs principals and subordinates
Structural model, HDFS cluster are made of NameNode and DataNode, and wherein NameNode is as primary server, management text
The data of the NameSpace of part system and client to the DataNode management storage in the access operation of file, cluster.One
User or a program can create directory, among storage file to many catalogues, the name space level of file system and its
His file system is similar, and can also can be realized creation, mobile file, and file is moved to another from a catalogue
Or renaming.
In the scheme that early stage is realized, metadata is dispersed in each file, and be it is useless, request picture every time
When will read in memory, therefore have tremendous influence in face of the scene performance of mass picture.Hadoop distributed file system
Lower Haystack engine 100 is deployed in cloud platform directly uses, and user or program can pass through web services protocol realization
The read-write of object and the access of storage resource.
Metadata is also known as broker data, relaying data and mainly describes the letter of data attribute for the data for describing data
Breath, for supporting such as to indicate storage location, historical data, resource lookup, file record function.Such as one picture be one
Data, then can also carry some normal datas in picture.For example size date etc., the above are exactly to belong to first number
According to.
Under HDFS file system, due to the lookup of metadata, this traditional design leads to excessive disk operating, therefore
The present embodiment carefully reduces the metadata of every photo, and such Haystack storage machine can execute institute in main memory
Some metadata lookups, this selection saves the disk operating for reading real data, to improve overall handling capacity.Drop
Resource consumption when the storage scene of low large amount of small documents more specifically includes the following steps,
S1: being deployed to cloud platform for Haystack engine 100, is based on Haystack engine 100, user and program pass through
The read-write of web services protocol realization object and the access of storage resource, including creation needle model 200 and generation index file
400, wherein needle model 200 includes key, size, data data information of each small documents.
S2: being packaged each small documents for having been written into cloud disk using Haystack engine 100 and it is respectively right to create
The needle model 200 answered generates the corresponding offset of assignment in this step, is constructing during the creation of needle model 200
If or update and encounter the needle model 200 of same offset when mapping in memory, it is low with the high coverage values of offset value
Be updated.
Needle can generate corresponding offset...Haystack when creation and not allow for covering needle, institute
Identical key can only be possessed by adding a new needle with the modification of picture.Haystack utilizes a very simple
Means distinguish duplicate needle, that is, judge they offset (needle of new version be certainly offset most
That high), it is low with the covering of offset high if encountering identical needle when constructing or updating and map in memory
's.
Such as offset is 1 when having a file a.pngHaystack to this document creation needle, another text
Part be also a.pngHaystack to this document creation needle when offset be 2, merely lead to when if searching this file
It crosses filename lookup just and will appear two as a result, filtering one time by offset again, high level update low value has just had to one
As a result.
S3: the collection of building needle model 200 is combined into data file module 300, and small documents are chased after by the sequencing of write-in
It is added in data file module 300.
S4: index file 400 is generated using Haystack engine 100 and is written in cloud disk, index file 400 saves often
Key, offset, size information of a needle model 200, and the preceding nybble of index file 400 preservation key, resource disappear
Consumption is only a quarter (16 byte) of traditional approach, and the needle model 200 in data file module 300 is according to key
Lexicographic order storage.
S5: realizing file search according to index file 400, realizes that file search includes according to index file 400,
Index can be loaded into the memory of cloud platform server by Haystack engine 100 when starting, in memory by searching for index
Offset and size in the data file is positioned, including searching initial position of the small documents in data file module 300
With size.
Further, Haystack engine 100 cloud platform is deployed in the present embodiment to include the following steps,
Cloud platform provides operation data bank interface, installs Haystack engine 100, installs and applies from PyPI, supports
The tetra- kinds of full-text search engine rear ends whoosh, solr, Xapian, Elasticsearc, belong to a kind of frame of full-text search.This
Step specific implementation, such as can refer to as follows:
Search engine is set, target configuration is added in setup module, such as adds following configuration in setting, it can
With reference to as follows:
Creation index class simultaneously adds block mapping relations, creates a search_index.py file, then creation index
Class, for example to create and index to News, it can refer to as follows:
Url mapping is added in url.py, specific implementation can refer to as follows:
Template is added, is created in template in creation search column, such as under template file with the catalogue of flowering structure:
Template--search--indexes--news (name of app) -- news (name of app) _ text.txt.news_
Addition needs the field being indexed in text.txt file, can refer to as follows:
Referring to the signal of Fig. 2, the present embodiment further includes the steps that being scanned for according to index file 400, specific as follows:
Read small documents;
According to preceding 4 bytes of the key of search request file in the alphabetical serial number of memory;
Offset is obtained, size numerical value obtains the key value of needle model 200 from stack;
Whether judgment models key value is equal with the key value of file, if then will according to the size in needle model 200
Data returns to user;If otherwise whether judgment models key value is equal with preceding 4 bytes of file;
If then completing to search and starting to calculate next 200 position of needle model, reading is next to state needle model
200key value, if otherwise not returning to any data.
Referring to the signal of Fig. 3~4, it is illustrated as the structure of data file module 300 in the present embodiment, Fig. 5 is illustrated as indexing
The composition of file 400, Fig. 6 are illustrated as searching matched composition.Wherein size tab file physical size, key file identifier,
Date the file information and id file id.
Such as: the key that user reads file is ab cd ef 2a, is matched to this prefix of ab cd ef ac, at this time
Offset is directed toward this needle of ab cd ef 1a, matches miss for the first time.By being stored in needleheader
Size, we can position the position ab cd ef 2a, be matched to correct needle, and by reading data to user.
In order to reduce invalid IO, (such as the metadata of directory entry, such as the authority information of file spell scattered small documents
It is connected into a small amount of metadata of big file maintenance (id offset size cookie etc.), effectively reduces IO, and metadata
It can cache in memory, reduce a large amount of invalid IO.By reorganizing file structure and caching, the member of an average picture
Data only need 10B memory, and all caching picture metadata is possibly realized in this way, and a read operation only needs 1 IO.
Scene: consumed amount of ram and time are stored in 1,000,000 files in the present embodiment:
HDFS |
In the present embodiment |
>300mb |
<100mb |
>2hours |
<0.5hour |
It is above-mentioned obviously to can be seen that HDFS is either time-consuming or amount of storage, the present embodiment have biggish advantage,
Referring additionally to the signal of Fig. 9, the present embodiment by the file of 1,000,000,000 population sizes respectively traditional scheme carry out storage and
The method for being applied to this programme carries out storage test, and traditional cloud storage HDFS includes that Apache provides open source scheme, this implementation
Example carries out storage test using Apache offer open source scheme and tests as a comparison with this programme storage test, the study found that from
Figure 9 above can be seen that in the case where quantity of documents is 1,000,000,000, using saving as in traditional scheme consumption more than 26G, and use
This paper scheme only consumes the more memories of 9G, and memory use reduces 2/3.The present embodiment passes through the speed transmitted to the upper figure of file simultaneously
Degree is tested, and respectively with the cloud storage of traditional scheme and using the cloud storage optimization method of the present embodiment, test result can join
According to lower table, it is not difficult to find out that the present embodiment all have compared with the speed that the space consuming that conventional method either stores still stores it is bright
Aobvious advantage.
It is most effective that the present embodiment should be noted that the optimization method is handled for picture, therefore in the present embodiment
Small documents refer to be image.It is understandable to be, it certainly include the process of request and the read-write of image.Specifically, including first
The definition of small documents is set to the different threshold values of framework block according to the different needs of the user, is less than framework block threshold for all
The file of value size is defined as small documents, and threshold value is 75%.Such as hadoop block size be usually arranged as 128MB,
256MB is intended to increasing.It according to different requirements, also can be different to the specific decision rule of small documents, it is assumed here that
It is the 75% of hadoop block size, i.e. 75% file of size of the size less than hadoopblock is all small documents.
Doclet object based on Haystack engine 100 is image file, includes the steps that image is read and write: uploading layer and connects
The image that user uploads is received, original image size is measured and is saved into storage layer;Images serve layer receives HTTP image and asks
It asks, and provides a user the image for being stored in storage layer;The request of user is dispatched to nearest cloud platform node first, if slow
Hit is deposited, picture material is directly returned into user, otherwise requests the storage system of rear end, image content is returned to user by caching.
It is further, as follows for Haystack image basis frame and process flow referring to the signal of Fig. 7:
Haystack is a kind of object storage based on HTTP, it contains pointer, is mapped to storage object.In Haystack
In with pointer store image, the image of hundreds of thousands of meters is gathered a Haystack store files, to eliminate first number
According to load.The expense that this allows for metadata is very small, and it is every to enable us to the storage in storage file and memory index
The position of a pointer.So that the retrieval of image data can be completed with a small amount of I/O operation, it is some unnecessary to eliminate
Metadata expense.
Haystack framework mainly includes that there are three parts in the present embodiment: Haystack Directory, Haystack
Store, Haystack Cache, Haystack Store are physical store nodes, and tissue storage is empty in the form of physics spool
Between, the corresponding physical file of each physics spool, therefore the physical file metamessage very little on each memory node.Multiple objects
The physics spool managed on storage node forms a logic spool, for backing up.Haystack Directory stores logical volume
The corresponding relationship of axis and physics spool.Haystack Cache is mainly used for solving the problems, such as excessively to rely on cloud provider, mentions
For the buffer service of nearest increased picture.
Write request (picture upload procedure) process flow of Haystack are as follows: Web Server requests Haystack first
Directory obtains Image ID and writeable logic spool, then writes data into corresponding per each physics spool.Haystack
The function of Directory is as follows:
The mapping of logic spool to physics spool is provided, distributes Image ID for write request;
Load balancing is provided, selects logic spool for write operation, read operation selects physics spool;
Cloud Server is shielded, certain picture requests is can choose and directly walks Haystack Cache;
Marking certain logic spools is Red-only.
In Haystack storage file, each pointer has a corresponding index record, and indicator index record
Sequence must the pointer to relevant Haystack storage file sequence match.Index file, which provides, searches Haystack storage
Minimum metadata needed for a certain particular pointer in file.In order to quickly search, index record is loaded into and is organized to a number
According in structure, this is the responsibility of Haystack application program.The main purpose of index is quickly load pointer metadata to arrive
In memory, without traversing huge Haystack storage file, this is because the size of index is typically less than storage file
1%.
Embodiment 2
Referring to the signal of Fig. 7, it is illustrated as a kind of cloud storage system of large amount of small documents of the present embodiment proposition, above-mentioned implementation
The method of example relies on the present embodiment realization, and cloud storage system is based on HDFS file system, in modern corporate environment, single machine
Capacity can not often store mass data, need across machine storage.The file system being distributed on cluster is managed collectively to be known as dividing
Cloth file system.And once in systems, network is introduced, the complexity of all-network programming is just inevitably introduced,
Such as challenge first is that if guaranteeing that data are not lost when node is not available.This system is deployed in be made in cloud platform
With.
Although traditional Network File System also referred to as distributed file system, there are some limitations for it.Due to text
Part is stored on single machine, therefore can not provide Reliability Assurance, when many clients access simultaneously, it is easy to cause to service
Device pressure, causes performance bottleneck.In addition it if to operate file, needs to be synchronized to local first, these modifications are same
Before walking server-side, other clients are sightless.It is not a kind of typical distributed system in a way.
HDFS provides various interactive modes, such as passes through JavaAPI, HTTP, shell-command row.The friendship of order line
Mutually mainly operated by hadoopfs, the file system concept of Hadoop be it is abstract, HDFS is one such reality
It is existing.Order line can be used with HDFS interaction, carry out operation document system there are also many modes.Such as java application can make
Operated with org.apache.hadoop.fs.FileSystem, the operation of other forms be also all based on FileSystem into
Row encapsulation.The present embodiment uses the interactive mode of HTTP.
Further, the system of the present embodiment includes Haystack engine 100, needle model 200, data file module
300 and index file 400.It is more specifical:
Haystack engine 100 is open source search framework, can be deployed to cloud platform and directly use, object-based storage
Equipment can be realized the read-write of object and the access of storage resource by web services agreement;Needle model 200 be based on
The model that Haystack engine 100 creates, for saving the data information of small documents;Data file module 300 is needle mould
The collection merging that type 200 is stored in sequence is stored in the cloud;Index file 400 is the rope generated based on Haystack engine 100
Draw table.
Haystack engine 100 can be laid out by the interface provided and directly be used in cloud platform, and Haystack is
The picture storage system of Facebook solves system, and the mode specifically disposed can refer in the method for above-described embodiment, and pass through
The included method of Haystack engine 100 creates needle model 200, belongs to Haystack file system, and
Haystack engine 100 can generate index file 400, and above-mentioned module is run in cloud platform.
It should be noted that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to preferable
Embodiment describes the invention in detail, those skilled in the art should understand that, it can be to technology of the invention
Scheme is modified or replaced equivalently, and without departing from the spirit and scope of the technical solution of the present invention, should all be covered in this hair
In bright scope of the claims.