CN110502472A - A kind of the cloud storage optimization method and its system of large amount of small documents - Google Patents

A kind of the cloud storage optimization method and its system of large amount of small documents Download PDF

Info

Publication number
CN110502472A
CN110502472A CN201910735729.6A CN201910735729A CN110502472A CN 110502472 A CN110502472 A CN 110502472A CN 201910735729 A CN201910735729 A CN 201910735729A CN 110502472 A CN110502472 A CN 110502472A
Authority
CN
China
Prior art keywords
file
small documents
haystack
engine
large amount
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910735729.6A
Other languages
Chinese (zh)
Inventor
王任之
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tibet ningsuan Technology Group Co.,Ltd.
Original Assignee
Beijing Ningsuan Technology Co Ltd
Tibet Ningbo Information Technology Co Ltd
Tibet Ningsuan Technology Group Co Ltd
Dilu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ningsuan Technology Co Ltd, Tibet Ningbo Information Technology Co Ltd, Tibet Ningsuan Technology Group Co Ltd, Dilu Technology Co Ltd filed Critical Beijing Ningsuan Technology Co Ltd
Priority to CN201910735729.6A priority Critical patent/CN110502472A/en
Publication of CN110502472A publication Critical patent/CN110502472A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the cloud storage optimization methods and its system of a kind of large amount of small documents, include the following steps, by Haystack engine distribution to cloud platform;The each small documents for having been written into cloud disk are packaged using the Haystack engine and create corresponding needle model;The collection for constructing the needle model is combined into data file module, and the small documents are appended in the data file module by the sequencing of write-in;It is generated in index file write-in cloud disk using the Haystack engine;File search is realized according to the index file.Beneficial effects of the present invention: it can reduce the resource consumption when storage scene of large amount of small documents, invalid input/output can be reduced, scattered small documents are spliced into a small amount of metadata of big file maintenance, metadata, which can cache in memory, reduces a large amount of invalid input/output.

Description

A kind of the cloud storage optimization method and its system of large amount of small documents
Technical field
The present invention relates to the technical field of cloud computing platform more particularly to a kind of cloud storage optimization methods of large amount of small documents And its cloud storage optimization system.
Background technique
In recent years under HDFS file system, each file will be created corresponding inode etc metadata, but Under mass file scene, traditional HDFS can not carry so many metadata IO amount and so huge metasearch Calculation amount, unique way is exactly to reduce amount of metadata, then being bound to that the quantity of document entity will be reduced, so these are literary Part system without exception be all the method for having used such a accommodation, i.e., create file again hereof, be unable to satisfy efficiently Store large amount of small documents.
Metadata is dispersed in each file, and if first number except above four occurs in the metadata returned According to, then for users these be exactly it is useless, every time request picture when will read in memory, in face of mass picture Scene performance has tremendous influence.
Summary of the invention
The purpose of this section is to summarize some aspects of the embodiment of the present invention and briefly introduce some preferable implementations Example.It may do a little simplified or be omitted to avoid our department is made in this section and the description of the application and the title of the invention Point, the purpose of abstract of description and denomination of invention it is fuzzy, and this simplification or omit and cannot be used for limiting the scope of the invention.
In view of above-mentioned existing problem, the present invention is proposed.
Therefore, the technical problem that the present invention solves is: the cloud storage optimization method for providing a kind of large amount of small documents is full Sufficient existing file system can not efficient storage large amount of small documents.
In order to solve the above technical problems, the invention provides the following technical scheme: a kind of cloud storage of large amount of small documents optimizes Method includes the following steps, by Haystack engine distribution to cloud platform;Using the Haystack engine to having been written into cloud Each small documents of disk are packaged and create corresponding needle model;The collection for constructing the needle model is combined into Data file module, the small documents are appended in the data file module by the sequencing of write-in;Using described Haystack engine generates in index file write-in cloud disk;File search is realized according to the index file.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: described to incite somebody to action Haystack engine distribution includes the following steps to cloud platform, and the cloud platform provides operation data bank interface, described in installation Haystack engine;Search engine is set, target configuration is added in setup module;Creation index class simultaneously adds block mapping pass System;Template is added, in template in creation search column.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: be based on institute It states Haystack engine, user and program and passes through the read-write of web services protocol realization object and the access of storage resource, including wound It builds needle model and generates index file, wherein the needle model includes key, size, data number of each small documents It is believed that breath.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: the rope Before quotation part saves key, offset, size information of each needle model, and the index file only saves key Nybble, the needle model in the data file module are stored according to the lexicographic order of key.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: described The corresponding offset of assignment is generated in needle model creation process, if encountered when constructing or updating and map in memory identical The needle model of offset is then updated with the high coverage values of offset value are low.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: further include The step of being scanned for according to the index file reads small documents;According to search request file in the alphabetical serial number of memory Preceding 4 bytes of key;Offset is obtained, size numerical value obtains the key value of the needle model from stack;Judge mould Whether type key value is equal with the key value of file, if data is then returned to use according to the size in the needle model Family;If otherwise whether judgment models key value is equal with preceding 4 bytes of file;If then completing to search and starting to calculate next The needle modal position, read it is next it is described state needle model key value, if otherwise not returning to any data.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: including small The definition of file is set to the different threshold values of framework block according to the different needs of the user, is less than framework block threshold value for all The file of size is defined as small documents, and the threshold value is 75%.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: described Realize that file search includes according to the index file, index can be loaded into cloud platform service by when Haystack engine start In the memory of device, offset and size in the data file is positioned by searching for indexing in memory, including search small text Initial position and size of the part in the data file module.
A kind of preferred embodiment of cloud storage optimization method as large amount of small documents of the present invention, in which: described The doclet object of Haystack engine is image file, includes the steps that image is read and write, and uploads layer and receives the figure that user uploads Picture measures original image size and is saved into storage layer;Images serve layer receives HTTP image request, and provides a user It is stored in the image of the storage layer;The request of user is dispatched to nearest cloud platform node first, if cache hit, directly Picture material is returned into user, otherwise requests the storage system of rear end, image content is returned to user by caching.
Another technical problem that the present invention solves is: the cloud storage optimization system for providing a kind of large amount of small documents meets now There is file system can not efficient storage large amount of small documents.
In order to solve the above technical problems, the invention provides the following technical scheme: a kind of cloud storage of large amount of small documents optimizes System, including Haystack engine, needle model, data file module and index file;The Haystack engine is to open Source search framework, can be deployed to cloud platform and directly use, and object-based storage equipment can be real by web services agreement The read-write of existing object and the access of storage resource;The needle model is the model created based on the Haystack engine, For saving the data information of small documents;The data file module is that the collection that the needle model is stored in sequence merges It is stored in the cloud;The index file is the concordance list generated based on the Haystack engine.
Beneficial effects of the present invention: it can reduce the resource consumption when storage scene of large amount of small documents, nothing can be reduced Input/output is imitated, scattered small documents are spliced into a small amount of metadata of big file maintenance, metadata can be buffered in interior Reduce a large amount of invalid input/output in depositing.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill of field, without any creative labor, it can also be obtained according to these attached drawings other Attached drawing.Wherein:
Fig. 1 is that the overall flow structure of the cloud storage optimization method of large amount of small documents described in first embodiment of the invention is shown It is intended to;
Fig. 2 is the overall flow structural schematic diagram of indexed search described in first embodiment of the invention;
Fig. 3 is the structural schematic diagram of data file module described in first embodiment of the invention;
Fig. 4 is structural schematic diagram of the data file module described in first embodiment of the invention for the file information;
Fig. 5 is the structural schematic diagram of index file described in first embodiment of the invention;
Fig. 6 is the matched flowage structure schematic diagram of information searching described in first embodiment of the invention;
Fig. 7 is the structural schematic diagram of framework and request processing flow described in first embodiment of the invention;
Fig. 8 is that the whole theory structure of the cloud storage optimization system of large amount of small documents described in second embodiment of the invention shows It is intended to.
Fig. 9 is stored and is applied to this programme in traditional scheme respectively for file described in second embodiment of the invention and deposited Store up test data result schematic diagram.
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, right with reference to the accompanying drawings of the specification A specific embodiment of the invention is described in detail, it is clear that and described embodiment is a part of the embodiments of the present invention, and It is not all of embodiment.Based on the embodiments of the present invention, ordinary people in the field is without making creative work Every other embodiment obtained, all should belong to the range of protection of the invention.
In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention, but the present invention can be with Implemented using other than the one described here other way, those skilled in the art can be without prejudice to intension of the present invention In the case of do similar popularization, therefore the present invention is not limited by the specific embodiments disclosed below.
Secondly, " one embodiment " or " embodiment " referred to herein, which refers to, may be included at least one realization side of the invention A particular feature, structure, or characteristic in formula." in one embodiment " that different places occur in the present specification not refers both to The same embodiment, nor the individual or selective embodiment mutually exclusive with other embodiments.
Combination schematic diagram of the present invention is described in detail, when describing the embodiments of the present invention, for purposes of illustration only, indicating device The sectional view of structure can disobey general proportion and make partial enlargement, and the schematic diagram is example, should not limit this herein Invent the range of protection.In addition, the three-dimensional space of length, width and depth should be included in actual fabrication.
Simultaneously in the description of the present invention, it should be noted that the orientation of the instructions such as " upper and lower, inner and outer " in term Or positional relationship is to be based on the orientation or positional relationship shown in the drawings, and is merely for convenience of description of the present invention and simplification of the description, and It is not that the device of indication or suggestion meaning or element must have a particular orientation, be constructed and operated in a specific orientation, therefore It is not considered as limiting the invention.In addition, term " first, second or third " is used for description purposes only, and cannot understand For indication or suggestion relative importance.
In the present invention unless otherwise clearly defined and limited, term " installation is connected, connection " shall be understood in a broad sense, example Such as: may be a fixed connection, be detachably connected or integral type connection;It equally can be mechanical connection, be electrically connected or be directly connected to, Can also indirectly connected through an intermediary, the connection being also possible to inside two elements.For the ordinary skill people of this field For member, the concrete meaning of above-mentioned term in the present invention can be understood with concrete condition.
As used in this application, it is related real that term " component ", " module ", " system " etc. are intended to refer to computer Body, the computer related entity can be hardware, firmware, the combination of hardware and software, software or running software.Example Such as, component, which may be, but not limited to, is: the processing that runs on a processor, processor, object, executable file, in execution Thread, program and/or computer.As an example, the application run on the computing device and the calculating equipment can be components. One or more components can reside in process in execution and/or thread, and component can be located in a computer And/or it is distributed between two or more computers.In addition, these components can be from it with various data knots It is executed in the various computer-readable mediums of structure.These components can be by such as according to one or more data groupings (for example, the data from a component, another component in the component and local system, distributed system interact and/ Or interacted in a manner of signal by the network and other systems of such as internet etc) signal, with local and/or remote The mode of journey process is communicated.
Embodiment 1
Signal referring to Fig.1 is illustrated as proposing a kind of the whole of the cloud storage optimization method of large amount of small documents in the present embodiment Body flow diagram.This method is based on being deployed in cloud platform under Hadoop distributed file system and realize, Hadoop is distributed File system (HDFS) is designed to be suitble to operate in the distributed file system on common hardware, it and existing distribution are literary Part system has many common ground.But the difference of it and other distributed file systems simultaneously is also apparent.HDFS is one The system of a Error Tolerance is suitble to be deployed on cheap machine, and HDFS can provide the data access of high-throughput, very suitable Close the application on large-scale dataset.HDFS relaxes a part of POSIX constraint, and Lai Shixian streaming reads file system data Purpose.HDFS is developed most beginning as the architecture of ApacheNutch search engine project, and which employs principals and subordinates Structural model, HDFS cluster are made of NameNode and DataNode, and wherein NameNode is as primary server, management text The data of the NameSpace of part system and client to the DataNode management storage in the access operation of file, cluster.One User or a program can create directory, among storage file to many catalogues, the name space level of file system and its His file system is similar, and can also can be realized creation, mobile file, and file is moved to another from a catalogue Or renaming.
In the scheme that early stage is realized, metadata is dispersed in each file, and be it is useless, request picture every time When will read in memory, therefore have tremendous influence in face of the scene performance of mass picture.Hadoop distributed file system Lower Haystack engine 100 is deployed in cloud platform directly uses, and user or program can pass through web services protocol realization The read-write of object and the access of storage resource.
Metadata is also known as broker data, relaying data and mainly describes the letter of data attribute for the data for describing data Breath, for supporting such as to indicate storage location, historical data, resource lookup, file record function.Such as one picture be one Data, then can also carry some normal datas in picture.For example size date etc., the above are exactly to belong to first number According to.
Under HDFS file system, due to the lookup of metadata, this traditional design leads to excessive disk operating, therefore The present embodiment carefully reduces the metadata of every photo, and such Haystack storage machine can execute institute in main memory Some metadata lookups, this selection saves the disk operating for reading real data, to improve overall handling capacity.Drop Resource consumption when the storage scene of low large amount of small documents more specifically includes the following steps,
S1: being deployed to cloud platform for Haystack engine 100, is based on Haystack engine 100, user and program pass through The read-write of web services protocol realization object and the access of storage resource, including creation needle model 200 and generation index file 400, wherein needle model 200 includes key, size, data data information of each small documents.
S2: being packaged each small documents for having been written into cloud disk using Haystack engine 100 and it is respectively right to create The needle model 200 answered generates the corresponding offset of assignment in this step, is constructing during the creation of needle model 200 If or update and encounter the needle model 200 of same offset when mapping in memory, it is low with the high coverage values of offset value Be updated.
Needle can generate corresponding offset...Haystack when creation and not allow for covering needle, institute Identical key can only be possessed by adding a new needle with the modification of picture.Haystack utilizes a very simple Means distinguish duplicate needle, that is, judge they offset (needle of new version be certainly offset most That high), it is low with the covering of offset high if encountering identical needle when constructing or updating and map in memory 's.
Such as offset is 1 when having a file a.pngHaystack to this document creation needle, another text Part be also a.pngHaystack to this document creation needle when offset be 2, merely lead to when if searching this file It crosses filename lookup just and will appear two as a result, filtering one time by offset again, high level update low value has just had to one As a result.
S3: the collection of building needle model 200 is combined into data file module 300, and small documents are chased after by the sequencing of write-in It is added in data file module 300.
S4: index file 400 is generated using Haystack engine 100 and is written in cloud disk, index file 400 saves often Key, offset, size information of a needle model 200, and the preceding nybble of index file 400 preservation key, resource disappear Consumption is only a quarter (16 byte) of traditional approach, and the needle model 200 in data file module 300 is according to key Lexicographic order storage.
S5: realizing file search according to index file 400, realizes that file search includes according to index file 400, Index can be loaded into the memory of cloud platform server by Haystack engine 100 when starting, in memory by searching for index Offset and size in the data file is positioned, including searching initial position of the small documents in data file module 300 With size.
Further, Haystack engine 100 cloud platform is deployed in the present embodiment to include the following steps,
Cloud platform provides operation data bank interface, installs Haystack engine 100, installs and applies from PyPI, supports The tetra- kinds of full-text search engine rear ends whoosh, solr, Xapian, Elasticsearc, belong to a kind of frame of full-text search.This Step specific implementation, such as can refer to as follows:
Search engine is set, target configuration is added in setup module, such as adds following configuration in setting, it can With reference to as follows:
Creation index class simultaneously adds block mapping relations, creates a search_index.py file, then creation index Class, for example to create and index to News, it can refer to as follows:
Url mapping is added in url.py, specific implementation can refer to as follows:
Template is added, is created in template in creation search column, such as under template file with the catalogue of flowering structure: Template--search--indexes--news (name of app) -- news (name of app) _ text.txt.news_ Addition needs the field being indexed in text.txt file, can refer to as follows:
Referring to the signal of Fig. 2, the present embodiment further includes the steps that being scanned for according to index file 400, specific as follows:
Read small documents;
According to preceding 4 bytes of the key of search request file in the alphabetical serial number of memory;
Offset is obtained, size numerical value obtains the key value of needle model 200 from stack;
Whether judgment models key value is equal with the key value of file, if then will according to the size in needle model 200 Data returns to user;If otherwise whether judgment models key value is equal with preceding 4 bytes of file;
If then completing to search and starting to calculate next 200 position of needle model, reading is next to state needle model 200key value, if otherwise not returning to any data.
Referring to the signal of Fig. 3~4, it is illustrated as the structure of data file module 300 in the present embodiment, Fig. 5 is illustrated as indexing The composition of file 400, Fig. 6 are illustrated as searching matched composition.Wherein size tab file physical size, key file identifier, Date the file information and id file id.
Such as: the key that user reads file is ab cd ef 2a, is matched to this prefix of ab cd ef ac, at this time Offset is directed toward this needle of ab cd ef 1a, matches miss for the first time.By being stored in needleheader Size, we can position the position ab cd ef 2a, be matched to correct needle, and by reading data to user.
In order to reduce invalid IO, (such as the metadata of directory entry, such as the authority information of file spell scattered small documents It is connected into a small amount of metadata of big file maintenance (id offset size cookie etc.), effectively reduces IO, and metadata It can cache in memory, reduce a large amount of invalid IO.By reorganizing file structure and caching, the member of an average picture Data only need 10B memory, and all caching picture metadata is possibly realized in this way, and a read operation only needs 1 IO.
Scene: consumed amount of ram and time are stored in 1,000,000 files in the present embodiment:
HDFS In the present embodiment
>300mb <100mb
>2hours <0.5hour
It is above-mentioned obviously to can be seen that HDFS is either time-consuming or amount of storage, the present embodiment have biggish advantage,
Referring additionally to the signal of Fig. 9, the present embodiment by the file of 1,000,000,000 population sizes respectively traditional scheme carry out storage and The method for being applied to this programme carries out storage test, and traditional cloud storage HDFS includes that Apache provides open source scheme, this implementation Example carries out storage test using Apache offer open source scheme and tests as a comparison with this programme storage test, the study found that from Figure 9 above can be seen that in the case where quantity of documents is 1,000,000,000, using saving as in traditional scheme consumption more than 26G, and use This paper scheme only consumes the more memories of 9G, and memory use reduces 2/3.The present embodiment passes through the speed transmitted to the upper figure of file simultaneously Degree is tested, and respectively with the cloud storage of traditional scheme and using the cloud storage optimization method of the present embodiment, test result can join According to lower table, it is not difficult to find out that the present embodiment all have compared with the speed that the space consuming that conventional method either stores still stores it is bright Aobvious advantage.
It is most effective that the present embodiment should be noted that the optimization method is handled for picture, therefore in the present embodiment Small documents refer to be image.It is understandable to be, it certainly include the process of request and the read-write of image.Specifically, including first The definition of small documents is set to the different threshold values of framework block according to the different needs of the user, is less than framework block threshold for all The file of value size is defined as small documents, and threshold value is 75%.Such as hadoop block size be usually arranged as 128MB, 256MB is intended to increasing.It according to different requirements, also can be different to the specific decision rule of small documents, it is assumed here that It is the 75% of hadoop block size, i.e. 75% file of size of the size less than hadoopblock is all small documents.
Doclet object based on Haystack engine 100 is image file, includes the steps that image is read and write: uploading layer and connects The image that user uploads is received, original image size is measured and is saved into storage layer;Images serve layer receives HTTP image and asks It asks, and provides a user the image for being stored in storage layer;The request of user is dispatched to nearest cloud platform node first, if slow Hit is deposited, picture material is directly returned into user, otherwise requests the storage system of rear end, image content is returned to user by caching.
It is further, as follows for Haystack image basis frame and process flow referring to the signal of Fig. 7:
Haystack is a kind of object storage based on HTTP, it contains pointer, is mapped to storage object.In Haystack In with pointer store image, the image of hundreds of thousands of meters is gathered a Haystack store files, to eliminate first number According to load.The expense that this allows for metadata is very small, and it is every to enable us to the storage in storage file and memory index The position of a pointer.So that the retrieval of image data can be completed with a small amount of I/O operation, it is some unnecessary to eliminate Metadata expense.
Haystack framework mainly includes that there are three parts in the present embodiment: Haystack Directory, Haystack Store, Haystack Cache, Haystack Store are physical store nodes, and tissue storage is empty in the form of physics spool Between, the corresponding physical file of each physics spool, therefore the physical file metamessage very little on each memory node.Multiple objects The physics spool managed on storage node forms a logic spool, for backing up.Haystack Directory stores logical volume The corresponding relationship of axis and physics spool.Haystack Cache is mainly used for solving the problems, such as excessively to rely on cloud provider, mentions For the buffer service of nearest increased picture.
Write request (picture upload procedure) process flow of Haystack are as follows: Web Server requests Haystack first Directory obtains Image ID and writeable logic spool, then writes data into corresponding per each physics spool.Haystack The function of Directory is as follows:
The mapping of logic spool to physics spool is provided, distributes Image ID for write request;
Load balancing is provided, selects logic spool for write operation, read operation selects physics spool;
Cloud Server is shielded, certain picture requests is can choose and directly walks Haystack Cache;
Marking certain logic spools is Red-only.
In Haystack storage file, each pointer has a corresponding index record, and indicator index record Sequence must the pointer to relevant Haystack storage file sequence match.Index file, which provides, searches Haystack storage Minimum metadata needed for a certain particular pointer in file.In order to quickly search, index record is loaded into and is organized to a number According in structure, this is the responsibility of Haystack application program.The main purpose of index is quickly load pointer metadata to arrive In memory, without traversing huge Haystack storage file, this is because the size of index is typically less than storage file 1%.
Embodiment 2
Referring to the signal of Fig. 7, it is illustrated as a kind of cloud storage system of large amount of small documents of the present embodiment proposition, above-mentioned implementation The method of example relies on the present embodiment realization, and cloud storage system is based on HDFS file system, in modern corporate environment, single machine Capacity can not often store mass data, need across machine storage.The file system being distributed on cluster is managed collectively to be known as dividing Cloth file system.And once in systems, network is introduced, the complexity of all-network programming is just inevitably introduced, Such as challenge first is that if guaranteeing that data are not lost when node is not available.This system is deployed in be made in cloud platform With.
Although traditional Network File System also referred to as distributed file system, there are some limitations for it.Due to text Part is stored on single machine, therefore can not provide Reliability Assurance, when many clients access simultaneously, it is easy to cause to service Device pressure, causes performance bottleneck.In addition it if to operate file, needs to be synchronized to local first, these modifications are same Before walking server-side, other clients are sightless.It is not a kind of typical distributed system in a way.
HDFS provides various interactive modes, such as passes through JavaAPI, HTTP, shell-command row.The friendship of order line Mutually mainly operated by hadoopfs, the file system concept of Hadoop be it is abstract, HDFS is one such reality It is existing.Order line can be used with HDFS interaction, carry out operation document system there are also many modes.Such as java application can make Operated with org.apache.hadoop.fs.FileSystem, the operation of other forms be also all based on FileSystem into Row encapsulation.The present embodiment uses the interactive mode of HTTP.
Further, the system of the present embodiment includes Haystack engine 100, needle model 200, data file module 300 and index file 400.It is more specifical:
Haystack engine 100 is open source search framework, can be deployed to cloud platform and directly use, object-based storage Equipment can be realized the read-write of object and the access of storage resource by web services agreement;Needle model 200 be based on The model that Haystack engine 100 creates, for saving the data information of small documents;Data file module 300 is needle mould The collection merging that type 200 is stored in sequence is stored in the cloud;Index file 400 is the rope generated based on Haystack engine 100 Draw table.
Haystack engine 100 can be laid out by the interface provided and directly be used in cloud platform, and Haystack is The picture storage system of Facebook solves system, and the mode specifically disposed can refer in the method for above-described embodiment, and pass through The included method of Haystack engine 100 creates needle model 200, belongs to Haystack file system, and Haystack engine 100 can generate index file 400, and above-mentioned module is run in cloud platform.
It should be noted that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to preferable Embodiment describes the invention in detail, those skilled in the art should understand that, it can be to technology of the invention Scheme is modified or replaced equivalently, and without departing from the spirit and scope of the technical solution of the present invention, should all be covered in this hair In bright scope of the claims.

Claims (10)

1. a kind of cloud storage optimization method of large amount of small documents, it is characterised in that: include the following steps,
Haystack engine (100) is deployed to cloud platform;
The each small documents for having been written into cloud disk are packaged using the Haystack engine (100) and create respective correspondence Needle model (200);
The collection for constructing the needle model (200) is combined into data file module (300), and the small documents are successive suitable by write-in Sequence is appended in the data file module (300);
It is generated in index file (400) write-in cloud disk using the Haystack engine (100);
File search is realized according to the index file (400).
2. the cloud storage optimization method of large amount of small documents as described in claim 1, it is characterised in that: described to draw Haystack It holds up (100) and is deployed to cloud platform and include the following steps,
The cloud platform provides operation data bank interface, installs the Haystack engine (100);
Search engine is set, target configuration is added in setup module;
Creation index class simultaneously adds block mapping relations;
Template is added, in template in creation search column.
3. the cloud storage optimization method of large amount of small documents as claimed in claim 1 or 2, it is characterised in that: based on described Haystack engine (100), user and program pass through the read-write of web services protocol realization object and the access of storage resource, packet It includes creation needle model (200) and generates index file (400), wherein the needle model (200) includes each small text Key, size, data data information of part.
4. the cloud storage optimization method of large amount of small documents as claimed in claim 3, it is characterised in that: the index file (400) key, offset, size information of each needle model (200) are saved, and the index file (400) is only protected Deposit the preceding nybble of key, the needle model (200) in the data file module (300) according to key lexicographic order Storage.
5. the cloud storage optimization method of large amount of small documents as claimed in claim 4, it is characterised in that: the needle model (200) the corresponding offset of assignment is generated during creation, if encountering same offset when constructing or updating and map in memory The needle model (200) of amount, then be updated with the high coverage values of offset value are low.
6. the cloud storage optimization method of large amount of small documents as described in claim 4 or 5, it is characterised in that: further include according to institute The step of index file (400) scans for is stated,
Read small documents;
According to preceding 4 bytes of the key of search request file in the alphabetical serial number of memory;
Offset is obtained, size numerical value obtains the key value of the needle model (200) from stack;
Whether judgment models key value is equal with the key value of file, if then according to the size in the needle model (200) Data is returned into user;If otherwise whether judgment models key value is equal with preceding 4 bytes of file;
If then completing to search and starting to calculate next needle model (200) position, reads and next described state needle Model (200) key value, if otherwise not returning to any data.
7. the cloud storage optimization method of large amount of small documents as claimed in claim 6, it is characterised in that: determine including small documents Justice is set to the different threshold values of framework block according to the different needs of the user, by all texts less than framework block threshold size Part is defined as small documents, and the threshold value is 75%.
8. the cloud storage optimization method of large amount of small documents as claimed in claim 7, it is characterised in that: described according to the index File (400) realizes that file search includes,
Index can be loaded into the memory of cloud platform server by the Haystack engine (100) when starting, and be led in memory Lookup index is crossed to position offset and size in the data file, including searches small documents in the data file module (300) initial position and size in.
9. the cloud storage optimization method of large amount of small documents as claimed in claim 7 or 8, it is characterised in that: the Haystack The doclet object of engine (100) is image file, includes the steps that image is read and write,
It uploads layer and receives the image that user uploads, measure original image size and be saved into storage layer;
Images serve layer receives HTTP image request, and provides a user the image for being stored in the storage layer;
The request of user is dispatched to nearest cloud platform node first and picture material is directly returned to user if cache hit, Otherwise the storage system of rear end is requested, image content is returned to user by caching.
10. a kind of cloud storage system of large amount of small documents, it is characterised in that: including Haystack engine (100), needle model (200), data file module (300) and index file (400);
The Haystack engine (100) is open source search framework, can be deployed to cloud platform and directly use, object-based to deposit Equipment is stored up, can be realized the read-write of object and the access of storage resource by web services agreement;The needle model (200) For the model created based on the Haystack engine (100), for saving the data information of small documents;The data file mould Block (300) is that the collection merging that the needle model (200) is stored in sequence is stored in the cloud;The index file (400) For the concordance list generated based on the Haystack engine (100).
CN201910735729.6A 2019-08-09 2019-08-09 A kind of the cloud storage optimization method and its system of large amount of small documents Pending CN110502472A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910735729.6A CN110502472A (en) 2019-08-09 2019-08-09 A kind of the cloud storage optimization method and its system of large amount of small documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910735729.6A CN110502472A (en) 2019-08-09 2019-08-09 A kind of the cloud storage optimization method and its system of large amount of small documents

Publications (1)

Publication Number Publication Date
CN110502472A true CN110502472A (en) 2019-11-26

Family

ID=68586366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910735729.6A Pending CN110502472A (en) 2019-08-09 2019-08-09 A kind of the cloud storage optimization method and its system of large amount of small documents

Country Status (1)

Country Link
CN (1) CN110502472A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380383A (en) * 2020-11-11 2021-02-19 北京中电兴发科技有限公司 Efficient fault-tolerant indexing method for real-time video stream data
CN112765113A (en) * 2021-01-31 2021-05-07 云知声智能科技股份有限公司 Index compression method and device, computer readable storage medium and electronic equipment
CN114356230A (en) * 2021-12-22 2022-04-15 天津南大通用数据技术股份有限公司 Method and system for improving reading performance of column storage engine

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778270A (en) * 2015-04-24 2015-07-15 成都汇智远景科技有限公司 Storage method for multiple files
CN106874348A (en) * 2016-12-26 2017-06-20 贵州白山云科技有限公司 File is stored and the method for indexing means, device and reading file
CN109063192A (en) * 2018-08-29 2018-12-21 广州洪荒智能科技有限公司 A kind of high-performance mass file storage system working method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778270A (en) * 2015-04-24 2015-07-15 成都汇智远景科技有限公司 Storage method for multiple files
CN106874348A (en) * 2016-12-26 2017-06-20 贵州白山云科技有限公司 File is stored and the method for indexing means, device and reading file
CN109063192A (en) * 2018-08-29 2018-12-21 广州洪荒智能科技有限公司 A kind of high-performance mass file storage system working method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380383A (en) * 2020-11-11 2021-02-19 北京中电兴发科技有限公司 Efficient fault-tolerant indexing method for real-time video stream data
CN112380383B (en) * 2020-11-11 2021-06-18 北京中电兴发科技有限公司 Fault-tolerant indexing method for real-time video stream data
CN112765113A (en) * 2021-01-31 2021-05-07 云知声智能科技股份有限公司 Index compression method and device, computer readable storage medium and electronic equipment
CN112765113B (en) * 2021-01-31 2024-04-09 云知声智能科技股份有限公司 Index compression method, index compression device, computer readable storage medium and electronic equipment
CN114356230A (en) * 2021-12-22 2022-04-15 天津南大通用数据技术股份有限公司 Method and system for improving reading performance of column storage engine
CN114356230B (en) * 2021-12-22 2024-04-23 天津南大通用数据技术股份有限公司 Method and system for improving read performance of column storage engine

Similar Documents

Publication Publication Date Title
US20230400990A1 (en) System and method for performing live partitioning in a data store
US10013185B2 (en) Mapping systems and methods of an accelerated application-oriented middleware layer
JP6560308B2 (en) System and method for implementing a data storage service
US10467188B2 (en) In-line policy management with multi-level object handle
US9558194B1 (en) Scalable object store
US20210200446A1 (en) System and method for providing a committed throughput level in a data store
US20170206265A1 (en) Distributed Consistent Database Implementation Within An Object Store
CN103812939B (en) Big data storage system
US20120296866A1 (en) System and method for implementing on demand cloud database
CN109074387A (en) Versioned hierarchical data structure in Distributed Storage area
CN108140040A (en) The selective data compression of database in memory
CN103530387A (en) Improved method aimed at small files of HDFS
CN110502472A (en) A kind of the cloud storage optimization method and its system of large amount of small documents
CN110347651A (en) Method of data synchronization, device, equipment and storage medium based on cloud storage
JP5557824B2 (en) Differential indexing method for hierarchical file storage
CN107122238B (en) Efficient iterative Mechanism Design method based on Hadoop cloud Computational frame
CN106570113A (en) Cloud storage method and system for mass vector slice data
CN111966692A (en) Data processing method, medium, device and computing equipment for data warehouse
US10558373B1 (en) Scalable index store
CN110008197A (en) A kind of data processing method, system and electronic equipment and storage medium
Sveen Efficient storage of heterogeneous geospatial data in spatial databases
US11455305B1 (en) Selecting alternate portions of a query plan for processing partial results generated separate from a query engine
US10146833B1 (en) Write-back techniques at datastore accelerators
US10387384B1 (en) Method and system for semantic metadata compression in a two-tier storage system using copy-on-write
US10606805B2 (en) Object-level image query and retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210207

Address after: 11 / F, Liuwu building, Liuwu New District, Lhasa City, Tibet Autonomous Region, 850000

Applicant after: Tibet ningsuan Technology Group Co.,Ltd.

Address before: 11 / F, Liuwu building, Liuwu New District, Lhasa City, Tibet Autonomous Region, 850000

Applicant before: Tibet ningsuan Technology Group Co.,Ltd.

Applicant before: DILU TECHNOLOGY Co.,Ltd.

Applicant before: TIBET NINGSUAN INFORMATION TECHNOLOGY Co.,Ltd.

Applicant before: Beijing ningsuan Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191126