CN104679898A

CN104679898A - Big data access method

Info

Publication number: CN104679898A
Application number: CN201510118185.0A
Authority: CN
Inventors: 刘颖
Original assignee: Chengdu Hui Zhi Distant View Science And Technology Ltd
Current assignee: Chengdu Hui Zhi Distant View Science And Technology Ltd
Priority date: 2015-03-18
Filing date: 2015-03-18
Publication date: 2015-06-03

Abstract

The invention provides a big data access method used for accessing file resources in a cloud storage platform. The big data access method comprises the following steps: combining files of which the sizes are smaller than preset sizes in a distributed file system to form a new storage file; establishing a primary index and a secondary index for the combined file; buffering the indexes before a user requests by utilizing a pre-access mechanism. According to the big data access method, the response speed of a cloud storage platform and the overall performance of the distributed file system are kept under the condition that a great number of small files are read and written.

Description

A kind of large data access method

Technical field

The present invention relates to data to store, particularly a kind of file access method of large data.

Background technology

Along with the appearance with massive medical data that develops rapidly of intelligent medical treatment, need corresponding large database as carrier to preserve these data, but large data be scheduled to a large problem.The document retrieval quantity of medical circle along with Internet resources also exponentially level increase, especially the renewal speed of small documents and semi-invariant all constantly promote, and have become medical cloud and have stored problem demanding prompt solution.Although distributed file system be widely used in large-scale data storage, analyze, a lot of mechanism adopted distributed file system to solve the data problem increased fast all.But the design of existing distributed file system is mainly for the read-write of large files, the storage of large amount of small documents can reduce file system overall performance, can not be stored as in main system by being applied in well medical retrieval etc. with small documents.Therefore, for the problems referred to above existing in correlation technique, at present effective solution is not yet proposed.

Summary of the invention

For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of large data access method, comprising:

In distributed file system, the file lower than default size is merged, forms new storage file,

File set up master index after being combined and secondary index,

Utilize prefetch mechanisms, before user's request, buffer memory is carried out to index.

Preferably, described the file lower than default size to be merged, comprises further:

Divide File is become block, by the NameSpace persistence of distributed file system in an image file, by namenode, this image file is loaded in internal memory during startup;

User presents a paper after resource in cloud storage platform, first screened by file filter device according to filtercondition, for qualified file, generate new storage file by after the Piece file mergence of the same attribute of predetermined quantity, while by new storage file writing system, upgrade index file;

To the read-write operation of each file, first inquire about in NameSpace, the information such as block address, file size of locating file, and then retrieve in back end space.

Preferably, described in be combined after file set up master index and secondary index, wherein master index is the resource collection belonging to file; Described secondary index is concrete resource entries, and described resource collection is the set of the resource entries with correlativity, and a resource entries only belongs to a resource collection, and file can divide according to Attribute domain; When needs file reading, in master index and secondary index, Check askes successively; At file after filtering, according to resource collection belonging to file entries, to be unit carry out merging to the file after filtering to cloud storage platform becomes blocks of files; In data processing, just resource entries in new blocks of files can be distributed to same MapReduce task.

Preferably, described prefetch mechanisms comprises further: the data will visited by the data prediction user of user's current accessed, and buffer memory is called in its index, system responses is accelerated when user accesses, described user is in download or before browsing resource entries, intermediate result collection is obtained by the mode of retrieval or directory search, then the resource entries needed is selected to access further wherein, see the result set page user and perform the index by buffered in advance intermediate result pooling of resources entry in the interval between downloading or browsing, user click download or browse time not execute file metadata query, directly carry out transfer files.

Preferably, describedly buffer memory is carried out to index comprise further, after user sends retrieval request, the resource entries result set meeting user's needs is ask in Web service according to user search condition Check, return to user, create asynchronous thread simultaneously and upgrade buffer memory, download or upgrade cache contents in time interval between browse operation returning user's result set and browse result set to user and determine to click, when receiving renewal cache contents request, call index module to retrieve, the metadata of current results collection entry is loaded into buffer memory, when user sends download or browse request, Web service is called distributed type file system client side and is searched metadata in the buffer and start to read data and to client transmissions, the thread pool of a server maintenance fixing number of threads, a thread process is called when receiving renewal cache request at every turn, if do not have idle thread in thread pool, this buffer memory task is allowed to wait for, FIFO algorithm is utilized to set up cache pool and allocating cache pond size, key-value pair key/value is kept in cache pool, wherein filename is as key, the back end ID of file, the combination of reference position and length is as value, eliminate the cache entries at most, this cache pool provides two operations and put operation and get to operate, put operation puts into data toward cache pool, if existing data reach the upper limit inside cache pool, then according to FIFO cache replacement algorithm replacement data, Get operation obtains corresponding value value according to key value.

Preferably, described master index data are stored in relational database, access is provided by relational database access interface, the Map data structure in Java is used to preserve, on the basis of resource collection write into Databasce, be increased in the field be worth by system generation when resource entries is added, data acquisition in master index Key/Value structure, use Map data structure in Java, also exist according to this Map object of content initialization in database when service starts always, add when there being new resource collection or have deleted time, this Map object is upgraded, secondary index is created by open source projects Lucene, supports small documents metadata retrieval, real-time update index file whenever user adds resource entries time, when multiple user adds resource entries under a resource collection simultaneously, realize the con current control of file write,

Described file is carried out merging comprising further: create SequenceFile object, by the filtration of filtrator, merge meeting pre-conditioned file, resource collection according to resource entries place is searched in master index, after finding file path corresponding to resource collection, create SequenceFile object, and obtain the Writer object of SequenceFile and it is configured, prepare writing in files, a new thread is opened while execute file write, by file place value corresponding for this resource entries, length information write resource entries secondary index, resource entries writes successfully and closes output stream, return and submit to successfully, otherwise return and submit to unsuccessfully

The present invention compared to existing technology, has the following advantages:

For the read-write of the small documents for full-text search, by index and raising recall precision of looking ahead, keep response speed and the distributed file system overall performance of cloud storage platform when storage and the reading of large amount of small documents.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the large data access method according to the embodiment of the present invention.

Embodiment

Detailed description to one or more embodiment of the present invention is hereafter provided together with the accompanying drawing of the diagram principle of the invention.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.Scope of the present invention is only defined by the claims, and the present invention contain many substitute, amendment and equivalent.Set forth many details in the following description to provide thorough understanding of the present invention.These details are provided for exemplary purposes, and also can realize the present invention according to claims without some in these details or all details.

An aspect of of the present present invention provides a kind of large data access method for small documents read-write, for large data access and utilization provide new solution.Fig. 1 is the large data access method process flow diagram according to the embodiment of the present invention.

After user submits resource in cloud storage platform, first resource needs the screening through file filter device, and filtercondition comprises file size, type etc., is referred to as small documents by the file of screening.Propose a kind of consolidation strategy to these small documents subsequently, the small documents of some merges the new storage file of rear generation, generally merges the small documents belonging to same attribute.Index file is upgraded while by new storage file writing system.Index in cloud storage platform comprises, and master index is the resource collection belonging to file, as type etc.; Secondary index is concrete resource entries.When needs file reading, in master index and secondary index, Check askes successively, reduces query context, can ensure higher reading response.

The core of the accumulation layer design of cloud storage platform system of the present invention comprises: first carry out merging to small documents and generate storage file, file set up secondary index after being combined based on the storage feature of database again, is looked ahead by index and improves the response speed that file reads.Below accumulation layer concrete details is introduced in detail.

The 1 storage file generation strategy merged based on small documents

Divide File is become block and block one by one, the default size of block is 64M.The NameSpace of distributed file system, is loaded in internal memory by namenode during startup by persistence in an image file.Large amount of small documents can cause namenode low memory, generates the search efficiency of file during excessive image file reduction file reading.To the read-write operation of each file, first inquire about in NameSpace, the information such as block address, file size of locating file, and then retrieve in back end space.When the file read is very little, in read-write process, main time all consumes in retrieval and inquisition, instead of the transmission of file data, affects the treatment effeciency of server cluster.

Cloud storage platform utilizes small documents to merge and generates storage file.First realize a filtrator to filter with size by type file, select the document files that can carry out full-text search, the threshold value of file size setting is herein 10M, is then considered as large files, does not need to merge when file is greater than 10M.After filtering, according to resource collection belonging to file entries, to be unit carry out merging to the small documents after filtering to cloud storage platform becomes blocks of files.Resource collection is the set of the resource entries with certain correlativity, and a resource entries only belongs to a resource collection.Usual set is according to the division such as range of attributes, time, and file can divide according to Attribute domain.In new blocks of files, resource entries has very large relevance, blocks of files just can be distributed to a MapReduce task by Data processing afterwards, the calculated amount avoided because of task wastes the time of task matching and switching very little, reduce data movement in the cluster, just meet Hadoop mobile computing principle more more effective than Mobile data.

2 set up secondary index optimizes file reading speed

After small documents merges, namenode internal memory is the performance bottleneck of whole file system, because all file metadata information need to be stored in its internal memory, can reduce the quantity of file after being merged by small documents, save a lot of memory headroom, but the file reading efficiency after merging can be very low.

The preferred embodiment of the invention adopts hierarchical index to set up small documents index of metadata, is little index file by large index file with rational regular partition.Take resource collection as master index, the resource entries content under each resource collection, as secondary index, is first carried out Check according to the set of resource entries place when searching like this and is looked for, then search in corresponding secondary index file.Although many processes of searching in master index, because resource collection number can not be too many, its time of searching is very little, and much less than global index's file of the secondary index file through dividing, so can improve search efficiency on the whole.Simultaneously secondary index file also and not all is loaded into internal memory, according to internal memory service condition and binding cache strategy carries out flexible dispatching, can solve the problem of low memory.

The 3 file response speed optimizations of looking ahead based on index

Here the data that the index proposed will be accessed below looking ahead and referring to by user's current accessed data prediction user, and buffer memory is called in its index.If energy Accurate Prediction, the data just can user will accessed in advance are loaded into buffer memory, just can obtain system responses faster when user accesses.

User, in download or before browsing resource entries, usually must be obtained " intermediate result collection " by the mode of retrieval or directory search, the resource entries needed then could be selected wherein to access further.The interval of a several seconds is there is between user sees the result set page and execution is downloaded or browsed, during this period of time by the index of buffered in advance intermediate result pooling of resources entry, the inquiry of a series of file metadata just need not be performed again when user clicks and downloads or browse, directly carry out transfer files, the request response of these files can be improved so to a great extent.This response promotes does not need too many internal memory, and suppose has 100,000 users performing retrieval simultaneously, and each result set page shows 20 resource entries, and the metadata information of a buffer memory file needs 150B, also only needs 0.3G memory headroom.

Below describe the accumulation layer framework of cloud storage platform of the present invention in detail.Cloud storage platform system is except utilizing above-mentioned strategy, and when realizing, its accumulation layer framework is the basis of system.Cloud storage platform accumulation layer is structured on the distributed memory system on Hadoop cluster, provides basic file preserve and read service.

The framework of cloud storage platform accumulation layer adopts three-decker design: user interface layer, Business Logic and accumulation layer, and in order to improve performance, adopts the mode be separated with server cluster by Web server.The user interface that namely user interface layer provides, user is sent request and receiving feedback information by the function that this layer provides.Business Logic is the function realization layer that small documents reads and writes, and comprises Piece file mergence, index construct and buffer memory structure etc.

Business Logic comprises the functional modules such as Piece file mergence, searching system, small documents index, buffer memory and distributed system client.Each module is implemented as follows:

(1) Piece file mergence

Piece file mergence function comprises 2 stages: establishment SequenceFile object carries out small documents and merges.By the filtration of filtrator, merge meeting the file merging requirement, first search in master index according to the resource collection at resource entries place, after finding file path corresponding to resource collection, create SequenceFile object, and obtain the Writer object of SequenceFile and it is configured, prepare writing in files.A new thread is opened, by the metadata information such as file place value, length corresponding for this resource entries write resource entries secondary index while execute file write.Resource entries writes successfully and closes output stream, returns and submits to successfully, otherwise returns and submit to unsuccessfully.

(2) searching system

Document retrieval function is provided, relies on this module to carry out reading optimization based on " intermediate result collection " to distributed file system.

(3) small documents index

Build small documents index, comprise resource collection master index and resource entries secondary index, index file creation is provided, add and the function such as deletion record.

Master index data are stored in relational database, provide access by relational database access interface, use the Map data structure in Java to preserve.Because resource collection is stored in database, only needs according to this index to be increased in the field be worth by system generation time resource entries is added, so can be kept in relational database, do not affect treatment effeciency.Data acquisition in master index Key/Value structure, can use Map data structure in Java to improve Check and ask efficiency.In addition, for ensureing recall precision, also must exist according to this Map object of content initialization in database when service starts always, because master index number of files is few, Map object committed memory is very little, so system overhead is limited, add when there being new resource collection or have deleted time, need upgrade this Map object.

Secondary index is created by open source projects Lucene, supports small documents metadata retrieval.Lucene has a set of perfect index construct, upgrades and search solution, and when indexed file is less than 1G, search efficiency is very high, can be used for building commercial search engine.The function that the index that cloud storage platform will create needs some special, as needed real-time update index file whenever user adds resource entries time; When multiple user adds resource entries under a resource collection simultaneously, the con current control of file write; Compressed index file is to reduce EMS memory occupation etc.

(4) look ahead

In order to promote response speed better, providing the cache management to user's interested " intermediate result collection " here, comprising spatial cache and safeguarding, buffer update, the functions such as update algorithm maintenance.

After user sends retrieval request, the resource entries result set meeting user's needs is ask in Web service according to user search condition Check, return to user, create asynchronous thread simultaneously and upgrade buffer memory, upgrade cache contents returning user's result set and browse result set to user and determine to click in the time interval between download or browse operation.When cache module receives renewal cache contents request, call index module and retrieve, the metadata of current results collection entry is loaded into buffer memory.When user sends download or browse request, Web service is called distributed system client and is searched metadata in the buffer and start to read data and to client transmissions.

The thread pool of a system maintenance fixing number of threads calls a thread when receiving renewal cache request at every turn and goes process, if do not have idle thread in thread pool, allows this buffer memory task wait for.The system resource that buffer update task accounts for can be maintained in a rational scope like this, not influential system overall performance.The present invention selects FIFO algorithm realization cache module scheduling feature, eliminates the cache entries at most in the most efficient manner.Specific implementation is: set up cache pool, and allocating cache pond size, is defaulted as 32M, can preserve 200,000 file metadata information.That store inside cache pool is key-value pair key/value one by one, and filename is as key, and the back end ID of file, the combination of reference position and length is as value.This cache pool provides two to operate put and get.Put puts into data toward cache pool, if existing data reach the upper limit inside cache pool, then replaces corresponding data according to cache replacement algorithm, directly puts into just if also had living space.Get operation obtains corresponding value value, as then do not returned sky according to key value.

(5) distributed system client

Distributed system client encapsulates operation document system and the mutual API in the external world, comprises reading and writing of files and inquiry file position etc.When file system receives file read request, first judge through file filter device, the metadata information of the file having belonged to merged then locating file first in the buffer, if do not exist, then search in indexed file, if still search less than; communicate with namenode.Check builds the Reader object that then SequenceFile object obtain SequenceFile and sends read requests to back end after finding file metadata, close inlet flow, returned after transferring data to user.

User has two kinds of request methods, and a kind of is the write request of presenting a paper, a kind of be inquiry, browse or the read requests of Gains resources.

When Web server receive user submit resource request to time, first judge whether that needing to do small documents merges, and if desired, then carries out Piece file mergence, does not need, directly use distributed file system to write interface and carry out writing.Prepare file to write distributed file system by distributed type file system client side after Piece file mergence, while distributed system client writing in files, call small documents index upgrade module and perform small documents index and renewal, because Web server main frame is separated with server cluster, write and renewal can be performed by different threads simultaneously, do not affect each other.Submission successful information is returned to client when distributed file system writes Web service successfully.

Send file read request when user needs browser document detailed content or download file, this request frequency is high, expends system resource maximum.When Web server receives the read requests of user, first the condition submitted to according to user by searching system is retrieved, the resource entries result set that obtaining user needs returns to user and browses, the entry set (giving tacit consent to 20) showing first page in the user interface in result set is sent to cache module simultaneously, and open an independent thread and upgrade buffer memory, when user browsed the result set page request that returns download or browse detailed time, Web service is called distributed type file system client side and is prepared file reading content, distributed type file system client side is locating file positional information first in the buffer, if do not find, then search in small documents index, then directly arrive back end after finding positional information and read data, return to user.

In sum, the present invention proposes a kind of large data access method, for the read-write of the small documents for full-text search, by index and raising recall precision of looking ahead, keep response speed and the distributed file system overall performance of cloud storage platform when storage and the reading of large amount of small documents.

Obviously, it should be appreciated by those skilled in the art, above-mentioned of the present invention each module or each step can realize with general computing system, they can concentrate on single computing system, or be distributed on network that multiple computing system forms, alternatively, they can realize with the executable program code of computing system, thus, they can be stored and be performed by computing system within the storage system.Like this, the present invention is not restricted to any specific hardware and software combination.

Should be understood that, above-mentioned embodiment of the present invention only for exemplary illustration or explain principle of the present invention, and is not construed as limiting the invention.Therefore, any amendment made when without departing from the spirit and scope of the present invention, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.In addition, claims of the present invention be intended to contain fall into claims scope and border or this scope and border equivalents in whole change and modification.

Claims

1. a large data access method, for accessing file resource in cloud storage platform, is characterized in that, comprising:

In distributed file system, the file lower than default size is merged, form new storage file, the file set up master index after being combined and secondary index;

Utilize prefetch mechanisms, before user's request resource, buffer memory is carried out to index.

2. method according to claim 1, is characterized in that, is describedly merged by the file lower than default size, comprises further:

3. method according to claim 2, is characterized in that, described in be combined after file set up master index and secondary index, wherein master index is the resource collection belonging to file; Described secondary index is concrete resource entries, and described resource collection is the set of the resource entries with correlativity, and a resource entries only belongs to a resource collection, and file can divide according to Attribute domain; When needs file reading, in master index and secondary index, Check askes successively; At file after filtering, according to resource collection belonging to file entries, to be unit carry out merging to the file after filtering to cloud storage platform becomes blocks of files; In data processing, just resource entries in new blocks of files can be distributed to same MapReduce task.

4. method according to claim 1, it is characterized in that, described prefetch mechanisms comprises further: the data will visited by the data prediction user of user's current accessed, and buffer memory is called in its index, system responses is accelerated when user accesses, described user is in download or before browsing resource entries, intermediate result collection is obtained by the mode of retrieval or directory search, then the resource entries needed is selected to access further wherein, see the result set page user and perform the index by buffered in advance intermediate result pooling of resources entry in the interval between downloading or browsing, user click download or browse time not execute file metadata query, directly carry out transfer files.

5. method according to claim 4, it is characterized in that, describedly buffer memory is carried out to index comprise further, after user sends retrieval request, the resource entries result set meeting user's needs is ask in Web service according to user search condition Check, return to user, create asynchronous thread simultaneously and upgrade buffer memory, download or upgrade cache contents in time interval between browse operation returning user's result set and browse result set to user and determine to click, when receiving renewal cache contents request, call index module to retrieve, the metadata of current results collection entry is loaded into buffer memory, when user sends download or browse request, Web service is called distributed type file system client side and is searched metadata in the buffer and start to read data and to client transmissions, the thread pool of a server maintenance fixing number of threads, a thread process is called when receiving renewal cache request at every turn, if do not have idle thread in thread pool, this buffer memory task is allowed to wait for, FIFO algorithm is utilized to set up cache pool and allocating cache pond size, key-value pair key/value is kept in cache pool, wherein filename is as key, the back end ID of file, the combination of reference position and length is as value, eliminate the cache entries at most, this cache pool provides two operations and put operation and get to operate, put operation puts into data toward cache pool, if existing data reach the upper limit inside cache pool, then according to FIFO cache replacement algorithm replacement data, Get operation obtains corresponding value value according to key value.

6. method according to claim 3, it is characterized in that, described master index data are stored in relational database, access is provided by relational database access interface, the Map data structure in Java is used to preserve, on the basis of resource collection write into Databasce, be increased in the field be worth by system generation when resource entries is added, data acquisition in master index Key/Value structure, use Map data structure in Java, also exist according to this Map object of content initialization in database when service starts always, add when there being new resource collection or have deleted time, this Map object is upgraded, secondary index is created by open source projects Lucene, supports small documents metadata retrieval, real-time update index file whenever user adds resource entries time, when multiple user adds resource entries under a resource collection simultaneously, realize the con current control of file write,

Described file is carried out merging comprising further: create SequenceFile object, by the filtration of filtrator, merge meeting pre-conditioned file, resource collection according to resource entries place is searched in master index, after finding file path corresponding to resource collection, create SequenceFile object, and obtain the Writer object of SequenceFile and it is configured, prepare writing in files, a new thread is opened while execute file write, by file place value corresponding for this resource entries, length information write resource entries secondary index, resource entries writes successfully and closes output stream, return and submit to successfully, otherwise return and submit to unsuccessfully.