CN106570113A

CN106570113A - Cloud storage method and system for mass vector slice data

Info

Publication number: CN106570113A
Application number: CN201610939884.6A
Authority: CN
Inventors: 马潇; 王景朝; 费香泽; 王宪
Original assignee: State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; State Grid Anhui Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; State Grid Anhui Electric Power Co Ltd
Priority date: 2016-10-25
Filing date: 2016-10-25
Publication date: 2017-04-19
Anticipated expiration: 2036-10-25
Also published as: CN106570113B

Abstract

The invention discloses a cloud storage method for mass vector slice data. The method comprises the following steps of: establishing a distributed file system directory tree file; establishing all the metadata nodes corresponding to a distributed file system directory tree file; aggregating the mass vector slice data under the same directories in a distributed file system so as to generate a mass vector slice data packet; storing the mass vector slice data packet in the metadata nodes; establishing indexes for the mass vector slice data, and correlating the mass vector slice data through the indexes so as to form a net-structure data index table of the mass vector slice data, wherein the index table is used for recording paths of the mass vector slice data in the mass vector slice data packets; and providing mass vector slice data indexing services through a mass vector slice data packet index table.

Description

A kind of magnanimity vector slice of data cloud storage method and system

Technical field

The present invention relates to mass data storage field, more particularly, to a kind of magnanimity vector slice of data cloud storage side Method and system.

Background technology

With the continuous development of science and technology, the mass data epoch have arrived.Therefore, how file optimizing system is negative Carry, lifting the harmony of load becomes demand important at present.When the size of data set is more than an independent physical computer Storage capacity when, it is therefore necessary to it carry out subregion and store on some independent computers.Google, Amazon, IBM Substantial amounts of scientific research strength has been put into the international major company such as Microsoft in this field, it is proposed that the Mass Data Management skill of various innovations Art.At present research work is concentrated mainly on accumulation layer, computation layer and interface layer this 3 levels.The Hadoop projects of prior art Realize Hadoop distributed file system Hadoop DFS (abbreviation HDFS), and multiple programming framework Hadoop MapReduce.Distributed file system framework introduces the complexity of network programming, therefore distributed field system on network System is more increasingly complex than ordinary magnetic disc file.The target of distributed file system is to realize resource-sharing, program picture is stored and is visited Ask that its Typical Representative is Google file system GFS, Hadoop similar to the mode of local file is accessed to Remote File Manipulation File system HDFS, dynamo, TFS etc..Present distributed file system generally remains almost identical with local file system Access interface and object model, this is primarily to provide a user with compatibility backward.

Prior art is primarily directed to super large rank (referring to that file size is hundreds of MB, GB or TB) data file and adopts and be based on Distributed file system is stored and read.But carry out based on distributed file system for large amount of small documents data, by It is slow in storage speed, it is impossible to meet the storage demand of large amount of small documents data.Currently without for storage large amount of small documents data Carry out the technical scheme for being stored based on distributed file system and being read.

The content of the invention

In order to solve the speed issue that large amount of small documents data carry out when being stored based on distributed file system, this Bright to provide a method that, methods described includes：

Set up all metadata nodes corresponding with distributive catalogue of document system tree；

To be polymerized with the magnanimity vector slice of data under first class catalogue in distributed file system, generated magnanimity vector Slice of data bag；

The magnanimity vector slice of data bag is stored in the metadata node；

Set up for the magnanimity vector slice of data and index, the magnanimity vector slice of data sets up association by index, Form the data directory of cancellated magnanimity vector slice of data；

The magnanimity vector slice of data index service is provided by the magnanimity vector slice of data bag concordance list.

Preferably, method according to claim 1, methods described includes：

The magnanimity vector slice of data index includes the magnanimity vector slice of data path, title and in the sea Side-play amount in amount vector slice of data bag；

The magnanimity vector slice of data path includes that first site position, magnanimity vector slice of data line position are put and magnanimity Vector slice of data column position.

Preferably, methods described includes：

The default unitary Data Node of each layer, concordance list is stored in each layer be pre-designed of metadata node；

The magnanimity vector slice of data concordance list stored in the metadata is transmitted to client, magnanimity arrow is set up The lasting mapping table of amount slice of data concordance list.

Preferably, the magnanimity vector slice of data bag includes file header and at least one record；

The file header includes file type, version number, document keyword, file name, per recording corresponding described in bar Position；

Per described in bar record correspondence one vector slice of data, it is described per bar record include vector slice of data length, Key length, key and value.

Preferably, the magnanimity vector slice of data bag is stored using data file sequencing method.

Preferably, also include：Carry out adding storage in the afterbody of the magnanimity vector slice of data bag.

Preferably, methods described includes：The magnanimity vector slice of data is indexed into table cache to client, is reduced and is accessed The metadata node number of times accesses the access times of magnanimity vector slice of data to improve.

Preferably, also include：The method that magnanimity vector slice of data is read out：

The corresponding unit of the magnanimity vector slice of data bag is determined by the magnanimity vector slice of data concordance list Back end shortest path；

By it is determined that metadata node in file header in data APMB package, determine the vector slice of data Position.

Based on embodiments of the present invention, the present invention provides a kind of cloud storage system for magnanimity vector slice of data, The system includes：

First signal generating unit, for setting up distributive catalogue of document system tree file；

Second signal generating unit, for setting up all metadata nodes corresponding with distributive catalogue of document system tree；

Polymerized unit, for will be gathered with the magnanimity vector slice of data under first class catalogue based on distributed file system Close, generate magnanimity vector slice of data bag；

Memory element, for the magnanimity vector slice of data bag to be stored in the metadata node；

3rd signal generating unit, for generating the magnanimity vector slice of data concordance list, by concordance list the sea is set up The network structure of amount vector slice of data bag, for recording the magnanimity vector slice of data in the magnanimity vector slice of data Path in bag；

Indexing units, for providing the magnanimity vector slice of data index by magnanimity vector slice of data index Service.

Beneficial effects of the present invention are：To enter with the magnanimity vector slice of data under first class catalogue in distributed file system Row polymerization, generates magnanimity vector slice of data bag so that magnanimity vector slice of data realizes quick storage.Propose simultaneously as sea Amount vector slice of data sets up index, and magnanimity vector slice of data sets up association by index, forms cancellated magnanimity arrow The data directory of amount slice of data.By the data directory of network structure, realization finds corresponding unit by shortest path Back end, accelerates the access speed of data.

Description of the drawings

By reference to the following drawings, the illustrative embodiments of the present invention can be more fully understood by：

Fig. 1 is according to a kind of magnanimity vector slice of data cloud storage method system flow chart of embodiment of the present invention；And

Fig. 2 is according to a kind of magnanimity vector slice of data cloud storage method system structure chart of embodiment of the present invention.

Specific embodiment

With reference now to accompanying drawing, the illustrative embodiments of the present invention are introduced, however, the present invention can be with many different shapes Formula is not limited to embodiment described herein implementing, there is provided these embodiments are to disclose at large and fully The present invention, and fully pass on the scope of the present invention to person of ordinary skill in the field.For showing for being illustrated in the accompanying drawings Term in example property embodiment is not limitation of the invention.In the accompanying drawings, identical cells/elements are attached using identical Icon is remembered.

Unless otherwise stated, term (including scientific and technical terminology) used herein has to person of ordinary skill in the field It is common to understand implication.Further it will be understood that the term limited with the dictionary being usually used, is appreciated that and it The linguistic context of association area has consistent implication, and is not construed as Utopian or excessively formal meaning.

Fig. 1 is according to a kind of magnanimity vector slice of data cloud storage method system flow chart of embodiment of the present invention.This It is bright to propose a kind of method that magnanimity vector slice of data based on distributed file system is stored.The solution of the present invention is with existing Based on distributive catalogue of document system tree construction, the multiple magnanimity vector slice of datas in a catalogue are packaged into into magnanimity arrow Amount slice of data bag is stored, and the magnanimity vector slice of data bag being packaged into is large data files, and file-level is in 100 MB More than.Meanwhile, technical scheme life magnanimity vector slice of data sets up index, and record magnanimity vector slice of data is in sea Path in amount vector slice of data bag, accesses magnanimity vector slice of data and provides interface for client.The method of the present invention Make full use of in the advantage of the high fault-tolerant of master-salve distributed file system, extensibility and distributivity, it is super in object oriented file rank On the basis of crossing the distributed file system of 100 MB, the efficient storage of massive vector data is realized.Method proposed by the present invention makes Massive vector data is stored with distributed file system, while massive vector data is set up indexing, storage sea at present is solved The slow-footed problem of amount vector data, and improve access speed by setting up index.

Preferably, method 100 starts to walk from step 101：Set up distributive catalogue of document system tree file.Build distributed text Part system directory tree construction file, can make full use of the high fault-tolerant of distributed file system, extensibility and distributed excellent Point.

Preferably, step 102：Set up all metadata nodes corresponding with distributive catalogue of document system tree.Metadata Node is used for data storage.

Preferably, step 103：To be gathered with the magnanimity vector slice of data under first class catalogue in distributed file system Close, generate magnanimity vector slice of data bag.The file structure of design magnanimity vector slice of data bag, magnanimity vector slice of data bag Including file header and at least one record.File header includes file type, version number, document keyword, file name, remembers per bar Record corresponding position.One vector slice of data of correspondence is recorded per bar, record includes length, the bond distance of vector slice of data per bar Degree, key and value.The additional of magnanimity vector slice of data is stored as being added in the afterbody of magnanimity vector slice of data bag Storage.Magnanimity vector slice of data bag is stored using data file sequencing method.Embodiment proposed by the present invention, uses In the method in a distributed manner based on system architecture of massive vector data cloud storage, by a metadata node and metadata node The back end composition of lower multilevel hierarchy.Embodiments of the present invention are by the magnanimity vector slice of data whole under same first class catalogue It is saved in the data file under the catalogue, is the data file magnanimity vector slice of data bag in the present invention, is distributed text File in part system.In embodiment of the present invention, the memory technology that is polymerized it is critical only that magnanimity vector slice of data APMB package Design.Magnanimity vector slice of data APMB package is distributed using binary system key/value (Key/Value) perdurable data structure File system files, it is made up of file header and one or more subsequent record.Magnanimity vector slice of data APMB package head First three byte for SEQ file type, the version number of a byte representation file data structure followed by.File header is also Including some other field, including the content such as the title of key and value respective type.Magnanimity vector slice of data is when being stored Directly the afterbody in magnanimity vector slice of data APMB package is added.Per bar, record represents a vector slice of data.Record It is made up of the length, key length, key, value four that record.Wherein the value of key is the filename of vector slice of data, is worth and is cut for vector The content of sheet data.

Preferably, step 104：Magnanimity vector slice of data bag is stored in metadata node.Magnanimity vector number of slices It is that base distributed file system is realized according to bag storage method, it depends on distribution to the operation that magnanimity vector slice of data is accessed Formula file system.The additional of magnanimity vector slice of data is stored as carrying out adding in the afterbody of magnanimity vector slice of data bag depositing Storage.Magnanimity vector slice of data bag is stored using data file sequencing method.Vector is cut into slices when there is a client When data are write under certain catalogue, the client can carry out write operation, distributed file system note to the data file of the catalogue What occupancy authority Lease for having recorded the data file was considered as file writes lock.If now another client is also required to certainly Oneself vector slice of data is stored under identical catalogue, and equally it can also go application to the magnanimity vector number of slices under the catalogue Write operation is carried out according to APMB package.Lock, and distributed document are write because magnanimity vector slice of data APMB package has had one System is not carried out the maintenance of transactions requests queue, the result for directly returning operation failure to client.From from the point of view of user, It is not in conflict that different magnanimity vector slice of data APMB packages are created under same catalogue, but is in fact in rear end The operation that same magnanimity vector slice of data APMB package is carried out, because such lock mechanism just occurs multiple users same To the problem of different vector slice of data write conflicts under one catalogue.The realization of magnanimity vector slice of data APMB package mainly adopts number According to the sequence and unserializing method of file.So-called serializing, refers to and for structured object to be converted into byte stream, so as in network Upper transmission is write and permanently stored on disk.Unserializing is referred to the inverse process of the byte stream meeting of conversion structured object.

Preferably, step 105：Set up for magnanimity vector slice of data and index, magnanimity vector slice of data is built by index Vertical association, forms the data directory of cancellated magnanimity vector slice of data；Concordance list is used to record the section of magnanimity vector Path of the data in magnanimity vector slice of data bag.Magnanimity vector slice of data index includes magnanimity vector slice of data road Footpath, title and the side-play amount in magnanimity vector slice of data bag, magnanimity vector slice of data path include first site position, Magnanimity vector slice of data line position is put and magnanimity vector slice of data column position.For example, a magnanimity vector slice of data road Footpath includes<18,0506>, wherein 18 is metadata site position, 05 puts for magnanimity vector slice of data line position, and 06 is magnanimity arrow Amount slice of data column position.When making a look up to this magnanimity vector slice of data, by positioning metadata site position 18, then Corresponding row 05 is continued to search for, then makes a look up corresponding row 06.All magnanimity vector slice of datas are according to path in concordance list Metadata site position, magnanimity vector slice of data line position is put and magnanimity vector slice of data column position Special composition is netted Index structure.Embodiments of the present invention can realize the shortest path that magnanimity vector slice of data is searched.

Each layer of metadata node presets a metadata node for being used for data storage concordance list, by the section of magnanimity vector Data directory is stored in corresponding metadata node.By the magnanimity vector slice of data concordance list recorded in metadata transmit to Catalogue file, and set up the lasting mapping table of magnanimity vector slice of data index in client.

The index record of vector slice of data vector slice of data in concrete magnanimity vector slice of data APMB package Position and other attributes of vector slice of data, it is that client is necessary after the data for having stored magnanimity vector slice of data It to be its establishment.The magnanimity that the title comprising magnanimity vector slice of data, magnanimity vector slice of data are located in index record Vector slice of data APMB package path and the side-play amount in magnanimity vector slice of data APMB package.Magnanimity vector slice of data Digit shared by APMB package name determines the quantity of data file under a catalogue, and the digit shared by side-play amount determines data text The capacity of data storage is limited under the size of part, therefore one catalogue of explanation.

Preferably, magnanimity vector slice of data index is distributed to each back end to manage.Magnanimity vector slice of data Although index data it is very huge, after being distributed in metadata node, the index data in single metadata node is with regard to phase To very little, and the ability of cluster-based storage magnanimity vector slice of data depends on the scale of cluster.The size of cluster scale is not only The size of memory capacity can be determined, the size of storage magnanimity vector slice of data quantity can be more embodied.Metadata node is safeguarded The index of vector slice of data, and provide index service to client.Vector is safeguarded in the index position description of vector slice of data The metadata node of slice of data index.

Preferably, the index of magnanimity vector slice of data is classified according to its parent directory being located, and its objective is will be same Magnanimity vector slice of data index under one catalogue is managed by the metadata node of same one-level.In view of the feature, the present invention Embodiment creates index position mapping table to record the mapping relations of catalogue and metadata node.Index position mapping table is by unit Back end is managed.Client is when magnanimity vector slice of data index is inquired about, it is necessary first to knows and safeguards the extra large vector The metadata node position of slice of data index.It by the way that the path of magnanimity vector slice of data is passed to into metadata node, so Afterwards metadata node finds metadata node position according to the parent directory search index position mapping table in extra large vector slice of data path Put.The present invention designs index position maintenance module in metadata node, dedicated for distributing back end for catalogue, safeguards index Position mapping table.

Preferably, index position maintenance module therefrom selects to distribute to according to the total data node that metadata node is safeguarded Catalogue.Index position mapping table is persisted on local disk, and when its data change, the content on its disk also will Re-start renewal.If index position maintenance module can not find enough metadata sections when metadata node is distributed to catalogue Point, the module can wait unappropriated direct insertion to catalogue in distribution queue, while the content of the queue also wants persistence To on disk, queue once has new catalogue to add or deletion is required for updating again on disk.When metadata node starts Need queuing data on disk to be read in internal memory.The purpose of the queue is to wait for distributed file system new data section Point is registered plus fashionable, and index position maintenance module is redistributed to the catalogue in queue.Same queue updates also need every time Carry out persistence.

Embodiment of the present invention on back end by designing vector slice of data index module come maintenance management vector The index of slice of data, to client index service is provided.Module safeguard the index record and index file in internal memory and with The corresponding journal file of index file.Metadata node is ranked up to accelerate the lookup for indexing to visit to index record with B-tree Ask.The renewal of index record first can modify to memory data structure, temporarily asynchronous to correspond to index file.But will more New content record as needed arranges index file after back end starts in the corresponding Log files of the index file Sequence reads in internal memory, and index data structure is updated according to Log, and will now index record stores data again in internal memory Old index file is replaced on node, Log is emptied.The purpose of do so is in order to avoid the unexpected power-off of back end causes internal memory In index data lose.

Preferably, magnanimity vector slice of data is indexed into table cache to client, reduces and access the metadata node time Number accesses the access times of magnanimity vector slice of data to improve.Embodiments of the present invention, by client-cache user Commonly used magnanimity vector slice of data index, it is possible to reduce access times of the client to metadata node, improves to magnanimity The efficiency that vector slice of data is accessed.

Preferably, step 106：Magnanimity vector slice of data index clothes are provided by magnanimity vector slice of data bag concordance list Business.The corresponding metadata node shortest path of magnanimity vector slice of data bag is determined by magnanimity vector slice of data concordance list. By it is determined that metadata node in file header in data APMB package, determine the position of vector slice of data.

Fig. 2 is according to a kind of magnanimity vector slice of data cloud storage method system structure chart of embodiment of the present invention.System 200 include：

First signal generating unit 201, for setting up distributive catalogue of document system tree file；

Second signal generating unit 202, for setting up all metadata nodes corresponding with distributive catalogue of document system tree；

Polymerized unit 203, for will be entered with the magnanimity vector slice of data under first class catalogue based on distributed file system Row polymerization, generates magnanimity vector slice of data bag；

Memory element 204, for magnanimity vector slice of data bag to be stored in metadata node；

3rd signal generating unit 205, for generating magnanimity vector slice of data concordance list, by concordance list magnanimity vector is set up The network structure of slice of data bag, for recording path of the magnanimity vector slice of data in magnanimity vector slice of data bag；

Indexing units 206, for providing magnanimity vector slice of data index service by magnanimity vector slice of data index.

Magnanimity vector slice of data cloud storage method system 200 a kind of according to the embodiment of the present invention is another with the present invention A kind of magnanimity vector slice of data cloud storage method system 100 of embodiment is corresponding, and here is no longer repeated.

The present invention is described by reference to a small amount of embodiment.However, it is known in those skilled in the art, as What subsidiary Patent right requirement was limited, except the present invention other embodiments disclosed above equally fall the present invention's In the range of.

Normally, all terms for using in the claims are all solved according to them in the usual implication of technical field Release, unless clearly defined in addition wherein.It is all of to be all opened ground with reference to " one/described/be somebody's turn to do [device, component etc.] " At least one of described device, component etc. example is construed to, unless otherwise expressly specified.Any method disclosed herein Step all need not be run with disclosed accurate order, unless explicitly stated otherwise.

In addition, those skilled in the art are it should be appreciated that embodiments herein can be provided as method, system or calculate Machine program product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or with reference to software and hardware side The form of the embodiment in face.And, the application can be adopted and wherein include computer usable program code at one or more The computer implemented in computer-usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) The form of program product.

The application is the flow process with reference to method, equipment (system) and computer program according to the embodiment of the present application Figure and/or block diagram are describing.It should be understood that can be by computer program instructions flowchart and/or each stream in block diagram The combination of journey and/or square frame and flow chart and/or the flow process in block diagram and/or square frame.These computer programs can be provided The processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced for reality by the instruction of computer or the computing device of other programmable data processing devices The device of the function of specifying in present one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple square frames.

These computer program instructions may be alternatively stored in can guide computer or other programmable data processing devices with spy In determining the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory is produced to be included referring to Make the manufacture of device, the command device realize in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or The function of specifying in multiple square frames.

These computer program instructions also can be loaded in computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented process, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow process of flow chart or multiple flow processs and/or block diagram one The step of function of specifying in individual square frame or multiple square frames.

Claims

1. a kind of cloud storage method for magnanimity vector slice of data, methods described includes：

To be polymerized with the magnanimity vector slice of data under first class catalogue in distributed file system, generated the section of magnanimity vector Packet；

The magnanimity vector slice of data bag is stored in the metadata node；

Set up for the magnanimity vector slice of data and index, the magnanimity vector slice of data sets up association by index, formed The data directory of cancellated magnanimity vector slice of data；

2. method according to claim 1, methods described includes：

The magnanimity vector slice of data index includes the magnanimity vector slice of data path, title and in magnanimity arrow Side-play amount in amount slice of data bag；

3. method according to claim 1, methods described includes：

The magnanimity vector slice of data concordance list stored in the metadata is transmitted to client, magnanimity vector is set up and is cut The lasting mapping table of sheet data concordance list.

4. method according to claim 1, the magnanimity vector slice of data bag includes file header and at least one record；

The file header includes file type, version number, document keyword, file name, per recording corresponding position described in bar；

Per one vector slice of data of correspondence is recorded described in bar, the record per bar includes length, the bond distance of vector slice of data Degree, key and value.

5. method according to claim 1, the magnanimity vector slice of data bag is entered using data file sequencing method Row storage.

6. method according to claim 1, also includes：Carry out adding in the afterbody of the magnanimity vector slice of data bag and deposit Storage.

7. method according to claim 1, methods described includes：By magnanimity vector slice of data index table cache extremely Client.

8. method according to claim 4, also includes：The method that magnanimity vector slice of data is read out：

The corresponding metadata of the magnanimity vector slice of data bag is determined by the magnanimity vector slice of data concordance list Node shortest path；

By it is determined that metadata node in file header in data APMB package, determine the position of the vector slice of data Put.

9. a kind of cloud storage system for magnanimity vector slice of data, the system includes：

Polymerized unit, for will be polymerized with the magnanimity vector slice of data under first class catalogue based on distributed file system, Generate magnanimity vector slice of data bag；

3rd signal generating unit, for generating the magnanimity vector slice of data concordance list, by concordance list the magnanimity arrow is set up The network structure of amount slice of data bag, for recording the magnanimity vector slice of data in the magnanimity vector slice of data bag Path；

Indexing units, for providing the magnanimity vector slice of data index clothes by magnanimity vector slice of data index Business.