CN106095796A

CN106095796A - Distributed data storage method, Apparatus and system

Info

Publication number: CN106095796A
Application number: CN201610371832.3A
Authority: CN
Inventors: 吕家进; 徐朝辉; 胡军锋; 段永政; 张振山; 戚翯; 刘博闻; 崔金虎; 瞿红来; 钟亮
Original assignee: Postal Savings Bank of China Ltd
Current assignee: Postal Savings Bank of China Ltd
Priority date: 2016-05-30
Filing date: 2016-05-30
Publication date: 2016-11-09

Abstract

The invention discloses a kind of distributed data storage method, Apparatus and system.Wherein, the method includes: screen the basic data got, and determines the type of basic data, and wherein, type at least includes: structured type and destructuring type；Basic data is stored to the first sub-storage cluster and/or the second sub-storage cluster according to type.The present invention solves the technical problem that the data access delay of existing distributed data-storage system is high.

Description

Distributed data storage method, Apparatus and system

Technical field

The present invention relates to internet arena, in particular to a kind of distributed data storage method, Apparatus and system.

Background technology

Apache Nutch is the source of Hadoop, and Hadoop technology is widely applied at internet arena, Also obtain the common concern of research circle simultaneously.As Yahool uses a group of planes for 4000 nodes to run Hadoop, support ad system Research with Web search；Facebook uses a group of planes for 1000 nodes to run Hadoop, storing daily record data, supports thereon Data analysis and machine learning；Baidu Hadoop processes weekly the data of 200TB, scans for log analysis and web data Excacation；Middle mobile academy develops " Herba Cistanches " (BigCloud) system based on Hadoop, is used not only for related data and divides Analysis, the most externally provides service；The Hadoop system of Taobao is for storing and process the transactional related data of ecommerce.

Further, domestic colleges and universities and scientific research institutions also based on Hadoop in data storage, resource management, job scheduling, property Energy optimization, system high-available and safety aspect are studied.

But, in existing Hadoop technology, there are the following problems:

1, data access delay is high, is not suitable for the data access operation of low latency.

2, data access delay is high, and causing cannot the substantial amounts of small documents of efficient storage.

3, multi-user management is not supported, it is impossible to realize multi-user's write and amendment.

For the problem that the data access delay of above-mentioned existing distributed data-storage system is high, the most not yet propose effectively Solution.

Summary of the invention

Embodiments provide a kind of distributed data storage method, Apparatus and system, at least to solve existing point The technical problem that the data access delay of cloth data-storage system is high.

An aspect according to embodiments of the present invention, it is provided that a kind of distributed data-storage system, including: data acquisition Server, for being acquired basic data；Data processing server, is connected with data acquisition server, for basis Data are classified, and determine the type of basic data, and wherein, type at least includes: structured type and destructuring type；Point Cloth storage cluster, is connected with data processing server, for the basic data of structured type being stored to the first son storage Cluster, stores the basic data of destructuring type to the second sub-storage cluster.

Further, above-mentioned distributed storage cluster also includes: index server, is connected with the first sub-storage cluster, uses Data indexing information is generated in the basic data according to structured type.

Further, said system also includes: buffer memory server, is connected with data processing server, for by number The basic data collected according to acquisition server caches.

Further, the second sub-storage cluster uses Hadoop HDFS distributed file storage framework.

Further, said system also includes: application server, is connected with distributed storage cluster, for providing dividing The data-interface that in cloth storage cluster, the basic data of storage conducts interviews.

Another aspect according to embodiments of the present invention, additionally provides a kind of distributed data storage method, including: to acquisition To basic data screen, determine the type of basic data, wherein, type at least includes: structured type and non-structural Change type；Basic data is stored to the first sub-storage cluster and/or the second sub-storage cluster according to type.

Further, screening the basic data got, after determining the type of basic data, method is also wrapped Include: according to the basic data of destructuring type, generate the metadata corresponding with basic data；Using metadata as structuring class The basic data of type stores to the first sub-storage cluster.

Further, basic data is being stored to the first sub-storage cluster and/or the second sub-storage cluster according to type Afterwards, method also includes: generating data indexing information according to basic data, wherein, data indexing information at least includes: basis number According to description information and storage positional information；Data indexing information is stored to index server.

Further, basic data is stored to the first sub-storage cluster and/or the second sub-storage cluster, bag according to type Include: according to type, basic data is stored to caching server；According to the storage strategy pre-set, by the base of structured type Plinth data store to the first sub-storage cluster, and the basic data storage of destructuring type is worth the second sub-storage cluster.

Another aspect according to embodiments of the present invention, additionally provides a kind of distributed data storage method, including: screening mould Block, for screening the basic data got, determines the type of basic data, and wherein, type at least includes: structuring Type and destructuring type；First memory module, for by basic data according to type store to the first sub-storage cluster and/ Or the second sub-storage cluster.

Further, said apparatus also includes: the first generation module, for the basic data according to destructuring type, Generate the metadata corresponding with basic data；Second memory module, for using metadata as the basic data of structured type Store to the first sub-storage cluster.

Further, said apparatus also includes: the second generation module, generates for the basic data according to structured type Data indexing information, wherein, data indexing information at least includes: the description information of basic data and storage positional information；3rd Memory module, for storing data indexing information to index server.

In embodiments of the present invention, use and the basic data got is screened, determine the type of basic data, its In, type at least includes: structured type and destructuring type；Basic data is stored to the first son storage collection according to type Group and/or the mode of the second sub-storage cluster, thus reach to improve the purpose of distributed storage cluster global storage efficiency, it is achieved Reducing the technique effect of the time delay of distributed storage cluster, the data solving existing distributed data-storage system are visited Ask and postpone high technical problem.

Accompanying drawing explanation

Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this Bright schematic description and description is used for explaining the present invention, is not intended that inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the system framework figure of distributed data-storage system according to embodiments of the present invention；

Fig. 2 is the system framework figure of a kind of optional distributed data-storage system according to embodiments of the present invention；

Fig. 3 is the system framework figure of a kind of optional distributed data-storage system according to embodiments of the present invention；

Fig. 4 is the system framework figure of a kind of optional distributed data-storage system according to embodiments of the present invention；

Fig. 5 is the flow chart of distributed data storage method according to embodiments of the present invention；

Fig. 6 is the schematic diagram of a kind of optional Distributed Storage device according to embodiments of the present invention；

Fig. 7 is the schematic diagram of a kind of optional Distributed Storage device according to embodiments of the present invention；And

Fig. 8 is the schematic diagram of a kind of optional Distributed Storage device according to embodiments of the present invention.

Detailed description of the invention

In order to make those skilled in the art be more fully understood that the present invention program, below in conjunction with in the embodiment of the present invention Accompanying drawing, is clearly and completely described the technical scheme in the embodiment of the present invention, it is clear that described embodiment is only The embodiment of a present invention part rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under not making creative work premise, all should belong to the model of present invention protection Enclose.

It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " it is etc. for distinguishing similar object, without being used for describing specific order or precedence.Should be appreciated that so use Data can exchange in the appropriate case, in order to embodiments of the invention described herein can with except here diagram or Order beyond those described is implemented.Additionally, term " includes " and " having " and their any deformation, it is intended that cover Cover non-exclusive comprising, such as, contain series of steps or the process of unit, method, system, product or equipment are not necessarily limited to Those steps clearly listed or unit, but can include the most clearly listing or for these processes, method, product Or intrinsic other step of equipment or unit.

According to embodiments of the present invention, it is provided that the system embodiment of a kind of distributed data-storage system, Fig. 1 is according to this The system framework figure of the distributed data-storage system of inventive embodiments, as it is shown in figure 1, this system includes: data acquisition service Device 21, data processing server 23 and distributed storage cluster 25.

Wherein, data acquisition server 21, for being acquired basic data；Data processing server 23, with data Acquisition server 21 connects, and for classifying basic data, determines the type of basic data, and wherein, type at least includes: Structured type and destructuring type；Distributed storage cluster 25, is connected with data processing server 23, for by structuring The basic data of type stores to the first sub-storage cluster 251, the basic data of destructuring type is stored to the second son and deposits Accumulation 253.

Concrete, by above-mentioned data acquisition server 21, data processing server 23 and distributed storage cluster 25, Before basic data is carried out distributed storage, carried out by the type of the data processing server 23 basic data to collecting Classification processes, and according to the type of basic data, the sub-storage cluster of difference basic data being stored in distributed storage cluster In.According to the type of basic data, different types of basic data is stored to the son using the storage form adapted to it and deposit In the middle of accumulation.

The type of basic data is at least divided into structured type and destructuring type, wherein, structured type Basic data be row data, can be stored directly in data base, with bivariate table structure come logical expression realize data. The basic data of destructuring type is for the basic data of structured type, and inconvenience data base's two dimension logical table comes Performance, it includes the office documents of all forms, text, picture, XML, HTML, all kinds of form, image and audio/visual information Deng.

As the optional embodiment of one, data processing server 23, after classifying basic data, also may be used Carry out further examination with the further basic data being destructuring type to type, the literary composition of shorthand information will be used for Content in part is extracted, and is identified the content in picture format by optical character recognition OCR technique, extracts Corresponding metadata, stores metadata to the first sub-storage cluster as the data of structured type.

By above-mentioned data acquisition server 21, data processing server 23 and distributed storage cluster 25, can basis Type, by basic data, stores with the storage mode adapted to it, thus reaches to improve distributed storage cluster entirety and deposit The purpose of storage efficiency, it is achieved that reduce the technique effect of the time delay of distributed storage cluster, solve existing distributed number According to the technical problem that the data access delay of storage system is high.

As the optional embodiment of one, as in figure 2 it is shown, above-mentioned distributed storage cluster 25, it is also possible to including: index Server 255.

Wherein, index server 255, it is connected with the first sub-storage cluster 251, for the basic number according to structured type According to generating data indexing information.

Wherein, by index server 255, index can be generated according to the storage position of the basic data of structured type Data, it is also possible to according to storage position and the storage position of corresponding metadata of the basic data of destructuring type, Generate index data.By index server 255, in original accurate index inquiry, metadata query, structured data query On the basis of, it is achieved by multiple index combination, unstructured data is carried out the inquiry mode retrieved at a high speed.

As the optional embodiment of one, as it is shown on figure 3, system also includes: buffer memory server 27.

Wherein, buffer memory server 27, it is connected with data processing server 21, for by data acquisition server collection To basic data cache.

Concrete, by buffer memory server 27, the basic data that data processing server 21 collects can be entered The storage that row is provisional, and according to the storage strategy pre-set, basic data is concentrated and is uploaded to distributed storage cluster 25 In the middle of.

As the optional embodiment of one, buffer memory server can be carried out layering arrange according to data scale.Logical Obtain basic data after successively, and successively collect, according to the storage strategy pre-set, the form uploaded, basic data is carried out Gather and arrange.

In the middle of reality application, at least can arrange in buffer memory server 27: save preposition caching server (one-level Caching server), the preposition caching server of national centre (L2 cache server) and taking with the background process of system interaction Business device (three grades of caching servers).

Wherein it is possible to by plug-in unit by scanning, the basic data of nonstructured type uploaded, utilize buffer memory server Successively it is uploaded to the second sub-storage cluster for storing nonstructured type.Unstructured data pipe in second sub-storage cluster Unstructured data is stored by platform with the structure of basic storage cell, and to corresponding operation system feedback association letter Breath.Wherein, basic storage cell in the second sub-storage cluster can be according to the requirement of operation system or pre-set Blocks of files size is cut by storage strategy.

When basic data is had access to, message can be had access to the by front end applications service directly request and transmission Unstructured data management platform in two sub-storage clusters, unstructured data management platform analysis request message, and utilize Have access to engine and isolate the unstructured data needed for operation system, feed back in time and have access to front-end server, in operation system Middle integrative display is out.

As the optional embodiment of one, the second sub-storage cluster 253 uses Hadoop HDFS distributed document to store Framework.

In the middle of reality application, replace existing storage architecture with Hadoop HDFS distributed file storage framework, It is mainly in view of the feature of Hadoop HDFS, in order to preferably basic data be managed and provide basis to operation system The support of data.

Hadoop HDFS can support linear expansion and the backup of many copies, and this advantage can fully meet destructuring Data management platform stores horizontal dilatation, safety and node data and stores wanting of dynamic equalization national centre's data Ask；Hadoop can build the HA Namenode of High Availabitity.A lot of ripe to the high availability of Hadoop HA in the industry and Solution reliably, the Master HA deployment mode for national centre provides guidance；Utilize that Hadoop provides is abundant Function, stores and manages unstructured data and the structural data of magnanimity, and data type can be various. This feature can be unstructured data management platform realize unstructured data classification storage provide the foundation；Utilize Hadoop, it is possible to use map reduce realizes cloud computing flexibly.On the basis of meeting future usage distributed storage, it is provided that Cloud computing builds expands basis；Utilize Hadoop, can be easier to integrated third-party instrument or assembly, such as hbase, Hive, zookeeper etc., thus realize more powerful critical-path analysis function, self-management ability, also provide big number for next step Provide an environmental condition according to statistics.

As the optional embodiment of one, the second sub-storage cluster 253 can use Master HA to store structure further Frame.

In the middle of reality application, for managing the distributed storage of the unstructured data management platform of destructuring type Use can be that Master-Salve pattern realizes doing memory node the work such as node analysis, data management.So that Master service becomes the process core of platform.Further, it is possible to use the ripe scheme of existing Hadoop HA also combines Apply actual deployment, so that the two-shipper of Master has high availability, and ensure the stalwartness of platform in the case of accident With stable.

As the optional embodiment of one, as shown in Figure 4, in above-mentioned distributed data-storage system, it is also possible to bag Include: application server 29.

Wherein, application server 29, it is connected with distributed storage cluster 25, deposits in distributed storage cluster for providing The data-interface that the basic data of storage conducts interviews.

In the middle of reality application, for ensureing comprehensively, be efficiently completed that operation system accesses, serviced by Standard Interface and be System access standard.Unified interface service is supplied to external system various protocols by application server 29 and accesses support, by one is Row access and realize the use of various service tuple in unstructured data management platform base service framework.According to different system Service logic and demand customize access interface Services Composition, it is achieved the most succinct system access pattern, with reach save The cost such as time, investment.

From the foregoing, it will be observed that above-mentioned distributed data-storage system relatively prior art, have a characteristic that

The distributed system increased income is used to create unified distributed data-storage system, it is achieved mass data storage and pipe Reason.Owing to the basic data amount of the destructuring type of class enterprise of bank is huge, produce all kinds of vouchers every day and file data is high Reaching 2TB, the data volume storing and managing is up to PB level.In this case, Hadoop adopts as Apache tissue The open source projects framework that the thought of Google storage and management mass data is released just is being suitable for designing requirement.Distributed data is deposited Storage system uses Hadoop framework to build distributed environment, and mass small documents carries out Piece file mergence storage, uses ZooKeeper The cluster that management builds.

A large amount of cheap PC Server cluster and low side array is used to replace the system hardware of traditional high-side storage solution Framework.Distributed data-storage system based on hadoop open source technology uses the Technical Architecture increased income not only to meet bank sea Amount data whole nation centralized stores manages, accesses for operation system the requirement of the self-characters such as offer loose coupling service, is also future The degree of depth excavates destructuring and semi-structured basic data use value lays framework basis further, achieves employing especially big Measure cheap PC Server cluster and the system hardware framework of low side array replacement traditional high-side storage solution.This is not only Enterprise saves the cost that substantial contribution puts into, reduces data infrastructure, the most in no way inferior in the access of professional high-end storage Efficiency, the highest in the case of magnanimity unstructured data stores, improve destructuring number under big data environment especially According to value.

Distributed data-storage system based on hadoop open source technology has extraordinary autgmentability and stability.Distribution Formula storage architecture not only solves the performance pressures that extension brings, and the equipment that is also easy to expands and debugs and dispose, it is possible to Save, for enterprise, a large amount of human and material resources costs that upgrading brings, reduce potential risk, maintenance platform that system upgrade brings Production run steady in a long-term.

Distributed data-storage system based on hadoop open source technology, can be for bank based on big data management Class enterprise magnanimity unstructured data storage with share provide solution while, also provide for for destructuring type The management of basic data complete lifecycle, have perfect security authentication mechanism, it is possible to for class enterprise of bank with content be The business driven provides complete flow process to realize.

Distributed Full-text Indexing Technology can be complementary with relational data library inquiry, meets efficient data retrieval requirement.Base In the metadata of relation data library storage, face storage data volume huge, the problems such as recall precision is the highest.Utilize distributed full text Index solve relevant database cannot fuzzy search problem, and batch precise search utilize traditional database advantage to realize.As This forms the complementation of document retrieval pattern, it is possible to meet the requirement that bank uses for unstructured data.

Distributed data-storage system based on hadoop open source technology achieves the basic data of destructuring type Gather, manage and share in each operation system；Realize the optimization of operation flow and reproduce, making the unstructured datas such as archives Manage more science, rationally.For the following all kinds of business developments of class enterprise of bank provide image file, data file centralized Control and The strong basic platform of standardized management supports.Realize inside control system procedure, it is achieved rules and regulations implant operation flow, finally Realize Work Flow Optimizing and reproducing, lay a good foundation for striding forward to Functional Bank from traditional bank of department.

According to embodiments of the present invention, it is provided that the embodiment of the method for a kind of distributed data storage method, explanation is needed It is can to perform in the computer system of such as one group of computer executable instructions in the step shown in the flow chart of accompanying drawing, And, although show logical order in flow charts, but in some cases, can perform with the order being different from herein Shown or described step.

Fig. 5 is the flow chart of distributed data storage method according to embodiments of the present invention, as it is shown in figure 5, the method bag Include following steps:

Step S21, screens the basic data got, and determines the type of basic data, and wherein, type is at least wrapped Include: structured type and destructuring type.

Step S23, stores basic data to the first sub-storage cluster and/or the second sub-storage cluster according to type.

Concrete, in above-mentioned steps S21 to step S23, determined the class of the basic data got by data screening Type, and according to type, basic data is stored to predetermined storage cluster with corresponding storage form.Thus reach raising point The purpose of cloth storage cluster global storage efficiency, it is achieved that reduce the technique effect of the time delay of distributed storage cluster, Solve the technical problem that the data access delay of existing distributed data-storage system is high.

As the optional embodiment of one, in step S21, the basic data got is screened, determine basis number According to type after, the method also includes:

Step S221, according to the basic data of destructuring type, generates the metadata corresponding with basic data.

Step S223, stores metadata to the first sub-storage cluster as the basic data of structured type.

Concrete, by step S221 to step S223, after basic data is classified, then it is non-structural to type The content changed in the basic data of type is extracted, and gets first number of basic data for describing destructuring type According to.Further, metadata is stored to the first sub-storage cluster as the basic data of structured type, to improve read-write Efficiency.

As the optional embodiment of one, in step S23, basic data is stored to the first son storage collection according to type After group and/or the second sub-storage cluster, the method also includes:

Step S25, generates data indexing information according to basic data, and wherein, data indexing information at least includes: basis number According to description information and storage positional information.

Step S27, stores data indexing information to index server.

Concrete, by step S25 to step S27, according to the content description information of basic data, storage position and/or Incidence relation generates data indexing information, and is stored to index server by data indexing information.Thus reduce distributed The load of storage cluster, and improve the system effectiveness of overall distribution formula storage system.

As the optional embodiment of one, in step S23, basic data is stored to the first son storage collection according to type In group and/or the second sub-storage cluster, this step includes:

Step S231, stores basic data to caching server according to type.

Step S233, according to the storage strategy pre-set, stores the basic data of structured type to the first son and deposits Accumulation, is worth the second sub-storage cluster by the basic data storage of destructuring type.

Concrete, caching server can be set in distributed data-storage system, and these caching servers can Arrange with classification.Caching server can store the basic data that data acquisition server collects with user temporarily.According in advance The storage strategy arranged, is successively uploaded to the first sub-storage cluster by basic data according to set of types and the second sub-storage cluster is worked as In.

According to embodiments of the present invention, additionally provide the device embodiment of a kind of Distributed Storage device, such as Fig. 6 institute Showing, above-mentioned Distributed Storage device includes: screening module 31 and the first memory module 33.

Wherein, screen module 31, for the basic data got is screened, determine the type of basic data, its In, type at least includes: structured type and destructuring type；First memory module 33, is used for basic data according to class Type stores to the first sub-storage cluster and/or the second sub-storage cluster.

Concrete, in above-mentioned screening module 31 and the first memory module 33, determined the base got by data screening The type of plinth data, and according to type, basic data is stored to predetermined storage cluster with corresponding storage form.Thus Reach to improve the purpose of distributed storage cluster global storage efficiency, it is achieved that reduce the time delay of distributed storage cluster Technique effect, solves the technical problem that the data access delay of existing distributed data-storage system is high.

As the optional embodiment of one, as it is shown in fig. 7, said apparatus can also include: the first generation module 321 He Second memory module 323.

Wherein, the first generation module 321, for the basic data according to destructuring type, generate and basic data pair The metadata answered；Second memory module 323, deposits for metadata being stored to the first son as the basic data of structured type Accumulation.

Concrete, by above-mentioned first generation module 321 and the second memory module 323, basic data is being classified After, then the content in the basic data that type is destructuring type is extracted, get for describing destructuring class The metadata of the basic data of type.Further, metadata is stored to the first son storage as the basic data of structured type In the middle of cluster, to improve read-write efficiency.

As the optional embodiment of one, as shown in Figure 8, said apparatus can also include: the second generation module 35 He 3rd memory module 37.

Second generation module 35, generates data indexing information, wherein, data for the basic data according to structured type Index information at least includes: the description information of basic data and storage positional information；3rd memory module 37, for by data rope Fuse breath stores to index server.

Concrete, by above-mentioned second generation module 35 and the 3rd memory module 37, describe according to the content of basic data Information, storage position and/or incidence relation generate data indexing information, and data indexing information is stored to index server work as In.Thus reduce the load of distributed storage cluster, and improve the system effectiveness of overall distribution formula storage system.

Further, as the optional embodiment of one, in above-mentioned first memory module 33, can perform to walk as follows Rapid:

According to type, basic data is stored to caching server.And according to the storage strategy pre-set, by structuring The basic data of type stores to the first sub-storage cluster, and the basic data storage of destructuring type is worth the second son storage collection Group.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not has in certain embodiment The part described in detail, may refer to the associated description of other embodiments.

In several embodiments provided herein, it should be understood that disclosed technology contents, can be passed through other Mode realizes.Wherein, device embodiment described above is only schematically, the division of the most described unit, Ke Yiwei A kind of logic function divides, actual can have when realizing other dividing mode, the most multiple unit or assembly can in conjunction with or Person is desirably integrated into another system, or some features can be ignored, or does not performs.Another point, shown or discussed is mutual Between coupling direct-coupling or communication connection can be the INDIRECT COUPLING by some interfaces, unit or module or communication link Connect, can be being electrical or other form.

The described unit illustrated as separating component can be or may not be physically separate, shows as unit The parts shown can be or may not be physical location, i.e. may be located at a place, or can also be distributed to multiple On unit.Some or all of unit therein can be selected according to the actual needs to realize the purpose of the present embodiment scheme.

It addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to two or more unit are integrated in a unit.Above-mentioned integrated list Unit both can realize to use the form of hardware, it would however also be possible to employ the form of SFU software functional unit realizes.

If described integrated unit realizes and as independent production marketing or use using the form of SFU software functional unit Time, can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially The part that in other words prior art contributed or this technical scheme completely or partially can be with the form of software product Embodying, this computer software product is stored in a storage medium, including some instructions with so that a computer Equipment (can be for personal computer, server or the network equipment etc.) perform the whole of method described in each embodiment of the present invention or Part steps.And aforesaid storage medium includes: USB flash disk, read only memory (ROM, Read-Only Memory), random access memory are deposited Reservoir (RAM, Random Access Memory), portable hard drive, magnetic disc or CD etc. are various can store program code Medium.

The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For Yuan, under the premise without departing from the principles of the invention, it is also possible to make some improvements and modifications, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

1. a distributed data-storage system, it is characterised in that including:

Data acquisition server, for being acquired basic data；

Data processing server, is connected with described data acquisition server, for classifying described basic data, determines institute Stating the type of basic data, wherein, described type at least includes: structured type and destructuring type；

Distributed storage cluster, is connected with described data processing server, for by the described basis number of described structured type According to storing to the first sub-storage cluster, the described basic data of described destructuring type is stored to the second sub-storage cluster.

System the most according to claim 1, it is characterised in that described distributed storage cluster includes:

Index server, is connected with described first sub-storage cluster, for the described basic data according to described structured type Generate data indexing information.

System the most according to claim 2, it is characterised in that described system also includes:

Buffer memory server, is connected with described data processing server, for collected by described data acquisition server Described basic data caches.

System the most according to claim 1, it is characterised in that described second sub-storage cluster uses Hadoop HDFS to divide Cloth file storage framework.

System the most as claimed in any of claims 1 to 4, it is characterised in that described system also includes:

Application server, is connected with described distributed storage cluster, for providing storage in described distributed storage cluster The data-interface that described basic data conducts interviews.

6. being applied in claim 1 to 5 distributed data storage method for system described in any one, its feature exists In, including:

Screening the basic data got, determine the type of described basic data, wherein, described type at least includes: Structured type and destructuring type；

Described basic data is stored to the first sub-storage cluster and/or the second sub-storage cluster according to described type.

Method the most according to claim 6, it is characterised in that the basic data got is being screened, is determining institute After stating the type of basic data, described method also includes:

According to the described basic data of described destructuring type, generate the metadata corresponding with described basic data；

Described metadata is stored to described first sub-storage cluster as the described basic data of described structured type.

Method the most according to claim 7, it is characterised in that described basic data is being stored to according to described type After one sub-storage cluster and/or the second sub-storage cluster, described method also includes:

Generating data indexing information according to described basic data, wherein, described data indexing information at least includes: described basis number According to description information and storage positional information；

Described data indexing information is stored to index server.

Method the most according to claim 8, it is characterised in that described basic data is stored to described according to described type First sub-storage cluster and/or described second sub-storage cluster, including:

According to described type, described basic data is stored to caching server；

According to the storage strategy pre-set, the described basic data of described structured type is stored to the first son storage collection Group, is worth the second sub-storage cluster by the described basic data storage of described destructuring type.

10. a Distributed Storage device, it is characterised in that including:

Screening module, for screening the basic data got, determines the type of described basic data, wherein, described Type at least includes: structured type and destructuring type；

First memory module, for storing described basic data to the first sub-storage cluster and/or second according to described type Sub-storage cluster.