CN103916459A

CN103916459A - Big data filing and storing system

Info

Publication number: CN103916459A
Application number: CN201410077302.9A
Authority: CN
Inventors: 孙知信; 胡燕平; 宫婧; 王攀
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2014-03-04
Filing date: 2014-03-04
Publication date: 2014-07-09

Abstract

The invention discloses a big data filing and storing system. The big data filing and storing system is formed by connecting a data source, a standard interface, a cloud database, a management module, a flexible management structure, an operation system and a storage medium in sequence. The big data filing and storing system is characterized in that a sound monitor module is connected between the standard interface and the cloud database. An access grouping module is connected between the flexible management structure and the operation system. The flexible management structure is further connected with an RAID stripe optimization module. New modules are erected on a basic filing system framework, it is ensured that the filing system supports the cloud computing and can process a large number of IO operations from a client-side, data access with low latency is ensured, an error detection mechanism of an HDD is optimized on the aspect of error detection, and therefore the error detection efficiency of the RAID system is improved, the data handling capacity of the filing system is improved through the three aspects, and the system can address the challenge of big data filing.

Description

A kind of large data filing storage system

Technical field

The present invention relates to field of data storage, particularly a kind of large data filing storage system.

Background technology

Under large data environment, cloud computing technology has been tending towards ripe, large-scale IT enterprises are advancing the deployment of cloud storage, various intelligent cloud storage systems arise at the historic moment, CSS (Cloud Storage System) cloud storage system is also towards the cloud standby system that can run, and backup and data archiving just strides forward towards the direction merging.Traditional filing technology is faced with new challenges, and it is that it expands new interface that the soft or hard resource in the database under cloud computing environment and pond needs filing system, is no longer simple data picked-up and access.The long-term data filing of magnanimity need to be considered the efficiency of data retrieval, and hierarchical storage management is comparatively ideal archive mode.The occupation mode of tape is normally used for to the medium (write-once that seldom data of use are filed substantially, read-never or maybe, once write, never use or seldom use), disk can be used for filing the data that expection may be retrieved, in the filing system of expansion cloud computing environment, lower floor need to dispose the filing module for carrying out classification, need to there is the forecast function of data retrieval, filing data is carried out the analysis of historical retrieval simultaneously, take timely classification storage, this is similar to computer memory system, high-performance cache (buffer memory) from the hard disk of bottom to CPU, capacity is reducing, access speed is but in upgrading.The management of classification relates to media migration simultaneously, and migration need to be considered the characteristic of filing data and medium simultaneously, guarantees the data property held of Data Migration and media migration, dielectric stability.

Tackle informationalized public administration and enterprises and institutions' electronization running, the filing system of high energy efficiency certainly will become the important leverage that supports the three-dimensional growth of data.Based on Information Lifecycle Management inwardly, filing is the link getting the brush-off always, is mainly that archive mode based on tape technologies is by cloud epoch, large data impact.Past, researchers can focus in data storage, the pressure that this data acquisition ability that mainly comes from sensor resource is brought to storage system, through long-term research and practice, distributed storage system, cloud storage have realized effective storage of large data gradually, and the focus of research is by the filing system of transferring under new computing environment.

Hologram memory medium, organic metal laminated film, breakthrough DNA (Deoxyribonucleic acid DNA (deoxyribonucleic acid), chromosomal main chemical compositions) and quartz glass plate is expected to break through tape and CD is master's long term archival medium, before the memory interface of industrial standard does not occur, filing still mainly relies on the filing system take hard disk as first order storage medium.There are research and the design of a lot of filing systems both at home and abroad, the people such as You have proposed the filing storage system of a degree of depth, adopt virtual content-addressable (content addressed) storing framework and multimode inter-file (intermediate file) and intra-file (internal file) compression mechanism, effectively solve data dependence and changed lower data compression, measure the efficiency of content and metadata store, displaying need to change the reconstructed model of rank and the PRELIMINARY RESULTS of memory property is provided, in its framework, adopting MD5 (Message Digest Algorithm MD5 Message Digest Algorithm 5) or SHA-1 (Secure Hash Algorithm-1 SHA) is the major part that each file calculates virtual directory address, under large data environment, for calculating a cryptographic Hash, each file can increase load for system.

Sensing technology makes flow data ubiquitous, produce flow data endlessly, this is testing enterprise's storage instantly and the ability of filing, the people such as Abe have proposed the mechanism of operation merging and have carried out filing stream data, access or retouching operation when great majority operation, concerning visitor, may have the time delay of height, visitor can not have access to the data that write in logic, need to control the time-domain of union operation.The people such as Wildani adopt the semantic filing data of disposing, and according to the semanteme of the historical record of access, index of reference device is set up the access catalogue based on semantic, and in the time that repeated accesses and semantic logic are runed counter to, index faces huge challenge.

The heavy difficult point using the storage present situation of cloud storage as large data filing not also in currently available technology, the design shortage of filing system and the seamless combination of cloud database, the migration of data has often directly been absorbed filing medium from data source, increase pressure to the data access in later stage, the access meeting of explosion type brings super large load to system, tackle this access and concentrate the solution that still rests on dependence buffer memory with multiple mechanism, and in the error detection problem of filing disk, although Klein is from RAID (Redundant Arrays of Independent Disks disk array, the angle of system RAID) has proposed to improve the scheme of filing, in error detection order, propose to adopt the mechanism of the preferential error detection of maximum distance cell block, general thinks that the error probability of farthest cell piece is larger, be theoretically unsound, optimize on band and there is no concrete method.

Summary of the invention

For solving the problems of the technologies described above, the technical solution adopted in the present invention is as follows:

A kind of large data filing storage system, connected and composed successively by data source, cloud database, administration module, scalable management framework, operating system, storage medium, it is characterized in that being connected with monitor module between data source and cloud database, monitor is connected with cloud database, the access situation of database of record, formulate different filing strategies according to access situation, by the transfer of data in cloud database in the filing storage system of lower floor; Between scalable management framework and operating system, be also connected with access grouping module, access grouping module adopts Ontology to the training of SVM SVMs, and the grouping conducting interviews based on Ontology, reduces the number of times that disk rotates; Scalable management framework is also optimized module with RAID band and is connected, and RAID band is optimized module and adopted the method for changing banded zone the piece exchange on the erroneous block on band and minimum other bands of visit capacity, the protective effect that improves data in magnetic disk with this.

Scalable management framework is for coordinating the concurrent operations of each functional module.

Administration module comprises that index and metadata query, tactical management and metadata generate.

Storage medium is disk or tape.

The present invention sets up new module on a basic filing system framework, guarantee the support of filing system to cloud computing, can process a large amount of IO (input/output I/O) operation from client simultaneously, guarantee the data access of low delay, aspect error detection for HDD (Hard Disk Drive hard disk drive) thus error-detection mechanism be optimized the error detection efficiency that improves RAID system, the data throughout that promotes filing system by above three aspects, assurance system can be tackled the challenge of large data filing.

Accompanying drawing explanation

A kind of large data filing storage system general frame schematic diagram of Fig. 1.

Fig. 2 high in the clouds database is monitored schematic diagram.

The access group technology figure of Fig. 3 based on body.

Fig. 4 RAID band is optimized schematic diagram.

Embodiment

Below in conjunction with accompanying drawing, technical scheme is done further and illustrated.

Fig. 1 is a kind of large data filing storage system general frame, is connected and composed successively by data source, standard interface, cloud database, administration module, scalable management framework, operating system, storage medium;

1) data source, data source is mainly to file storage system and the user supervisor of service, by the interface of standard, all isomeric datas stored in cloud database, and data in can real time access cloud database.

2) standard interface: standard interface is mainly used in extracting and access filing data, integrates hardware technology and software engineering, carries out with user for filing system and upper strata storage alternately.Native system is supported many industry standard interfaces and application programming interfaces (API), a kind of interface is for the picked-up of document, for a retrieval for document, for example Extensible Access Method (XAM) supports complex data type and semantic data-interface.

3) monitor module: record newly entering data and setting up block-based Visitor Logs of high in the clouds database, the picked-up of data and access all can wake the monitor of cloud data periphery up, the accumulative total visit capacity of the memory location of new data and visit data under monitor module records, when data access amount is migrated in the filing system of lower floor lower than threshold value and while meeting data filing strategy

The function of monitor module: 1) read the setting of cloud database, diplomatic division dummy block will on the cloud database that there is no Data classification or deblocking; 2) divide according to publicly-owned cloud or privately owned cloud, privately owned cloud can be directly and target filing layer mutual, and publicly-owned cloud also needs to set up and dock with target filing system, realizes independently filing of business data; 3) the access situation in monitored data storehouse, setting threshold moves to data in filing system by the network port lower than the database of threshold value automatically to stipulated time section access and modification value after repeating deletion.

Fig. 3 is that the agency under publicly-owned cloud files scheme, data may be by random depositing in multiple cloud databases, as the cloud database 1/ cloud database/2 cloud database 3 in Fig. 3, monitor in figure is connected with cloud database by network, by the access situation of recording data blocks and the filing strategy of definition, monitor by the data in cloud database by Internet Transmission in the filing storage system of lower floor.The filing system of the upper figure thought of cloud computing, small business is in the situation that storage resources is limited, and by publicly-owned cloud, data are stored according to the index of filing system, and buffer memory also can be accelerated the access speed of client.

4) cloud database: the database that high in the clouds is monitored;

5) index and metadata query: be used to the data of filing to set up index and inquiry passage is provided;

6) tactical management and audit: strategy and log audit function that filing is provided;

7) generation of metadata and lookup service: be used to filing data to generate corresponding metadata, position in conjunction with data itself service of searching;

It is parallel administration module that index and metadata query, tactical management and metadata generate, index and metadata query are for setting up the index of filing data collection, complete inquiry service fast, the data of metadata generation module generated data, coordinate index and metadata query module to realize the storage of filing data.Policy management module is the data management stipulations constraint filing system with particular requirement for filing system manager, and keeper is carried out tactful renewal, deleted and monitoring by the standard entrance of policy management module.Between module, have communication, policy management module, mainly to index and metadata query module and metadata generation module sending strategy bag, receives after strategy bag, and index and metadata query and metadata generation module can be according to new tactful sorting and file data.

8) telescopic storage architecture: the filing data managerial ability with core; for coordinating the concurrent operations of each functional module; the expansion of back-up system; access grouping and RAID band are optimized in the visible accompanying drawing 3 of module and accompanying drawing 4; operating system module is the articulamentum of hardware and software; the resource management module of the large-scale basis having been combined by the operating system of various hardware, the storage mediums such as manager's disk, CD tape.

9) access grouping module: adopt Ontology to the training of SVM (Support Vector Machine, SVM) SVMs, the grouping conducting interviews based on Ontology, reduces the number of times that disk rotates;

In the module of access grouping, adopt Ontology to train SVM SVMs, index need to have the ability of generative semantics body, Ontology represents the resource of a type, and the feature of such resource is that the physical location of depositing is concentrated, and has semantic dependency, access burst is through semantic training, will be classified in access group, the Ontology of the corresponding class of each access group, as shown in Figure 3, system directory generation target proportion change time, need to generate new Ontology and train SVM.

Concrete Ontology is summarized as follows the implementation method of SVM SVMs training: (1) generates some ontology libraries by the catalogue in the index service module in Fig. 3, as ontology library 1/ ontology library 2/ ontology library 3 in Fig. 3; (2) sample set of choosing at random decile in each ontology library is trained svm classifier device, as the svm classifier device in Fig. 3; (3) enter the keyword of IO queried access collection of index service or word and enter grader and classify, obtain access group 1/ access group 2/ access group 3 in Fig. 3; (3) occur, after larger variation, to carry out step (1) to obtain new body and grader at index list.

Ontology is commonly used for the modeling of database, the example that data become body is adsorbed in corresponding ontology library, adopt the method for Ontology, coordinate document indexing server to realize classification, the load balancing of access IO operation, the progression of data set doubly increases the increase that brings data access amount, need to manage the so multidata quick response that simultaneously also will guarantee access in filing system, the memory module of just necessary optimization data on disk, makes disk in the time of each rotation, process more access.

Filing data only has fraction to enliven, Fig. 3 is that the semantic feature based on index and access generates some Ontologies, the access grouping realizing, adopt SVM method to the access classification in fixed time interval, can be described as 3 stages, the catalogue generative semantics ontology library of index, housebroken SVM according to ontology library to access IO activity classification, ontology library can, along with the disk of the dynamic quickly positioning target body sensing of the semantic complexity of visit capacity and access, improve treatment effeciency.

10) RAID band is optimized module: adjust the detection order of fixed cell and adjust how wrong banded zone.

Taking into account system autgmentability, the longer-term storage of data need to not merge with the front large data storage of filing, such as the solution of the storage of HDFS (Hadooop Distributed File System distributed file system), although IT giant's storage solution can help enterprise to carry out the backup and data archiving of data, comprise the storage of classification, the wheel of medium turns, reuses, in the face of the filing of large data, its performance or the unknown.Support in the urgent need to system to cloud database, expands the storage capacity of medium, system access amount increase with error detection on also to have the mechanism of the large data of reply.

Disk is still the first storage medium of filing system, be positioned at the first order of medium classification, can be assumed to the Cache of filing system, optimize disk storage technology, detect timely disk error and repair avoiding misdata to be forwarded in the poor medium of other access capabilities by wheel, and then reduce system load.

The RIAD system of optimizing can effectively reduce disk and makes mistakes, and adopts the monitoring model of layering to monitor whole RAID system, and location fault, prevents loss of data, various performance index.The model of layering is intended to from the top of RAID controller, what every one deck represented is actual data rather than different media, because RAID technology itself is invalid to the such medium of tape, occur wrong with cleaned the needs that carry out redundancy check and stored separately, the ability correcting a mistake depends on the redundant information of each yard, expands twice method and corrects a mistake by changing the layout of RAID and band length.

Aspect error checking, the error detection of HDD is the region that each dish is divided into regular length, then divides more tiny unit in this region, area size 128M, cell size is 1M, first detects the fixed cell in each region in testing process, so circulation.In the RAID system of optimizing, the disk that arrives lifetime edge is carried out to continuous scouring, to the disk that does not arrive the lifetime in each following in detection, select from the unit that has preferential right to examin in this region in last round of detection at a distance of conduct farthest, because detecting near error probability errorless unit far below unit away from from it.On RAID band, in the time having multiple zone errors on discovery band, choose region corresponding to band that other visit capacity is less and change.

What Fig. 4 showed is a mapping table that band error detection obtains, this be on a RAID band mapping of field remap Virtual table, one total disk 0, 2 three disks of disk 1 disk, region ij represents j+1 region on i+1 band, for example region 11 represents the 2nd Two Areas on (band 1) band, upper hatched example areas is fault zone, vertical line region is healthy area, when on a band when the related fault of multiple region tool, for preventing from intersecting the wrong new mistake that causes, to the band adjustment of breaking down more, make the zone errors on band drop to minimum quantity, former method does not have the rule of design section exchange, the exchange of irregularities differs and reduces surely wrong generation, by the aij time period fault-free detected parameters in each region of having carried out mark, it is foundation that the method for adjusting adopts the trouble-free detection number of times of marked region, be defined as follows:

Definition 1:a[ij] record the no-failure access number in the upper disk j of band i region;

Definition 2:S[i]=max{a[ij], obtain the maximum access value on band i;

Definition 3:T (ij)=min{S[0] ... S[i-1], S[i+1] ... S[n] }, calculate the object band that needs the region region ij exchanging, after optimization finishes, a[ij] all zero clearings.

Adopt and in the method for contrast health detection situation, adjust the more band of zone errors and can avoid unknown intersection mistake, zone errors are moved to the probability that further reduces unknown error on the band of visit capacity minimum.

11) classification storage, medium management and backup.

The major function of classification memory module has been that storage classification, the media wheel of data turns, according to accessed situation and the storage policy of data in filing disk, take different storage modes respectively data to be stored on the memory device that performance is different, realize the Autonomic Migration Framework of data between memory device.Medium management is mainly used in managing the various media resource of medium in holding, and completes the management objectives such as Performance Evaluation, dynamically increase, deletion of medium.Backup module is used for creating data trnascription, in case the hardware of system or medium break down.

Claims

1. a large data filing storage system, connected and composed successively by data source, cloud database, administration module, scalable management framework, operating system, storage medium, it is characterized in that being connected with monitor module between data source and cloud database, monitor is connected with cloud database, the access situation of database of record, formulate different filing strategies according to access situation, by the transfer of data in cloud database in the filing storage system of lower floor; Between scalable management framework and operating system, be also connected with access grouping module, access grouping module adopts Ontology to the training of SVM SVMs, and the grouping conducting interviews based on Ontology, reduces the number of times that disk rotates; Scalable management framework is also optimized module with RAID band and is connected, and RAID band is optimized module and adopted the method for changing banded zone the piece exchange on the erroneous block on band and minimum other bands of visit capacity, the protective effect that improves data in magnetic disk with this.

2. according to the large data filing storage system of the said one of claim 1, it is characterized in that scalable management framework is for coordinating the concurrent operations of each functional module.

3. according to the large data filing storage system of the said one of claim 1, it is characterized in that administration module comprises that index and metadata query, tactical management and metadata generate.

4. according to the large data filing storage system of the said one of claim 1, it is characterized in that storage medium is disk or tape.