CN104657459A - Massive data storage method based on file granularity - Google Patents

Massive data storage method based on file granularity Download PDF

Info

Publication number
CN104657459A
CN104657459A CN201510066822.4A CN201510066822A CN104657459A CN 104657459 A CN104657459 A CN 104657459A CN 201510066822 A CN201510066822 A CN 201510066822A CN 104657459 A CN104657459 A CN 104657459A
Authority
CN
China
Prior art keywords
file
data
state
record
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510066822.4A
Other languages
Chinese (zh)
Other versions
CN104657459B (en
Inventor
王振宇
王树鹏
王勇
王曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201510066822.4A priority Critical patent/CN104657459B/en
Publication of CN104657459A publication Critical patent/CN104657459A/en
Application granted granted Critical
Publication of CN104657459B publication Critical patent/CN104657459B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/122File system administration, e.g. details of archiving or snapshots using management policies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices

Abstract

The invention discloses a massive data storage method based on file granularity. The massive data storage method comprises the following steps of (1) dividing a data storage cluster into a plurality of partitions, wherein each partition is provided with a partition value; (2) creating a business data sheet for the record of each department, and setting a partitioning rule for the records of each business data sheet; (3) for each record in the to-be-stored business data, storing into a file of the corresponding partition according to the number and a partitioning rule, and creating an index file; storing the number of the record, the path of the file, the number of a storage node, and the number of storage equipment into a metadata file; furthermore, creating a view between the business data sheets; according to the metadata file, separating the business data sheet, the view, the record partition and the index information belonging to the same business scene into the same database, so as to obtain a massive metadata management model. The massive data storage method has the advantages that the data management accuracy is improved, and the data division and organization flexibility is improved.

Description

A kind of mass data storage means based on file granularity
Technical field
The present invention relates to a kind of mass data storage means based on file granularity, in particular to one in the hadoop ecosphere, compatible Hive metadata schema, adopt file as bottom fundamentals of management unit, status of support management mass data metadata management model realization scheme.Belong to mass data storage management review field.
Background technology
Find according to the IDC research of 5 years in the past, global metadata amount is approximately every two years doubled; 2010, global metadata amount strode into the ZB epoch, and expecting the year two thousand twenty global metadata amount will reach the 35ZB of terrorise.The degree participating in internet product and application along with netizen is more and more darker, and internet will be more intelligent, and the data volume of internet is explosive growth, and large data age arrives.So huge data volume brings great challenge to data-storage system.Traditional unit data-storage system is infeasible, and distributed memory system becomes the inexorable trend that Future Data stores development.
In storage system, the way to manage of metadata can be divided into two types: centralized management and distributed management.Centralized management arranges a special meta data server within the storage system, carries out separately all issued transaction relevant to metadata by it.The all metadata of file system is all stored in above the memory device of this server.All requests to file that client sends, all will first send request meta data server, obtain relevant metadata, could perform subsequent operation.Distributed meta-data management be by metadata store on multiple nodes of system, and the dynamic migration of data can be realized between node, the responsibility of metadata management has also just been assigned to each different node accordingly and has got on, distributed meta-data management pattern is by realizing higher metadata access bandwidth to the parallel access of the metadata on multiple node, improve metadata access performance, but also can cause the overhead safeguarding metadata consistency simultaneously.
For current, parallel virtual file system PVFS (Parallel Virtual File System) just have employed centralized management model.PVFS adopts the design of a meta data server, multiple I/O server, meta data server asks equiblibrium mass distribution to each I/O node all I/O that computing node sends over, achieve the equilibrium that system I/O accesses load, and improve I/O concurrency, thus substantially increase the performance of the network storage.But when storage system interior joint increase, scale become large time, adopt single centralized meta data server just cannot meet the requirement of metadata transmission bandwidth, cause system bottleneck.
And xFS adopts is exactly distributed meta-data management, stored by distributed data and decrease concentrated bottleneck with the function of metadata management, and also achieve that data store with metadata management separated from one another.The feature of xFS does not have special meta data server, and reduce performance bottleneck and eliminate single failpoint, it has certain fault-tolerant ability and certain static expansion capability in addition.If but adopt distributed meta-data management mode just to need a kind of good meta-data distribution algorithm, rational data distribution algorithms can bring higher performance and better expansibility for storage system.
Hive is the data warehouse applications be implemented on Hadoop cluster of Facebook exploitation, it provide be similar to SQL grammer HQL statement as data access interface, structurized data file can be mapped as a database table, and complete SQL query function is provided, SQL statement can be converted to MapReduce task and run.Its core by achieving a set of centralized metadata management method, taking out the elements such as database, table, view, index, realizing Organization of Data and management by HDFS distributed file system exactly.Metadata in Hive is not described concrete file, only supports to have arrived data directory rank.
Summary of the invention
The object of the present invention is to provide a kind of mass data storage means based on file granularity, bottom data is managed granularity and is promoted to file by file by the program, improve the fine degree of Mass Data Management on the one hand, under adding distributed scene, node, the incidence relation of disc information and file, achieve unified Organization of Data and management and dispatching, on the other hand by method of state management, realize System Fault Tolerance, ensure the consistance of data and the load balancing of access, simultaneously, this model remains the compatibility with hive metadata schema, make this metadata schema can carry out seamless switching use in the hadoop ecosystem.
The technical solution adopted in the present invention is as follows:
Based on a mass data storage means for file granularity, the steps include:
1) data store set group is divided into multiple subregion, each subregion has one point of zones values;
2) a business datum table is created to the record of each department, and the zoning ordinance recorded in each business datum table is set;
3) for each record of business datum to be stored, to be stored in the file of corresponding subregion according to its numbering and zoning ordinance and to create index file; Then the path of the numbering of this record, place file, affiliated memory node numbering, memory device numbering are stored in meta data file.
Further, create the view between setting business datum table, according to described meta data file, by belonging to the business datum table of same business scenario, view, record place subregion, index information put under in same database, obtains a magnanimity metadata management model.
Further, described magnanimity metadata management model comprises physical element, logical elements and business element; Wherein, physical element comprises memory device, memory node, file, and logical elements comprises database, business datum table, view, index, subregion, and business element comprises user, region.
Further, the state of described memory device comprises reaches the standard grade, rolls off the production line, and the state of described memory node comprises reaches the standard grade, rolls off the production line, and the state of described file comprises write, closedown, stable state, to be deleted.
Further, to each file configuration one file size or stored record sum, when file reach impose a condition time, this file status is set to stable state; One life cycle is arranged to each file, when the holding time of file exceedes this life cycle, the state of this file is set to state to be deleted.
Further, to several copies of described file generated; The generation method of described copy is:
61) whether the state detecting current file is stable state, if so, enters into 62), otherwise enter into 63);
62) judge whether the current copy amount of file meets configuration requirement, if met, then continue the next file of scanning, if do not met, then this file is carried out to the generation of copy, once complete the generation of all required copy amounts, then tab file number of copies is ghost number;
63) judge whether current file is closed condition, and the shut-in time exceeds Configuration Values; If not, then continue the next file of scanning, otherwise current file is labeled as stable state, and carry out the generation of copy according to the number of copies of configuration, continue the next file of scanning.
Further, memory node described in each is provided with a daemon thread, and for testing to the file in all memory devices on memory node, the method for inspection is:
71) initialization listed files to be tested, obtains file path information;
72) if current file state is write, and creation-time does not exceed the time-out time of configuration, then continue the next file of inspection, otherwise enter into 73);
73) if file status is write, then closedown is labeled as, if not, then enter 74);
74) judge that whether current file is readable, if so, then continue the next file of inspection, otherwise, enter 75);
75) by current not readable file erase, corresponding number of copies subtracts 1, if number of copies is 0, then deletes the metadata information of this file.
Further, data query is carried out according to described magnanimity metadata management model, its method for: first determine the business datum Table A that will inquire about then in described magnanimity metadata management model, to obtain the listed files meeting querying condition in business datum Table A under all subregions according to inquiry request according to the inquiry department of input; Then according to the duplicate of the document distribution in this listed files and memory node state, storage device status, generate a task list, then this task list is issued to this inquiry request of concurrence performance on each memory node.
Further, the method be stored into by described record in the file of corresponding subregion is: the node listing first obtaining the upper line states of current storage cluster, then from this node listing, a memory node is selected according to the Data import strategy of setting, and carry out data loading operations according to the available storage device under this this memory node of memory node memory load situation choice of dynamical, produce a new record hereof, and the state marking this file is write state, after loading completes write, by the status indication of this file for closing or stable state.
Compared with prior art, good effect of the present invention is:
The present invention is in metadata schema, document element is presented to external system as important ingredient, improve the fine degree of Mass Data Management on the one hand, under adding distributed scene, node, the degree of association of disc information and file, achieve unified Organization of Data and management and dispatching, on the other hand by method of state management, realize System Fault Tolerance, ensure the consistance of data and the load balancing of access, simultaneously, this model remains the compatibility with hive metadata schema, make this metadata schema can carry out seamless switching use in the hadoop ecosystem.
Accompanying drawing explanation
Fig. 1 illustrates the element schematic diagram of the mass data metadata management model based on file granularity;
Fig. 2 illustrates Data import process flow diagram;
Fig. 3 illustrates data retrieval process flow diagram;
Fig. 4 illustrates the copy consistency management flow chart based on status mechanism;
Fig. 5 illustrates file status overhaul flow chart.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with accompanying drawing, the Backup Data organization and management method of level sectional type is according to an embodiment of the invention further described.
According to a first aspect of the invention, a kind of magnanimity metadata management model based on file granularity is provided.As shown in Figure 1, this data model comprises physical element, logical elements and business element three aspects, wherein physical element comprises memory device, node, file three elements, wherein memory device be to for disk, physical storage device such as dish battle array etc. abstract, node is abstract for physical host or fictitious host computer, file refer to be stored in certain node certain memory device on data file; Logical elements comprises database, table, view, index, subregion, wherein database is abstract for business scenario, table is abstract to the management of Organization of Datas different in same business scenario, view is the logical combination to difference table, index is the institutional framework of the acceleration for the data access in table, and subregion is then the deposit data organization and administration object that in his-and-hers watches, data are carried out according to certain zoning ordinance; Business element comprises user, region, and user and region are exactly manage the user of system level and the division on deployment ground.In general, physical element is to the mapping of bottom hardware resource, data and abstract, each physical metadata includes status information, the state of memory device and node comprises reaches the standard grade, rolls off the production line two, file status comprise write, close, stable state and four to be deleted, wherein stable state and be to be deletedly distinctive state in the present invention, stable state is in steady state (SS) for markup document, data can not be write again, need user to be configured according to record number or file size; To be deletedly refer to that preservation exceedes the life cycle time limit, the state before deleting by physics, user can not conduct interviews.Logical elements is abstract to data organization and management, make data logically can carry out a point library management, submeter management and partition management according to service logic, and the foundation of index can be carried out on this basis, use view show between the tissue of logic, to realize better, Data Placement and tissue more flexibly; And business element is to the abstract of upper-layer service system or different pieces of information center and maps, represent higher level on Data Placement and organizational form.
Metadata schema of the present invention uses traditional Relational DataBase to carry out the storage of metadata in practice, a table in each element correspondence database, and the incidence relation between element is see accompanying drawing 1; Wherein the superiors are business element, the bottom be physical element, what upper and lower content was accepted in centre is logical elements, by the division of this three classes business element, node under distributed scene, memory device (disk), file object are associated together, realize unified Organization of Data and management and dispatching.
As follows according to the typical process of this model realization:
1.1 this for the data of system storage for electricity agrees forms data, order data comprises: order creation time, O/No. (uniquely), order contents; Wherein, order data is produced by multiple department, needs logic distinguishing, more by the scene of department's inquiry when namely inquiring about, but also needs all departments or specify the data of several department to carry out unifying to show.Business scenario requires that all orders must persistent storage, and each order is a structural data, every day order quantity at 5,500,000,000, about about 700GB, fluctuates and is no more than 20%.There is provided the ability of Data import (stored in distributed storage cluster) and retrieval (reading from distributed storage cluster) on this basis, the response time, maximum-delay was no more than 5 minutes in level second.
The constructive process of 1.2 metadata schema
At the beginning of Account Dept's design, metadata is just defined as the core component docking port of system, the metadata information of needs storage.The data message of storage is wherein needed to comprise traffic table information (for the order format definition of different departments), view (for being carried out by the order of multiple different department unifying to show), partition information (carrying out organize for the order data under certain being shown), index (promoting recall precision to use), in order to expand, propose the concept of database, the table of same business scenario, view, subregion, index information are put under in a database (concept at this place is with the database in RDBMS).In order to carry out Organization of Data on above-mentioned logical elements basis, subregion granularity has been carried out the expansion of physical element, namely (storage format of the storage engines multiple order data used according to us is stored in the middle of a file successively to use the concept of file to carry out the convergence of many order record, namely many records are comprised in this file, and the zone attribute value of these records is identical), because log file information needs the information of node, memory device, therefore physical element also with the addition of node, storing device information.
In addition, in business scenario, have the concept of region and department user, thus on upper strata out abstract business element, realize the management to access control and regional ascription.
Therefore, the incidence relation between above-mentioned element is as follows:
1) database comprises table, view, and a database comprises multiple table, view;
2) table belongs to a database, can belong to multiple view; Such as two Table As, B, it comprises field order_id respectively, order_time, order_content, there is now demand will look into all sequence informations of this Liang Ge department, just can create the view V between a Table A and table B on this basis, retrieve by this view when later retrieval;
3) view belongs to a database, comprises at least one table;
4) subregion belongs to a table, and have multiple subregion under a table, the division of subregion uses division rule, regular record in table, the value of each concrete partitioned record subregion;
5) have 0 or one or more indexes under a table, index can not have data, belongs to and postpones to create, the rule of the index only recording indexes under table, and index data and file are bound;
6) file belongs to a subregion, and a subregion can have multiple file, and file is divided into two classes: source document and index file; A source document can comprise one or more index file, and its index file is controlled by index information, and an index file only belongs to a source document.
7) copy information of a file is recorded in the middle of each file, and namely a file F1 has two copies (F2, F3), then difference transcript information (F1, F2, F3) in F1, F2, F3;
8) file is stored on a disk of a node;
9) each node has unique identification, and a node has 1 or multiple disk, and a disk has unique identification, only belongs to a node;
According to above-mentioned constraint, the present invention uses MySQL as the storage container of metadata persistence, carry out metadata store, wherein, the table of database (DBS), table (TBLS), view (VIEWS), subregion (PARTITIONS), index (INDEXS), file (FILES), node (NODES), the corresponding MySQL of memory device (DEVICES) difference, constraint information is as implied above, does not repeat them here and shows.
2. the use procedure of model
Abstract be expressed in this and just repeat no more, suppose that the order data of present You Liangge department A, B needs to be stored in distributed storage cluster herein, in this emphasis statement metadata schema role in the process and use-pattern.
2.1 Data import use flow process
First, before loading, need the external interface using metadata to provide, create two tables, assuming that table name is (A, B), meanwhile, (what create is zoning ordinance herein to carry out the establishment of subregion to Table A and table B, use O/No. to carry out the calculating of subregion, only store " O/No. " herein); If do not create table, create the division simultaneously, then cannot load.Because index postpones to create, so index now can be created also can create index (index herein is also rule, can not produce index data immediately) after Data import again.
Loading procedure receives the order record imported into from front end, according to the department that order produces, zoning ordinance is used to calculate order data, any determine to be sent on the platform machine in the storage cluster of rear end, suppose that rule is Hash herein, assuming that the machine in cluster only has 2 for storing, identification number is 0, 1, zoning ordinance is (hash, 4) (zoning ordinance herein carries out according to hash, rule is for being assigned in cluster on existing machine by the digital averaging of specifying in hash rule), herein, according to the division rule of subregion, hash result is 0 by loading procedure, 2 be sent on machine 0, be 1 by hash result, 3 be sent on machine 1.If subregion division rule becomes (hash, 5), be then 0,2 by hash result, 4 are sent on machine 0, being sent on machine 1 of other.If machine is 3, hash subregion or 4, rule herein can for by hash result be 0,3 be sent to machine 0, by result be 1 be sent to machine 1, by result be 2 be sent to machine 2.On the whole, point zones values with load rule corresponding to machine by loading appointment, can change flexibly.The O/No. of all scenes is all generated by the mode of UUID, can ensure what all order ID did not repeat; The data of same department can be stored in different subregions, under different subregions, produce one or more file.
Present supposition subregion is (hash, 4), and number of machines is 2, service regeulations: by hash result be 0,2 be sent on machine 0, be 1,3 be sent on machine 1 by hash result.
Herein for each order, use hash counter hashcode, then divided by 4 remainders, be then point zones values, such as an O/No. 1,1%4=1, then this order can be dealt into point zones values of Table A is store in the file of 1; O/No. is 5,5%4=1, then this order also can be dealt into point zones values of Table A is store in the file of 1.
Loading procedure can be by O/No. 1 record be written in a file, the path of this file, affiliated node serial number, disk number are written in metadata, state with tense marker this article part is write state, until after reaching the requirement of shut-off rule control, by this closing of a file, flag state is for closing or stable state.
2.2 data retrievals use flow process
Retrieve when inquiry, which need to formulate retrieval department, namely the table of inquiry is set, or use view, assuming that the order data in present question blank A, the inquiry request that then user can use class SQL grammer to write submits to search engine, All Files information under the subregion that search engine is all under obtaining Table A first in the metadata, cutting is carried out according to index information, partition information, deduct the fileinfo not needing in this inquiry to scan, then the file after cutting is sent in each node, scans according to condition.
Include file element in metadata, because have recorded node disc information, file status information, file index information, copy information, be enough to the balanced demand raised speed with retrieval of holding load.
2.3 indexes generate and recover flow process
After index delayed trigger, the index information of spanned file, marks information in source document.Generate multiple index data, then mark respectively in source document.Any one file, after closedown or stable state, all needs to calculate MD5 value, to guarantee data security.
2.4 copies generate and recover flow process
Copy generation is an off-line thread of system, the fileinfo that continuous traversal is local, check whether copy amount meets the Minimum requirements of configuration, if find that quantity does not mate (not enough or exceed), then trigger copy recover or delete flow process, then mark up-to-date information in the metadata, if failure, then do not mark.
As shown in Figure 2, load flow process: the node listing (i.e. available node listing) loading the upper line states obtaining current storage cluster, according to Data import strategy, (strategy herein can be selected voluntarily by user, as poll or Hash mode), select a node in current available node, and carry out data loading operations according to the some available storage device under this this node of node memory load situation choice of dynamical.In loading procedure, if node, disk are delayed, machine damages or loss of service, and node, disk are then labeled as down status, when loading again accesses meta-data acquisition enabled node, disc information, then can get rid of node, the disk of abnormal off-line, ensure that loading can not obliterated data.After abnormal nodes, disk recover, be labeled as in the metadata and reach the standard grade, then when again loading, just there will be node, the disk of abnormal restoring in enabled node, disk list, data can join in this node or disk again.The fileinfo produced in Data import is all published in the metadata schema mentioned in the present invention, produces a new record, and be labeled as write state in document element, after loading completes write, this file is labeled as closedown or stable state.
As shown in Figure 3, retrieval flow: during retrieval, first accesses meta-data information, by carrying out cutting to search condition, according to table, the Method of Data Organization of subregion carries out the screening of file to be retrieved, and carry out judgement file by index file and whether meet search condition, get rid of the file do not satisfied condition, obtain the listed files meeting search condition, namely the file on the node of abnormal off-line and disk is got rid of in this process, the listed files returned is added up according to duplicate of the document distribution situation, again according to node, the file hit situation of disk does access load balance optimization, the request of being about to is distributed on different nodes and different memory devices as far as possible, finally the task list after optimization is issued on each node, on multiple dish position, concurrence retrieval request is carried out by the service on each node.When certain node delays machine extremely, vertex ticks rolls off the production line, and retrieval request then can not be sent to this point, and accordingly, the copy information of other enabled nodes of this file recorded in metadata then can return to retrieval, ensures the correctness of result for retrieval.
According to a second aspect of the invention, provide a kind of management method based on state, give different states by all physical element, carry out unifying condition managing.
As shown in Figure 4, the generation step of copy:
1) whether the state of systems axiol-ogy current file is stable state, if so, enters into 2), otherwise enter into 4).
2) judge whether the current copy amount of file meets configuration requirement, if met, then continue the next file of scanning, if do not met, enter into 3).
3) this file is carried out to the generation of copy, once complete the generation of all required copy amounts, after completing this operation, tab file number of copies be ghost number, and middle if any exception, then mark copy amount is as the criterion with the actual quantity completed; When copy amount exceedes copy configuration quantity, then use random algorithm to be deleted by the copy exceeding part, and carry out copy amount mark, if delete procedure occurs abnormal, then the copy amount deleted with reality marks.After completing above-mentioned judgement and operation, continue the next file of scanning.
4) judge whether current file is closed condition, and the shut-in time exceeds Configuration Values, if not, continue the next file of scanning, otherwise enter into 5); Consider sometimes due to reasons such as exceptions, a part of file being caused not reach stable state all the time, in order to this part file also being carried out copy generation, having done above-mentioned setting, namely exceed the time of configuration, even if this file is non-stable state, also be changed to stable state.
5) be stable state by file mark, and carry out the generation of copy according to the number of copies of configuration, continue the next file of scanning.
In order to realize better fault-tolerant processing, also need the specific daemon thread of node sets to test to the file in all memory devices on node, as shown in Figure 5, the flow process of inspection is as follows:
1) initialization listed files to be tested, obtains file path information;
2) if current file state is write, and creation-time does not exceed the time-out time of configuration, then continue the next file of inspection, otherwise enter into 3);
3) if file status is write, then closedown is labeled as, if not, then enter 4);
4) judge that whether current file is readable, if so, then continue the next file of inspection, otherwise, enter 5);
5) by current not readable file erase, corresponding number of copies-1, if number of copies is 0, then deletes this file metadata information.
According to a third aspect of the present invention, the compatibility for hive metadata schema in the hadoop ecosphere is provided.In order to better be combined with the hadoop ecosphere, the invention solves the compatibling problem with hive metadata schema, thus be supported in adopt in the system of metadata management method of the present invention and directly dispose the query manipulation that hive program carries out data.The design of concrete compatibility point comprises the following aspects:
1) database, table, view, index, subregion etc. are mainly comprised in the metadata schema of the compatibility of model essential element: Hive, logical elements part in the present invention is identical with it, and expanded the physical element of bottom on this basis: node, memory device and file, while realizing more precision management, retention system is compatible;
2) data access management is carried out by the mode of Direct Mark folder path in table, subregion, index object in the metadata schema of the compatibility of Method of Data Organization: Hive, in the present invention, carry out file management by the method be consistent by the file in the path defined in file object and table, subregion, index object, realize the compatibility of data access.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, but it can not in order to limit the present invention.The ordinary technical staff in the technical field of the invention, without departing from the spirit and scope of the present invention, does a little change and modification, in protection scope of the present invention.Therefore protection scope of the present invention is when being as the criterion with the claim person of defining.

Claims (9)

1., based on a mass data storage means for file granularity, the steps include:
1) data store set group is divided into multiple subregion, each subregion has one point of zones values;
2) a business datum table is created to the record of each department, and the zoning ordinance recorded in each business datum table is set;
3) for each record of business datum to be stored, to be stored in the file of corresponding subregion according to its numbering and zoning ordinance and to create index file; Then the path of the numbering of this record, place file, affiliated memory node numbering, memory device numbering are stored in meta data file.
2. the method for claim 1, it is characterized in that, create the view between setting business datum table, according to described meta data file, by belonging to the business datum table of same business scenario, view, record place subregion, index information put under in same database, obtains a magnanimity metadata management model.
3. method as claimed in claim 2, it is characterized in that, described magnanimity metadata management model comprises physical element, logical elements and business element; Wherein, physical element comprises memory device, memory node, file, and logical elements comprises database, business datum table, view, index, subregion, and business element comprises user, region.
4. method as claimed in claim 3, is characterized in that, the state of described memory device comprises reaches the standard grade, rolls off the production line, and the state of described memory node comprises reaches the standard grade, rolls off the production line, and the state of described file comprises write, closedown, stable state, to be deleted.
5. method as claimed in claim 4, is characterized in that, to each file configuration one file size or stored record sum, when file reach impose a condition time, this file status is set to stable state; One life cycle is arranged to each file, when the holding time of file exceedes this life cycle, the state of this file is set to state to be deleted.
6. the method as described in claim 4 or 5, is characterized in that, to several copies of described file generated; The generation method of described copy is:
61) whether the state detecting current file is stable state, if so, enters into 62), otherwise enter into 63);
62) judge whether the current copy amount of file meets configuration requirement, if met, then continue the next file of scanning, if do not met, then this file is carried out to the generation of copy, once complete the generation of all required copy amounts, then tab file number of copies is ghost number;
63) judge whether current file is closed condition, and the shut-in time exceeds Configuration Values; If not, then continue the next file of scanning, otherwise current file is labeled as stable state, and carry out the generation of copy according to the number of copies of configuration, continue the next file of scanning.
7. method as claimed in claim 6, it is characterized in that, memory node described in each is provided with a daemon thread, and for testing to the file in all memory devices on memory node, the method for inspection is:
71) initialization listed files to be tested, obtains file path information;
72) if current file state is write, and creation-time does not exceed the time-out time of configuration, then continue the next file of inspection, otherwise enter into 73);
73) if file status is write, then closedown is labeled as, if not, then enter 74);
74) judge that whether current file is readable, if so, then continue the next file of inspection, otherwise, enter 75);
75) by current not readable file erase, corresponding number of copies subtracts 1, if number of copies is 0, then deletes the metadata information of this file.
8. method as claimed in claim 6, it is characterized in that, data query is carried out according to described magnanimity metadata management model, its method for: first determine the business datum Table A that will inquire about then in described magnanimity metadata management model, to obtain the listed files meeting querying condition in business datum Table A under all subregions according to inquiry request according to the inquiry department of input; Then according to the duplicate of the document distribution in this listed files and memory node state, storage device status, generate a task list, then this task list is issued to this inquiry request of concurrence performance on each memory node.
9. method as claimed in claim 4, it is characterized in that, the method be stored into by described record in the file of corresponding subregion is: the node listing first obtaining the upper line states of current storage cluster, then from this node listing, a memory node is selected according to the Data import strategy of setting, and carry out data loading operations according to the available storage device under this this memory node of memory node memory load situation choice of dynamical, produce a new record hereof, and the state marking this file is write state, after loading completes write, by the status indication of this file for closing or stable state.
CN201510066822.4A 2015-02-09 2015-02-09 A kind of mass data storage means based on file granularity Active CN104657459B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510066822.4A CN104657459B (en) 2015-02-09 2015-02-09 A kind of mass data storage means based on file granularity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510066822.4A CN104657459B (en) 2015-02-09 2015-02-09 A kind of mass data storage means based on file granularity

Publications (2)

Publication Number Publication Date
CN104657459A true CN104657459A (en) 2015-05-27
CN104657459B CN104657459B (en) 2018-02-16

Family

ID=53248587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510066822.4A Active CN104657459B (en) 2015-02-09 2015-02-09 A kind of mass data storage means based on file granularity

Country Status (1)

Country Link
CN (1) CN104657459B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205160A (en) * 2015-09-29 2015-12-30 浙江宇视科技有限公司 Data write-in method and device
CN105354316A (en) * 2015-11-12 2016-02-24 南京移腾电力技术有限公司 Rapid power system real-time database access method
CN108021338A (en) * 2016-10-31 2018-05-11 甲骨文国际公司 It is used for realization the system and method for two layers of committing protocol
CN108170824A (en) * 2018-01-05 2018-06-15 马上消费金融股份有限公司 A kind of date storage method based on SQL, device, equipment and storage medium
CN108804465A (en) * 2017-05-04 2018-11-13 中兴通讯股份有限公司 A kind of method and system of distributed caching database data migration
CN109299115A (en) * 2018-11-30 2019-02-01 北京锐安科技有限公司 A kind of date storage method, device, server and storage medium
CN109495392A (en) * 2018-10-31 2019-03-19 泰康保险集团股份有限公司 Message conversion process method and device, electronic equipment, storage medium
CN109788077A (en) * 2019-03-27 2019-05-21 上海爱数信息技术股份有限公司 A kind of cloud standby system that supporting cluster and its method
CN109815219A (en) * 2019-02-18 2019-05-28 国家计算机网络与信息安全管理中心 Support the implementation method of the Data lifecycle management of multiple database engine
CN109840166A (en) * 2019-01-14 2019-06-04 京东数字科技控股有限公司 Across the cluster object storage async backup methods, devices and systems of one kind
CN109933289A (en) * 2019-03-15 2019-06-25 深圳市网心科技有限公司 A kind of stored copies dispositions method, system and electronic equipment and storage medium
CN111177102A (en) * 2019-12-25 2020-05-19 苏州浪潮智能科技有限公司 Optimization method and system for realizing HDFS (Hadoop distributed File System) starting acceleration
CN111694847A (en) * 2020-06-04 2020-09-22 贵州易鲸捷信息技术有限公司 Updating access method with high concurrency and low delay for extra-large LOB data
CN111737057A (en) * 2020-06-24 2020-10-02 深圳软牛科技有限公司 APFS file system data recovery method and device and electronic equipment
WO2020249039A1 (en) * 2019-06-13 2020-12-17 黄亚娟 Space data system, method, computer device, and storage medium
CN112347076A (en) * 2020-11-05 2021-02-09 中国平安人寿保险股份有限公司 Data storage method and device of distributed database and computer equipment
CN112771511A (en) * 2019-02-22 2021-05-07 斯诺弗雷克公司 Multi-level metadata in a database system
CN112882861A (en) * 2021-02-18 2021-06-01 北京思特奇信息技术股份有限公司 Service configuration data loading and recovery system and method
CN113238993A (en) * 2021-05-14 2021-08-10 中国人民银行数字货币研究所 Data processing method and device
US11636114B2 (en) 2019-02-22 2023-04-25 Snowflake Inc. Multi-level data for database systems
US11798050B2 (en) 2020-10-09 2023-10-24 Alipay (Hangzhou) Information Technology Co., Ltd. Managing blockchain-based trustable transaction services
US11935048B2 (en) 2020-10-09 2024-03-19 Alipay (Hangzhou) Information Technology Co., Ltd. Managing blockchain-based trustable transaction services

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101354726A (en) * 2008-09-17 2009-01-28 中国科学院计算技术研究所 Method for managing memory metadata of cluster file system
CN102332004A (en) * 2011-07-29 2012-01-25 中国科学院计算技术研究所 Data processing method and system for managing mass data
US20120054182A1 (en) * 2010-08-24 2012-03-01 International Business Machines Corporation Systems and methods for massive structured data management over cloud aware distributed file system
CN103795811A (en) * 2014-03-06 2014-05-14 焦点科技股份有限公司 Information storage and data statistical management method based on meta data storage

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101354726A (en) * 2008-09-17 2009-01-28 中国科学院计算技术研究所 Method for managing memory metadata of cluster file system
US20120054182A1 (en) * 2010-08-24 2012-03-01 International Business Machines Corporation Systems and methods for massive structured data management over cloud aware distributed file system
CN102332004A (en) * 2011-07-29 2012-01-25 中国科学院计算技术研究所 Data processing method and system for managing mass data
CN103795811A (en) * 2014-03-06 2014-05-14 焦点科技股份有限公司 Information storage and data statistical management method based on meta data storage

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴广君等: ""海量结构化数据存储检索系统"", 《计算机研究与发展》 *
王正也等: ""一种基于Hive日志分析的大数据存储优化方法"", 《软件》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205160A (en) * 2015-09-29 2015-12-30 浙江宇视科技有限公司 Data write-in method and device
CN105354316A (en) * 2015-11-12 2016-02-24 南京移腾电力技术有限公司 Rapid power system real-time database access method
CN108021338A (en) * 2016-10-31 2018-05-11 甲骨文国际公司 It is used for realization the system and method for two layers of committing protocol
CN108021338B (en) * 2016-10-31 2023-12-19 甲骨文国际公司 System and method for implementing a two-layer commit protocol
CN108804465A (en) * 2017-05-04 2018-11-13 中兴通讯股份有限公司 A kind of method and system of distributed caching database data migration
CN108804465B (en) * 2017-05-04 2023-06-30 中兴通讯股份有限公司 Method and system for data migration of distributed cache database
CN108170824A (en) * 2018-01-05 2018-06-15 马上消费金融股份有限公司 A kind of date storage method based on SQL, device, equipment and storage medium
CN109495392A (en) * 2018-10-31 2019-03-19 泰康保险集团股份有限公司 Message conversion process method and device, electronic equipment, storage medium
CN109495392B (en) * 2018-10-31 2021-05-07 泰康保险集团股份有限公司 Message conversion processing method and device, electronic equipment and storage medium
CN109299115A (en) * 2018-11-30 2019-02-01 北京锐安科技有限公司 A kind of date storage method, device, server and storage medium
CN109840166B (en) * 2019-01-14 2021-03-30 京东数字科技控股有限公司 Cross-cluster object storage asynchronous backup method, device and system
CN109840166A (en) * 2019-01-14 2019-06-04 京东数字科技控股有限公司 Across the cluster object storage async backup methods, devices and systems of one kind
CN109815219A (en) * 2019-02-18 2019-05-28 国家计算机网络与信息安全管理中心 Support the implementation method of the Data lifecycle management of multiple database engine
CN109815219B (en) * 2019-02-18 2021-11-23 国家计算机网络与信息安全管理中心 Implementation method for supporting data life cycle management of multiple database engines
CN112771511A (en) * 2019-02-22 2021-05-07 斯诺弗雷克公司 Multi-level metadata in a database system
US11636114B2 (en) 2019-02-22 2023-04-25 Snowflake Inc. Multi-level data for database systems
CN112771511B (en) * 2019-02-22 2022-11-25 斯诺弗雷克公司 Multi-level metadata in a database system
CN109933289B (en) * 2019-03-15 2022-06-10 深圳市网心科技有限公司 Storage copy deployment method and system, electronic equipment and storage medium
CN109933289A (en) * 2019-03-15 2019-06-25 深圳市网心科技有限公司 A kind of stored copies dispositions method, system and electronic equipment and storage medium
CN109788077A (en) * 2019-03-27 2019-05-21 上海爱数信息技术股份有限公司 A kind of cloud standby system that supporting cluster and its method
WO2020249039A1 (en) * 2019-06-13 2020-12-17 黄亚娟 Space data system, method, computer device, and storage medium
US11675818B2 (en) 2019-06-13 2023-06-13 Yajuan HUANG Cosmic space data system, method, computer device, and storage medium
CN111177102A (en) * 2019-12-25 2020-05-19 苏州浪潮智能科技有限公司 Optimization method and system for realizing HDFS (Hadoop distributed File System) starting acceleration
CN111694847A (en) * 2020-06-04 2020-09-22 贵州易鲸捷信息技术有限公司 Updating access method with high concurrency and low delay for extra-large LOB data
CN111694847B (en) * 2020-06-04 2023-07-18 贵州易鲸捷信息技术有限公司 Update access method with high concurrency and low delay for extra-large LOB data
CN111737057A (en) * 2020-06-24 2020-10-02 深圳软牛科技有限公司 APFS file system data recovery method and device and electronic equipment
US11798050B2 (en) 2020-10-09 2023-10-24 Alipay (Hangzhou) Information Technology Co., Ltd. Managing blockchain-based trustable transaction services
US11935048B2 (en) 2020-10-09 2024-03-19 Alipay (Hangzhou) Information Technology Co., Ltd. Managing blockchain-based trustable transaction services
CN112347076B (en) * 2020-11-05 2023-11-14 中国平安人寿保险股份有限公司 Data storage method and device of distributed database and computer equipment
CN112347076A (en) * 2020-11-05 2021-02-09 中国平安人寿保险股份有限公司 Data storage method and device of distributed database and computer equipment
CN112882861A (en) * 2021-02-18 2021-06-01 北京思特奇信息技术股份有限公司 Service configuration data loading and recovery system and method
CN112882861B (en) * 2021-02-18 2023-11-07 北京思特奇信息技术股份有限公司 Service configuration data loading and recovering system and method
CN113238993A (en) * 2021-05-14 2021-08-10 中国人民银行数字货币研究所 Data processing method and device
CN113238993B (en) * 2021-05-14 2023-12-05 中国人民银行数字货币研究所 Data processing method and device

Also Published As

Publication number Publication date
CN104657459B (en) 2018-02-16

Similar Documents

Publication Publication Date Title
CN104657459A (en) Massive data storage method based on file granularity
US11816126B2 (en) Large scale unstructured database systems
US11960464B2 (en) Customer-related partitioning of journal-based storage systems
Padhy et al. RDBMS to NoSQL: reviewing some next-generation non-relational database’s
US9753999B2 (en) Distributed database with mappings between append-only files and repartitioned files
CN103109292B (en) The system and method for Aggregation Query result in fault tolerant data base management system
US10860563B2 (en) Distributed database with modular blocks and associated log files
US20130110873A1 (en) Method and system for data storage and management
CN104102710A (en) Massive data query method
US20130311488A1 (en) Time Stamp Bounded Addition of Data to an Append-Only Distributed Database Table
Gajendran A survey on nosql databases
US11314717B1 (en) Scalable architecture for propagating updates to replicated data
CN111522880B (en) Method for improving data read-write performance based on mysql database cluster
CN110287150B (en) Metadata distributed management method and system for large-scale storage system
US11216516B2 (en) Method and system for scalable search using microservice and cloud based search with records indexes
US20130311421A1 (en) Logical Representation of Distributed Database Table Updates in an Append-Only Log File
CN114600094A (en) Generating hash trees for database architectures
US8812453B2 (en) Database archiving using clusters
CN102597969A (en) Database management device using key-value store with attributes, and key-value-store structure caching-device therefor
Zhang et al. Big Data
Cooper et al. PNUTS to sherpa: Lessons from yahoo!'s cloud database
CN109086296A (en) A kind of e-commerce system based on browser and server structure
Saxena et al. NoSQL Databases-Analysis, Techniques, and Classification
Dobos et al. A comparative evaluation of nosql database systems
CN113886505B (en) Management system for realizing dynamic modeling based on search engine and relational database

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant