CN105677826A

CN105677826A - Resource management method for massive unstructured data

Info

Publication number: CN105677826A
Application number: CN201610003635.6A
Authority: CN
Inventors: 张善海; 熊贵喜; 蔡朝辉; 杜博文; 凌萍; 谢志普
Original assignee: BOCOM SMART NETWORK TECHNOLOGIES Inc
Current assignee: BOCOM SMART NETWORK TECHNOLOGIES Inc
Priority date: 2016-01-04
Filing date: 2016-01-04
Publication date: 2016-06-15

Abstract

The invention provides a resource management method for massive unstructured data. The method includes the steps that a, the storage mode of the massive unstructured data is determined according to the size of a file of the massive unstructured data, and the file is stored on an HDFS or in an HBase; b, metadata information of the data is stored in the HBase, and query speed is increased by building index tables of metadata according to themes, tags and other information of the metadata; c, when the metadata is queried, the index tables of the metadata can be searched for according to the themes or the tags of the metadata needing to be searched for, and a data table is fast positioned; d, when unstructured data records are queried, the data index table corresponding to the data table needs to be found according to the naming rule of the data index tables, semantic tags of the data are queried in the data index table, recording main keys of the data needing to be searched for are found, and the data is fast positioned in the data table according to the main keys. By means of the resource management method, the massive unstructured data can be effectively organized and managed, and fast and efficient query can be performed.

Description

A kind of method for managing resource for magnanimity unstructured data

Technical field

The present invention relates to distributed data base HBase and distributed file system HDFS field, particularly to a kind of method for managing resource for magnanimity unstructured data.

Background technology

The full name of HDFS is HadoopDistributedFilesystem, is the flagship level file system of Hadoop. Its thought source is in Google file system (GoogleFileSystem, GFS), and applicable write-once, the access module that repeatedly reads, meets urban multi-source market demand scene. It is a distributed file system being suitable for storing big file, it is possible to as the data source of Hadoop and Spark.

HBase is based on the distributed data base of increasing income of Google Bigtable exploitation, and it is not traditional relevant database, and its initial objective is exactly solve traditional Relational DataBase not enough problem in theoretical and practice when processing extensive mass data. Owing to the bottom data of HBase is stored on HDFS, therefore HBase has high fault tolerance equally. The main feature of HBase has:

1) enhanced scalability. In memory capacity, HBase achieves level of linearity extension. When data volume reaches certain threshold values, data will be carried out horizontal segmentation by HBase, and will be assigned in thousands of servers of cluster by segmentation block. When the scale of data arrives the limit of cluster, HBase also supports to expand number of clusters, it is achieved do not shut down dynamic seamless dilatation.

2) high-performance. The design original intention of HBase seeks to meet the high concurrent mass data inquiry of user. It has 2 mechanism to ensure concurrently to inquire about efficiently. One is data segmentations. Data are divided into each node of cluster by HBase, and when user inquires about data, each node can return corresponding data block simultaneously, it is achieved concurrently inquires about. Two is caching mechanism. HBase devises efficient caching mechanism, Cache when being provided with MemStore unit especially as reading and writing data, it is possible to significantly increase the hit rate of data access.

3) high availability. The bottom of HBase utilizes HDFS to store data, and namely HDFS itself has high fault tolerance. When certain machine data is lost, HBase can find the backup of these data by HDFS, and duplicate copy, renewal system log (SYSLOG) table again. This ensure that the high availability of HBase system.

And along with current Urban Data amount is day by day huge, unstructured data kind gets more and more, how in distributed system, magnanimity unstructured data to be stored and to manage the direction just becoming research.

Summary of the invention

The present invention is directed to city unstructured data day by day huge, process day by day time-consuming technical problem, it is proposed to a kind of method for managing resource for magnanimity unstructured data, magnanimity unstructured data can be carried out effective organization and management by the method.

A kind of method for managing resource for magnanimity unstructured data, comprises the following steps:

Step a: determine its storage mode according to the size of unstructured data file, when described unstructured data file size exceedes given threshold value, it is deposited into HDFS file system, and the tables of data created on HBase stores its essential information and the path on HDFS; When described unstructured data file size is less than or equal to given threshold value, described file is serialized and is stored directly in HBase data base;

Step b: build metadata table and data directory according to described unstructured data, and utilize described metadata table to build index of metadata table;

Step c: when query metadata, makes a look up described index of metadata table according to the theme of the metadata to search or label, to obtain the tables of data of correspondence; And

Step d: when inquiring about unstructured data record, naming rule according to described data directory finds the data directory that tables of data is corresponding, the semantic label of described unstructured data record is searched afterwards in described data directory, obtain the major key of the data record to search, then according to described major key rapidly locating in described tables of data.

The method for managing resource for magnanimity unstructured data according to the present invention, for magnanimity unstructured data, it is possible to carries out effective organization and management, it is possible to inquire about fast and efficiently, substantially increases data-handling efficiency.

Accompanying drawing explanation

Fig. 1 is the flow chart of data processing figure according to the inventive method.

Fig. 2 is unstructured data unified storage exemplary plot.

Fig. 3 is the hierarchical chart of tables of data in data base.

Fig. 4 is data resource inquiry schematic diagram.

Detailed description of the invention

Below in conjunction with accompanying drawing, the present invention is described in detail. Following example are not limitation of the present invention. Under the spirit and scope without departing substantially from inventive concept, those skilled in the art it is conceivable that change and advantage be all included in the present invention.

Fig. 1 is the flow chart of data processing figure according to the inventive method, comprises the storing process of data, and metadata table, data directory and index of metadata create process, data request processing process three part.

First the storing process of data is described in detail below. Storing process includes to create tables of data on HBase and being stored on request on HBase and HDFS by initial data. As it is shown in figure 1, specific as follows:

Step a1: first according to the data to upload, creates corresponding data table on HBase. The content of the tables of data created is mainly some essential informations of unstructured data, such as table name, file size, access mode, content, semantic label etc.

Step a2: then user selects the file to upload, calls and uploads interface and carry out file transmission.

Step a3: judge the size of data file.

Step a4: if file size is more than 1MB, is just deposited into HDFS, and stores its essential information and the path on HDFS in the tables of data of HBase, otherwise enters step a5.

Step a5: file is serialized and is stored directly in HBase data base.

HBase is one and has high reliability, high-performance, and row store, expansible, the distributed data base system of real-time read write attribute, it is possible to meet the city unstructured data primary demand to storage. But, due to the design of HBase self, when directly storage large data objects, performance can be problematic for it. Such as, when the region of HBase rises to a certain size time, (acquiescence 256MB) can carry out fractured operation (split) automatically, at this moment can block all write operations to current bay. Additionally, a large amount of writes of same subregion can be caused repeatedly write with a brush dipped in Chinese ink operation (flush), thus frequently triggering union operation (Compaction), take the I/O of cluster.

For the problems referred to above, a solution is to leave on HDFS by unstructured data object, and then file path writes the particular column race of HBase, when to access this unstructured data object, just by reading the respective file on HDFS. Although this scheme can solve the problem that HBase storage performance issue caused by large data objects, but if the unstructured data object size of required storage is not judged, adopts in this way without exception, can cause that the small documents on HDFS is too much. And too much small documents influences whether the performance of HDFS. Therefore reasonably way is the storage mode that the size according to unstructured data object capacity selects it. Being based on discussed above, the present invention is with the 1MB cut-off rule for both modes. When unstructured data object is less than or equal to 1MB time, after its sequence can be turned to Byte array, be stored in the particular column race of HBase. Otherwise, then can be deposited on HDFS in the form of a file, then the path of file is left in the particular column race of HBase. HBase can arrange string specially for identifying the storage mode of data simultaneously.

Fig. 2 is unstructured data unified storage exemplary plot. Such as having two unstructured data objects to need storage, wherein a is the file of txt form, and b is the picture of png form. The size of a is less, for 3KB, it is possible to be directly converted in the Herba Orobanches that byte arrays leaves HBase in by file content. And the size of b is relatively big, for 5MB, then can be deposited on HDFS, path is write in the middle of HBase. Judged the storage mode of data object by an identity column simultaneously.

As shown in Figure 2, HBase deposits unstructured data object by a Ge Lie race (ColumnFamily), and these row race (ColumnFamily) comprise five row: Name row, and Size arranges, Format arranges, Access arranges, and Content row and Tags arrange, and represents the filename corresponding to unstructured data object respectively, size, file type, storage mode, content and semantic label. When Access is set to be stored directly in hbase, then the content of Content is the Byte array of unstructured data. Otherwise, it is set to unstructured data and deposits path on HDFS.

Data storage completes the structure of laggard row data directory, metadata table and index of metadata table, illustrates below in conjunction with Fig. 1 and Fig. 3.

First, data directory and metadata table (step b1) are built.

All unstructured datas all create data directory on HBase data base, and data directory is with the semantic label field in data message for line unit, and content is the major key of all unstructured data records relevant to institute semantic tags. Separating with " # " between multiple major keys, form is<" semantic label ", " major key 1# major key 2# major key 3#... ">. Semantic label field is the description to each unstructured data record, the major key of unstructured data and filename (filename).

Further, all unstructured datas are all created on HBase data base metadata table, described metadata table is the metadata information of unstructured data, every a line correspondence one unstructured data. Metadata information is the information for describing unstructured data attribute, it is used for the function supporting to include instruction storage position, resource lookup, file record, metadata information includes field: table name (name), theme (subject), label (tages) and file format (format), as shown in Figure 3. Using table name as line unit (rowkey) in metadata table, it is simple to made a look up by table name.

Wherein, Urban Data is classified by subject field general orientation, including traffic, environment and the condition of the people etc.

Label field is the Further Division to the various subject data in city, and this traffic subject data has the data such as traffic surveillance videos, bayonet socket picture.

When user uploads data, system can update metadata table.

Owing to client is to be inquired about by keyword, in order to accelerate inquiry velocity, the present invention creates index of metadata table (step b2) according to the metadata information in metadata table. Index of metadata table is with the theme in described metadata information or label field for line unit, and content is the table name of all tables of data relevant to described theme or label. Wherein, each line unit correspond to a list, and this list is the line unit of the metadata information of the data resource comprising institute's inquiry tag. In the content, multiple table names separate with " # ", and form is<" theme or label ", " table name 1# table name 2# table name 3#... ">, as shown in Figure 3.

Tables of data as shown in Figure 3 and concordance list, in bold box be all line unit in figure.

Finally introduce data request processing process in conjunction with Fig. 1 and Fig. 4.

When user uses theme or tag queries metadata, according to the metadata tag to search or theme, index of metadata table is made a look up, specifically includes following steps:

Step c1: determine the theme or label that to search data;

Step c2: with theme obtained in the previous step or label information for line unit, the index of metadata table in HBase is made a look up;

Step c3: determine whether requested theme or label, has the table name then returning all tables of data corresponding with requested theme or label in index of metadata table;

Step c4: the table name according to tables of data obtained in the previous step, finds the metadata of correspondence with table name for line unit in metadata table and returns, and the information of return includes table name, city, theme, label and file format, without then returning sky.

After obtaining the table name of all tables of data at the unstructured data record place to inquire about, inquiry unstructured data record further, specifically include following steps:

Step d1: the table name first passing through the step c4 tables of data obtained searches the data directory that unstructured data record is corresponding, the table name of data directory is spliced with " Index " by the table name of tables of data, as: data directory can be quickly found out " table name _ Index ".

Step d2: be then used by semantic label and data directory is made a look up, obtains the major key of all data records corresponding to institute's semantic tags, the i.e. filename of unstructured data. Here, semantic label is that user provides, and user wants to look up the data comprising which semantic information, namely can determine that semantic label.

Step d3: use the data record major key obtained to search in tables of data, and return relevant unstructured data record.

By above-mentioned step, it is possible to be quickly positioned to the unstructured data record searched in tables of data.

Obviously, those of ordinary skill in the art will be appreciated that, above embodiments is intended merely to the explanation present invention, and it is not used as limitation of the invention, as long as in the spirit of the present invention, to the change of embodiment described above, modification all by the Claims scope dropping on the present invention.

Claims

1. the method for managing resource for magnanimity unstructured data, it is characterised in that comprise the following steps:

2. method according to claim 1, it is characterised in that in step a, described given threshold value is 1MB, further includes steps of

Step a1: first create the tables of data uploading data on HBase;

Step a2: select the data file to upload;

Step a3: judge described data file size;

Step a4: if file size is more than 1MB, is just deposited into HDFS, and stores its essential information and the path on HDFS in HBase table, otherwise enters step a5; And

Step a5: file is serialized and is stored directly in HBase data base.

3. method according to claim 2, it is characterised in that step b farther includes:

Step b1: all unstructured datas all create metadata table and data directory on HBase data base, and described metadata table includes the metadata information of described unstructured data;

Step b2: create index of metadata table according to metadata information in described metadata table.

4. method according to claim 3, it is characterized in that, described metadata information is the information for describing unstructured data attribute, being used for the function supporting to include instruction storage position, resource lookup, file record, described metadata information includes following field: table name, theme, label and file format; Using table name as line unit in described metadata information table, for being made a look up by table name.

5. method according to claim 4, it is characterised in that Urban Data is classified by described subject field general orientation, including traffic, environment and the condition of the people.

6. method according to claim 5, it is characterised in that described label field is the Further Division to the various subject data in city, described traffic subject data has the data of traffic surveillance videos, bayonet socket picture.

7. method according to claim 6, it is characterized in that, described data directory is with the semantic label of described unstructured data record for line unit, with the major key of all unstructured data records relevant to institute semantic tags for content, separate with " # " between multiple major keys, form is<" semantic label ", " major key 1# major key 2# major key 3#... ">; Institute's semantic tags field is the description to each unstructured data record, the major key of described unstructured data and filename.

8. method according to claim 7, it is characterized in that, described index of metadata table is with the theme in described metadata information or label field for line unit, content is the table name of all tables of data relevant to described theme or label, multiple table names separate with " # ", form is<" theme or label ", " table name 1# table name 2# table name 3#... ">.

9. method according to claim 8, it is characterised in that in step c, during query metadata, makes a look up index of metadata table according to the metadata tag to search or theme, comprises the following steps:

Step c1: determine the theme or label that to search data;

Step c3: determine whether requested theme or label, has the table name then returning all tables of data corresponding with requested theme or label in index of metadata table; And

10. method according to claim 9, it is characterised in that in step d, during inquiry unstructured data record, further includes steps of

Step d1: the table name first passing through the step c4 tables of data obtained searches the data directory that unstructured data record is corresponding, and the table name of described data directory is spliced with " Index " by the table name of tables of data;

Step d2: use semantic label that data directory is made a look up, obtain the major key of all data records corresponding to institute's semantic tags, the i.e. filename of unstructured data; And

Step d3: use the data record major key obtained search in tables of data and return described unstructured data record.