CN110196888A

CN110196888A - Data-updating method, device, system and medium based on Hadoop

Info

Publication number: CN110196888A
Application number: CN201910448948.6A
Authority: CN
Inventors: 彭陈成; 张阳
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2019-05-27
Filing date: 2019-05-27
Publication date: 2019-09-03
Anticipated expiration: 2039-05-27
Also published as: WO2020238597A1; CN110196888B

Abstract

The invention discloses a kind of data-updating method based on Hadoop, device, system and media, the method comprising the steps of: after detecting that Hadoop cluster receives the race batch task of client transmission, batch task of running is compiled in the Hadoop cluster, described run is obtained and criticizes the corresponding task sentence of task；The task sentence is parsed in data governing system, obtains the logical relation that the task sentence corresponds to each database table；The genetic connection of correspondence database table in preset pattern database is updated according to the logical relation.The present invention is realized when executing race batch task, and the genetic connection of correspondence database table in graphic data base is updated according to the logical relation that batch task of running corresponds to each database table, improves the accuracy of the genetic connection in graphic data base between database table.

Description

Data-updating method, device, system and medium based on Hadoop

Technical field

The present invention relates to financial technology (Fintech) technical field of data processing more particularly to a kind of based on Hadoop's Data-updating method, device, system and medium.

Background technique

With financial technology, the especially continuous development of internet techno-financial (Fintech), more and more technologies (such as distributed, block chain Blockchain, artificial intelligence) is applied in financial field, but financial circles also propose more technology High requirement.

Currently, respective Hadoop big data platform has all been built by many enterprises, it is based on the Hadoop big data platform, enterprise Each application system data are all shared in the Hadoop big data platform in the industry, so formed respective data warehouse and Several Data Marts built up for each different themes, wherein what is stored in data warehouse is some application system Data, Data Mart storage is the data of same subject in each application system.When in some that some Data Mart relies on The data generation data of trip database table are run again when, it is notified that downstream data library table is passive with the variation of upstream data library table Again run and criticize.The realization of this notice often relies on a global distributed task dispatching platform.When upstream number When changing according to library table, Hadoop big data platform can inform the task schedule platform, by task schedule platform generation pair The notice answered is sent to downstream data library table and corresponds in client, is handled again with triggering the data of downstream data library table.When upper When the genetic connection of data is more complicated between trip database table and downstream data library table, task schedule platform can not be determined by shadow Loud all downstream data libraries table causes the genetic connection of data between upstream data library table and downstream data library table to update endless Entirely, so as to cause the genetic connection inaccuracy of data between upstream data library table and downstream data library table, that is, lead to graph data Genetic connection inaccuracy in library between database table.

Summary of the invention

The main purpose of the present invention is to provide a kind of data-updating method based on Hadoop, device, system and medium, Aim to solve the problem that existing when executing race batch task, the technology of the genetic connection inaccuracy in graphic data base between database table Problem.

To achieve the above object, the present invention provides a kind of data-updating method based on Hadoop, described to be based on Hadoop Data-updating method comprising steps of

After detecting that Hadoop cluster receives the race batch task of client transmission, to institute in the Hadoop cluster It states race batch task to be compiled, obtains described run and criticize the corresponding task sentence of task；

The task sentence is parsed in data governing system, the task sentence is obtained and corresponds to each database The logical relation of table；

The genetic connection of correspondence database table in preset pattern database is updated according to the logical relation.

Preferably, described to detect that Hadoop cluster receives visitor if the task sentence is that data update sentence After the race batch task that family end is sent, batch task of running is compiled in the Hadoop cluster, the race batch is obtained and appoints Be engaged in corresponding task sentence the step of after, further includes:

Batch task corresponding data that runs is processed in the Hadoop cluster, the data after being processed；

Metadatabase is updated according to the data after the processing, obtains the updated metadatabase；

Updated metadata is obtained in the updated metadatabase by the data governing system, and is obtained The table name of database table where data after data and the processing after the processing；

Table name described in the graphic data base is updated according to the data after the updated metadata and the processing Claim corresponding database table, and updated database table is determined as upstream data library table；

In the step for updating the genetic connection of correspondence database table in preset pattern database according to the logical relation After rapid, table corresponding downstream data library table in the upstream data library is determined according to the genetic connection；

Downstream data library table is updated according to the data after the updated metadata and the processing.

Preferably, the step that table corresponding downstream data library table in the upstream data library is determined according to the genetic connection After rapid, further includes:

Sending prompt information gives downstream data library table corresponding client, so that the client is according to the prompt Information alert user, the corresponding upstream data library table of downstream data library table have updated；

If the more new command that downstream data library table corresponds to client transmission is received, according to the updated member Data after data and the processing update downstream data library table.

Preferably, described that updated member is obtained in the updated metadatabase by the data governing system The step of data includes:

The monitoring day of the updated metadatabase is obtained by preset oracle listener in the data governing system Will；

The monitoring log is parsed, the target keywords monitored in log are obtained；

Updated metadata in the updated metadatabase is obtained according to the target keywords.

Preferably, described to detect that Hadoop cluster receives visitor if the task sentence is data query sentence After the race batch task that family end is sent, batch task of running is compiled in the Hadoop cluster, the race batch is obtained and appoints Be engaged in corresponding task sentence the step of after, further includes:

The corresponding target data of the data query sentence is obtained in the Hadoop cluster；

The target data is sent to described run and criticizes the corresponding client of task.

Preferably, described that the task sentence is parsed in data governing system, obtain the task sentence pair The step of answering the logical relation of each database table include:

The task sentence is parsed in data governing system, obtains the corresponding database of the task sentence Table；

The source table and object table in the task sentence correspondence database table are determined, according to the source table and the object table Determine that the task sentence corresponds to the logical relation of each database table.

Preferably, it is described detect the race batch task that Hadoop cluster receives client transmission after, in the Hadoop Batch task of running is compiled in cluster, obtaining described the step of running batch task corresponding task sentence includes:

After detecting that Hadoop cluster receives the race batch task of client transmission, the Hadoop cluster is called Hive compiler is compiled batch task of running, and obtains described run and criticizes the corresponding HQL sentence of task.

In addition, to achieve the above object, it is described to be based on the present invention also provides a kind of data update apparatus based on Hadoop The data update apparatus of Hadoop includes:

Collector, for after detecting that Hadoop cluster receives the race batch task of client transmission, described Batch task of running is compiled in Hadoop cluster, described run is obtained and criticizes the corresponding task sentence of task；

Parsing module obtains the task sentence for parsing in data governing system to the task sentence The logical relation of corresponding each database table；

Update module, the blood relationship for updating correspondence database table in preset pattern database according to the logical relation are closed System.

In addition, to achieve the above object, it is described to be based on the present invention also provides a kind of data update system based on Hadoop The data update system of Hadoop includes memory, processor and is stored on the memory and can transport on the processor The capable date update program based on Hadoop, it is described real when being executed based on the date update program of Hadoop by the processor Now as described above data-updating method based on Hadoop the step of.

In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium Be stored with the date update program based on Hadoop on storage medium, it is described based on the date update program of Hadoop by processor The step of data-updating method based on Hadoop as described above is realized when execution.

The present invention passes through after detecting that Hadoop cluster receives the race batch task of client transmission, in Hadoop cluster In batch task of race is compiled, obtain running batch corresponding task sentence of task, in data governing system to task sentence into Row parsing, the task sentence of obtaining correspond to the logical relation of each database table, update preset pattern database according to logical relation The genetic connection of middle correspondence database table, realizes when executing race batch task, corresponds to each database table according to batch task is run Logical relation update the genetic connection of correspondence database table in graphic data base, improve in graphic data base database table it Between genetic connection accuracy.

Detailed description of the invention

Fig. 1 is the flow diagram of the data-updating method first embodiment the present invention is based on Hadoop；

Fig. 2 is the flow diagram of the data-updating method second embodiment the present invention is based on Hadoop；

Fig. 3 is the functional schematic module map of the data update apparatus preferred embodiment the present invention is based on Hadoop；

Fig. 4 is the structural schematic diagram for the hardware running environment that the embodiment of the present invention is related to.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

The present invention provides a kind of data-updating method based on Hadoop, and referring to Fig.1, Fig. 1 is that the present invention is based on Hadoop Data-updating method first embodiment flow diagram.

The embodiment of the invention provides the embodiments of the data-updating method based on Hadoop, it should be noted that although Logical order is shown in flow charts, but in some cases, can be executed with the sequence for being different from herein it is shown or The step of description.

The specialized vocabulary in the embodiment of the present invention is illustrated below:

1. JanusGraph: the distributed graphic database of an open source, it has good scalability, passes through multimachine collection Group can support to store and inquire the diagram data on tens billion of vertex and side.JanusGraph is a transaction database, is supported big Execute to amount user's high concurrent complicated real-time figure traversal.

2. Hadoop: a distributed system infrastructure developed by apache foundation, be one can be to big The software frame that data carry out distributed treatment is measured, carries out data processing in a reliable, efficient and scalable way.

3. HDFS: distributed file system (Hadoop Distributed File System), HDFS has high fault tolerance The characteristics of, and be designed to be deployed on cheap (low-cost) hardware；And it provides high-throughput (high Throughput the data for) carrying out access application are suitble to the application program for having super large data set (large data set).

4. metadata: Metadata, also known as broker data, relaying data, for data (the data about for describing data Data), the information of data attribute (property) is mainly described, for supporting as indicated storage location, historical data, resource It searches and the functions such as file record.

5. Binlog:binlog log is used to record and all have updated data or potential have updated all of data Sentence.Sentence is saved in the form of " event ", it describes data change of all generations in some database.

6. the abbreviation of HQL:HiveQL is a kind of similar SQL (Structured Query Language, structuralized query Language) language, it is compatible with most SQL syntax, but not fully supports SQL standard.

7. CANAL: a open source projects under Alibaba, pure Java exploitation.It is parsed based on database Incremental Log, It provides incremental data to subscribe to and consume, mainly supports MySQL (Relational DBMS) at present and (also support mariaDB)。

8. Map Reduce: be based on a kind of programming model of hadoop platform, for large-scale dataset (be greater than 1TB) and Row operation, effect are that the rambling data of a pile are summed up according to certain feature, then handle and obtain knot to the end Fruit.What Map was faced is rambling irrelevant data, it parses each data, therefrom extracts key and value, and It is concluded to obtain final result by the data that Reduce obtains Map.

9. graphic data base: being a seed type of NoSQL (non-relational database) database, its Graphics Application theory is deposited Store up the relation information between entity.Graphic data base is a kind of non-relational database, its Graphics Application theory storage entity it Between relation information.

10. hive is a Tool for Data Warehouse based on Hadoop, the data file of structuring can be mapped as one Database table, and simple SQL query function is provided, SQL statement can be converted to Map Reduce task and run.

Data-updating method based on Hadoop includes:

Step S10, after detecting that Hadoop cluster receives the race batch task of client transmission, in the Hadoop collection Batch task of running is compiled in group, described run is obtained and criticizes the corresponding task sentence of task.

Hadoop cluster is in real time or timing detects whether to receive the race batch task of client transmission.Wherein, timing is right The duration answered is arranged according to specific needs, is not particularly limited in the present embodiment to the corresponding duration of timing.Work as client End needs to check data, or when more new data, and the user of client can trigger race batch task manually, can also set in the client Set the race batch task of timing.Updated data package includes but is not limited to modify existing data, increases new data or delete existing Data.After Hadoop cluster receives the race batch task of client transmission, Hadoop cluster is received to institute to run batch task It is compiled, to obtain running batch corresponding task sentence of task, and the task sentence is sent to data governing system.Further Ground, when Hadoop cluster does not receive the race batch task of client transmission, Hadoop cluster continues to test whether receive visitor The race batch task that family end is sent.

Further, step S10 includes:

Step a calls the Hadoop collection after detecting that Hadoop cluster receives the race batch task of client transmission The hive compiler of group is compiled batch task of running, and obtains described run and criticizes the corresponding HQL sentence of task.

Specifically, in Hadoop cluster, it is provided with hive compiler, enforcement engine and oracle listener, oracle listener is Hook (hook) oracle listener.After Hadoop cluster receives the race batch task of client transmission, Hadoop cluster passes through it Built-in hive compiler compiles the race batch task, obtains the race batch corresponding HQL of task (Hibernate Query Language, query statement) sentence, it is to be understood that HQL sentence is to run batch corresponding task sentence of task.It needs Bright, Hadoop cluster can also compile the race batch task by the hive compiler built in it, and it is corresponding to obtain the race batch task SQL statement.

After Hadoop cluster obtains running batch task corresponding HQL sentence by hive compiler, Hadoop cluster is by HQL Sentence is committed in enforcement engine, at this point, the oracle listener of Hadoop cluster can listen to each HQL sentence, and obtains institute Acquired HQL sentence is sent to data governing system by the HQL sentence listened to.

Step S20 parses the task sentence in data governing system, and it is corresponding each to obtain the task sentence The logical relation of a database table.

After data governing system receives the task sentence of Hadoop collection pocket transmission, data governing system is to the task language Sentence is parsed, to obtain the logical relation that task sentence corresponds to each database table.It should be noted that in task sentence In, there are data relevant to batch task of race, and these data are there may be in different database tables, each database table it Between there are certain logical relations.Such as the database table A where some data a, but data a will be processed by database table B After obtain, at this point, showing database table A and database table B, there are logical relations.

Further, step S20 includes:

Step b parses the task sentence in data governing system, obtains the corresponding number of the task sentence According to library table.

Step c determines source table and object table in the task sentence correspondence database table, according to the source table and described Object table determines that the task sentence corresponds to the logical relation of each database table.

Specifically, data governing system parses task sentence, obtains the corresponding database table of task sentence, determines Source table and object table in database table determine that task sentence corresponds to the logic pass of each database table according to source table and object table System.It is understood that source table is upstream table, object table is downstream table, i.e. data in object table are derived from source table.It needs It is noted that can have corresponding database table name, and show logic between each database table name in task sentence The logic keyword of relationship is the source table and object table that can determine task sentence correspondence database table by the logic key system. If there is " database table C from database table D " in task sentence, that is, it can determine that database table C is object table, data Library table D is source table.

Step S30 updates the genetic connection of correspondence database table in preset pattern database according to the logical relation.

After data governing system, which obtains task sentence, corresponds to the logical relation of each database table, data governing system root The genetic connection of correspondence database table is updated according to the logical relation.It should be noted that can determine task by the logical relation Logical relation between each database table involved in sentence.In the present embodiment, graphic data base can be JanusGraph Graphic data base can also be the graphic data bases such as neo4j, ImageNet and HugeGraph.Neo4j is one high performance, NOSQL graphic data base, structural data is stored on network rather than in table by it.ImageNet project is one for regarding Feel the large-scale visible database of object recognition software research.HugeGraph is a easy-to-use, efficient, general open source figure number According to library system (Graph Database, GitHub item address), Apache TinkerPop3 frame and completely compatible is realized Gremlin query language, has a perfect tool chain component, power-assisted user easily construct application on based on chart database and Product.

It is understood that the task sentence may will be updated the genetic connection of database table in graphic data base, it can also The genetic connection of database table in image data base can not be will be updated.Such as when task sentence is data query sentence, scheme at this time The genetic connection of database table will not change in graphic data library, but image data base still can be according to data query language The logical relation of database table involved in sentence updates the genetic connection of correspondence database table, only each in chart database at this time Genetic connection and updated genetic connection before a database table update are consistent.

Further, in order to not will be updated database table in graphic data base when batch task of race is data query task Genetic connection, Hadoop cluster judges whether the race batch task is data query task, if should when receiving batch task of race Running batch task is data query task, and the race batch corresponding task sentence of task is not then sent to data and administers system by Hadoop cluster System, at this point, also there is no need to update the genetic connection of database table in graphic data base.

The present embodiment passes through after detecting that Hadoop cluster receives the race batch task of client transmission, in Hadoop collection Race batch task is compiled in group, obtains running batch corresponding task sentence of task, to task sentence in data governing system It is parsed, the task sentence of obtaining corresponds to the logical relation of each database table, updates preset pattern data according to logical relation The genetic connection of correspondence database table in library, realizes when executing race batch task, corresponds to each database according to batch task is run The logical relation of table updates the genetic connection of correspondence database table in graphic data base, improves database table in graphic data base Between genetic connection accuracy.

Further, propose that the present invention is based on the data-updating method second embodiments of Hadoop.

The data-updating method second embodiment based on Hadoop and the data-updating method based on Hadoop The difference of first embodiment is that, if the task sentence is that data update sentence, referring to Fig. 2, the data based on Hadoop are more New method further include:

Step S40 processes batch task corresponding data that runs in the Hadoop cluster, after being processed Data.

It should be noted that present pattern keyword in task sentence, can determine task language by the type keyword The statement type of sentence.Such as when in task sentence there are update, add and delete expression update type keyword when, can Determine that task sentence is that data update sentence.When there are the types of the expressions such as search and gain inquiry, acquisition in task sentence When keyword, it may be determined that task sentence is data query sentence.

After determining task sentence is that data update sentence, Hadoop cluster adds to batch corresponding data of task are run Work, the data after being processed.Specifically, Hadoop cluster can be calculated by Map Reduce, will run batch corresponding number of task According to the data for being processed into specific format.Batch data of the corresponding data mart modeling of task at fixed length size, Huo Zhejia will such as be run Work at certain specific data type data.At this point, running batch corresponding data of task can be the data newly increased, it can also be modification The data etc. stored in the metadata that Hadoop cluster corresponding relationship type database is stored, or modification HDFS.It needs to illustrate , in Hadoop cluster, data are stored in HDFS, and the metadata of data and are not stored in HDFS It on the HDFS of Hadoop itself, but is stored in traditional relevant database, is such as stored in MySQL.

Step S50 updates metadatabase according to the data after the processing, obtains the updated metadatabase.

After data after Hadoop cluster obtains processing, Hadoop cluster can update HDFS according to the data after the processing The data of middle storage, and the metadatabase of metadata is updated storage according to the data after processing, obtain updated metadatabase. If the database for storing metadata is MySQL, Hadoop cluster updates MySQL according to the data after processing.

Step S60 obtains updated first number by the data governing system in the updated metadatabase According to, and obtain the table name of the data after the processing and the place of the data after processing database table.

After data governing system, which listens to metadatabase, to be updated, obtained in the metadatabase of data governing system in the updated Updated metadata is taken, and obtains the data after processing in the HDFS of Hadoop cluster, and obtains the data after processing The table name of place database table.It should be noted that each data can all be stored in some database in Hadoop cluster In table, each database table has a table name, which can uniquely indicate some table name.

Further, it is described obtained in the updated metadatabase by the data governing system it is updated The step of metadata includes:

Step d obtains the updated metadatabase by preset oracle listener in the data governing system Monitor log.

Step e parses the monitoring log, obtains the target keywords monitored in log.

Step f obtains updated metadata in the updated metadatabase according to the target keywords.

Further, data governing system deploys Binlog oracle listener in metadatabase, and specifically, data are administered Binlog oracle listener is deployed in metadatabase by system using the frame of CANAL, and data governing system is supervised using Binlog Program is listened to obtain the monitoring log of updated metadatabase, which is Binlog log, parses the monitoring log, obtains The target keywords in the monitoring log are taken, obtain updated member according in the metadatabase of the target keywords in the updated Data.Wherein, the keywords such as target keywords update, add and delete.In the present embodiment, target keywords and class Type keyword can be identical, can not also be identical.

The working principle of CANAL are as follows: 1. simulate the interaction protocol of MySQL slave (from MySQL), pretending oneself is MySQL slave sends dump agreement to MySQL master (main MySQL)；MySQL master receives dump request, starts Binary log (Binlog, binary log) is pushed to slave (namely MySQL)；It is (original to parse Binary log object For byte stream)；2. parsing Binary log using the open-replicator of open source, wherein Open Replicator is One MySQL binlog with written in Java analyzes program.；3. CANAL needs to safeguard Event Store (storage), Ke Yicun It takes in Memory, File, Zookeeper；4. CANAL needs to safeguard the state of client, one instance of synchronization (into Journey) there can only be a consumption terminal consumption.

Further, in monitoring log, it also can parse the table name of the corresponding database table of updated metadata Claim, the corresponding table name of updated metadata is known as the table name of the data after the corresponding processing of updated metadata, i.e. data Governing system can get updated metadata by monitoring log, and determine the data place database table after processing Table name.

Step S70 updates in the graphic data base according to the data after the updated metadata and the processing The corresponding database table of the table name, and updated database table is determined as upstream data library table.

It is understood that there are correspondence database tables for each metadata, therefore, each metadata exists pair The table name answered.After data, table name and updated metadata after data governing system gets processing, data are administered System according to updated metadata and the corresponding database table of table name in the corresponding update graphic data base of data after processing, And updated database table is determined as upstream data library table.

It should be noted that step S40, step S50, step S60, step S70 can step S20 and step S30 it Preceding execution can also be executed after step S20 and step S30, or be executed between step S20 and step S30.

Step S80 determines the corresponding downstream data library table of upstream data library table according to the genetic connection.

Step S90 updates downstream data library table according to the data after the updated metadata and the processing.

After data governing system determines the upstream data library table in graphic data base, data governing system is according to figure number The corresponding downstream data library table of upstream data library table is determined according to the genetic connection in library, and according to updated metadata and processing Data afterwards update downstream data library table.It should be noted that because being to deposit there are the data in the database table of genetic connection In dependence, therefore, after some data in the table of upstream data library change, there are blood relationships with upstream data library table The downstream data library table of relationship will receive influence, in order to keep upstream data library table in the table of downstream data library data it is consistent Property, so needing to update downstream data library table according to the data after updated metadata and processing.

The present embodiment is by corresponding to table name and updated metadata more according to the data after processing, the data after processing Database table in new graphic data base, obtains updated database table, and updated database table is determined as upstream Database table, according to after processing data and updated metadata updates there are the downstreams of genetic connection with upstream data library table Database table accurately maintains the data consistency in upstream data library table and downstream data library table in real time.

Further, propose that the present invention is based on the data-updating method 3rd embodiments of Hadoop.

The data-updating method 3rd embodiment based on Hadoop and the data-updating method based on Hadoop The difference of second embodiment is, the data-updating method based on Hadoop further include:

Step g sends prompt information and gives downstream data library table corresponding client, so that the client is according to institute Prompt information prompt user is stated, table corresponding upstream data library table in the downstream data library has updated.

Step h, if the more new command that downstream data library table corresponds to client transmission is received, according to the update Data after metadata and the processing afterwards update downstream data library table.

After data governing system determines downstream data library table, data governing system generates prompt information, and by the prompt Information is sent in the corresponding client of downstream data library table, and detects whether that receiving downstream data library corresponds to client transmission More new command.After database table corresponding client in downstream receives prompt information, the prompt information is exported, it should with basis Prompt information prompts the corresponding user of downstream data library table, which has updated, In the present embodiment, the way of output for updating instruction is not limited.At this point, the corresponding user of downstream data library table can be in the client More new command is triggered in display interface.After database table corresponding client in downstream detects more new command, which is referred to Order is sent to data governing system.When data governing system receives the more new command that downstream data library table corresponds to client transmission Afterwards, data governing system updates downstream data library table according to the data after updated metadata and processing.

The present embodiment is by sending prompt information to downstream data library table and corresponding to client after the update of upstream data library table End, the user oneself for allowing downstream data library table to correspond to client decides whether to update downstream data library table, and is receiving downstream After the more new command of database table user triggering, downstream data table is updated, realizes after the update of upstream data library table, allows downstream The corresponding user of database table independently decides whether to update downstream data library table.

Further, propose that the present invention is based on the data-updating method fourth embodiments of Hadoop.

The data-updating method fourth embodiment based on Hadoop and the data-updating method based on Hadoop The difference of first, second or third embodiment is, if the task sentence is data query sentence, the number based on Hadoop According to update method further include:

Step i obtains the corresponding target data of the data query sentence in the Hadoop cluster.

The target data is sent to described run and criticizes the corresponding client of task by step j.

If it is determined that task sentence is data query sentence, Hadoop cluster then parses data query sentence, obtains data and look into The target table name of the corresponding data Table storehouse table of sentence is ask, and data query sentence pair is obtained in HDFS according to target table name The target data answered, and acquired target data is sent to the race batch corresponding client of task.

The present embodiment is sent out target data by obtaining the corresponding target data of data query sentence in Hadoop cluster The race batch corresponding client of task is given, the data inquiry request of task schedule platform forwarding client is withouted waiting for, improves The search efficiency of data in inquiry Hadoop cluster.

In addition, referring to Fig. 3, it is described to be based on Hadoop the present invention also provides a kind of data update apparatus based on Hadoop Data update apparatus include:

Collector 10, for after detecting that Hadoop cluster receives the race batch task of client transmission, described Batch task of running is compiled in Hadoop cluster, described run is obtained and criticizes the corresponding task sentence of task；

Parsing module 20 obtains the task language for parsing in data governing system to the task sentence Sentence pair answers the logical relation of each database table；

Update module 30, for updating the blood relationship of correspondence database table in preset pattern database according to the logical relation Relationship.

Further, if the task sentence is that data update sentence, the data update apparatus based on Hadoop Further include:

Processing module is added for processing in the Hadoop cluster to batch task corresponding data that runs Data after work；

The update module 30 is also used to update metadatabase according to the data after the processing, obtains described updated Metadatabase；Table name described in the graphic data base is updated according to the data after the updated metadata and the processing Claim corresponding database table；The downstream data library is updated according to the data after the updated metadata and the processing Table；

The data update apparatus based on Hadoop further include:

First obtains module, for obtaining update in the updated metadatabase by the data governing system Metadata afterwards, and obtain the table name of the data after the processing and the place of the data after processing database table；

Determining module, for updated database table to be determined as upstream data library table；It is true according to the genetic connection The fixed corresponding downstream data library table of upstream data library table.

Further, the data update apparatus based on Hadoop further include:

First sending module gives downstream data library table corresponding client, for described for sending prompt information Client prompts user according to the prompt information, and table corresponding upstream data library table in the downstream data library has updated；

If the update module 30 is also used to receive the more new command that downstream data library table corresponds to client transmission, Then downstream data library table is updated according to the data after the updated metadata and the processing.

Further, the first acquisition module includes:

Acquiring unit, for obtaining updated first number by oracle listener preset in the data governing system According to the monitoring log in library；

First resolution unit, for parsing the monitoring log；

The acquiring unit is also used to obtain the target keywords in the monitoring log；It is obtained according to the target keywords Take updated metadata in the updated metadatabase.

Further, if the task sentence is data query sentence, the data update apparatus based on Hadoop is also Include:

Second obtains module, for obtaining the corresponding number of targets of the data query sentence in the Hadoop cluster According to；

Second sending module criticizes the corresponding client of task for the target data to be sent to described run.

Further, the parsing module 20 further include:

Second resolution unit obtains the task for parsing in data governing system to the task sentence The corresponding database table of sentence；

Determination unit, for determining source table and object table in the task sentence correspondence database table, according to the source Table and the object table determine that the task sentence corresponds to the logical relation of each database table.

Further, the collector 10 is also used to work as the race batch for detecting that Hadoop cluster receives client transmission After task, the hive compiler of the Hadoop cluster is called to be compiled batch task of running, obtains the race batch task Corresponding HQL sentence.

It should be noted that each embodiment of the data update apparatus based on Hadoop and the above-mentioned number based on Hadoop Essentially identical according to each embodiment of update method, in this not go into detail.

In addition, the present invention also provides a kind of data update systems based on Hadoop.As shown in figure 4, Fig. 4 is of the invention real Apply the structural schematic diagram for the hardware running environment that a scheme is related to.

It should be noted that the structure that Fig. 4 is the hardware running environment of data update system that may be based on Hadoop is shown It is intended to.The embodiment of the present invention can be PC, the terminal devices such as portable computer based on the data update system of Hadoop.

As shown in figure 4, being somebody's turn to do the data update system based on Hadoop may include: processor 1001, such as CPU, storage Device 1005, user interface 1003, network interface 1004, communication bus 1002.Wherein, communication bus 1002 is for realizing these groups Connection communication between part.User interface 1003 may include display screen (Display), input unit such as keyboard (Keyboard), optional user interface 1003 can also include standard wireline interface and wireless interface.Network interface 1004 is optional May include standard wireline interface and wireless interface (such as WI-FI interface).Memory 1005 can be high speed RAM memory, It is also possible to stable memory (non-volatile memory), such as magnetic disk storage.Memory 1005 optionally may be used also To be independently of the storage device of aforementioned processor 1001.

Optionally, the data update system based on Hadoop can also include camera, (Radio Frequency, is penetrated RF Frequently circuit), sensor, voicefrequency circuit, WiFi module etc..

It will be understood by those skilled in the art that data update system structure shown in Fig. 4 based on Hadoop not structure The restriction of data update system based on Hadoop in pairs may include components more more or fewer than diagram, or combine certain A little components or different component layouts.

As shown in figure 4, as may include that operating system, network are logical in a kind of memory 1005 of computer storage medium Believe module, Subscriber Interface Module SIM and the date update program based on Hadoop.Wherein, operating system is to manage and control to be based on The program of the data update system hardware and software resource of Hadoop supports date update program based on Hadoop and other The operation of software or program.

In data update system shown in Fig. 4 based on Hadoop, user interface 1003 is mainly used for connecting client, Data communication is carried out with client；Network interface 1004 is mainly used for connecting background server, carries out data with his server is shouted Communication；Processor 1001 can be used for calling the date update program based on Hadoop stored in memory 1005, and execute The step of data-updating method based on Hadoop as described above.

The present invention is based on the data update system specific embodiments of Hadoop and the above-mentioned data update based on Hadoop Each embodiment of method is essentially identical, and details are not described herein.

In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium On be stored with the date update program based on Hadoop, it is real when the date update program based on Hadoop is executed by processor Now as described above data-updating method based on Hadoop the step of.

Computer readable storage medium specific embodiment of the present invention and the above-mentioned data-updating method based on Hadoop are each Embodiment is essentially identical, and details are not described herein.

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, method, article or the device that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or device institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, method of element, article or device.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, computer, clothes Business device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of data-updating method based on Hadoop, which is characterized in that the data-updating method packet based on Hadoop Include following steps:

After detecting that Hadoop cluster receives the race batch task of client transmission, to the race in the Hadoop cluster The task of criticizing is compiled, and is obtained described run and is criticized the corresponding task sentence of task；

The task sentence is parsed in data governing system, the task sentence is obtained and corresponds to each database table Logical relation；

2. as described in claim 1 based on the data-updating method of Hadoop, which is characterized in that if the task sentence is number It is according to sentence is updated, then described after detecting that Hadoop cluster receives the race batch task of client transmission, in the Hadoop Batch task of running is compiled in cluster, after obtaining described the step of running batch task corresponding task sentence, further includes:

Updated metadata is obtained in the updated metadatabase by the data governing system, and described in acquisition The table name of database table where data after data and the processing after processing；

Table name pair described in the graphic data base is updated according to the data after the updated metadata and the processing The database table answered, and updated database table is determined as upstream data library table；

It is described according to the logical relation update preset pattern database in correspondence database table genetic connection the step of it Afterwards, table corresponding downstream data library table in the upstream data library is determined according to the genetic connection；

3. as claimed in claim 2 based on the data-updating method of Hadoop, which is characterized in that described to be closed according to the blood relationship Be the step of determining upstream data library table corresponding downstream data library table after, further includes:

Sending prompt information gives downstream data library table corresponding client, so that the client is according to the prompt information User is prompted, table corresponding upstream data library table in the downstream data library has updated；

If the more new command that downstream data library table corresponds to client transmission is received, according to the updated metadata Downstream data library table is updated with the data after the processing.

4. as claimed in claim 2 based on the data-updating method of Hadoop, which is characterized in that described to be controlled by the data The step of reason system obtains updated metadata in the updated metadatabase include:

The monitoring log of the updated metadatabase is obtained by preset oracle listener in the data governing system；

5. as described in claim 1 based on the data-updating method of Hadoop, which is characterized in that if the task sentence is number It is according to query statement, then described after detecting that Hadoop cluster receives the race batch task of client transmission, in the Hadoop Batch task of running is compiled in cluster, after obtaining described the step of running batch task corresponding task sentence, further includes:

6. as described in claim 1 based on the data-updating method of Hadoop, which is characterized in that described in data governing system In the task sentence is parsed, obtain the step of task sentence corresponds to the logical relation of each database table packet It includes:

The task sentence is parsed in data governing system, obtains the corresponding database table of the task sentence；

It determines the source table and object table in the task sentence correspondence database table, is determined according to the source table and the object table The task sentence corresponds to the logical relation of each database table.

7. such as the data-updating method as claimed in any one of claims 1 to 6 based on Hadoop, which is characterized in that the detection After the race batch task for receiving client transmission to Hadoop cluster, batch task of running is carried out in the Hadoop cluster Compiling, obtaining described the step of running batch task corresponding task sentence includes:

After detecting that Hadoop cluster receives the race batch task of client transmission, the hive of the Hadoop cluster is called to compile It translates device to be compiled batch task of running, obtains described run and criticize the corresponding HQL sentence of task.

8. a kind of data update apparatus based on Hadoop, which is characterized in that the data update apparatus packet based on Hadoop It includes:

Collector, for after detecting that Hadoop cluster receives the race batch task of client transmission, in the Hadoop Batch task of running is compiled in cluster, described run is obtained and criticizes the corresponding task sentence of task；

It is corresponding to obtain the task sentence for parsing in data governing system to the task sentence for parsing module The logical relation of each database table；

Update module, for updating the genetic connection of correspondence database table in preset pattern database according to the logical relation.

9. as claimed in claim 8 based on the data update apparatus of Hadoop, which is characterized in that if the task sentence is number According to sentence is updated, then the data update apparatus based on Hadoop further include:

Processing module, for being processed in the Hadoop cluster to batch task corresponding data that runs, after obtaining processing Data；

The update module is also used to update metadatabase according to the data after the processing, obtains the updated metadata Library；It is corresponding that table name described in the graphic data base is updated according to the data after the updated metadata and the processing Database table；Downstream data library table is updated according to the data after the updated metadata and the processing；

The data update apparatus based on Hadoop further include:

First obtains module, updated for being obtained in the updated metadatabase by the data governing system Metadata, and obtain the table name of the data after the processing and the place of the data after processing database table；

Determining module, for updated database table to be determined as upstream data library table；Institute is determined according to the genetic connection State the corresponding downstream data library table of upstream data library table.

10. as claimed in claim 9 based on the data update apparatus of Hadoop, which is characterized in that described based on Hadoop's Data update apparatus further include:

First sending module gives downstream data library table corresponding client, for the client for sending prompt information End prompts user according to the prompt information, and table corresponding upstream data library table in the downstream data library has updated；

If the update module is also used to receive the more new command that downstream data library table corresponds to client transmission, basis Data after the updated metadata and the processing update downstream data library table.

11. as claimed in claim 9 based on the data update apparatus of Hadoop, which is characterized in that described first obtains module Include:

Acquiring unit, for obtaining the updated metadatabase by oracle listener preset in the data governing system Monitoring log；

First resolution unit, for parsing the monitoring log；

The acquiring unit is also used to obtain the target keywords in the monitoring log；Institute is obtained according to the target keywords State updated metadata in updated metadatabase.

12. a kind of data update system based on Hadoop, which is characterized in that the data update system packet based on Hadoop It includes memory, processor and is stored in the data based on Hadoop that can be run on the memory and on the processor more New procedures, it is described to realize when being executed based on the date update program of Hadoop by the processor as any in claim 1 to 7 The step of data-updating method based on Hadoop described in item.

13. a kind of computer readable storage medium, which is characterized in that be stored with and be based on the computer readable storage medium The date update program of Hadoop realizes such as claim when the date update program based on Hadoop is executed by processor The step of data-updating method described in any one of 1 to 7 based on Hadoop.