WO2020238597A1

WO2020238597A1 - Hadoop-based data updating method, device, system and medium

Info

Publication number: WO2020238597A1
Application number: PCT/CN2020/089637
Authority: WO
Inventors: 彭陈成; 张阳
Original assignee: 深圳前海微众银行股份有限公司
Priority date: 2019-05-27
Filing date: 2020-05-11
Publication date: 2020-12-03
Also published as: CN110196888A

Abstract

Disclosed are a Hadoop-based data updating method, device and system as well as a medium. The method comprises the following steps: after the reception by a Hadoop cluster of a batch running task sent by a client is detected, compiling the batch running task in the Hadoop cluster to obtain a task statement corresponding to the batch running task (S10); parsing the task statement in a data management system to obtain the logical relationship of the task statement corresponding to each database table (S20); and updating, according to the logical relationship, the blood relationship corresponding to the database table in a preset graph database (S30).

Description

Hadoop-based data updating method, device, system and medium

This application claims the priority of the Chinese patent application filed on May 27, 2019 with the application number 201910448948.6 and the title "Hadoop-based data update method, device, system and medium", the entire content of which is incorporated herein by reference Applying.

Technical field

This application relates to the field of financial technology (Fintech) data processing technology, and in particular to a data update method, device, system and medium based on Hadoop.

Background technique

With the continuous development of financial technology, especially Internet technology finance (Fintech), more and more technologies (such as distributed, blockchain, artificial intelligence, etc.) are applied in the financial field, but the financial industry has also proposed technology higher requirement.

At present, many companies have built their own Hadoop big data platform. Based on this Hadoop big data platform, the data of various application systems in the enterprise are shared in the Hadoop big data platform, and then their own data warehouses and different topics are formed. Several data marts have been built. Among them, the data warehouse stores the data of a certain application system, and the data mart stores the data of the same subject in each application system. When a data mart relies on a certain upstream database table data reruns, it will notify the downstream database table to passively rerun the batch as the upstream database table changes. The realization of this notification often relies on a global distributed task scheduling platform. When a certain upstream database table changes, the Hadoop big data platform will notify the task scheduling platform, and the task scheduling platform will generate a corresponding notification and send it to the client corresponding to the downstream database table to trigger the reprocessing of the data in the downstream database table. When the blood relationship between the upstream database table and the downstream database table is complicated, the task scheduling platform cannot determine all the downstream database tables that are affected, resulting in incomplete update of the blood relationship between the upstream database table and the downstream database table. The blood relationship between the upstream database table and the downstream database table is inaccurate, that is, the blood relationship between the database tables in the graph database is inaccurate.

Summary of the invention

technical problem

The solution to the problem

Technical solutions

The main purpose of this application is to provide a data update method, device, system and medium based on Hadoop, aiming to solve the existing technical problem of inaccurate blood relationship between database tables in graph databases when running batch tasks. .

In order to achieve the above objective, this application provides a Hadoop-based data update method. The Hadoop-based data update method includes the following steps:

After detecting that the Hadoop cluster receives the batch running task sent by the client, compiling the batch running task in the Hadoop cluster to obtain the task statement corresponding to the batch running task;

Analyze the task statement in the data management system to obtain the logical relationship between the task statement and each database table;

The blood relationship of the corresponding database table in the preset graph database is updated according to the logical relationship.

In one embodiment, if the task sentence is a data update sentence, after detecting that the Hadoop cluster receives the batch running task sent by the client, the batch running task is compiled in the Hadoop cluster, After obtaining the task sentence corresponding to the batch running task, it further includes:

Processing the data corresponding to the batch running task in the Hadoop cluster to obtain processed data;

Update the metadata database according to the processed data to obtain the updated metadata database;

Obtaining updated metadata in the updated metadata database through the data management system, and obtaining the processed data and the table name of the database table where the processed data is located;

Update the database table corresponding to the table name in the graph database according to the updated metadata and the processed data, and determine the updated database table as an upstream database table;

After the step of updating the blood relationship of the corresponding database table in the preset graph database according to the logical relationship, determine the downstream database table corresponding to the upstream database table according to the blood relationship;

The downstream database table is updated according to the updated metadata and the processed data.

In an embodiment, after the step of determining the downstream database table corresponding to the upstream database table according to the blood relationship, the method further includes:

Sending prompt information to the client corresponding to the downstream database table, so that the client prompts the user according to the prompt information that the upstream database table corresponding to the downstream database table has been updated;

If an update instruction sent by the client corresponding to the downstream database table is received, the downstream database table is updated according to the updated metadata and the processed data.

In an embodiment, the step of obtaining updated metadata in the updated metadata database through the data management system includes:

Obtaining the monitoring log of the updated metadata database through a preset monitoring program in the data management system;

Parse the monitoring log to obtain the target keyword in the monitoring log;

Acquire updated metadata in the updated metadata database according to the target keyword.

In one embodiment, if the task sentence is a data query sentence, after detecting that the Hadoop cluster receives the batch running task sent by the client, compile the batch running task in the Hadoop cluster, After obtaining the task sentence corresponding to the batch running task, it further includes:

Obtaining target data corresponding to the data query statement in the Hadoop cluster;

The target data is sent to the client corresponding to the batch running task.

In an embodiment, the step of parsing the task statement in the data management system to obtain the logical relationship of the task statement corresponding to each database table includes:

Parse the task sentence in the data management system to obtain the database table corresponding to the task sentence;

Determine the source table and the target table in the database table corresponding to the task sentence, and determine the logical relationship of the task sentence corresponding to each database table according to the source table and the target table.

In one embodiment, after detecting that the Hadoop cluster receives the batch running task sent by the client, the step of compiling the batch running task in the Hadoop cluster to obtain the task sentence corresponding to the batch running task include:

After detecting that the Hadoop cluster receives the batch running task sent by the client, the hive compiler of the Hadoop cluster is called to compile the batch running task, and the HQL statement corresponding to the batch running task is obtained.

In addition, in order to achieve the above object, the present application also provides a Hadoop-based data update device, the Hadoop-based data update device includes:

The compiling module is used to compile the batch running task in the Hadoop cluster after detecting that the Hadoop cluster receives the batch running task sent by the client to obtain the task statement corresponding to the batch running task;

The parsing module is used to parse the task sentence in the data management system to obtain the logical relationship of the task sentence corresponding to each database table;

The update module is used to update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship.

In addition, in order to achieve the above object, this application also provides a Hadoop-based data update system. The Hadoop-based data update system includes a memory, a processor, and a storage device that is stored on the memory and can run on the processor. A Hadoop-based data update program, which implements the steps of the Hadoop-based data update method as described above when the Hadoop-based data update program is executed by the processor.

In addition, in order to achieve the above-mentioned object, this application also provides a computer-readable storage medium that stores a Hadoop-based data update program that is implemented when the Hadoop-based data update program is executed by a processor. The steps of the data update method based on Hadoop as described above.

In this application, after detecting that the Hadoop cluster receives the batch running task sent by the client, the batch running task is compiled in the Hadoop cluster to obtain the task sentence corresponding to the batch running task, and the task sentence is parsed in the data management system. Obtain the logical relationship of the task statement corresponding to each database table, and update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship, so that when the batch task is executed, the graph database is updated according to the logical relationship of each database table corresponding to the batch task. The blood relationship of the corresponding database tables in the database improves the accuracy of the blood relationship between the database tables in the graph database.

The beneficial effects of the invention

Brief description of the drawings

Description of the drawings

Fig. 1 is a schematic flowchart of a first embodiment of a Hadoop-based data update method according to the present application;

2 is a schematic flowchart of a second embodiment of the Hadoop-based data update method of this application;

Fig. 3 is a functional schematic block diagram of a preferred embodiment of the Hadoop-based data update device of the present application;

Fig. 4 is a schematic structural diagram of a hardware operating environment involved in a solution of an embodiment of the present application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Invention embodiment

Embodiments of the invention

It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.

This application provides a data update method based on Hadoop. Refer to FIG. 1, which is a schematic flowchart of a first embodiment of the data update method based on Hadoop in this application.

The embodiment of the application provides an embodiment of a data update method based on Hadoop. It should be noted that although the logical sequence is shown in the flowchart, in some cases, the sequence shown here can be executed in a different order. Steps out or described.

The following describes the professional vocabulary used in the embodiments of this application:

①JanusGraph: An open source distributed graph database, which has good scalability, and can support the storage and query of tens of billions of vertices and edge graph data through a multi-machine cluster. JanusGraph is a transactional database that supports a large number of users to execute complex real-time graph traversal with high concurrency.

②Hadoop: A distributed system infrastructure developed by the Apache Foundation. It is a software framework that can perform distributed processing of large amounts of data, and perform data processing in a reliable, efficient, and scalable manner.

③HDFS: Distributed File System (Hadoop Distributed File System). HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput to access applications The data of the program is suitable for applications with a large data set.

④ Metadata: Metadata, also known as intermediary data, relay data, is data describing data (data about data), mainly describing data properties (property), used to support such as indicating storage location, historical data, resource search And file recording and other functions.

⑤ Binlog: The binlog log is used to record all statements that have updated data or have potentially updated data. Statements are stored in the form of "events", which describe all data changes that occur in a certain database.

⑥HQL: Short for HiveQL, it is a language similar to SQL (Structured Query Language, Structured Query Language), which is compatible with most SQL syntax, but does not fully support the SQL standard.

⑦CANAL: An open source project under Alibaba, pure Java development. Based on database incremental log analysis, it provides incremental data subscription and consumption. Currently, it mainly supports MySQL (relational database management system) (also supports mariaDB).

⑧MapReduce: a programming model based on the hadoop platform, used for parallel operations of large-scale data sets (greater than 1TB), the function is to summarize a bunch of messy data according to certain characteristics, and then process and get the final result. Map faces messy and unrelated data. It parses each data, extracts key and value from it, and summarizes the data obtained by Map through Reduce to get the final result.

⑨Graphic database: It is a type of NoSQL (non-relational database) database, which uses graph theory to store the relationship information between entities. Graph database is a non-relational database, which uses graph theory to store the relational information between entities.

⑩hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and provide simple SQL query functions, and can convert SQL statements into MapReduce tasks for operation.

Hadoop-based data update methods include:

Step S10, after detecting that the Hadoop cluster receives the batch running task sent by the client, compile the batch running task in the Hadoop cluster to obtain the task statement corresponding to the batch running task.

The Hadoop cluster detects in real time or regularly whether it receives batch tasks sent by the client. Wherein, the duration corresponding to the timing can be set according to specific needs, and there is no specific limitation on the duration corresponding to the timing in this embodiment. When the client needs to view data or update data, the user of the client can manually trigger the batch running task, or set a regular batch running task in the client. Updating data includes but is not limited to modifying existing data, adding new data, or deleting existing data. After the Hadoop cluster receives the batch running task sent by the client, the Hadoop cluster compiles the received batch running task to obtain the task statement corresponding to the batch running task, and sends the task statement to the data management system. Further, when the Hadoop cluster does not receive the batch running task sent by the client, the Hadoop cluster continues to detect whether it receives the batch running task sent by the client.

Further, step S10 includes:

Step a: After detecting that the Hadoop cluster receives the batch running task sent by the client, call the Hive compiler of the Hado op cluster to compile the batch running task, and obtain the HQL statement corresponding to the batch running task.

Specifically, in a Hadoop cluster, a hive compiler, an execution engine, and a listener are provided, and the listener is a hook (hook) listener. When the Hadoop cluster receives the batch running task sent by the client, the Hadoop cluster compiles the batch running task through its built-in hive compiler, and obtains the HQL (Hibernate Query Language, query statement) statement corresponding to the batch running task, which is understandable Yes, the HQL statement is the task statement corresponding to the batch task. It should be noted that the Hadoop cluster can also compile the batch running task through its built-in hive compiler to obtain the SQL statement corresponding to the batch running task.

After the Hadoop cluster obtains the HQL statement corresponding to the batch task through the hive compiler, the Hadoop cluster submits the HQL statement to the execution engine. At this time, the listener of the Hadoop cluster can monitor each HQL statement and obtain the monitored HQL statement. HQL sentence, the acquired HQL sentence is sent to the data management system.

Step S20: Analyze the task sentence in the data management system to obtain the logical relationship between the task sentence and each database table.

After the data management system receives the task statement sent by the Hadoop cluster, the data management system parses the task statement to obtain the logical relationship between the task statement and each database table. It should be noted that in the task statement, there is data related to the batch task, and these data may exist in different database tables, and there is a certain logical relationship between the database tables. For example, the database table A where a certain data a is located, but the data a needs to be processed by the database table B. At this time, it indicates that the database table A and the database table B have a logical relationship.

Further, step S20 includes:

Step b: Analyze the task sentence in the data management system to obtain a database table corresponding to the task sentence.

Step c: Determine the source table and the target table in the database table corresponding to the task sentence, and determine the logical relationship between the task sentence and each database table according to the source table and the target table.

Specifically, the data management system parses the task statement, obtains the database table corresponding to the task statement, determines the source table and the target table in the database table, and determines the logical relationship between the task statement and each database table according to the source table and the target table. It is understandable that the source table is an upstream table, and the target table is a downstream table, that is, the data in the target table comes from the source table. It should be noted that in the task statement, there will be a corresponding database table name and a logical keyword indicating the logical relationship between the database table names. Through the logical key system, the source table and the source table of the database table corresponding to the task statement can be determined. Target table. If there is "database table C from database table D" in the task sentence, then it can be determined that database table C is the target table and database table D is the source table.

Step S30, updating the blood relationship of the corresponding database table in the preset graph database according to the logical relationship.

After the data management system obtains the logical relationship of the task sentence corresponding to each database table, the data management system updates the blood relationship of the corresponding database table according to the logical relationship. It should be noted that the logical relationship between the various database tables involved in the task statement can be determined through the logical relationship. In this embodiment, the graph database can be a JanusGraph graph database, or can be a graph database such as neo4j, ImageNet, and HugeGraph. Neo4j is a high-performance, NOSQL graph database that stores structured data on the network instead of in tables. The ImageNet project is a large-scale visualization database for the research of visual object recognition software. HugeGraph is an easy-to-use, efficient and general-purpose open source graph database system (Graph Database, GitHub project address), which implements the Apache TinkerPop3 framework and is fully compatible with Gremlin query language. It has complete tool chain components to help users easily build graph-based databases. Applications and products above.

It is understandable that the task statement may update the blood relationship of the database table in the graph database, or may not update the blood relationship of the database table in the image database. For example, when the task statement is a data query statement, the blood relationship of the database table in the graph database will not change at this time, but the image database can still update the blood relationship of the corresponding database table according to the logical relationship of the database table involved in the data query statement Relationship, but at this time the blood relationship of each database table in the graph database before the update is consistent with the blood relationship after the update.

Further, in order to not update the blood relationship of the database tables in the graph database when the batch running task is a data query task, when the Hadoop cluster receives the batch running task, it judges whether the batch running task is a data query task. Batch tasks are data query tasks, and the Hadoop cluster does not send the task statements corresponding to the batch tasks to the data management system. At this time, there is no need to update the blood relationship of the database tables in the graph database.

In this embodiment, after detecting that the Hadoop cluster receives the batch running task sent by the client, the batch running task is compiled in the Hadoop cluster to obtain the task sentence corresponding to the batch running task, and the task sentence is parsed in the data management system , Obtain the logical relationship of the task statement corresponding to each database table, and update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship, so that when the batch task is executed, the graph is updated according to the logical relationship of each database table corresponding to the batch task. The blood relationship of the corresponding database tables in the database improves the accuracy of the blood relationship between the database tables in the graph database.

Further, a second embodiment of the Hadoop-based data update method of this application is proposed.

The difference between the second embodiment of the Hadoop-based data update method and the first embodiment of the Hadoop-based data update method is that if the task sentence is a data update sentence, referring to Figure 2, the Hadoop-based data update method also include:

Step S40, processing the data corresponding to the batch running task in the Hadoop cluster to obtain processed data.

It should be noted that there is a type keyword in the task statement, and the statement type of the task statement can be determined by the type keyword. For example, when there are update, add, and delete type keywords in the task statement, it can be determined that the task statement is a data update statement. When there are type keywords such as search and gain in the task statement that indicate query and acquisition, the task statement can be determined to be a data query statement.

When it is determined that the task statement is a data update statement, the Hadoop cluster processes the data corresponding to the batch task to obtain the processed data. Specifically, the Hadoop cluster uses MapReduce calculations to process the data corresponding to the batch task into data in a specific format. For example, the data corresponding to the batch running task is processed into fixed-length data, or processed into data of a certain data type. At this time, the data corresponding to the batch task can be newly added data, or it can be modifying the metadata stored in the relational database corresponding to the Hadoop cluster, or modifying the data stored in HDFS. It should be noted that in a Hadoop cluster, the data is stored in HDFS, and the metadata of the data in HDFS is not stored in Hadoop itself HDFS, but stored in a traditional relational database, such as stored in MySQL.

Step S50: Update the metadata database according to the processed data to obtain the updated metadata database.

After the Hadoop cluster obtains the processed data, the Hadoop cluster will update the data stored in HDFS according to the processed data, and update the metadata database storing metadata according to the processed data to obtain the updated metadata database. If the database storing metadata is MySQL, the Hadoop cluster will update MySQL based on the processed data.

Step S60: Obtain updated metadata from the updated metadata database through the data management system, and obtain the processed data and the table name of the database table where the processed data is located.

When the data management system monitors the metadata database update, the data management system obtains the updated metadata in the updated metadata database, and obtains the processed data in the HDFS of the Hadoop cluster, and obtains the database table where the processed data is located The name of the table. It should be noted that in a Hadoop cluster, every data is stored in a certain database table, and each database table has a table name, which can uniquely represent a certain table name.

Further, the step of obtaining updated metadata in the updated metadata database through the data management system includes:

Step d: Obtain the monitoring log of the updated metadata database through a preset monitoring program in the data management system.

Step e: Parse the monitoring log to obtain the target keyword in the monitoring log.

Step f: Obtain updated metadata in the updated metadata database according to the target keyword.

Further, the data management system deploys the Binlog listener in the metadata database. Specifically, the data management system uses the CANAL framework to deploy the Binlog listener in the metadata database. The data management system uses the Binlog listener to obtain the updated metadata. The monitoring log, the monitoring log is a Binlog log, the monitoring log is parsed, the target keyword in the monitoring log is obtained, and the updated metadata is obtained in the updated metadata database according to the target keyword. Among them, the target keywords are keywords such as update, add, and delete. In this embodiment, the target keyword and the type keyword may be the same or different.

The working principle of CANAL is: ①Imitate the interaction protocol of MySQL slave (from MySQL), pretend to be MySQL slave, and send dump protocol to MySQL master (master MySQL); MySQL master receives the dump request and starts to push Binary log (Binlog) Log) to slave (that is, MySQL); Parse Binary log object (originally byte stream); ② Use open-replicator to parse Binary log, among which Open Replicator is a MySQL binlog analysis program written in Java. ③CANAL needs to maintain Event Store (storage), which can be accessed in Memory, File, and Zookeeper; ④CANAL needs to maintain the state of the client, and an instance (process) can only have one consumer at a time.

Further, in the monitoring log, the table name of the database table corresponding to the updated metadata can also be parsed, and the table name corresponding to the updated metadata is the table name of the updated metadata corresponding to the processed data, namely The data management system can obtain updated metadata through monitoring logs and determine the table name of the database table where the processed data is located.

Step S70: Update the database table corresponding to the table name in the graph database according to the updated metadata and the processed data, and determine the updated database table as an upstream database table.

It is understandable that each metadata has a corresponding database table, and therefore, each metadata has a corresponding table name. When the data management system obtains the processed data, table name, and updated metadata, the data management system updates the database table corresponding to the table name in the graph database according to the updated metadata and the processed data, and will update it The subsequent database table is determined as the upstream database table.

It should be noted that step S40, step S50, step S60, and step S70 may be executed before step S20 and step S30, or may be executed after step S20 and step S30, or executed between step S20 and step S30.

Step S80: Determine the downstream database table corresponding to the upstream database table according to the blood relationship.

Step S90: Update the downstream database table according to the updated metadata and the processed data.

When the data management system determines the upstream database table in the graph database, the data management system determines the downstream database table corresponding to the upstream database table according to the blood relationship in the graph database, and updates the downstream database table according to the updated metadata and processed data . It should be noted that because the data in the blood relationship database table has a dependency relationship, when a certain data in the upstream database table changes, the downstream database table that has a blood relationship with the upstream database table will be affected , In order to maintain the consistency of the data in the upstream database table and the downstream database table, it is necessary to update the downstream database table according to the updated metadata and processed data.

In this embodiment, the updated database table is obtained by updating the database table in the graph database according to the processed data, the name of the corresponding table of the processed data, and the updated metadata, and the updated database table is determined as the upstream database table , According to the processed data and updated metadata, the downstream database table that has blood relationship with the upstream database table is updated, and the data consistency in the upstream database table and the downstream database table is accurately maintained in real time.

Further, a third embodiment of the Hadoop-based data update method of this application is proposed.

The difference between the third embodiment of the Hadoop-based data update method and the second embodiment of the Hadoop-based data update method is that the Hadoop-based data update method further includes:

Step g: Send prompt information to the client corresponding to the downstream database table, so that the client prompts the user according to the prompt information that the upstream database table corresponding to the downstream database table has been updated.

Step h: If an update instruction sent by the client corresponding to the downstream database table is received, the downstream database table is updated according to the updated metadata and the processed data.

After the data management system determines the downstream database table, the data management system generates prompt information, sends the prompt information to the client corresponding to the downstream database table, and detects whether an update instruction sent by the client corresponding to the downstream database is received. When the client corresponding to the downstream database table receives the prompt information, it outputs the prompt information to prompt the user corresponding to the downstream database table according to the prompt information. The upstream database table corresponding to the downstream database table has been updated. In this embodiment, The output method of the update command is not restricted. At this time, the user corresponding to the downstream database table can trigger an update instruction on the display interface of the client. When the client corresponding to the downstream database table detects the update instruction, it sends the update instruction to the data management system. When the data management system receives the update instruction sent by the client corresponding to the downstream database table, the data management system updates the downstream database table according to the updated metadata and processed data.

In this embodiment, after the upstream database table is updated, prompt information is sent to the client corresponding to the downstream database table, so that the user of the client corresponding to the downstream database table decides whether to update the downstream database table, and after receiving the update instruction triggered by the downstream database table user Later, the downstream data table is updated, so that after the upstream database table is updated, the user corresponding to the downstream database table can independently decide whether to update the downstream database table.

Further, a fourth embodiment of the Hadoop-based data update method of this application is proposed.

The difference between the fourth embodiment of the Hadoop-based data update method and the first, second, or third embodiment of the Hadoop-based data update method is that if the task sentence is a data query sentence, the data is based on Hadoop The update method also includes:

Step i: Obtain target data corresponding to the data query sentence in the Hadoop cluster.

Step j: Send the target data to the client corresponding to the batch running task.

If it is determined that the task statement is a data query statement, the Hadoop cluster parses the data query statement to obtain the target table name of the data table database table corresponding to the data query statement, and obtains the target data corresponding to the data query statement in HDFS according to the target table name, and Send the obtained target data to the client corresponding to the batch task.

In this embodiment, the target data corresponding to the data query statement is obtained in the Hadoop cluster, and the target data is sent to the client corresponding to the batch task. There is no need to wait for the task scheduling platform to forward the client's data query request, which improves the query of data in the Hadoop cluster. The query efficiency.

In addition, referring to Figure 3, the present application also provides a Hadoop-based data update device, the Hadoop-based data update device includes:

The compiling module 10 is configured to, after detecting that the Hadoop cluster receives the batch running task sent by the client, compile the batch running task in the Hadoop cluster to obtain the task statement corresponding to the batch running task;

The parsing module 20 is used to analyze the task sentence in the data management system to obtain the logical relationship of the task sentence corresponding to each database table;

The update module 30 is configured to update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship.

Further, if the task sentence is a data update sentence, the Hadoop-based data update device further includes:

The processing module is used to process the data corresponding to the batch running task in the Hadoop cluster to obtain processed data;

The update module 30 is further configured to update the metadata database according to the processed data to obtain the updated metadata database; update all data in the graphic database according to the updated metadata and the processed data. The database table corresponding to the name of the table; updating the downstream database table according to the updated metadata and the processed data;

The Hadoop-based data update device further includes:

The first obtaining module is used to obtain updated metadata in the updated metadata database through the data management system, and obtain the processed data and the table name of the database table where the processed data is located ；

The determining module is configured to determine the updated database table as an upstream database table; determine the downstream database table corresponding to the upstream database table according to the blood relationship.

Further, the Hadoop-based data update device further includes:

The first sending module is configured to send prompt information to the client corresponding to the downstream database table, so that the client prompts the user according to the prompt information that the upstream database table corresponding to the downstream database table has been updated;

The update module 30 is further configured to update the downstream database table according to the updated metadata and the processed data if an update instruction sent by the client corresponding to the downstream database table is received.

Further, the first obtaining module includes:

The obtaining unit is configured to obtain the monitoring log of the updated metadata database through a preset monitoring program in the data management system;

The first parsing unit is used for parsing the monitoring log;

The obtaining unit is further configured to obtain a target keyword in the monitoring log; and obtain updated metadata in the updated metadata database according to the target keyword.

Further, if the task sentence is a data query sentence, the Hadoop-based data update device further includes:

The second obtaining module is configured to obtain target data corresponding to the data query statement in the Hadoop cluster;

The second sending module is configured to send the target data to the client corresponding to the batch running task.

Further, the analysis module 20 further includes:

The second parsing unit is used to parse the task sentence in the data management system to obtain the database table corresponding to the task sentence;

The determining unit is configured to determine the source table and the target table in the database table corresponding to the task sentence, and determine the logical relationship of each database table corresponding to the task sentence according to the source table and the target table.

Further, the compilation module 10 is also configured to, after detecting that the Hadoop cluster receives the batch running task sent by the client, invoke the hive compiler of the Hadoop cluster to compile the batch running task to obtain the batch running task. The HQL statement corresponding to the task.

It should be noted that the various embodiments of the Hadoop-based data update device are basically the same as the foregoing embodiments of the Hadoop-based data update method, and will not be described in detail here.

In addition, this application also provides a data update system based on Hadoop. As shown in FIG. 4, FIG. 4 is a schematic structural diagram of the hardware operating environment involved in the solution of the embodiment of the present application.

It should be noted that Fig. 4 can be a structural diagram of the hardware operating environment of the Hadoop-based data update system. The Hadoop-based data update system in the embodiment of the present application may be a terminal device such as a PC and a portable computer.

As shown in FIG. 4, the Hadoop-based data update system may include: a processor 1001, such as a CPU, a memory 1005, a user interface 1003, a network interface 1004, and a communication bus 1002. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory. Optionally, the memory 1005 may also be a storage device independent of the foregoing processor 1001.

Optionally, the Hadoop-based data update system may also include a camera, an RF (Radio Frequency) circuit, a sensor, an audio circuit, a WiFi module, and so on.

Those skilled in the art can understand that the structure of the Hadoop-based data update system shown in Figure 4 does not constitute a limitation on the Hadoop-based data update system, and may include more or less components than those shown in the figure, or a combination of some Components, or different component arrangements.

As shown in FIG. 4, the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a data update program based on Hadoop. Among them, the operating system is a program that manages and controls the hardware and software resources of the Hadoop-based data update system, and supports the operation of the Hadoop-based data update program and other software or programs.

In the Hadoop-based data update system shown in Figure 4, the user interface 1003 is mainly used to connect to the client and communicate data with the client; the network interface 1004 is mainly used to connect to the background server and communicate with the Houta server; The device 1001 can be used to call the Hadoop-based data update program stored in the memory 1005 and execute the steps of the Hadoop-based data update method as described above.

The specific implementation of the Hadoop-based data update system of the present application is basically the same as the foregoing embodiments of the Hadoop-based data update method, and will not be repeated here.

In addition, an embodiment of the present application also proposes a computer-readable storage medium, the computer-readable storage medium stores a Hadoop-based data update program, and the Hadoop-based data update program is executed by a processor as described above The steps of the Hadoop-based data update method.

The specific implementation of the computer-readable storage medium of the present application is basically the same as the foregoing embodiments of the Hadoop-based data update method, and will not be repeated here.

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, method, article, or device. Without more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or device that includes the element.

The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.

The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A Hadoop-based data update method, wherein the Hadoop-based data update method includes the following steps:

After detecting that the Hadoop cluster receives the batch running task sent by the client, compiling the batch running task in the Hadoop cluster to obtain the task statement corresponding to the batch running task;

Analyze the task statement in the data management system to obtain the logical relationship between the task statement and each database table;

The blood relationship of the corresponding database table in the preset graph database is updated according to the logical relationship.
The Hadoop-based data update method according to claim 1, wherein if the task sentence is a data update sentence, when it is detected that the Hadoop cluster receives the batch task sent by the client, the Hadoop cluster After the step of compiling the batch running task in and obtaining the task sentence corresponding to the batch running task, it further includes:

Processing the data corresponding to the batch running task in the Hadoop cluster to obtain processed data;

Update the metadata database according to the processed data to obtain the updated metadata database;

Obtaining updated metadata in the updated metadata database through the data management system, and obtaining the processed data and the table name of the database table where the processed data is located;

Update the database table corresponding to the table name in the graph database according to the updated metadata and the processed data, and determine the updated database table as an upstream database table;

After the step of updating the blood relationship of the corresponding database table in the preset graph database according to the logical relationship, determine the downstream database table corresponding to the upstream database table according to the blood relationship;

The downstream database table is updated according to the updated metadata and the processed data.
The Hadoop-based data update method according to claim 2, wherein after the step of determining the downstream database table corresponding to the upstream database table according to the blood relationship, the method further comprises:

Sending prompt information to the client corresponding to the downstream database table, so that the client prompts the user according to the prompt information that the upstream database table corresponding to the downstream database table has been updated;

If an update instruction sent by the client corresponding to the downstream database table is received, the downstream database table is updated according to the updated metadata and the processed data.
The Hadoop-based data update method according to claim 2, wherein the step of obtaining updated metadata in the updated metadata database through the data management system comprises:

Obtaining the monitoring log of the updated metadata database through a preset monitoring program in the data management system;

Parse the monitoring log to obtain the target keyword in the monitoring log;

Acquire updated metadata in the updated metadata database according to the target keyword.
The Hadoop-based data update method according to claim 1, wherein, if the task sentence is a data query sentence, when it is detected that the Hadoop cluster receives the batch task sent by the client, the Hadoop cluster After the step of compiling the batch running task in and obtaining the task sentence corresponding to the batch running task, it further includes:

Obtaining target data corresponding to the data query statement in the Hadoop cluster;

The target data is sent to the client corresponding to the batch running task.
The Hadoop-based data update method according to claim 1, wherein the step of parsing the task statement in the data management system to obtain the logical relationship of the task statement corresponding to each database table comprises:

Parse the task sentence in the data management system to obtain the database table corresponding to the task sentence;

Determine the source table and the target table in the database table corresponding to the task sentence, and determine the logical relationship of the task sentence corresponding to each database table according to the source table and the target table.
The Hadoop-based data update method according to any one of claims 1 to 6, wherein after detecting that the Hadoop cluster receives the batch running task sent by the client, the batch running task is performed in the Hadoop cluster. The steps of compiling to obtain the task statement corresponding to the batch running task include:

After detecting that the Hadoop cluster receives the batch running task sent by the client, the hive compiler of the Hadoop cluster is called to compile the batch running task, and the HQL statement corresponding to the batch running task is obtained.
A Hadoop-based data update device, wherein the Hadoop-based data update device includes:

The compiling module is used to compile the batch running task in the Hadoop cluster after detecting that the Hadoop cluster receives the batch running task sent by the client to obtain the task statement corresponding to the batch running task;

The parsing module is used to parse the task sentence in the data management system to obtain the logical relationship of the task sentence corresponding to each database table;

The update module is used to update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship.
8. The Hadoop-based data update device according to claim 8, wherein if the task sentence is a data update sentence, the Hadoop-based data update device further comprises:

The processing module is used to process the data corresponding to the batch running task in the Hadoop cluster to obtain processed data;

The update module is further configured to update the metadata database according to the processed data to obtain the updated metadata database; and update the graphics database according to the updated metadata and the processed data. The database table corresponding to the table name; updating the downstream database table according to the updated metadata and the processed data;

The Hadoop-based data update device further includes:

The first obtaining module is used to obtain updated metadata in the updated metadata database through the data management system, and obtain the processed data and the table name of the database table where the processed data is located ；

The determining module is configured to determine the updated database table as an upstream database table; determine the downstream database table corresponding to the upstream database table according to the blood relationship.
9. The Hadoop-based data update device of claim 9, wherein the Hadoop-based data update device further comprises:

The first sending module is configured to send prompt information to the client corresponding to the downstream database table, so that the client prompts the user according to the prompt information that the upstream database table corresponding to the downstream database table has been updated;

The update module is further configured to update the downstream database table according to the updated metadata and the processed data if an update instruction sent by the client corresponding to the downstream database table is received.
The Hadoop-based data update device according to claim 9, wherein the first acquisition module comprises:

The obtaining unit is configured to obtain the monitoring log of the updated metadata database through a preset monitoring program in the data management system;

The first parsing unit is used for parsing the monitoring log;

The obtaining unit is further configured to obtain a target keyword in the monitoring log; and obtain updated metadata in the updated metadata database according to the target keyword.
A Hadoop-based data update system, wherein the Hadoop-based data update system includes a memory, a processor, and a Hadoop-based data update program stored on the memory and running on the processor, and When the Hadoop-based data update program is executed by the processor, the steps of the Hadoop-based data update method according to any one of claims 1 to 7 are realized.
A computer-readable storage medium, wherein a Hadoop-based data update program is stored on the computer-readable storage medium, and when the Hadoop-based data update program is executed by a processor, it implements any one of claims 1 to 7 The steps of the Hadoop-based data update method described in the item.