WO2020238597A1 - Hadoop-based data updating method, device, system and medium - Google Patents

Hadoop-based data updating method, device, system and medium Download PDF

Info

Publication number
WO2020238597A1
WO2020238597A1 PCT/CN2020/089637 CN2020089637W WO2020238597A1 WO 2020238597 A1 WO2020238597 A1 WO 2020238597A1 CN 2020089637 W CN2020089637 W CN 2020089637W WO 2020238597 A1 WO2020238597 A1 WO 2020238597A1
Authority
WO
WIPO (PCT)
Prior art keywords
hadoop
task
database table
database
data
Prior art date
Application number
PCT/CN2020/089637
Other languages
French (fr)
Chinese (zh)
Inventor
彭陈成
张阳
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2020238597A1 publication Critical patent/WO2020238597A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Definitions

  • This application relates to the field of financial technology (Fintech) data processing technology, and in particular to a data update method, device, system and medium based on Hadoop.
  • Financial technology Fetech
  • the Hadoop big data platform will notify the task scheduling platform, and the task scheduling platform will generate a corresponding notification and send it to the client corresponding to the downstream database table to trigger the reprocessing of the data in the downstream database table.
  • the task scheduling platform cannot determine all the downstream database tables that are affected, resulting in incomplete update of the blood relationship between the upstream database table and the downstream database table.
  • the blood relationship between the upstream database table and the downstream database table is inaccurate, that is, the blood relationship between the database tables in the graph database is inaccurate.
  • the main purpose of this application is to provide a data update method, device, system and medium based on Hadoop, aiming to solve the existing technical problem of inaccurate blood relationship between database tables in graph databases when running batch tasks. .
  • the Hadoop-based data update method includes the following steps:
  • the blood relationship of the corresponding database table in the preset graph database is updated according to the logical relationship.
  • the batch running task is compiled in the Hadoop cluster, After obtaining the task sentence corresponding to the batch running task, it further includes:
  • the downstream database table is updated according to the updated metadata and the processed data.
  • the method further includes:
  • the downstream database table is updated according to the updated metadata and the processed data.
  • the step of obtaining updated metadata in the updated metadata database through the data management system includes:
  • the task sentence is a data query sentence
  • the Hadoop cluster after detecting that the Hadoop cluster receives the batch running task sent by the client, compile the batch running task in the Hadoop cluster, After obtaining the task sentence corresponding to the batch running task, it further includes:
  • the target data is sent to the client corresponding to the batch running task.
  • the step of parsing the task statement in the data management system to obtain the logical relationship of the task statement corresponding to each database table includes:
  • the step of compiling the batch running task in the Hadoop cluster to obtain the task sentence corresponding to the batch running task include:
  • the hive compiler of the Hadoop cluster After detecting that the Hadoop cluster receives the batch running task sent by the client, the hive compiler of the Hadoop cluster is called to compile the batch running task, and the HQL statement corresponding to the batch running task is obtained.
  • the present application also provides a Hadoop-based data update device, the Hadoop-based data update device includes:
  • the compiling module is used to compile the batch running task in the Hadoop cluster after detecting that the Hadoop cluster receives the batch running task sent by the client to obtain the task statement corresponding to the batch running task;
  • the parsing module is used to parse the task sentence in the data management system to obtain the logical relationship of the task sentence corresponding to each database table;
  • the update module is used to update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship.
  • this application also provides a Hadoop-based data update system.
  • the Hadoop-based data update system includes a memory, a processor, and a storage device that is stored on the memory and can run on the processor.
  • a Hadoop-based data update program which implements the steps of the Hadoop-based data update method as described above when the Hadoop-based data update program is executed by the processor.
  • this application also provides a computer-readable storage medium that stores a Hadoop-based data update program that is implemented when the Hadoop-based data update program is executed by a processor. The steps of the data update method based on Hadoop as described above.
  • the batch running task is compiled in the Hadoop cluster to obtain the task sentence corresponding to the batch running task, and the task sentence is parsed in the data management system.
  • the blood relationship of the corresponding database tables in the database improves the accuracy of the blood relationship between the database tables in the graph database.
  • Fig. 1 is a schematic flowchart of a first embodiment of a Hadoop-based data update method according to the present application
  • FIG. 2 is a schematic flowchart of a second embodiment of the Hadoop-based data update method of this application.
  • Fig. 3 is a functional schematic block diagram of a preferred embodiment of the Hadoop-based data update device of the present application
  • Fig. 4 is a schematic structural diagram of a hardware operating environment involved in a solution of an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a first embodiment of the data update method based on Hadoop in this application.
  • the embodiment of the application provides an embodiment of a data update method based on Hadoop. It should be noted that although the logical sequence is shown in the flowchart, in some cases, the sequence shown here can be executed in a different order. Steps out or described.
  • 1JanusGraph An open source distributed graph database, which has good scalability, and can support the storage and query of tens of billions of vertices and edge graph data through a multi-machine cluster.
  • JanusGraph is a transactional database that supports a large number of users to execute complex real-time graph traversal with high concurrency.
  • 2Hadoop A distributed system infrastructure developed by the Apache Foundation. It is a software framework that can perform distributed processing of large amounts of data, and perform data processing in a reliable, efficient, and scalable manner.
  • 3HDFS Distributed File System (Hadoop Distributed File System).
  • HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput to access applications
  • the data of the program is suitable for applications with a large data set.
  • Metadata also known as intermediary data, relay data, is data describing data (data about data), mainly describing data properties (property), used to support such as indicating storage location, historical data, resource search And file recording and other functions.
  • Binlog The binlog log is used to record all statements that have updated data or have potentially updated data. Statements are stored in the form of "events", which describe all data changes that occur in a certain database.
  • 6HQL Short for HiveQL, it is a language similar to SQL (Structured Query Language, Structured Query Language), which is compatible with most SQL syntax, but does not fully support the SQL standard.
  • SQL Structured Query Language
  • MapReduce a programming model based on the hadoop platform, used for parallel operations of large-scale data sets (greater than 1TB), the function is to summarize a bunch of messy data according to certain characteristics, and then process and get the final result. Map faces messy and unrelated data. It parses each data, extracts key and value from it, and summarizes the data obtained by Map through Reduce to get the final result.
  • 10hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and provide simple SQL query functions, and can convert SQL statements into MapReduce tasks for operation.
  • Hadoop-based data update methods include:
  • Step S10 after detecting that the Hadoop cluster receives the batch running task sent by the client, compile the batch running task in the Hadoop cluster to obtain the task statement corresponding to the batch running task.
  • the Hadoop cluster detects in real time or regularly whether it receives batch tasks sent by the client.
  • the duration corresponding to the timing can be set according to specific needs, and there is no specific limitation on the duration corresponding to the timing in this embodiment.
  • the user of the client can manually trigger the batch running task, or set a regular batch running task in the client. Updating data includes but is not limited to modifying existing data, adding new data, or deleting existing data.
  • the Hadoop cluster After the Hadoop cluster receives the batch running task sent by the client, the Hadoop cluster compiles the received batch running task to obtain the task statement corresponding to the batch running task, and sends the task statement to the data management system. Further, when the Hadoop cluster does not receive the batch running task sent by the client, the Hadoop cluster continues to detect whether it receives the batch running task sent by the client.
  • step S10 includes:
  • Step a After detecting that the Hadoop cluster receives the batch running task sent by the client, call the Hive compiler of the Hado op cluster to compile the batch running task, and obtain the HQL statement corresponding to the batch running task.
  • a hive compiler in a Hadoop cluster, a hive compiler, an execution engine, and a listener are provided, and the listener is a hook (hook) listener.
  • the Hadoop cluster receives the batch running task sent by the client, the Hadoop cluster compiles the batch running task through its built-in hive compiler, and obtains the HQL (Hibernate Query Language, query statement) statement corresponding to the batch running task, which is understandable Yes, the HQL statement is the task statement corresponding to the batch task.
  • the Hadoop cluster can also compile the batch running task through its built-in hive compiler to obtain the SQL statement corresponding to the batch running task.
  • the Hadoop cluster After the Hadoop cluster obtains the HQL statement corresponding to the batch task through the hive compiler, the Hadoop cluster submits the HQL statement to the execution engine. At this time, the listener of the Hadoop cluster can monitor each HQL statement and obtain the monitored HQL statement. HQL sentence, the acquired HQL sentence is sent to the data management system.
  • Step S20 Analyze the task sentence in the data management system to obtain the logical relationship between the task sentence and each database table.
  • the data management system After the data management system receives the task statement sent by the Hadoop cluster, the data management system parses the task statement to obtain the logical relationship between the task statement and each database table. It should be noted that in the task statement, there is data related to the batch task, and these data may exist in different database tables, and there is a certain logical relationship between the database tables. For example, the database table A where a certain data a is located, but the data a needs to be processed by the database table B. At this time, it indicates that the database table A and the database table B have a logical relationship.
  • step S20 includes:
  • Step b Analyze the task sentence in the data management system to obtain a database table corresponding to the task sentence.
  • Step c Determine the source table and the target table in the database table corresponding to the task sentence, and determine the logical relationship between the task sentence and each database table according to the source table and the target table.
  • the data management system parses the task statement, obtains the database table corresponding to the task statement, determines the source table and the target table in the database table, and determines the logical relationship between the task statement and each database table according to the source table and the target table.
  • the source table is an upstream table
  • the target table is a downstream table, that is, the data in the target table comes from the source table.
  • Target table If there is "database table C from database table D" in the task sentence, then it can be determined that database table C is the target table and database table D is the source table.
  • Step S30 updating the blood relationship of the corresponding database table in the preset graph database according to the logical relationship.
  • the data management system After the data management system obtains the logical relationship of the task sentence corresponding to each database table, the data management system updates the blood relationship of the corresponding database table according to the logical relationship. It should be noted that the logical relationship between the various database tables involved in the task statement can be determined through the logical relationship.
  • the graph database can be a JanusGraph graph database, or can be a graph database such as neo4j, ImageNet, and HugeGraph.
  • Neo4j is a high-performance, NOSQL graph database that stores structured data on the network instead of in tables.
  • the ImageNet project is a large-scale visualization database for the research of visual object recognition software.
  • HugeGraph is an easy-to-use, efficient and general-purpose open source graph database system (Graph Database, GitHub project address), which implements the Apache TinkerPop3 framework and is fully compatible with Gremlin query language. It has complete tool chain components to help users easily build graph-based databases. Applications and products above.
  • the task statement may update the blood relationship of the database table in the graph database, or may not update the blood relationship of the database table in the image database.
  • the blood relationship of the database table in the graph database will not change at this time, but the image database can still update the blood relationship of the corresponding database table according to the logical relationship of the database table involved in the data query statement Relationship, but at this time the blood relationship of each database table in the graph database before the update is consistent with the blood relationship after the update.
  • the Hadoop cluster receives the batch running task, it judges whether the batch running task is a data query task.
  • Batch tasks are data query tasks, and the Hadoop cluster does not send the task statements corresponding to the batch tasks to the data management system. At this time, there is no need to update the blood relationship of the database tables in the graph database.
  • the batch running task is compiled in the Hadoop cluster to obtain the task sentence corresponding to the batch running task, and the task sentence is parsed in the data management system , Obtain the logical relationship of the task statement corresponding to each database table, and update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship, so that when the batch task is executed, the graph is updated according to the logical relationship of each database table corresponding to the batch task.
  • the blood relationship of the corresponding database tables in the database improves the accuracy of the blood relationship between the database tables in the graph database.
  • the Hadoop-based data update method also include:
  • Step S40 processing the data corresponding to the batch running task in the Hadoop cluster to obtain processed data.
  • the statement type of the task statement can be determined by the type keyword. For example, when there are update, add, and delete type keywords in the task statement, it can be determined that the task statement is a data update statement. When there are type keywords such as search and gain in the task statement that indicate query and acquisition, the task statement can be determined to be a data query statement.
  • the Hadoop cluster processes the data corresponding to the batch task to obtain the processed data.
  • the Hadoop cluster uses MapReduce calculations to process the data corresponding to the batch task into data in a specific format.
  • the data corresponding to the batch running task is processed into fixed-length data, or processed into data of a certain data type.
  • the data corresponding to the batch task can be newly added data, or it can be modifying the metadata stored in the relational database corresponding to the Hadoop cluster, or modifying the data stored in HDFS.
  • the data is stored in HDFS
  • the metadata of the data in HDFS is not stored in Hadoop itself HDFS, but stored in a traditional relational database, such as stored in MySQL.
  • Step S50 Update the metadata database according to the processed data to obtain the updated metadata database.
  • the Hadoop cluster After the Hadoop cluster obtains the processed data, the Hadoop cluster will update the data stored in HDFS according to the processed data, and update the metadata database storing metadata according to the processed data to obtain the updated metadata database. If the database storing metadata is MySQL, the Hadoop cluster will update MySQL based on the processed data.
  • Step S60 Obtain updated metadata from the updated metadata database through the data management system, and obtain the processed data and the table name of the database table where the processed data is located.
  • the data management system monitors the metadata database update
  • the data management system obtains the updated metadata in the updated metadata database, and obtains the processed data in the HDFS of the Hadoop cluster, and obtains the database table where the processed data is located The name of the table. It should be noted that in a Hadoop cluster, every data is stored in a certain database table, and each database table has a table name, which can uniquely represent a certain table name.
  • step of obtaining updated metadata in the updated metadata database through the data management system includes:
  • Step d Obtain the monitoring log of the updated metadata database through a preset monitoring program in the data management system.
  • Step e Parse the monitoring log to obtain the target keyword in the monitoring log.
  • Step f Obtain updated metadata in the updated metadata database according to the target keyword.
  • the data management system deploys the Binlog listener in the metadata database.
  • the data management system uses the CANAL framework to deploy the Binlog listener in the metadata database.
  • the data management system uses the Binlog listener to obtain the updated metadata.
  • the monitoring log is a Binlog log
  • the monitoring log is parsed, the target keyword in the monitoring log is obtained, and the updated metadata is obtained in the updated metadata database according to the target keyword.
  • the target keywords are keywords such as update, add, and delete.
  • the target keyword and the type keyword may be the same or different.
  • CANAL The working principle of CANAL is: 1Imitate the interaction protocol of MySQL slave (from MySQL), pretend to be MySQL slave, and send dump protocol to MySQL master (master MySQL); MySQL master receives the dump request and starts to push Binary log (Binlog) Log) to slave (that is, MySQL); Parse Binary log object (originally byte stream); 2 Use open-replicator to parse Binary log, among which Open Replicator is a MySQL binlog analysis program written in Java. 3CANAL needs to maintain Event Store (storage), which can be accessed in Memory, File, and Zookeeper; 4CANAL needs to maintain the state of the client, and an instance (process) can only have one consumer at a time.
  • Event Store storage
  • Memory Memory
  • File and Zookeeper
  • 4CANAL needs to maintain the state of the client, and an instance (process) can only have one consumer at a time.
  • the table name of the database table corresponding to the updated metadata can also be parsed, and the table name corresponding to the updated metadata is the table name of the updated metadata corresponding to the processed data, namely
  • the data management system can obtain updated metadata through monitoring logs and determine the table name of the database table where the processed data is located.
  • Step S70 Update the database table corresponding to the table name in the graph database according to the updated metadata and the processed data, and determine the updated database table as an upstream database table.
  • each metadata has a corresponding database table, and therefore, each metadata has a corresponding table name.
  • the data management system obtains the processed data, table name, and updated metadata
  • the data management system updates the database table corresponding to the table name in the graph database according to the updated metadata and the processed data, and will update it
  • the subsequent database table is determined as the upstream database table.
  • step S40, step S50, step S60, and step S70 may be executed before step S20 and step S30, or may be executed after step S20 and step S30, or executed between step S20 and step S30.
  • Step S80 Determine the downstream database table corresponding to the upstream database table according to the blood relationship.
  • Step S90 Update the downstream database table according to the updated metadata and the processed data.
  • the data management system determines the downstream database table corresponding to the upstream database table according to the blood relationship in the graph database, and updates the downstream database table according to the updated metadata and processed data .
  • the data in the blood relationship database table has a dependency relationship
  • the downstream database table that has a blood relationship with the upstream database table will be affected .
  • the updated database table is obtained by updating the database table in the graph database according to the processed data, the name of the corresponding table of the processed data, and the updated metadata, and the updated database table is determined as the upstream database table , According to the processed data and updated metadata, the downstream database table that has blood relationship with the upstream database table is updated, and the data consistency in the upstream database table and the downstream database table is accurately maintained in real time.
  • Step g Send prompt information to the client corresponding to the downstream database table, so that the client prompts the user according to the prompt information that the upstream database table corresponding to the downstream database table has been updated.
  • Step h If an update instruction sent by the client corresponding to the downstream database table is received, the downstream database table is updated according to the updated metadata and the processed data.
  • the data management system After the data management system determines the downstream database table, the data management system generates prompt information, sends the prompt information to the client corresponding to the downstream database table, and detects whether an update instruction sent by the client corresponding to the downstream database is received.
  • the client corresponding to the downstream database table receives the prompt information, it outputs the prompt information to prompt the user corresponding to the downstream database table according to the prompt information.
  • the upstream database table corresponding to the downstream database table has been updated. In this embodiment, The output method of the update command is not restricted.
  • the user corresponding to the downstream database table can trigger an update instruction on the display interface of the client.
  • the client corresponding to the downstream database table detects the update instruction, it sends the update instruction to the data management system.
  • the data management system receives the update instruction sent by the client corresponding to the downstream database table, the data management system updates the downstream database table according to the updated metadata and processed data.
  • prompt information is sent to the client corresponding to the downstream database table, so that the user of the client corresponding to the downstream database table decides whether to update the downstream database table, and after receiving the update instruction triggered by the downstream database table user Later, the downstream data table is updated, so that after the upstream database table is updated, the user corresponding to the downstream database table can independently decide whether to update the downstream database table.
  • the difference between the fourth embodiment of the Hadoop-based data update method and the first, second, or third embodiment of the Hadoop-based data update method is that if the task sentence is a data query sentence, the data is based on Hadoop
  • the update method also includes:
  • Step i Obtain target data corresponding to the data query sentence in the Hadoop cluster.
  • Step j Send the target data to the client corresponding to the batch running task.
  • the Hadoop cluster parses the data query statement to obtain the target table name of the data table database table corresponding to the data query statement, and obtains the target data corresponding to the data query statement in HDFS according to the target table name, and Send the obtained target data to the client corresponding to the batch task.
  • the target data corresponding to the data query statement is obtained in the Hadoop cluster, and the target data is sent to the client corresponding to the batch task. There is no need to wait for the task scheduling platform to forward the client's data query request, which improves the query of data in the Hadoop cluster. The query efficiency.
  • the Hadoop-based data update device includes:
  • the compiling module 10 is configured to, after detecting that the Hadoop cluster receives the batch running task sent by the client, compile the batch running task in the Hadoop cluster to obtain the task statement corresponding to the batch running task;
  • the parsing module 20 is used to analyze the task sentence in the data management system to obtain the logical relationship of the task sentence corresponding to each database table;
  • the update module 30 is configured to update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship.
  • the Hadoop-based data update device further includes:
  • the processing module is used to process the data corresponding to the batch running task in the Hadoop cluster to obtain processed data;
  • the update module 30 is further configured to update the metadata database according to the processed data to obtain the updated metadata database; update all data in the graphic database according to the updated metadata and the processed data.
  • the database table corresponding to the name of the table; updating the downstream database table according to the updated metadata and the processed data;
  • the Hadoop-based data update device further includes:
  • the first obtaining module is used to obtain updated metadata in the updated metadata database through the data management system, and obtain the processed data and the table name of the database table where the processed data is located ;
  • the determining module is configured to determine the updated database table as an upstream database table; determine the downstream database table corresponding to the upstream database table according to the blood relationship.
  • the Hadoop-based data update device further includes:
  • the first sending module is configured to send prompt information to the client corresponding to the downstream database table, so that the client prompts the user according to the prompt information that the upstream database table corresponding to the downstream database table has been updated;
  • the update module 30 is further configured to update the downstream database table according to the updated metadata and the processed data if an update instruction sent by the client corresponding to the downstream database table is received.
  • the first obtaining module includes:
  • the obtaining unit is configured to obtain the monitoring log of the updated metadata database through a preset monitoring program in the data management system;
  • the first parsing unit is used for parsing the monitoring log
  • the obtaining unit is further configured to obtain a target keyword in the monitoring log; and obtain updated metadata in the updated metadata database according to the target keyword.
  • the Hadoop-based data update device further includes:
  • the second obtaining module is configured to obtain target data corresponding to the data query statement in the Hadoop cluster
  • the second sending module is configured to send the target data to the client corresponding to the batch running task.
  • analysis module 20 further includes:
  • the second parsing unit is used to parse the task sentence in the data management system to obtain the database table corresponding to the task sentence;
  • the determining unit is configured to determine the source table and the target table in the database table corresponding to the task sentence, and determine the logical relationship of each database table corresponding to the task sentence according to the source table and the target table.
  • the compilation module 10 is also configured to, after detecting that the Hadoop cluster receives the batch running task sent by the client, invoke the hive compiler of the Hadoop cluster to compile the batch running task to obtain the batch running task.
  • FIG. 4 is a schematic structural diagram of the hardware operating environment involved in the solution of the embodiment of the present application.
  • Fig. 4 can be a structural diagram of the hardware operating environment of the Hadoop-based data update system.
  • the Hadoop-based data update system in the embodiment of the present application may be a terminal device such as a PC and a portable computer.
  • the Hadoop-based data update system may include: a processor 1001, such as a CPU, a memory 1005, a user interface 1003, a network interface 1004, and a communication bus 1002.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory.
  • the memory 1005 may also be a storage device independent of the foregoing processor 1001.
  • the Hadoop-based data update system may also include a camera, an RF (Radio Frequency) circuit, a sensor, an audio circuit, a WiFi module, and so on.
  • RF Radio Frequency
  • the structure of the Hadoop-based data update system shown in Figure 4 does not constitute a limitation on the Hadoop-based data update system, and may include more or less components than those shown in the figure, or a combination of some Components, or different component arrangements.
  • the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a data update program based on Hadoop.
  • the operating system is a program that manages and controls the hardware and software resources of the Hadoop-based data update system, and supports the operation of the Hadoop-based data update program and other software or programs.
  • the user interface 1003 is mainly used to connect to the client and communicate data with the client;
  • the network interface 1004 is mainly used to connect to the background server and communicate with the Houta server;
  • the device 1001 can be used to call the Hadoop-based data update program stored in the memory 1005 and execute the steps of the Hadoop-based data update method as described above.
  • an embodiment of the present application also proposes a computer-readable storage medium, the computer-readable storage medium stores a Hadoop-based data update program, and the Hadoop-based data update program is executed by a processor as described above The steps of the Hadoop-based data update method.
  • the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. ⁇
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.

Abstract

Disclosed are a Hadoop-based data updating method, device and system as well as a medium. The method comprises the following steps: after the reception by a Hadoop cluster of a batch running task sent by a client is detected, compiling the batch running task in the Hadoop cluster to obtain a task statement corresponding to the batch running task (S10); parsing the task statement in a data management system to obtain the logical relationship of the task statement corresponding to each database table (S20); and updating, according to the logical relationship, the blood relationship corresponding to the database table in a preset graph database (S30).

Description

基于Hadoop的数据更新方法、装置、系统及介质Hadoop-based data updating method, device, system and medium
本申请要求于2019年5月27日申请的、申请号为201910448948.6、名称为“基于Hadoop的数据更新方法、装置、系统及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on May 27, 2019 with the application number 201910448948.6 and the title "Hadoop-based data update method, device, system and medium", the entire content of which is incorporated herein by reference Applying.
技术领域Technical field
本申请涉及金融科技(Fintech)数据处理技术领域,尤其涉及一种基于Hadoop的数据更新方法、装置、系统及介质。This application relates to the field of financial technology (Fintech) data processing technology, and in particular to a data update method, device, system and medium based on Hadoop.
背景技术Background technique
随着金融科技,尤其是互联网科技金融(Fintech)的不断发展,越来越多的技术(如分布式、区块链Blockchain、人工智能等)应用在金融领域,但金融业也对技术提出了更高的要求。With the continuous development of financial technology, especially Internet technology finance (Fintech), more and more technologies (such as distributed, blockchain, artificial intelligence, etc.) are applied in the financial field, but the financial industry has also proposed technology higher requirement.
目前,很多企业都已搭建各自的Hadoop大数据平台,基于该Hadoop大数据平台,企业内各个应用系统数据都在该Hadoop大数据平台中进行共享,进而形成各自的数据仓库和针对各个不同主题而建成的若干个数据集市,其中,数据仓库中存储的是某个应用系统的数据,数据集市存储的是各个应用系统中,同一主题的数据。当某个数据集市依赖的某个上游数据库表的数据发生数据重跑时,会通知下游数据库表随着上游数据库表的变化而被动进行重新跑批。这个通知的实现往往依靠一个全局的分布式任务调度平台。当上游某个数据库表发生变化时,Hadoop大数据平台会告知该任务调度平台,由该任务调度平台生成对应的通知发送至下游数据库表对应客户端中,以触发下游数据库表的数据重新处理。当上游数据库表和下游数据库表之间数据的血缘关系比较复杂时,任务调度平台无法确定被影响的所有下游数据库表,导致上游数据库表和下游数据库表之间数据的血缘关系更新不完全,从而导致上游数据库表和下游数据库表之间数据的血缘关系不准确,即导致图形数据库中数据库表之间的血缘关系不准确。At present, many companies have built their own Hadoop big data platform. Based on this Hadoop big data platform, the data of various application systems in the enterprise are shared in the Hadoop big data platform, and then their own data warehouses and different topics are formed. Several data marts have been built. Among them, the data warehouse stores the data of a certain application system, and the data mart stores the data of the same subject in each application system. When a data mart relies on a certain upstream database table data reruns, it will notify the downstream database table to passively rerun the batch as the upstream database table changes. The realization of this notification often relies on a global distributed task scheduling platform. When a certain upstream database table changes, the Hadoop big data platform will notify the task scheduling platform, and the task scheduling platform will generate a corresponding notification and send it to the client corresponding to the downstream database table to trigger the reprocessing of the data in the downstream database table. When the blood relationship between the upstream database table and the downstream database table is complicated, the task scheduling platform cannot determine all the downstream database tables that are affected, resulting in incomplete update of the blood relationship between the upstream database table and the downstream database table. The blood relationship between the upstream database table and the downstream database table is inaccurate, that is, the blood relationship between the database tables in the graph database is inaccurate.
发明概述Summary of the invention
技术问题technical problem
问题的解决方案The solution to the problem
技术解决方案Technical solutions
本申请的主要目的在于提供一种基于Hadoop的数据更新方法、装置、系统及介质,旨在解决现有的在执行跑批任务时,图形数据库中数据库表之间的血缘关系不准确的技术问题。The main purpose of this application is to provide a data update method, device, system and medium based on Hadoop, aiming to solve the existing technical problem of inaccurate blood relationship between database tables in graph databases when running batch tasks. .
为实现上述目的,本申请提供一种基于Hadoop的数据更新方法,所述基于Hadoop的数据更新方法包括步骤:In order to achieve the above objective, this application provides a Hadoop-based data update method. The Hadoop-based data update method includes the following steps:
当检测到Hadoop集群接收到客户端发送的跑批任务后,在所述Hadoop集群中对所述跑批任务进行编译,得到所述跑批任务对应的任务语句;After detecting that the Hadoop cluster receives the batch running task sent by the client, compiling the batch running task in the Hadoop cluster to obtain the task statement corresponding to the batch running task;
在数据治理系统中对所述任务语句进行解析,得到所述任务语句对应各个数据库表的逻辑关系;Analyze the task statement in the data management system to obtain the logical relationship between the task statement and each database table;
根据所述逻辑关系更新预设图形数据库中对应数据库表的血缘关系。The blood relationship of the corresponding database table in the preset graph database is updated according to the logical relationship.
在一实施例中,若所述任务语句为数据更新语句,则所述当检测到Hadoop集群接收到客户端发送的跑批任务后,在所述Hadoop集群中对所述跑批任务进行编译,得到所述跑批任务对应的任务语句的步骤之后,还包括:In one embodiment, if the task sentence is a data update sentence, after detecting that the Hadoop cluster receives the batch running task sent by the client, the batch running task is compiled in the Hadoop cluster, After obtaining the task sentence corresponding to the batch running task, it further includes:
在所述Hadoop集群中对所述跑批任务对应数据进行加工,得到加工后的数据;Processing the data corresponding to the batch running task in the Hadoop cluster to obtain processed data;
根据所述加工后的数据更新元数据库,得到所述更新后的元数据库;Update the metadata database according to the processed data to obtain the updated metadata database;
通过所述数据治理系统在所述更新后的元数据库中获取更新后的元数据,并获取所述加工后的数据和所述加工后的数据所在数据库表的表名称;Obtaining updated metadata in the updated metadata database through the data management system, and obtaining the processed data and the table name of the database table where the processed data is located;
根据所述更新后的元数据和所述加工后的数据更新所述图形数据库中所述表名称对应的数据库表,并将更新后的数据库表确定为上游数据库表;Update the database table corresponding to the table name in the graph database according to the updated metadata and the processed data, and determine the updated database table as an upstream database table;
在所述根据所述逻辑关系更新预设图形数据库中对应数据库表的血缘关系的步骤之后,根据所述血缘关系确定所述上游数据库表对应的下游数据库表;After the step of updating the blood relationship of the corresponding database table in the preset graph database according to the logical relationship, determine the downstream database table corresponding to the upstream database table according to the blood relationship;
根据所述更新后的元数据和所述加工后的数据更新所述下游数据库表。The downstream database table is updated according to the updated metadata and the processed data.
在一实施例中,所述根据所述血缘关系确定所述上游数据库表对应的下游数据库表的步骤之后,还包括:In an embodiment, after the step of determining the downstream database table corresponding to the upstream database table according to the blood relationship, the method further includes:
发送提示信息给所述下游数据库表对应的客户端,以供所述客户端根据所述提 示信息提示用户,所述下游数据库表对应的上游数据库表已更新;Sending prompt information to the client corresponding to the downstream database table, so that the client prompts the user according to the prompt information that the upstream database table corresponding to the downstream database table has been updated;
若接收到所述下游数据库表对应客户端发送的更新指令,则根据所述更新后的元数据和所述加工后的数据更新所述下游数据库表。If an update instruction sent by the client corresponding to the downstream database table is received, the downstream database table is updated according to the updated metadata and the processed data.
在一实施例中,所述通过所述数据治理系统在所述更新后的元数据库中获取更新后的元数据的步骤包括:In an embodiment, the step of obtaining updated metadata in the updated metadata database through the data management system includes:
通过所述数据治理系统中预设的监听程序获取所述更新后的元数据库的监听日志;Obtaining the monitoring log of the updated metadata database through a preset monitoring program in the data management system;
解析所述监听日志,获取所述监听日志中的目标关键字;Parse the monitoring log to obtain the target keyword in the monitoring log;
根据所述目标关键字获取所述更新后的元数据库中更新后的元数据。Acquire updated metadata in the updated metadata database according to the target keyword.
在一实施例中,若所述任务语句为数据查询语句,则所述当检测到Hadoop集群接收到客户端发送的跑批任务后,在所述Hadoop集群中对所述跑批任务进行编译,得到所述跑批任务对应的任务语句的步骤之后,还包括:In one embodiment, if the task sentence is a data query sentence, after detecting that the Hadoop cluster receives the batch running task sent by the client, compile the batch running task in the Hadoop cluster, After obtaining the task sentence corresponding to the batch running task, it further includes:
在所述Hadoop集群中获取所述数据查询语句对应的目标数据;Obtaining target data corresponding to the data query statement in the Hadoop cluster;
将所述目标数据发送给所述跑批任务对应的客户端。The target data is sent to the client corresponding to the batch running task.
在一实施例中,所述在数据治理系统中对所述任务语句进行解析,得到所述任务语句对应各个数据库表的逻辑关系的步骤包括:In an embodiment, the step of parsing the task statement in the data management system to obtain the logical relationship of the task statement corresponding to each database table includes:
在数据治理系统中对所述任务语句进行解析,得到所述任务语句对应的数据库表;Parse the task sentence in the data management system to obtain the database table corresponding to the task sentence;
确定所述任务语句对应数据库表中的源表和目标表,根据所述源表和所述目标表确定所述任务语句对应各个数据库表的逻辑关系。Determine the source table and the target table in the database table corresponding to the task sentence, and determine the logical relationship of the task sentence corresponding to each database table according to the source table and the target table.
在一实施例中,所述检测到Hadoop集群接收到客户端发送的跑批任务后,在所述Hadoop集群中对所述跑批任务进行编译,得到所述跑批任务对应的任务语句的步骤包括:In one embodiment, after detecting that the Hadoop cluster receives the batch running task sent by the client, the step of compiling the batch running task in the Hadoop cluster to obtain the task sentence corresponding to the batch running task include:
当检测到Hadoop集群接收到客户端发送的跑批任务后,调用所述Hadoop集群的hive编译器对所述跑批任务进行编译,得到所述跑批任务对应的HQL语句。After detecting that the Hadoop cluster receives the batch running task sent by the client, the hive compiler of the Hadoop cluster is called to compile the batch running task, and the HQL statement corresponding to the batch running task is obtained.
此外,为实现上述目的,本申请还提供一种基于Hadoop的数据更新装置,所述基于Hadoop的数据更新装置包括:In addition, in order to achieve the above object, the present application also provides a Hadoop-based data update device, the Hadoop-based data update device includes:
编译模块,用于当检测到Hadoop集群接收到客户端发送的跑批任务后,在所述 Hadoop集群中对所述跑批任务进行编译,得到所述跑批任务对应的任务语句;The compiling module is used to compile the batch running task in the Hadoop cluster after detecting that the Hadoop cluster receives the batch running task sent by the client to obtain the task statement corresponding to the batch running task;
解析模块,用于在数据治理系统中对所述任务语句进行解析,得到所述任务语句对应各个数据库表的逻辑关系;The parsing module is used to parse the task sentence in the data management system to obtain the logical relationship of the task sentence corresponding to each database table;
更新模块,用于根据所述逻辑关系更新预设图形数据库中对应数据库表的血缘关系。The update module is used to update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship.
此外,为实现上述目的,本申请还提供一种基于Hadoop的数据更新系统,所述基于Hadoop的数据更新系统包括存储器、处理器和存储在所述存储器上并可在所述处理器上运行的基于Hadoop的数据更新程序,所述基于Hadoop的数据更新程序被所述处理器执行时实现如上所述的基于Hadoop的数据更新方法的步骤。In addition, in order to achieve the above object, this application also provides a Hadoop-based data update system. The Hadoop-based data update system includes a memory, a processor, and a storage device that is stored on the memory and can run on the processor. A Hadoop-based data update program, which implements the steps of the Hadoop-based data update method as described above when the Hadoop-based data update program is executed by the processor.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有基于Hadoop的数据更新程序,所述基于Hadoop的数据更新程序被处理器执行时实现如上所述的基于Hadoop的数据更新方法的步骤。In addition, in order to achieve the above-mentioned object, this application also provides a computer-readable storage medium that stores a Hadoop-based data update program that is implemented when the Hadoop-based data update program is executed by a processor. The steps of the data update method based on Hadoop as described above.
本申请通过当检测到Hadoop集群接收到客户端发送的跑批任务后,在Hadoop集群中对跑批任务进行编译,得到跑批任务对应的任务语句,在数据治理系统中对任务语句进行解析,得到任务语句对应各个数据库表的逻辑关系,根据逻辑关系更新预设图形数据库中对应数据库表的血缘关系,实现了当执行跑批任务时,根据跑批任务对应各个数据库表的逻辑关系更新图形数据库中对应数据库表的血缘关系,提高了图形数据库中数据库表之间的血缘关系的准确性。In this application, after detecting that the Hadoop cluster receives the batch running task sent by the client, the batch running task is compiled in the Hadoop cluster to obtain the task sentence corresponding to the batch running task, and the task sentence is parsed in the data management system. Obtain the logical relationship of the task statement corresponding to each database table, and update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship, so that when the batch task is executed, the graph database is updated according to the logical relationship of each database table corresponding to the batch task. The blood relationship of the corresponding database tables in the database improves the accuracy of the blood relationship between the database tables in the graph database.
发明的有益效果The beneficial effects of the invention
对附图的简要说明Brief description of the drawings
附图说明Description of the drawings
图1是本申请基于Hadoop的数据更新方法第一实施例的流程示意图;Fig. 1 is a schematic flowchart of a first embodiment of a Hadoop-based data update method according to the present application;
图2是本申请基于Hadoop的数据更新方法第二实施例的流程示意图;2 is a schematic flowchart of a second embodiment of the Hadoop-based data update method of this application;
图3是本申请基于Hadoop的数据更新装置较佳实施例的功能示意图模块图;Fig. 3 is a functional schematic block diagram of a preferred embodiment of the Hadoop-based data update device of the present application;
图4是本申请实施例方案涉及的硬件运行环境的结构示意图。Fig. 4 is a schematic structural diagram of a hardware operating environment involved in a solution of an embodiment of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
发明实施例Invention embodiment
本发明的实施方式Embodiments of the invention
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.
本申请提供一种基于Hadoop的数据更新方法,参照图1,图1为本申请基于Hadoop的数据更新方法第一实施例的流程示意图。This application provides a data update method based on Hadoop. Refer to FIG. 1, which is a schematic flowchart of a first embodiment of the data update method based on Hadoop in this application.
本申请实施例提供了基于Hadoop的数据更新方法的实施例,需要说明的是,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。The embodiment of the application provides an embodiment of a data update method based on Hadoop. It should be noted that although the logical sequence is shown in the flowchart, in some cases, the sequence shown here can be executed in a different order. Steps out or described.
以下对本申请实施例中所要用的专业词汇进行说明:The following describes the professional vocabulary used in the embodiments of this application:
①JanusGraph:一个开源的分布式图形数据库,它具有很好的扩展性,通过多机集群可支持存储和查询数百亿的顶点和边的图数据。JanusGraph是一个事务数据库,支持大量用户高并发地执行复杂的实时图遍历。①JanusGraph: An open source distributed graph database, which has good scalability, and can support the storage and query of tens of billions of vertices and edge graph data through a multi-machine cluster. JanusGraph is a transactional database that supports a large number of users to execute complex real-time graph traversal with high concurrency.
②Hadoop:一个由Apache基金会所开发的分布式系统基础架构,是一个能够对大量数据进行分布式处理的软件框架,以一种可靠、高效、可伸缩的方式进行数据处理。②Hadoop: A distributed system infrastructure developed by the Apache Foundation. It is a software framework that can perform distributed processing of large amounts of data, and perform data processing in a reliable, efficient, and scalable manner.
③HDFS:分布式文件系统(Hadoop Distributed File System),HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上;而且它提供高吞吐量(high throughput)来访问应用程序的数据,适合有着超大数据集(large data set)的应用程序。③HDFS: Distributed File System (Hadoop Distributed File System). HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput to access applications The data of the program is suitable for applications with a large data set.
④元数据:Metadata,又称中介数据、中继数据,为描述数据的数据(data about data),主要是描述数据属性(property)的信息,用来支持如指示存储位置、历史数据、资源查找和文件记录等功能。④ Metadata: Metadata, also known as intermediary data, relay data, is data describing data (data about data), mainly describing data properties (property), used to support such as indicating storage location, historical data, resource search And file recording and other functions.
⑤Binlog:binlog日志用于记录所有更新了数据或者已经潜在更新了数据的所有语句。语句以“事件”的形式保存,它描述了所有发生在某个数据库中的数据更改。⑤ Binlog: The binlog log is used to record all statements that have updated data or have potentially updated data. Statements are stored in the form of "events", which describe all data changes that occur in a certain database.
⑥HQL:HiveQL的简称,是一种类似SQL(Structured Query Language,结构化查询语言)的语言,它与大部分的SQL语法兼容,但是并不完全支持SQL标准。⑥HQL: Short for HiveQL, it is a language similar to SQL (Structured Query Language, Structured Query Language), which is compatible with most SQL syntax, but does not fully support the SQL standard.
⑦CANAL:阿里巴巴旗下的一款开源项目,纯Java开发。基于数据库增量日志解析,提供增量数据订阅和消费,目前主要支持了MySQL(关系型数据库管理系统)(也支持mariaDB)。⑦CANAL: An open source project under Alibaba, pure Java development. Based on database incremental log analysis, it provides incremental data subscription and consumption. Currently, it mainly supports MySQL (relational database management system) (also supports mariaDB).
⑧Map Reduce:基于hadoop平台一种编程模型,用于大规模数据集(大于1TB)的并行运算,作用是把一堆杂乱无章的数据按照某种特征归纳起来,然后处理并得到最后的结果。Map面对的是杂乱无章的互不相关的数据,它解析每个数据,从中提取出key和value,并通过Reduce将Map得到的数据进行归纳得到最终结果。⑧MapReduce: a programming model based on the hadoop platform, used for parallel operations of large-scale data sets (greater than 1TB), the function is to summarize a bunch of messy data according to certain characteristics, and then process and get the final result. Map faces messy and unrelated data. It parses each data, extracts key and value from it, and summarizes the data obtained by Map through Reduce to get the final result.
⑨图形数据库:是NoSQL(非关系型数据库)数据库的一种类型,它应用图形理论存储实体之间的关系信息。图形数据库是一种非关系型数据库,它应用图形理论存储实体之间的关系信息。⑨Graphic database: It is a type of NoSQL (non-relational database) database, which uses graph theory to store the relationship information between entities. Graph database is a non-relational database, which uses graph theory to store the relational information between entities.
⑩hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的SQL查询功能,可以将SQL语句转换为Map Reduce任务进行运行。⑩hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and provide simple SQL query functions, and can convert SQL statements into MapReduce tasks for operation.
基于Hadoop的数据更新方法包括:Hadoop-based data update methods include:
步骤S10,当检测到Hadoop集群接收到客户端发送的跑批任务后,在所述Hadoop集群中对所述跑批任务进行编译,得到所述跑批任务对应的任务语句。Step S10, after detecting that the Hadoop cluster receives the batch running task sent by the client, compile the batch running task in the Hadoop cluster to obtain the task statement corresponding to the batch running task.
Hadoop集群实时或者定时检测是否接收到客户端发送的跑批任务。其中,定时对应的时长可根据具体需要而设置,在本实施例中对定时对应的时长不做具体限制。当客户端需要查看数据,或者更新数据时,客户端的用户可手动触发跑批任务,也可在客户端中设置定时的跑批任务。更新数据包括但不限于修改已有的数据、增加新的数据或者删除已有的数据。当Hadoop集群接收到客户端发送的跑批任务后,Hadoop集群对所接收的跑批任务进行编译,以得到跑批任务对应的任务语句,并将该任务语句发送给数据治理系统。进一步地,当Hadoop集群未接收到客户端发送的跑批任务时,Hadoop集群继续检测是否接收到客户端发送的跑批任务。The Hadoop cluster detects in real time or regularly whether it receives batch tasks sent by the client. Wherein, the duration corresponding to the timing can be set according to specific needs, and there is no specific limitation on the duration corresponding to the timing in this embodiment. When the client needs to view data or update data, the user of the client can manually trigger the batch running task, or set a regular batch running task in the client. Updating data includes but is not limited to modifying existing data, adding new data, or deleting existing data. After the Hadoop cluster receives the batch running task sent by the client, the Hadoop cluster compiles the received batch running task to obtain the task statement corresponding to the batch running task, and sends the task statement to the data management system. Further, when the Hadoop cluster does not receive the batch running task sent by the client, the Hadoop cluster continues to detect whether it receives the batch running task sent by the client.
进一步地,步骤S10包括:Further, step S10 includes:
步骤a,当检测到Hadoop集群接收到客户端发送的跑批任务后,调用所述Hado op集群的hive编译器对所述跑批任务进行编译,得到所述跑批任务对应的HQL语句。Step a: After detecting that the Hadoop cluster receives the batch running task sent by the client, call the Hive compiler of the Hado op cluster to compile the batch running task, and obtain the HQL statement corresponding to the batch running task.
具体地,在Hadoop集群中,设置有hive编译器、执行引擎和监听程序,监听程序为hook(钩子)监听程序。当Hadoop集群接收到客户端发送的跑批任务后,Hadoop集群通过其内置的hive编译器编译该跑批任务,得到该跑批任务对应的HQL(Hibernate Query Language,查询语句)语句,可以理解的是,HQL语句即为跑批任务对应的任务语句。需要说明的是,Hadoop集群也可通过其内置的hive编译器编译该跑批任务,得到该跑批任务对应的SQL语句。Specifically, in a Hadoop cluster, a hive compiler, an execution engine, and a listener are provided, and the listener is a hook (hook) listener. When the Hadoop cluster receives the batch running task sent by the client, the Hadoop cluster compiles the batch running task through its built-in hive compiler, and obtains the HQL (Hibernate Query Language, query statement) statement corresponding to the batch running task, which is understandable Yes, the HQL statement is the task statement corresponding to the batch task. It should be noted that the Hadoop cluster can also compile the batch running task through its built-in hive compiler to obtain the SQL statement corresponding to the batch running task.
当Hadoop集群通过hive编译器得到跑批任务对应的HQL语句后,Hadoop集群将HQL语句提交至执行引擎中,此时,Hadoop集群的监听程序可监听到每一条HQL语句,并获取所监听到的HQL语句,将所获取的HQL语句发送至数据治理系统。After the Hadoop cluster obtains the HQL statement corresponding to the batch task through the hive compiler, the Hadoop cluster submits the HQL statement to the execution engine. At this time, the listener of the Hadoop cluster can monitor each HQL statement and obtain the monitored HQL statement. HQL sentence, the acquired HQL sentence is sent to the data management system.
步骤S20,在数据治理系统中对所述任务语句进行解析,得到所述任务语句对应各个数据库表的逻辑关系。Step S20: Analyze the task sentence in the data management system to obtain the logical relationship between the task sentence and each database table.
当数据治理系统接收到Hadoop集群发送的任务语句后,数据治理系统对该任务语句进行解析,以得到任务语句对应各个数据库表的逻辑关系。需要说明的是,在任务语句中,存在与跑批任务相关的数据,而这些数据可能存在不同的数据库表中,各个数据库表之间存在一定的逻辑关系。如某个数据a所在的数据库表A,但是数据a要经过数据库表B加工后得到,此时,表明数据库表A和数据库表B是存在逻辑关系的。After the data management system receives the task statement sent by the Hadoop cluster, the data management system parses the task statement to obtain the logical relationship between the task statement and each database table. It should be noted that in the task statement, there is data related to the batch task, and these data may exist in different database tables, and there is a certain logical relationship between the database tables. For example, the database table A where a certain data a is located, but the data a needs to be processed by the database table B. At this time, it indicates that the database table A and the database table B have a logical relationship.
进一步地,步骤S20包括:Further, step S20 includes:
步骤b,在数据治理系统中对所述任务语句进行解析,得到所述任务语句对应的数据库表。Step b: Analyze the task sentence in the data management system to obtain a database table corresponding to the task sentence.
步骤c,确定所述任务语句对应数据库表中的源表和目标表,根据所述源表和所述目标表确定所述任务语句对应各个数据库表的逻辑关系。Step c: Determine the source table and the target table in the database table corresponding to the task sentence, and determine the logical relationship between the task sentence and each database table according to the source table and the target table.
具体地,数据治理系统对任务语句进行解析,得到任务语句对应的数据库表,确定数据库表中的源表和目标表,根据源表和目标表确定任务语句对应各个数据库表的逻辑关系。可以理解的是,源表为上游表,目标表为下游表,即目标 表中的数据是来源于源表的。需要说明的是,在任务语句中,会存在对应的数据库表名,以及表明各个数据库表名之间逻辑关系的逻辑关键词,通过该逻辑关键系即可确定任务语句对应数据库表的源表和目标表。如若在任务语句中,存在“数据库表C from数据库表D”,即可确定数据库表C为目标表,数据库表D为源表。Specifically, the data management system parses the task statement, obtains the database table corresponding to the task statement, determines the source table and the target table in the database table, and determines the logical relationship between the task statement and each database table according to the source table and the target table. It is understandable that the source table is an upstream table, and the target table is a downstream table, that is, the data in the target table comes from the source table. It should be noted that in the task statement, there will be a corresponding database table name and a logical keyword indicating the logical relationship between the database table names. Through the logical key system, the source table and the source table of the database table corresponding to the task statement can be determined. Target table. If there is "database table C from database table D" in the task sentence, then it can be determined that database table C is the target table and database table D is the source table.
步骤S30,根据所述逻辑关系更新预设图形数据库中对应数据库表的血缘关系。Step S30, updating the blood relationship of the corresponding database table in the preset graph database according to the logical relationship.
当数据治理系统得到任务语句对应各个数据库表的逻辑关系后,数据治理系统根据该逻辑关系更新对应数据库表的血缘关系。需要说明的是,通过该逻辑关系可确定任务语句所涉及的各个数据库表之间的逻辑关系。在本实施例中,图形数据库可为JanusGraph图形数据库,也可为neo4j、ImageNet和HugeGraph等图形数据库。Neo4j是一个高性能的,NOSQL图形数据库,它将结构化数据存储在网络上而不是表中。ImageNet项目是一个用于视觉对象识别软件研究的大型可视化数据库。HugeGraph是一款易用、高效、通用的开源图数据库系统(Graph Database,GitHub项目地址),实现了Apache TinkerPop3框架及完全兼容Gremlin查询语言,具备完善的工具链组件,助力用户轻松构建基于图数据库之上的应用和产品。After the data management system obtains the logical relationship of the task sentence corresponding to each database table, the data management system updates the blood relationship of the corresponding database table according to the logical relationship. It should be noted that the logical relationship between the various database tables involved in the task statement can be determined through the logical relationship. In this embodiment, the graph database can be a JanusGraph graph database, or can be a graph database such as neo4j, ImageNet, and HugeGraph. Neo4j is a high-performance, NOSQL graph database that stores structured data on the network instead of in tables. The ImageNet project is a large-scale visualization database for the research of visual object recognition software. HugeGraph is an easy-to-use, efficient and general-purpose open source graph database system (Graph Database, GitHub project address), which implements the Apache TinkerPop3 framework and is fully compatible with Gremlin query language. It has complete tool chain components to help users easily build graph-based databases. Applications and products above.
可以理解的是,该任务语句可能会更新图形数据库中数据库表的血缘关系,也可能不会更新图像数据库中数据库表的血缘关系。如当任务语句为数据查询语句时,此时图形数据库中数据库表的血缘关系是不会改变的,但是图像数据库依旧可以根据数据查询语句中所涉及的数据库表的逻辑关系更新对应数据库表的血缘关系,只是此时图数据库中各个数据库表更新之前的血缘关系和更新后的血缘关系是一致的。It is understandable that the task statement may update the blood relationship of the database table in the graph database, or may not update the blood relationship of the database table in the image database. For example, when the task statement is a data query statement, the blood relationship of the database table in the graph database will not change at this time, but the image database can still update the blood relationship of the corresponding database table according to the logical relationship of the database table involved in the data query statement Relationship, but at this time the blood relationship of each database table in the graph database before the update is consistent with the blood relationship after the update.
进一步地,为了在跑批任务为数据查询任务时,不会更新图形数据库中数据库表的血缘关系,Hadoop集群在接收到跑批任务时,判断该跑批任务是否为数据查询任务,若该跑批任务为数据查询任务,Hadoop集群则不将跑批任务对应的任务语句发送给数据治理系统,此时,也就不需要更新图形数据库中数据库表的血缘关系了。Further, in order to not update the blood relationship of the database tables in the graph database when the batch running task is a data query task, when the Hadoop cluster receives the batch running task, it judges whether the batch running task is a data query task. Batch tasks are data query tasks, and the Hadoop cluster does not send the task statements corresponding to the batch tasks to the data management system. At this time, there is no need to update the blood relationship of the database tables in the graph database.
本实施例通过当检测到Hadoop集群接收到客户端发送的跑批任务后,在Hadoop集群中对跑批任务进行编译,得到跑批任务对应的任务语句,在数据治理系统中对任务语句进行解析,得到任务语句对应各个数据库表的逻辑关系,根据逻辑关系更新预设图形数据库中对应数据库表的血缘关系,实现了当执行跑批任务时,根据跑批任务对应各个数据库表的逻辑关系更新图形数据库中对应数据库表的血缘关系,提高了图形数据库中数据库表之间的血缘关系的准确性。In this embodiment, after detecting that the Hadoop cluster receives the batch running task sent by the client, the batch running task is compiled in the Hadoop cluster to obtain the task sentence corresponding to the batch running task, and the task sentence is parsed in the data management system , Obtain the logical relationship of the task statement corresponding to each database table, and update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship, so that when the batch task is executed, the graph is updated according to the logical relationship of each database table corresponding to the batch task. The blood relationship of the corresponding database tables in the database improves the accuracy of the blood relationship between the database tables in the graph database.
进一步地,提出本申请基于Hadoop的数据更新方法第二实施例。Further, a second embodiment of the Hadoop-based data update method of this application is proposed.
所述基于Hadoop的数据更新方法第二实施例与所述基于Hadoop的数据更新方法第一实施例的区别在于,若所述任务语句为数据更新语句,参照图2,基于Hadoop的数据更新方法还包括:The difference between the second embodiment of the Hadoop-based data update method and the first embodiment of the Hadoop-based data update method is that if the task sentence is a data update sentence, referring to Figure 2, the Hadoop-based data update method also include:
步骤S40,在所述Hadoop集群中对所述跑批任务对应数据进行加工,得到加工后的数据。Step S40, processing the data corresponding to the batch running task in the Hadoop cluster to obtain processed data.
需要说明的是,任务语句中存在类型关键字,通过该类型关键字即可确定任务语句的语句类型。如当任务语句中存在update、add和delete等表示更新的类型关键字时,可确定任务语句为数据更新语句。当任务语句中存在search和gain等表示查询、获取的类型关键字时,可确定任务语句为数据查询语句。It should be noted that there is a type keyword in the task statement, and the statement type of the task statement can be determined by the type keyword. For example, when there are update, add, and delete type keywords in the task statement, it can be determined that the task statement is a data update statement. When there are type keywords such as search and gain in the task statement that indicate query and acquisition, the task statement can be determined to be a data query statement.
当确定任务语句为数据更新语句后,Hadoop集群对跑批任务对应的数据进行加工,得到加工后的数据。具体地,Hadoop集群会通过Map Reduce计算,将跑批任务对应的数据加工成特定格式的数据。如将跑批任务对应的数据加工成固定长度大小的数据,或者加工成某种特定数据类型的数据。此时,跑批任务对应的数据可为新增加的数据,也可为修改Hadoop集群对应关系型数据库所存储的元数据,或者修改HDFS中存储的数据等。需要说明的是,在Hadoop集群中,其数据是存储在HDFS中的,而HDFS中数据的元数据并不存放在Hadoop本身的HDFS上,而是存放在传统的关系型数据库中,如存储在MySQL中。When it is determined that the task statement is a data update statement, the Hadoop cluster processes the data corresponding to the batch task to obtain the processed data. Specifically, the Hadoop cluster uses MapReduce calculations to process the data corresponding to the batch task into data in a specific format. For example, the data corresponding to the batch running task is processed into fixed-length data, or processed into data of a certain data type. At this time, the data corresponding to the batch task can be newly added data, or it can be modifying the metadata stored in the relational database corresponding to the Hadoop cluster, or modifying the data stored in HDFS. It should be noted that in a Hadoop cluster, the data is stored in HDFS, and the metadata of the data in HDFS is not stored in Hadoop itself HDFS, but stored in a traditional relational database, such as stored in MySQL.
步骤S50,根据所述加工后的数据更新元数据库,得到所述更新后的元数据库。Step S50: Update the metadata database according to the processed data to obtain the updated metadata database.
当Hadoop集群得到加工后的数据后,Hadoop集群会根据该加工后的数据更新HDFS中存储的数据,并根据加工后的数据更新存储元数据的元数据库,得到更新 后的元数据库。若存储元数据的数据库为MySQL,Hadoop集群则根据加工后的数据更新MySQL。After the Hadoop cluster obtains the processed data, the Hadoop cluster will update the data stored in HDFS according to the processed data, and update the metadata database storing metadata according to the processed data to obtain the updated metadata database. If the database storing metadata is MySQL, the Hadoop cluster will update MySQL based on the processed data.
步骤S60,通过所述数据治理系统在所述更新后的元数据库中获取更新后的元数据,并获取所述加工后的数据和所述加工后的数据所在数据库表的表名称。Step S60: Obtain updated metadata from the updated metadata database through the data management system, and obtain the processed data and the table name of the database table where the processed data is located.
当数据治理系统监听到元数据库更新后,数据治理系统在更新后的元数据库中获取更新后的元数据,以及在Hadoop集群的HDFS中获取加工后的数据,并获取加工后的数据所在数据库表的表名称。需要说明的是,在Hadoop集群中,每一数据都会存储在某个数据库表中,每一个数据库表都有一个表名称,该表名称可唯一表示某一个表名称。When the data management system monitors the metadata database update, the data management system obtains the updated metadata in the updated metadata database, and obtains the processed data in the HDFS of the Hadoop cluster, and obtains the database table where the processed data is located The name of the table. It should be noted that in a Hadoop cluster, every data is stored in a certain database table, and each database table has a table name, which can uniquely represent a certain table name.
进一步地,所述通过所述数据治理系统在所述更新后的元数据库中获取更新后的元数据的步骤包括:Further, the step of obtaining updated metadata in the updated metadata database through the data management system includes:
步骤d,通过所述数据治理系统中预设的监听程序获取所述更新后的元数据库的监听日志。Step d: Obtain the monitoring log of the updated metadata database through a preset monitoring program in the data management system.
步骤e,解析所述监听日志,获取所述监听日志中的目标关键字。Step e: Parse the monitoring log to obtain the target keyword in the monitoring log.
步骤f,根据所述目标关键字获取所述更新后的元数据库中更新后的元数据。Step f: Obtain updated metadata in the updated metadata database according to the target keyword.
进一步地,数据治理系统在元数据库中部署了Binlog监听程序,具体地,数据治理系统采用CANAL的框架将Binlog监听程序部署在元数据库中,数据治理系统采用Binlog监听程序获取更新后的元数据库的监听日志,该监听日志为Binlog日志,解析该监听日志,获取该监听日志中的目标关键字,根据该目标关键字在更新后的元数据库中获取更新后的元数据。其中,目标关键字为update、add和delete等关键字。在本实施例中,目标关键字和类型关键字可以相同,也可以不相同。Further, the data management system deploys the Binlog listener in the metadata database. Specifically, the data management system uses the CANAL framework to deploy the Binlog listener in the metadata database. The data management system uses the Binlog listener to obtain the updated metadata. The monitoring log, the monitoring log is a Binlog log, the monitoring log is parsed, the target keyword in the monitoring log is obtained, and the updated metadata is obtained in the updated metadata database according to the target keyword. Among them, the target keywords are keywords such as update, add, and delete. In this embodiment, the target keyword and the type keyword may be the same or different.
CANAL的工作原理为:①模拟MySQL slave(从MySQL)的交互协议,伪装自己为MySQL slave,向MySQL master(主MySQL)发送dump协议;MySQL master收到dump请求,开始推送Binary log(Binlog,二进制日志)给slave(也就是MySQL);解析Binary log对象(原始为byte流);②采用开源的open-replicator来解析Binary log,其中,Open Replicator是一个用Java编写的MySQL binlog分析程序。;③CANAL需要维护Event  Store(存储),可以存取在Memory,File,Zookeeper;④CANAL需要维护客户端的状态,同一时刻一个instance(进程)只能有一个消费端消费。The working principle of CANAL is: ①Imitate the interaction protocol of MySQL slave (from MySQL), pretend to be MySQL slave, and send dump protocol to MySQL master (master MySQL); MySQL master receives the dump request and starts to push Binary log (Binlog) Log) to slave (that is, MySQL); Parse Binary log object (originally byte stream); ② Use open-replicator to parse Binary log, among which Open Replicator is a MySQL binlog analysis program written in Java. ③CANAL needs to maintain Event Store (storage), which can be accessed in Memory, File, and Zookeeper; ④CANAL needs to maintain the state of the client, and an instance (process) can only have one consumer at a time.
进一步地,在监听日志中,也可以解析到更新后的元数据对应的数据库表的表名称,更新后的元数据对应的表名称为更新后的元数据对应加工后的数据的表名称,即数据治理系统通过监听日志可获取到更新后的元数据,以及确定加工后的数据所在数据库表的表名称。Further, in the monitoring log, the table name of the database table corresponding to the updated metadata can also be parsed, and the table name corresponding to the updated metadata is the table name of the updated metadata corresponding to the processed data, namely The data management system can obtain updated metadata through monitoring logs and determine the table name of the database table where the processed data is located.
步骤S70,根据所述更新后的元数据和所述加工后的数据更新所述图形数据库中所述表名称对应的数据库表,并将更新后的数据库表确定为上游数据库表。Step S70: Update the database table corresponding to the table name in the graph database according to the updated metadata and the processed data, and determine the updated database table as an upstream database table.
可以理解的是,每一个元数据都存在对应数据库表,因此,每一个元数据都存在对应的表名称。当数据治理系统获取到加工后的数据、表名称和更新后的元数据后,数据治理系统根据更新后的元数据和加工后的数据对应更新图形数据库中表名称对应的数据库表,并将更新后的数据库表确定为上游数据库表。It is understandable that each metadata has a corresponding database table, and therefore, each metadata has a corresponding table name. When the data management system obtains the processed data, table name, and updated metadata, the data management system updates the database table corresponding to the table name in the graph database according to the updated metadata and the processed data, and will update it The subsequent database table is determined as the upstream database table.
需要说明的是,步骤S40、步骤S50、步骤S60、步骤S70可以在步骤S20和步骤S30之前执行,也可在步骤S20和步骤S30之后执行,或者在步骤S20和步骤S30之间执行。It should be noted that step S40, step S50, step S60, and step S70 may be executed before step S20 and step S30, or may be executed after step S20 and step S30, or executed between step S20 and step S30.
步骤S80,根据所述血缘关系确定所述上游数据库表对应的下游数据库表。Step S80: Determine the downstream database table corresponding to the upstream database table according to the blood relationship.
步骤S90,根据所述更新后的元数据和所述加工后的数据更新所述下游数据库表。Step S90: Update the downstream database table according to the updated metadata and the processed data.
当数据治理系统确定图形数据库中的上游数据库表后,数据治理系统根据图形数据库中的血缘关系确定上游数据库表对应的下游数据库表,并根据更新后的元数据和加工后的数据更新下游数据库表。需要说明的是,因为存在血缘关系的数据库表中的数据是存在依赖关系的,因此,当上游数据库表中的某个数据发生变化后,与上游数据库表存在血缘关系的下游数据库表会受到影响,为了保持上游数据库表和下游数据库表中数据的一致性,所以需要根据更新后的元数据和加工后的数据更新下游数据库表。When the data management system determines the upstream database table in the graph database, the data management system determines the downstream database table corresponding to the upstream database table according to the blood relationship in the graph database, and updates the downstream database table according to the updated metadata and processed data . It should be noted that because the data in the blood relationship database table has a dependency relationship, when a certain data in the upstream database table changes, the downstream database table that has a blood relationship with the upstream database table will be affected , In order to maintain the consistency of the data in the upstream database table and the downstream database table, it is necessary to update the downstream database table according to the updated metadata and processed data.
本实施例通过根据加工后的数据、加工后的数据对应表名称和更新后的元数据更新图形数据库中的数据库表,得到更新后的数据库表,并将更新后的数据库表确定为上游数据库表,根据加工后的数据和更新后的元数据更新与上游数据 库表存在血缘关系的下游数据库表,实时,准确地保持了上游数据库表和下游数据库表中的数据一致性。In this embodiment, the updated database table is obtained by updating the database table in the graph database according to the processed data, the name of the corresponding table of the processed data, and the updated metadata, and the updated database table is determined as the upstream database table , According to the processed data and updated metadata, the downstream database table that has blood relationship with the upstream database table is updated, and the data consistency in the upstream database table and the downstream database table is accurately maintained in real time.
进一步地,提出本申请基于Hadoop的数据更新方法第三实施例。Further, a third embodiment of the Hadoop-based data update method of this application is proposed.
所述基于Hadoop的数据更新方法第三实施例与所述基于Hadoop的数据更新方法第二实施例的区别在于,基于Hadoop的数据更新方法还包括:The difference between the third embodiment of the Hadoop-based data update method and the second embodiment of the Hadoop-based data update method is that the Hadoop-based data update method further includes:
步骤g,发送提示信息给所述下游数据库表对应的客户端,以供所述客户端根据所述提示信息提示用户,所述下游数据库表对应的上游数据库表已更新。Step g: Send prompt information to the client corresponding to the downstream database table, so that the client prompts the user according to the prompt information that the upstream database table corresponding to the downstream database table has been updated.
步骤h,若接收到所述下游数据库表对应客户端发送的更新指令,则根据所述更新后的元数据和所述加工后的数据更新所述下游数据库表。Step h: If an update instruction sent by the client corresponding to the downstream database table is received, the downstream database table is updated according to the updated metadata and the processed data.
当数据治理系统确定下游数据库表后,数据治理系统生成提示信息,并将该提示信息发送给下游数据库表对应的客户端中,并检测是否接收到下游数据库对应客户端发送的更新指令。当下游数据库表对应的客户端接收到提示信息后,输出该提示信息,以根据该提示信息提示下游数据库表对应的用户,该下游数据库表对应的上游数据库表已更新,在本实施例中,不限制更新指令的输出方式。此时,下游数据库表对应的用户可在该客户端的显示界面中触发更新指令。当下游数据库表对应的客户端侦测到更新指令后,将该更新指令发送给数据治理系统。当数据治理系统接收到下游数据库表对应客户端发送的更新指令后,数据治理系统根据更新后的元数据和加工后的数据更新下游数据库表。After the data management system determines the downstream database table, the data management system generates prompt information, sends the prompt information to the client corresponding to the downstream database table, and detects whether an update instruction sent by the client corresponding to the downstream database is received. When the client corresponding to the downstream database table receives the prompt information, it outputs the prompt information to prompt the user corresponding to the downstream database table according to the prompt information. The upstream database table corresponding to the downstream database table has been updated. In this embodiment, The output method of the update command is not restricted. At this time, the user corresponding to the downstream database table can trigger an update instruction on the display interface of the client. When the client corresponding to the downstream database table detects the update instruction, it sends the update instruction to the data management system. When the data management system receives the update instruction sent by the client corresponding to the downstream database table, the data management system updates the downstream database table according to the updated metadata and processed data.
本实施例通过在上游数据库表更新后,发送提示信息给下游数据库表对应客户端,让下游数据库表对应客户端的用户自己决定是否更新下游数据库表,并在接收到下游数据库表用户触发的更新指令后,更新下游数据表,实现了在上游数据库表更新后,让下游数据库表对应的用户自主决定是否更新下游数据库表。In this embodiment, after the upstream database table is updated, prompt information is sent to the client corresponding to the downstream database table, so that the user of the client corresponding to the downstream database table decides whether to update the downstream database table, and after receiving the update instruction triggered by the downstream database table user Later, the downstream data table is updated, so that after the upstream database table is updated, the user corresponding to the downstream database table can independently decide whether to update the downstream database table.
进一步地,提出本申请基于Hadoop的数据更新方法第四实施例。Further, a fourth embodiment of the Hadoop-based data update method of this application is proposed.
所述基于Hadoop的数据更新方法第四实施例与所述基于Hadoop的数据更新方法第一、第二或第三实施例的区别在于,若所述任务语句为数据查询语句,则基于Hadoop的数据更新方法还包括:The difference between the fourth embodiment of the Hadoop-based data update method and the first, second, or third embodiment of the Hadoop-based data update method is that if the task sentence is a data query sentence, the data is based on Hadoop The update method also includes:
步骤i,在所述Hadoop集群中获取所述数据查询语句对应的目标数据。Step i: Obtain target data corresponding to the data query sentence in the Hadoop cluster.
步骤j,将所述目标数据发送给所述跑批任务对应的客户端。Step j: Send the target data to the client corresponding to the batch running task.
若确定任务语句为数据查询语句,Hadoop集群则解析数据查询语句,得到数据查询语句对应的数据表库表的目标表名称,并根据目标表名称在HDFS中获取数据查询语句对应的目标数据,并将所获取的目标数据发送给跑批任务对应的客户端。If it is determined that the task statement is a data query statement, the Hadoop cluster parses the data query statement to obtain the target table name of the data table database table corresponding to the data query statement, and obtains the target data corresponding to the data query statement in HDFS according to the target table name, and Send the obtained target data to the client corresponding to the batch task.
本实施例通过在Hadoop集群中获取数据查询语句对应的目标数据,将目标数据发送给跑批任务对应的客户端,不需要等待任务调度平台转发客户端的数据查询请求,提高了查询Hadoop集群中数据的查询效率。In this embodiment, the target data corresponding to the data query statement is obtained in the Hadoop cluster, and the target data is sent to the client corresponding to the batch task. There is no need to wait for the task scheduling platform to forward the client's data query request, which improves the query of data in the Hadoop cluster. The query efficiency.
此外,参照图3,本申请还提供一种基于Hadoop的数据更新装置,所述基于Hadoop的数据更新装置包括:In addition, referring to Figure 3, the present application also provides a Hadoop-based data update device, the Hadoop-based data update device includes:
编译模块10,用于当检测到Hadoop集群接收到客户端发送的跑批任务后,在所述Hadoop集群中对所述跑批任务进行编译,得到所述跑批任务对应的任务语句;The compiling module 10 is configured to, after detecting that the Hadoop cluster receives the batch running task sent by the client, compile the batch running task in the Hadoop cluster to obtain the task statement corresponding to the batch running task;
解析模块20,用于在数据治理系统中对所述任务语句进行解析,得到所述任务语句对应各个数据库表的逻辑关系;The parsing module 20 is used to analyze the task sentence in the data management system to obtain the logical relationship of the task sentence corresponding to each database table;
更新模块30,用于根据所述逻辑关系更新预设图形数据库中对应数据库表的血缘关系。The update module 30 is configured to update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship.
进一步地,若所述任务语句为数据更新语句,则所述基于Hadoop的数据更新装置还包括:Further, if the task sentence is a data update sentence, the Hadoop-based data update device further includes:
加工模块,用于在所述Hadoop集群中对所述跑批任务对应数据进行加工,得到加工后的数据;The processing module is used to process the data corresponding to the batch running task in the Hadoop cluster to obtain processed data;
所述更新模块30还用于根据所述加工后的数据更新元数据库,得到所述更新后的元数据库;根据所述更新后的元数据和所述加工后的数据更新所述图形数据库中所述表名称对应的数据库表;根据所述更新后的元数据和所述加工后的数据更新所述下游数据库表;The update module 30 is further configured to update the metadata database according to the processed data to obtain the updated metadata database; update all data in the graphic database according to the updated metadata and the processed data. The database table corresponding to the name of the table; updating the downstream database table according to the updated metadata and the processed data;
所述基于Hadoop的数据更新装置还包括:The Hadoop-based data update device further includes:
第一获取模块,用于通过所述数据治理系统在所述更新后的元数据库中获取更新后的元数据,并获取所述加工后的数据和所述加工后的数据所在数据库表的 表名称;The first obtaining module is used to obtain updated metadata in the updated metadata database through the data management system, and obtain the processed data and the table name of the database table where the processed data is located ;
确定模块,用于将更新后的数据库表确定为上游数据库表;根据所述血缘关系确定所述上游数据库表对应的下游数据库表。The determining module is configured to determine the updated database table as an upstream database table; determine the downstream database table corresponding to the upstream database table according to the blood relationship.
进一步地,所述基于Hadoop的数据更新装置还包括:Further, the Hadoop-based data update device further includes:
第一发送模块,用于发送提示信息给所述下游数据库表对应的客户端,以供所述客户端根据所述提示信息提示用户,所述下游数据库表对应的上游数据库表已更新;The first sending module is configured to send prompt information to the client corresponding to the downstream database table, so that the client prompts the user according to the prompt information that the upstream database table corresponding to the downstream database table has been updated;
所述更新模块30还用于若接收到所述下游数据库表对应客户端发送的更新指令,则根据所述更新后的元数据和所述加工后的数据更新所述下游数据库表。The update module 30 is further configured to update the downstream database table according to the updated metadata and the processed data if an update instruction sent by the client corresponding to the downstream database table is received.
进一步地,所述第一获取模块包括:Further, the first obtaining module includes:
获取单元,用于通过所述数据治理系统中预设的监听程序获取所述更新后的元数据库的监听日志;The obtaining unit is configured to obtain the monitoring log of the updated metadata database through a preset monitoring program in the data management system;
第一解析单元,用于解析所述监听日志;The first parsing unit is used for parsing the monitoring log;
所述获取单元还用于获取所述监听日志中的目标关键字;根据所述目标关键字获取所述更新后的元数据库中更新后的元数据。The obtaining unit is further configured to obtain a target keyword in the monitoring log; and obtain updated metadata in the updated metadata database according to the target keyword.
进一步地,若所述任务语句为数据查询语句,所述基于Hadoop的数据更新装置还包括:Further, if the task sentence is a data query sentence, the Hadoop-based data update device further includes:
第二获取模块,用于在所述Hadoop集群中获取所述数据查询语句对应的目标数据;The second obtaining module is configured to obtain target data corresponding to the data query statement in the Hadoop cluster;
第二发送模块,用于将所述目标数据发送给所述跑批任务对应的客户端。The second sending module is configured to send the target data to the client corresponding to the batch running task.
进一步地,所述解析模块20还包括:Further, the analysis module 20 further includes:
第二解析单元,用于在数据治理系统中对所述任务语句进行解析,得到所述任务语句对应的数据库表;The second parsing unit is used to parse the task sentence in the data management system to obtain the database table corresponding to the task sentence;
确定单元,用于确定所述任务语句对应数据库表中的源表和目标表,根据所述源表和所述目标表确定所述任务语句对应各个数据库表的逻辑关系。The determining unit is configured to determine the source table and the target table in the database table corresponding to the task sentence, and determine the logical relationship of each database table corresponding to the task sentence according to the source table and the target table.
进一步地,所述编译模块10还用于当检测到Hadoop集群接收到客户端发送的跑批任务后,调用所述Hadoop集群的hive编译器对所述跑批任务进行编译,得到所述跑批任务对应的HQL语句。Further, the compilation module 10 is also configured to, after detecting that the Hadoop cluster receives the batch running task sent by the client, invoke the hive compiler of the Hadoop cluster to compile the batch running task to obtain the batch running task. The HQL statement corresponding to the task.
需要说明的是,基于Hadoop的数据更新装置的各个实施例与上述基于Hadoop的数据更新方法的各实施例基本相同,在此不再详细赘述。It should be noted that the various embodiments of the Hadoop-based data update device are basically the same as the foregoing embodiments of the Hadoop-based data update method, and will not be described in detail here.
此外,本申请还提供一种基于Hadoop的数据更新系统。如图4所示,图4是本申请实施例方案涉及的硬件运行环境的结构示意图。In addition, this application also provides a data update system based on Hadoop. As shown in FIG. 4, FIG. 4 is a schematic structural diagram of the hardware operating environment involved in the solution of the embodiment of the present application.
需要说明的是,图4即可为基于Hadoop的数据更新系统的硬件运行环境的结构示意图。本申请实施例基于Hadoop的数据更新系统可以是PC,便携计算机等终端设备。It should be noted that Fig. 4 can be a structural diagram of the hardware operating environment of the Hadoop-based data update system. The Hadoop-based data update system in the embodiment of the present application may be a terminal device such as a PC and a portable computer.
如图4所示,该基于Hadoop的数据更新系统可以包括:处理器1001,例如CPU,存储器1005,用户接口1003,网络接口1004,通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 4, the Hadoop-based data update system may include: a processor 1001, such as a CPU, a memory 1005, a user interface 1003, a network interface 1004, and a communication bus 1002. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory. Optionally, the memory 1005 may also be a storage device independent of the foregoing processor 1001.
可选地,基于Hadoop的数据更新系统还可以包括摄像头、RF(Radio Frequency,射频)电路,传感器、音频电路、WiFi模块等等。Optionally, the Hadoop-based data update system may also include a camera, an RF (Radio Frequency) circuit, a sensor, an audio circuit, a WiFi module, and so on.
本领域技术人员可以理解,图4中示出的基于Hadoop的数据更新系统结构并不构成对基于Hadoop的数据更新系统的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure of the Hadoop-based data update system shown in Figure 4 does not constitute a limitation on the Hadoop-based data update system, and may include more or less components than those shown in the figure, or a combination of some Components, or different component arrangements.
如图4所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及基于Hadoop的数据更新程序。其中,操作系统是管理和控制基于Hadoop的数据更新系统硬件和软件资源的程序,支持基于Hadoop的数据更新程序以及其它软件或程序的运行。As shown in FIG. 4, the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a data update program based on Hadoop. Among them, the operating system is a program that manages and controls the hardware and software resources of the Hadoop-based data update system, and supports the operation of the Hadoop-based data update program and other software or programs.
在图4所示的基于Hadoop的数据更新系统中,用户接口1003主要用于连接客户端,与客户端进行数据通信;网络接口1004主要用于连接后台服务器,与吼他服务器进行数据通信;处理器1001可以用于调用存储器1005中存储的基于Hadoop的数据更新程序,并执行如上所述的基于Hadoop的数据更新方法的步骤。In the Hadoop-based data update system shown in Figure 4, the user interface 1003 is mainly used to connect to the client and communicate data with the client; the network interface 1004 is mainly used to connect to the background server and communicate with the Houta server; The device 1001 can be used to call the Hadoop-based data update program stored in the memory 1005 and execute the steps of the Hadoop-based data update method as described above.
本申请基于Hadoop的数据更新系统具体实施方式与上述基于Hadoop的数据更新方法各实施例基本相同,在此不再赘述。The specific implementation of the Hadoop-based data update system of the present application is basically the same as the foregoing embodiments of the Hadoop-based data update method, and will not be repeated here.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有基于Hadoop的数据更新程序,所述基于Hadoop的数据更新程序被处理器执行时实现如上所述的基于Hadoop的数据更新方法的步骤。In addition, an embodiment of the present application also proposes a computer-readable storage medium, the computer-readable storage medium stores a Hadoop-based data update program, and the Hadoop-based data update program is executed by a processor as described above The steps of the Hadoop-based data update method.
本申请计算机可读存储介质具体实施方式与上述基于Hadoop的数据更新方法各实施例基本相同,在此不再赘述。The specific implementation of the computer-readable storage medium of the present application is basically the same as the foregoing embodiments of the Hadoop-based data update method, and will not be repeated here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个......”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, method, article, or device. Without more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or device that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (13)

  1. 一种基于Hadoop的数据更新方法,其中,所述基于Hadoop的数据更新方法包括以下步骤:A Hadoop-based data update method, wherein the Hadoop-based data update method includes the following steps:
    当检测到Hadoop集群接收到客户端发送的跑批任务后,在所述Hadoop集群中对所述跑批任务进行编译,得到所述跑批任务对应的任务语句;After detecting that the Hadoop cluster receives the batch running task sent by the client, compiling the batch running task in the Hadoop cluster to obtain the task statement corresponding to the batch running task;
    在数据治理系统中对所述任务语句进行解析,得到所述任务语句对应各个数据库表的逻辑关系;Analyze the task statement in the data management system to obtain the logical relationship between the task statement and each database table;
    根据所述逻辑关系更新预设图形数据库中对应数据库表的血缘关系。The blood relationship of the corresponding database table in the preset graph database is updated according to the logical relationship.
  2. 如权利要求1所述的基于Hadoop的数据更新方法,其中,若所述任务语句为数据更新语句,则所述当检测到Hadoop集群接收到客户端发送的跑批任务后,在所述Hadoop集群中对所述跑批任务进行编译,得到所述跑批任务对应的任务语句的步骤之后,还包括:The Hadoop-based data update method according to claim 1, wherein if the task sentence is a data update sentence, when it is detected that the Hadoop cluster receives the batch task sent by the client, the Hadoop cluster After the step of compiling the batch running task in and obtaining the task sentence corresponding to the batch running task, it further includes:
    在所述Hadoop集群中对所述跑批任务对应数据进行加工,得到加工后的数据;Processing the data corresponding to the batch running task in the Hadoop cluster to obtain processed data;
    根据所述加工后的数据更新元数据库,得到所述更新后的元数据库;Update the metadata database according to the processed data to obtain the updated metadata database;
    通过所述数据治理系统在所述更新后的元数据库中获取更新后的元数据,并获取所述加工后的数据和所述加工后的数据所在数据库表的表名称;Obtaining updated metadata in the updated metadata database through the data management system, and obtaining the processed data and the table name of the database table where the processed data is located;
    根据所述更新后的元数据和所述加工后的数据更新所述图形数据库中所述表名称对应的数据库表,并将更新后的数据库表确定为上游数据库表;Update the database table corresponding to the table name in the graph database according to the updated metadata and the processed data, and determine the updated database table as an upstream database table;
    在所述根据所述逻辑关系更新预设图形数据库中对应数据库表的血缘关系的步骤之后,根据所述血缘关系确定所述上游数据库表对应的下游数据库表;After the step of updating the blood relationship of the corresponding database table in the preset graph database according to the logical relationship, determine the downstream database table corresponding to the upstream database table according to the blood relationship;
    根据所述更新后的元数据和所述加工后的数据更新所述下游数据 库表。The downstream database table is updated according to the updated metadata and the processed data.
  3. 如权利要求2所述的基于Hadoop的数据更新方法,其中,所述根据所述血缘关系确定所述上游数据库表对应的下游数据库表的步骤之后,还包括:The Hadoop-based data update method according to claim 2, wherein after the step of determining the downstream database table corresponding to the upstream database table according to the blood relationship, the method further comprises:
    发送提示信息给所述下游数据库表对应的客户端,以供所述客户端根据所述提示信息提示用户,所述下游数据库表对应的上游数据库表已更新;Sending prompt information to the client corresponding to the downstream database table, so that the client prompts the user according to the prompt information that the upstream database table corresponding to the downstream database table has been updated;
    若接收到所述下游数据库表对应客户端发送的更新指令,则根据所述更新后的元数据和所述加工后的数据更新所述下游数据库表。If an update instruction sent by the client corresponding to the downstream database table is received, the downstream database table is updated according to the updated metadata and the processed data.
  4. 如权利要求2所述的基于Hadoop的数据更新方法,其中,所述通过所述数据治理系统在所述更新后的元数据库中获取更新后的元数据的步骤包括:The Hadoop-based data update method according to claim 2, wherein the step of obtaining updated metadata in the updated metadata database through the data management system comprises:
    通过所述数据治理系统中预设的监听程序获取所述更新后的元数据库的监听日志;Obtaining the monitoring log of the updated metadata database through a preset monitoring program in the data management system;
    解析所述监听日志,获取所述监听日志中的目标关键字;Parse the monitoring log to obtain the target keyword in the monitoring log;
    根据所述目标关键字获取所述更新后的元数据库中更新后的元数据。Acquire updated metadata in the updated metadata database according to the target keyword.
  5. 如权利要求1所述的基于Hadoop的数据更新方法,其中,若所述任务语句为数据查询语句,则所述当检测到Hadoop集群接收到客户端发送的跑批任务后,在所述Hadoop集群中对所述跑批任务进行编译,得到所述跑批任务对应的任务语句的步骤之后,还包括:The Hadoop-based data update method according to claim 1, wherein, if the task sentence is a data query sentence, when it is detected that the Hadoop cluster receives the batch task sent by the client, the Hadoop cluster After the step of compiling the batch running task in and obtaining the task sentence corresponding to the batch running task, it further includes:
    在所述Hadoop集群中获取所述数据查询语句对应的目标数据;Obtaining target data corresponding to the data query statement in the Hadoop cluster;
    将所述目标数据发送给所述跑批任务对应的客户端。The target data is sent to the client corresponding to the batch running task.
  6. 如权利要求1所述的基于Hadoop的数据更新方法,其中,所述在数据治理系统中对所述任务语句进行解析,得到所述任务语句对应各个数据库表的逻辑关系的步骤包括:The Hadoop-based data update method according to claim 1, wherein the step of parsing the task statement in the data management system to obtain the logical relationship of the task statement corresponding to each database table comprises:
    在数据治理系统中对所述任务语句进行解析,得到所述任务语句 对应的数据库表;Parse the task sentence in the data management system to obtain the database table corresponding to the task sentence;
    确定所述任务语句对应数据库表中的源表和目标表,根据所述源表和所述目标表确定所述任务语句对应各个数据库表的逻辑关系。Determine the source table and the target table in the database table corresponding to the task sentence, and determine the logical relationship of the task sentence corresponding to each database table according to the source table and the target table.
  7. 如权利要求1至6任一项所述的基于Hadoop的数据更新方法,其中,所述检测到Hadoop集群接收到客户端发送的跑批任务后,在所述Hadoop集群中对所述跑批任务进行编译,得到所述跑批任务对应的任务语句的步骤包括:The Hadoop-based data update method according to any one of claims 1 to 6, wherein after detecting that the Hadoop cluster receives the batch running task sent by the client, the batch running task is performed in the Hadoop cluster. The steps of compiling to obtain the task statement corresponding to the batch running task include:
    当检测到Hadoop集群接收到客户端发送的跑批任务后,调用所述Hadoop集群的hive编译器对所述跑批任务进行编译,得到所述跑批任务对应的HQL语句。After detecting that the Hadoop cluster receives the batch running task sent by the client, the hive compiler of the Hadoop cluster is called to compile the batch running task, and the HQL statement corresponding to the batch running task is obtained.
  8. 一种基于Hadoop的数据更新装置,其中,所述基于Hadoop的数据更新装置包括:A Hadoop-based data update device, wherein the Hadoop-based data update device includes:
    编译模块,用于当检测到Hadoop集群接收到客户端发送的跑批任务后,在所述Hadoop集群中对所述跑批任务进行编译,得到所述跑批任务对应的任务语句;The compiling module is used to compile the batch running task in the Hadoop cluster after detecting that the Hadoop cluster receives the batch running task sent by the client to obtain the task statement corresponding to the batch running task;
    解析模块,用于在数据治理系统中对所述任务语句进行解析,得到所述任务语句对应各个数据库表的逻辑关系;The parsing module is used to parse the task sentence in the data management system to obtain the logical relationship of the task sentence corresponding to each database table;
    更新模块,用于根据所述逻辑关系更新预设图形数据库中对应数据库表的血缘关系。The update module is used to update the blood relationship of the corresponding database table in the preset graph database according to the logical relationship.
  9. 如权利要求8所述的基于Hadoop的数据更新装置,其中,若所述任务语句为数据更新语句,则所述基于Hadoop的数据更新装置还包括:8. The Hadoop-based data update device according to claim 8, wherein if the task sentence is a data update sentence, the Hadoop-based data update device further comprises:
    加工模块,用于在所述Hadoop集群中对所述跑批任务对应数据进行加工,得到加工后的数据;The processing module is used to process the data corresponding to the batch running task in the Hadoop cluster to obtain processed data;
    所述更新模块还用于根据所述加工后的数据更新元数据库,得到所述更新后的元数据库;根据所述更新后的元数据和所述加工后的数据更新所述图形数据库中所述表名称对应的数据库表;根据 所述更新后的元数据和所述加工后的数据更新所述下游数据库表;The update module is further configured to update the metadata database according to the processed data to obtain the updated metadata database; and update the graphics database according to the updated metadata and the processed data. The database table corresponding to the table name; updating the downstream database table according to the updated metadata and the processed data;
    所述基于Hadoop的数据更新装置还包括:The Hadoop-based data update device further includes:
    第一获取模块,用于通过所述数据治理系统在所述更新后的元数据库中获取更新后的元数据,并获取所述加工后的数据和所述加工后的数据所在数据库表的表名称;The first obtaining module is used to obtain updated metadata in the updated metadata database through the data management system, and obtain the processed data and the table name of the database table where the processed data is located ;
    确定模块,用于将更新后的数据库表确定为上游数据库表;根据所述血缘关系确定所述上游数据库表对应的下游数据库表。The determining module is configured to determine the updated database table as an upstream database table; determine the downstream database table corresponding to the upstream database table according to the blood relationship.
  10. 如权利要求9所述的基于Hadoop的数据更新装置,其中,所述基于Hadoop的数据更新装置还包括:9. The Hadoop-based data update device of claim 9, wherein the Hadoop-based data update device further comprises:
    第一发送模块,用于发送提示信息给所述下游数据库表对应的客户端,以供所述客户端根据所述提示信息提示用户,所述下游数据库表对应的上游数据库表已更新;The first sending module is configured to send prompt information to the client corresponding to the downstream database table, so that the client prompts the user according to the prompt information that the upstream database table corresponding to the downstream database table has been updated;
    所述更新模块还用于若接收到所述下游数据库表对应客户端发送的更新指令,则根据所述更新后的元数据和所述加工后的数据更新所述下游数据库表。The update module is further configured to update the downstream database table according to the updated metadata and the processed data if an update instruction sent by the client corresponding to the downstream database table is received.
  11. 如权利要求9所述的基于Hadoop的数据更新装置,其中,所述第一获取模块包括:The Hadoop-based data update device according to claim 9, wherein the first acquisition module comprises:
    获取单元,用于通过所述数据治理系统中预设的监听程序获取所述更新后的元数据库的监听日志;The obtaining unit is configured to obtain the monitoring log of the updated metadata database through a preset monitoring program in the data management system;
    第一解析单元,用于解析所述监听日志;The first parsing unit is used for parsing the monitoring log;
    所述获取单元还用于获取所述监听日志中的目标关键字;根据所述目标关键字获取所述更新后的元数据库中更新后的元数据。The obtaining unit is further configured to obtain a target keyword in the monitoring log; and obtain updated metadata in the updated metadata database according to the target keyword.
  12. 一种基于Hadoop的数据更新系统,其中,所述基于Hadoop的数据更新系统包括存储器、处理器和存储在所述存储器上并可在所述处理器上运行的基于Hadoop的数据更新程序,所述基于Hadoop的数据更新程序被所述处理器执行时实现如权利要求1至7中任一项所述的基于Hadoop的数据更新方法的步骤。A Hadoop-based data update system, wherein the Hadoop-based data update system includes a memory, a processor, and a Hadoop-based data update program stored on the memory and running on the processor, and When the Hadoop-based data update program is executed by the processor, the steps of the Hadoop-based data update method according to any one of claims 1 to 7 are realized.
  13. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有基于Hadoop的数据更新程序,所述基于Hadoop的数据更新程序被处理器执行时实现如权利要求1至7中任一项所述的基于Hadoop的数据更新方法的步骤。A computer-readable storage medium, wherein a Hadoop-based data update program is stored on the computer-readable storage medium, and when the Hadoop-based data update program is executed by a processor, it implements any one of claims 1 to 7 The steps of the Hadoop-based data update method described in the item.
PCT/CN2020/089637 2019-05-27 2020-05-11 Hadoop-based data updating method, device, system and medium WO2020238597A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910448948.6A CN110196888A (en) 2019-05-27 2019-05-27 Data-updating method, device, system and medium based on Hadoop
CN201910448948.6 2019-05-27

Publications (1)

Publication Number Publication Date
WO2020238597A1 true WO2020238597A1 (en) 2020-12-03

Family

ID=67753181

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/089637 WO2020238597A1 (en) 2019-05-27 2020-05-11 Hadoop-based data updating method, device, system and medium

Country Status (2)

Country Link
CN (1) CN110196888A (en)
WO (1) WO2020238597A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110196888A (en) * 2019-05-27 2019-09-03 深圳前海微众银行股份有限公司 Data-updating method, device, system and medium based on Hadoop
CN110555032A (en) * 2019-09-09 2019-12-10 北京搜狐新媒体信息技术有限公司 Data blood relationship analysis method and system based on metadata
CN111415199A (en) * 2020-03-20 2020-07-14 重庆锐云科技有限公司 Customer prediction updating method and device based on big data and storage medium
CN111563123B (en) * 2020-05-07 2023-08-22 北京首汽智行科技有限公司 Real-time synchronization method for hive warehouse metadata
CN112783871A (en) * 2021-03-16 2021-05-11 广东核电合营有限公司 Label data processing method, label data processing device, computer equipment and storage medium
CN113590386B (en) * 2021-07-30 2023-03-03 深圳前海微众银行股份有限公司 Disaster recovery method, system, terminal device and computer storage medium for data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060235835A1 (en) * 2005-04-14 2006-10-19 International Business Machines Corporation Model entity operations in query results
CN104424269A (en) * 2013-08-30 2015-03-18 中国电信股份有限公司 Data linage analysis method and device
CN106997369A (en) * 2016-01-26 2017-08-01 阿里巴巴集团控股有限公司 Data clearing method and device
CN109213826A (en) * 2017-06-30 2019-01-15 华为技术有限公司 Data processing method and equipment
CN109446279A (en) * 2018-10-15 2019-03-08 顺丰科技有限公司 Based on neo4j big data genetic connection management method, system, equipment and storage medium
CN109582660A (en) * 2018-12-06 2019-04-05 深圳前海微众银行股份有限公司 Data consanguinity analysis method, apparatus, equipment, system and readable storage medium storing program for executing
CN110196888A (en) * 2019-05-27 2019-09-03 深圳前海微众银行股份有限公司 Data-updating method, device, system and medium based on Hadoop

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018236886A1 (en) * 2017-06-21 2018-12-27 Opera Solutions Usa, Llc System and method for code and data versioning in computerized data modeling and analysis
CN107908672B (en) * 2017-10-24 2022-01-14 深圳前海微众银行股份有限公司 Application report realization method, device and storage medium based on Hadoop platform
CN108959564B (en) * 2018-07-04 2020-11-27 玖富金科控股集团有限责任公司 Data warehouse metadata management method, readable storage medium and computer device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060235835A1 (en) * 2005-04-14 2006-10-19 International Business Machines Corporation Model entity operations in query results
CN104424269A (en) * 2013-08-30 2015-03-18 中国电信股份有限公司 Data linage analysis method and device
CN106997369A (en) * 2016-01-26 2017-08-01 阿里巴巴集团控股有限公司 Data clearing method and device
CN109213826A (en) * 2017-06-30 2019-01-15 华为技术有限公司 Data processing method and equipment
CN109446279A (en) * 2018-10-15 2019-03-08 顺丰科技有限公司 Based on neo4j big data genetic connection management method, system, equipment and storage medium
CN109582660A (en) * 2018-12-06 2019-04-05 深圳前海微众银行股份有限公司 Data consanguinity analysis method, apparatus, equipment, system and readable storage medium storing program for executing
CN110196888A (en) * 2019-05-27 2019-09-03 深圳前海微众银行股份有限公司 Data-updating method, device, system and medium based on Hadoop

Also Published As

Publication number Publication date
CN110196888A (en) 2019-09-03

Similar Documents

Publication Publication Date Title
WO2020238597A1 (en) Hadoop-based data updating method, device, system and medium
CN109690524B (en) Data serialization in a distributed event processing system
CN107506451B (en) Abnormal information monitoring method and device for data interaction
US8447744B2 (en) Extensibility platform using data cartridges
JP6505123B2 (en) Processing Data Sets in Big Data Repository
CN109656963B (en) Metadata acquisition method, apparatus, device and computer readable storage medium
US20110196891A1 (en) Class loading using java data cartridges
JP2016500168A (en) Managing continuous queries with archived relations
KR20150039118A (en) Background format optimization for enhanced sql-like queries in hadoop
CN111752959B (en) Real-time database cross-database SQL interaction method and system
US20100293161A1 (en) Automatically avoiding unconstrained cartesian product joins
WO2018035799A1 (en) Data query method, application and database servers, middleware, and system
EP2975535A1 (en) Checking freshness of data for a data integration system, DIS
CN112434046B (en) Data blood margin analysis method, device, equipment and storage medium
US20140006000A1 (en) Built-in response time analytics for business applications
WO2018045610A1 (en) Method and device for executing distributed computing task
CN111782671A (en) Optimization method for realizing CDC (performance control) based on distributed database
US11232105B2 (en) Unified metrics computation platform
CN114969441A (en) Knowledge mining engine system based on graph database
CN113962597A (en) Data analysis method and device, electronic equipment and storage medium
US11693859B2 (en) Systems and methods for data retrieval from a database indexed by an external search engine
US8930426B2 (en) Distributed requests on remote data
Jacobs et al. Bad to the bone: Big active data at its core
CN117009371A (en) Data blood edge analysis method, device, equipment, storage medium and program product
US20210349910A1 (en) Systems and methods for spark lineage data capture

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20813539

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20813539

Country of ref document: EP

Kind code of ref document: A1