CN116346920A - Distributed database real-time acquisition lake entering method and system - Google Patents

Distributed database real-time acquisition lake entering method and system Download PDF

Info

Publication number
CN116346920A
CN116346920A CN202310232125.6A CN202310232125A CN116346920A CN 116346920 A CN116346920 A CN 116346920A CN 202310232125 A CN202310232125 A CN 202310232125A CN 116346920 A CN116346920 A CN 116346920A
Authority
CN
China
Prior art keywords
lake
entering
database
acquisition
database server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310232125.6A
Other languages
Chinese (zh)
Inventor
彭晓君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Cloud Technology Co Ltd
Original Assignee
Tianyi Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Cloud Technology Co Ltd filed Critical Tianyi Cloud Technology Co Ltd
Priority to CN202310232125.6A priority Critical patent/CN116346920A/en
Publication of CN116346920A publication Critical patent/CN116346920A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2477Temporal data queries
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/12Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/563Data redirection of data network streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/566Grouping or aggregating service requests, e.g. for unified processing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • H04L67/5681Pre-fetching or pre-delivering data based on network characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • H04L67/5682Policies or rules for updating, deleting or replacing the stored data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for collecting and entering a lake in real time by a distributed database, which belong to the technical field of data transmission, wherein the method comprises the following steps: defining a database server node and a collecting lake entering task node in a configuration center; the CDC acquisition service program initiates a request for modifying the heartbeat time to the database server node according to a preset time interval, and updates the heartbeat field in the database server node; adding the database required to be collected into the lake and corresponding data table information into a collecting and lake-entering task node; the CDC acquisition service program accesses and acquires the task nodes entering the lake and searches the task information needing to be acquired for entering the lake; the CDC acquisition service program acquires a database entering a lake according to the need, performs acquisition tasks on the corresponding data table and sends the acquisition tasks to the distributed message platform; the lake entering service program accesses each collecting lake entering task node, reads a database and a corresponding data table from the distributed message platform and carries out lake entering processing.

Description

Distributed database real-time acquisition lake entering method and system
Technical Field
The invention belongs to the technical field of data transmission, and particularly relates to a method and a system for real-time acquisition and lake entering of a distributed database.
Background
Database change data capture (CDC, change Data Capture) acquisition techniques have been the dominant technique in database synchronization and database acquisition today, with the CDC capturing incremental changes to data and data structures through the source database, propagating these changes to other databases or applications in near real time. In this way, the CDC can provide efficient, low-latency data transfer to the data warehouse so that information is timely converted and delivered to the application program for analysis.
The data lake technology is taken as a new technology of the storage of a plurality of data bins, and is introduced into a large data platform by a plurality of companies by virtue of high-efficiency real-time ingestion and reading efficiency and good compatibility of the traditional large data bin technology. A typical technique for data lakes is Hudi.
Database data acquisition by database CDC technology into lakes has been the mainstream database data acquisition scheme. The mainstream technology generally leads data through CDC into Kafka or Pulsar through CDC tool, and then writes the data into Hudi through Flink or Spark stream consumption.
However, the currently mainstream CDC technology collects database data into the lake, on one hand, because CDCs are all collected in real time, CDC tools are all in service form in the whole collection link. And typically because the acquisition sources are scattered across various locations of the network, often one database corresponds to one independent CDC service, and a large number of such independent services are very cumbersome to acquire configuration changes and maintenance work. On the other hand, the multiple CDC services and the lake-entering service are relatively independent, and the schema definition of the lake table during the lake-entering process is often pre-defined in advance and is hard-coded into the lake-entering process, so that the schema definition of the lake-entering process is not perceived for the change of the collected source data table.
Disclosure of Invention
The embodiment of the invention aims to provide a distributed database real-time collection and lake entering method and system, which can solve the technical problems that a large number of independent CDC services are very troublesome for collection configuration change and maintenance work in the existing data lake entering technology, a plurality of CDC services and lake entering services are relatively independent, and the change of schema of a source data table cannot be perceived in the lake entering process, so that the data lake entering process is redundant and the efficiency is low.
In order to solve the technical problems, the invention is realized as follows:
first aspect
The embodiment of the invention provides a method for collecting and entering a lake in real time by a distributed database, which comprises the following steps:
s101: defining a database server node and a collecting lake entering task node in a configuration center;
s102: deploying a change data capturing CDC acquisition service program on an acquisition database server;
s103: the CDC acquisition service program initiates a request for creating a database server node to a configuration center according to the configured configuration center information;
s104: under the condition that the database server node is built, the CDC acquisition service program initiates a request for modifying the heartbeat time to the database server node according to a preset time interval, and updates the heartbeat field in the database server node;
s105: adding the database required to be collected into the lake and corresponding data table information into a collecting and lake-entering task node;
s106: the CDC acquisition service program accesses the task node for acquiring the lake entering, searches the task information needing to acquire the lake entering, compares the task information with the data stored in the acquisition database server, and judges whether a database needing to acquire the lake entering and a corresponding data table exist;
s107: under the condition that a database needing to be collected into a lake and a corresponding data table exist, a CDC collection service program obtains schema information of the data table to be collected and sends the schema information to a collecting and lake-entering task node so as to write the schema information in the collecting and lake-entering task node;
s108: the CDC acquisition service program acquires a database entering a lake and a corresponding data table according to the need to perform acquisition tasks;
s109: the CDC acquisition service program sends the acquired database and the corresponding data table to the distributed message platform;
s110: the lake entering service program accesses each collecting lake entering task node, creates a lake entering task thread according to the schema information, reads a database and a corresponding data table from the distributed message platform, and carries out lake entering processing.
Further, the name of the database server node is the address of the acquisition database server, and the content of the database server node is the connection time and the heartbeat time of the acquisition database server.
Further, the names of the collecting and entering task nodes are the names of a collecting database and a corresponding collecting data table, the content of the database server node is the schema information of the collecting data table, and the schema information comprises the names, types, lengths and sequences of various fields and the libraries and tables of the entering tasks.
Further, after S104, further includes:
S104A: the alarm program traverses all database server nodes, compares the difference between the current time and the last heartbeat time of each acquisition database server, and judges whether the acquisition database server has abnormal service or not according to the comparison result of the difference and the preset timeout time.
Further, S108 is specifically:
the CDC acquisition service program acquires a database required to be acquired into a lake in a manner of total snapshot and binlog increment and a corresponding data table.
Second aspect
The embodiment of the invention provides a distributed database real-time acquisition lake entering system, which comprises the following steps:
the definition module is used for defining a database server node and a collecting lake entering task node in the configuration center;
the deployment module is used for deploying the change data capturing CDC acquisition service program on the acquisition database server;
the first request module is used for initiating a request for creating a database server node to a configuration center by the CDC acquisition service program according to the configured configuration center information;
the second request module is used for initiating a request for modifying the heartbeat time to the database server node by the CDC acquisition service program according to a preset time interval under the condition that the database server node is completely created, and updating the heartbeat field in the database server node;
the adding module is used for adding the database required to be collected into the lake and corresponding data table information to the collecting and lake-entering task node;
the comparison module is used for accessing the task node for collecting the lake entering by the CDC collection service program, searching the task information needing to collect the lake entering, comparing the task information with the data stored in the collection database server, and judging whether a database needing to collect the lake entering and a corresponding data table exist;
the writing module is used for acquiring the schema information of the data table to be acquired by the CDC acquisition service program under the condition that the database and the corresponding data table to be acquired exist, and sending the schema information to the acquisition and lake-entering task node so as to write the schema information in the acquisition and lake-entering task node;
the collection module is used for collecting a database entering a lake according to the need by the CDC collection service program and carrying out a collection task by a corresponding data table;
the sending module is used for sending the acquired database and the corresponding data table to the distributed message platform by the CDC acquisition service program;
and the lake entering module is used for accessing each collecting lake entering task node by the lake entering service program, creating a lake entering task thread according to the schema information, reading a database and a corresponding data table from the distributed message platform, and carrying out lake entering processing.
Further, the name of the database server node is the address of the acquisition database server, and the content of the database server node is the connection time and the heartbeat time of the acquisition database server.
Further, the names of the collecting and entering task nodes are the names of a collecting database and a corresponding collecting data table, the content of the database server node is the schema information of the collecting data table, and the schema information comprises the names, types, lengths and sequences of various fields and the libraries and tables of the entering tasks.
Further, the distributed database real-time collection lake-entering system further comprises:
and the alarm module is used for traversing all database server nodes by the alarm program, comparing the difference between the current time and the last heartbeat time of each acquisition database server, and judging whether the acquisition database server has abnormal service or not according to the comparison result of the difference and the preset timeout time.
Further, the acquisition module is specifically configured to:
the CDC acquisition service program acquires a database required to be acquired into a lake in a manner of total snapshot and binlog increment and a corresponding data table.
The invention has at least the following beneficial effects:
in the embodiment of the invention, the database server node and the collecting and lake entering task node are defined in the configuration center, and then all collecting and lake entering services are uniformly managed by utilizing the database server node and the collecting and lake entering task node, so that the collecting and configuration change and maintenance work are simpler, the operation and maintenance management cost is reduced, the change of the schema of the source data table is dynamically perceived by the collecting and lake entering task node in the lake entering process, the lake entering strategy is regulated according to the change of the schema, the data lake entering flow is simplified, and the efficiency of data lake entering is improved.
Drawings
FIG. 1 is a schematic flow chart of a method for real-time collection and lake entering of a distributed database according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a distributed database real-time acquisition lake-entering system according to an embodiment of the present invention.
The achievement of the object, functional features and advantages of the present invention will be further described with reference to the embodiments, referring to the accompanying drawings.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The method and the system for real-time acquisition and lake entering of the distributed database provided by the embodiment of the invention are described in detail through specific embodiments and application scenes thereof by combining the attached drawings.
Example 1
Referring to fig. 1, a flow diagram of a method for real-time collection and lake entering of a distributed database according to an embodiment of the present invention is shown.
The embodiment of the invention provides a method for collecting and entering a lake in real time by a distributed database, which comprises the following steps:
s101: database server nodes and acquisition lake-entering task nodes are defined in the configuration center.
The zookeeper can be used as a configuration center to define a database server node and a collecting and lake entering task node.
Optionally, a root node/datacloud is configured in the zookeeper, a CDC acquisition source root node/datacloud/sources are configured, and an acquisition synchronization task root node/datacloud/jobs is configured.
The database server node is mainly used for the functions of registration, maintenance, data source management and the like of the database server.
The collecting and entering task nodes are mainly used for: on one hand, the management program configures a database and table names needing to be collected into a lake; on the other hand, the data source data table information (schema) is synchronous with the lake entering service, so that the management of the lake entering field is realized.
Optionally, the name of the database server node is the address of the collection database server, and the content of the database server node is the connection time and the heartbeat time of the collection database server.
Optionally, the name of the collecting and entering task node is the name of a collecting database and the name of a corresponding collecting data table, and the content of the database server node is the schema information of the collecting data table, wherein the schema information comprises the name, the type, the length and the sequence of each field and the library and the table of the entering task.
S102: and deploying a change data capturing CDC acquisition service program on the acquisition database server.
Wherein, after deploying the CDC acquisition service program, the CDC acquisition service program can access to the acquisition database server.
Optionally, a configuration center for configuring the CDC acquisition service program is a service where the zookeeper configuration center is located, and the CDC acquisition service program is deployed and started in a server needing to be acquired.
S103: and the CDC acquisition service program initiates a request for creating the database server node to the configuration center according to the configured configuration center information.
S104: and under the condition that the database server node is built, the CDC acquisition service program initiates a request for modifying the heartbeat time to the database server node according to a preset time interval, and updates the heartbeat field in the database server node.
The specific preset time interval can be set by a person skilled in the art according to actual needs, and the embodiment of the invention is not limited.
Optionally, the CDC acquisition service program automatically registers the native ip under the configuration center/data_source, and periodically initiates heartbeat information to update the heartbeat field in the node.
S105: and adding the database required to be collected into the lake and corresponding data table information into the collecting and lake-entering task node.
S106: the CDC acquisition service program accesses the task node for acquiring the lake entering, searches the task information needing to be acquired into the lake, compares the task information with the data stored in the acquisition database server, and judges whether a database needing to be acquired into the lake and a corresponding data table exist.
S107: under the condition that a database needing to be collected into a lake and a corresponding data table exist, the CDC collection service program obtains the schema information of the data table to be collected and sends the schema information to the collecting and lake-entering task node so as to write the schema information in the collecting and lake-entering task node.
S108: and the CDC acquisition service program acquires the database entering the lake and the corresponding data table according to the requirement to carry out acquisition tasks.
In one possible implementation, S108 is specifically: the CDC acquisition service program acquires a database required to be acquired into a lake in a manner of total snapshot and binlog increment and a corresponding data table.
In the embodiment of the invention, the data acquisition efficiency can be improved by adopting the manner of the total amount of snapshot and the binlog increment, and the accuracy of data acquisition can be ensured.
S109: the CDC acquisition service program transmits the acquired database and the corresponding data table to the distributed message platform.
Wherein the distributed message platform may be kafka or pulsar.
S110: the lake entering service program accesses each collecting lake entering task node, creates a lake entering task thread according to the schema information, reads a database and a corresponding data table from the distributed message platform, and carries out lake entering processing.
Optionally, the configuration center of the configuration data lake-entering service is a service where the zookeeper configuration center is located, and the data lake-entering service is started. The lake-entering service will automatically monitor/data-enclosed/jobs for newly added tasks. And the user newly adds the task of collecting and entering the lake through the configuration center management interface or the command line pair of the zookeeper/datacloud/jobs.
The invention has at least the following beneficial effects:
in the embodiment of the invention, the database server node and the collecting and lake entering task node are defined in the configuration center, and then all collecting and lake entering services are uniformly managed by utilizing the database server node and the collecting and lake entering task node, so that the collecting and configuration change and maintenance work are simpler, the operation and maintenance management cost is reduced, the change of the schema of the source data table is dynamically perceived by the collecting and lake entering task node in the lake entering process, the lake entering strategy is regulated according to the change of the schema, the data lake entering flow is simplified, and the efficiency of data lake entering is improved.
In one possible implementation, after S104, the method further includes:
S104A: the alarm program traverses all database server nodes, compares the difference between the current time and the last heartbeat time of each acquisition database server, and judges whether the acquisition database server has abnormal service or not according to the comparison result of the difference and the preset timeout time.
Optionally, if the difference between the last heartbeat times of the collection database server is greater than the preset timeout time, it is determined that the collection database server has abnormal service, and at this time, alarm information should be sent out in time.
Alternatively, the alarm information may be a voice alarm, or a pop-up alarm window may be popped up.
In the embodiment of the invention, whether the service abnormality exists in the acquisition database server can be timely and accurately judged according to the comparison result of the difference value and the preset timeout time, so that corresponding processing is carried out, and the acquisition lake entering service is uniformly managed, so that the maintenance work is more timely and simpler.
Example two
Referring to fig. 2, a schematic structural diagram of a distributed database real-time collection lake entering system according to an embodiment of the present invention is shown.
The embodiment of the invention provides a distributed database real-time acquisition lake entering system 20, which comprises the following components:
a definition module 201, configured to define a database server node and an acquisition lake-entering task node in a configuration center;
a deployment module 202 for deploying a change data capture CDC acquisition service program on an acquisition database server;
a first request module 203, configured to initiate a request for creating a database server node to a configuration center by using a CDC acquisition service program according to the configured configuration center information;
the second request module 204 is configured to initiate, by the CDC acquisition service program, a request for modifying the heartbeat time to the database server node according to a preset time interval when the database server node is created, and update a heartbeat field in the database server node;
the adding module 205 is configured to add the database to be collected into the lake and the corresponding data table information to the collecting and entering task node;
the comparison module 206 is used for accessing the task node for collecting the lake entering by the CDC collection service program, searching the task information for collecting the lake entering, comparing the task information with the data stored in the collection database server, and judging whether a database for collecting the lake entering and a corresponding data table exist;
a writing module 207, configured to, in the case that there are databases and corresponding data tables that need to be collected into a lake, obtain schema information of the data tables to be collected by the CDC collection service program, and send the schema information to a task node for collecting into the lake, so as to write the schema information in the task node for collecting into the lake;
the collection module 208 is used for collecting the database entering the lake and the corresponding data table according to the need by the CDC collection service program to perform a collection task;
a sending module 209, configured to send the collected database and the corresponding data table to the distributed message platform by using the CDC collection service program;
the lake entering module 210 is configured to access each collecting lake entering task node by using the lake entering service program, create a lake entering task thread according to the schema information, read a database and a corresponding data table from the distributed message platform, and perform lake entering processing.
Further, the name of the database server node is the address of the acquisition database server, and the content of the database server node is the connection time and the heartbeat time of the acquisition database server.
Further, the names of the collecting and entering task nodes are the names of a collecting database and a corresponding collecting data table, the content of the database server node is the schema information of the collecting data table, and the schema information comprises the names, types, lengths and sequences of various fields and the libraries and tables of the entering tasks.
Further, the distributed database real-time collection lake-entering system further comprises: the alarm module 211 is configured to compare a difference between a current time and a last heartbeat time of each collection database server, and determine whether the collection database server has a service abnormality according to a comparison result of the difference and a preset timeout time.
Further, the acquisition module 208 is specifically configured to: the CDC acquisition service program acquires a database required to be acquired into a lake in a manner of total snapshot and binlog increment and a corresponding data table.
The distributed database real-time acquisition lake entering system 20 provided by the embodiment of the invention can realize the steps and effects of the distributed database real-time acquisition lake entering method. And are not repeated here to avoid repetition.
The invention has at least the following beneficial effects:
in the embodiment of the invention, the database server node and the collecting and lake entering task node are defined in the configuration center, and then all collecting and lake entering services are uniformly managed by utilizing the database server node and the collecting and lake entering task node, so that the collecting and configuration change and maintenance work are simpler, the operation and maintenance management cost is reduced, the change of the schema of the source data table is dynamically perceived by the collecting and lake entering task node in the lake entering process, the lake entering strategy is regulated according to the change of the schema, the data lake entering flow is simplified, and the efficiency of data lake entering is improved.
The virtual system provided by the invention can be a system, and can also be a component, an integrated circuit or a chip in a terminal.
The foregoing is merely exemplary of the present invention and is not intended to limit the present invention. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are to be included in the scope of the claims of the present invention.

Claims (10)

1. A method for real-time collection and lake entering of a distributed database, which is characterized by comprising the following steps:
s101: defining a database server node and a collecting lake entering task node in a configuration center;
s102: deploying a change data capturing CDC acquisition service program on an acquisition database server;
s103: the CDC acquisition service program initiates a request for creating the database server node to the configuration center according to the configured configuration center information;
s104: under the condition that the database server node is built, the CDC acquisition service program initiates a request for modifying the heartbeat time to the database server node according to a preset time interval, and updates the heartbeat field in the database server node;
s105: adding the database needing to be collected into the lake and corresponding data table information to the collecting and lake-entering task node;
s106: the CDC acquisition service program accesses the task node for acquiring the lake entering, searches the task information needing to acquire the lake entering, compares the task information with the data stored in the acquisition database server, and judges whether a database needing to acquire the lake entering and a corresponding data table exist;
s107: under the condition that a database needing to be collected into a lake and a corresponding data table exist, the CDC collection service program obtains the schema information of the data table to be collected and sends the schema information to the collecting and lake-entering task node so as to write the schema information in the collecting and lake-entering task node;
s108: the CDC acquisition service program acquires a database entering a lake and a corresponding data table according to the need to perform acquisition tasks;
s109: the CDC acquisition service program sends the acquired database and the corresponding data table to a distributed message platform;
s110: and the lake entering service program accesses each collecting lake entering task node, creates a lake entering task thread according to the schema information, reads a database and a corresponding data table from the distributed message platform, and carries out lake entering processing.
2. The method for real-time collection and lake-entering of a distributed database according to claim 1, wherein the name of the database server node is the address of the collection database server, and the content of the database server node is the connection time and the heartbeat time of the collection database server.
3. The method for real-time collection and lake-entering of a distributed database according to claim 1, wherein the names of the collecting and lake-entering task nodes are the names of the collecting database and the corresponding collecting data table, the content of the database server node is the schema information of the collecting data table, and the schema information comprises the names, types, lengths, sequences of various fields and the libraries and tables of the lake-entering task.
4. The method for real-time collection and lake-entering of a distributed database according to claim 1, further comprising, after S104:
S104A: and traversing all the database server nodes by the alarm program, comparing the difference between the current time and the last heartbeat time of each acquisition database server, and judging whether the acquisition database server has abnormal service or not according to the comparison result of the difference and the preset timeout time.
5. The method for real-time collection and lake-entering of a distributed database according to claim 1, wherein S108 is specifically:
the CDC acquisition service program acquires a database and a corresponding data table which need to be acquired into a lake in a manner of total snapshot and binlog increment.
6. A distributed database real-time acquisition lake-entering system, comprising:
the definition module is used for defining a database server node and a collecting lake entering task node in the configuration center;
the deployment module is used for deploying the change data capturing CDC acquisition service program on the acquisition database server;
the first request module is used for the CDC acquisition service program to initiate a request for creating the database server node to the configuration center according to the configured configuration center information;
the second request module is used for initiating a request for modifying the heartbeat time to the database server node by the CDC acquisition service program according to a preset time interval under the condition that the database server node is completely created, and updating the heartbeat field in the database server node;
the adding module is used for adding the database required to be collected into the lake and corresponding data table information to the collecting and lake-entering task node;
the comparison module is used for the CDC acquisition service program to access the task nodes for acquiring the lake entering, searching the task information needing to acquire the lake entering, comparing the task information with the data stored in the acquisition database server, and judging whether a database needing to acquire the lake entering and a corresponding data table exist or not;
the writing-in module is used for acquiring the schema information of the data table to be acquired by the CDC acquisition service program and sending the schema information to the acquisition and lake-entering task node under the condition that a database and a corresponding data table to be acquired and entering a lake exist, so that the schema information is written in the acquisition and lake-entering task node;
the collection module is used for collecting a database entering a lake and a corresponding data table according to the need by the CDC collection service program to carry out a collection task;
the sending module is used for sending the acquired database and the corresponding data table to the distributed message platform by the CDC acquisition service program;
and the lake entering module is used for accessing each collecting lake entering task node by a lake entering service program, creating a lake entering task thread according to the schema information, reading a database and a corresponding data table from the distributed message platform, and carrying out lake entering processing.
7. The distributed database real-time acquisition and lake-entering system according to claim 6, wherein the name of the database server node is the address of the acquisition database server, and the content of the database server node is the connection time and the heartbeat time of the acquisition database server.
8. The distributed database real-time collection and lake-entering system according to claim 6, wherein the names of the collection and lake-entering task nodes are the names of the collection database and the corresponding collection data tables, the content of the database server nodes is the schema information of the collection data tables, and the schema information comprises the names, types, lengths, sequences of various fields and the libraries and tables of the lake-entering tasks.
9. The distributed database real-time harvesting lake-entering system of claim 6, further comprising:
and the alarm module is used for traversing all the database server nodes by an alarm program, comparing the difference value between the current time and the last heartbeat time of each acquisition database server, and judging whether the acquisition database server has abnormal service or not according to the comparison result of the difference value and the preset timeout time.
10. The distributed database real-time acquisition lake-entering system of claim 1, wherein the acquisition module is specifically configured to:
the CDC acquisition service program acquires a database and a corresponding data table which need to be acquired into a lake in a manner of total snapshot and binlog increment.
CN202310232125.6A 2023-03-12 2023-03-12 Distributed database real-time acquisition lake entering method and system Pending CN116346920A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310232125.6A CN116346920A (en) 2023-03-12 2023-03-12 Distributed database real-time acquisition lake entering method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310232125.6A CN116346920A (en) 2023-03-12 2023-03-12 Distributed database real-time acquisition lake entering method and system

Publications (1)

Publication Number Publication Date
CN116346920A true CN116346920A (en) 2023-06-27

Family

ID=86885111

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310232125.6A Pending CN116346920A (en) 2023-03-12 2023-03-12 Distributed database real-time acquisition lake entering method and system

Country Status (1)

Country Link
CN (1) CN116346920A (en)

Similar Documents

Publication Publication Date Title
CN112000737B (en) Data synchronization method, system, terminal and storage medium based on multi-cloud management
US7007269B2 (en) Method of providing open access to application profiling data
US7340578B1 (en) Method and apparatus for maintaining an accurate inventory of storage capacity in a clustered data processing system
CN103177116B (en) A kind of distributed information log based on two-stage index is processed and querying method
CN112181764B (en) Kubernetes resource data monitoring method and device
US20060259349A1 (en) In-queue jobs information monitoring and filtering
CN110413599A (en) Generating date and storage system and method
US4792921A (en) Network event identifiers
CN107404417A (en) A kind of processing method of monitoring data, processing unit and processing system
CN109298978B (en) Recovery method and system for database cluster of specified position
CN109299157A (en) A kind of data export method and device of distributed big single table
GB2378546A (en) Automatic configuration of performance management software
CN111061802B (en) Power data management processing method, device and storage medium
CN113704790A (en) Abnormal log information summarizing method and computer equipment
CN111355802A (en) Information pushing method and device
CN109597764A (en) A kind of test method and relevant apparatus of catalogue quota
CN111666344A (en) Heterogeneous data synchronization method and device
CN113157904A (en) Sensitive word filtering method and system based on DFA algorithm
CN112256555A (en) Automatic test case management system and test case execution state conversion method
CN116346920A (en) Distributed database real-time acquisition lake entering method and system
CN101968747B (en) Cluster application management system and application management method thereof
CN112866049A (en) Server host index acquisition method and system
CN112685370B (en) Log collection method, device, equipment and medium
CN114443294B (en) Big data service component deployment method, system, terminal and storage medium
CN116186082A (en) Data summarizing method based on distribution, first server and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination