CN115374102A

CN115374102A - Data processing method and system

Info

Publication number: CN115374102A
Application number: CN202110875696.2A
Authority: CN
Inventors: 周志燕
Original assignee: Beijing Daxing Technology Co ltd
Current assignee: Beijing Daxing Technology Co ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2022-11-22

Abstract

The application discloses a data processing method and a system, which relate to the field of big data, wherein the method comprises the following steps: disguising the service instance into MySQL through canal service, and acquiring data manipulation language DML and data definition language DDL logs of the corresponding main library from the library nodes; outputting the acquired data manipulation language DML and data definition language DDL logs to kafka; synchronously outputting binlog data in the kafka to a set data storage component in a setting mode according to a synchronization task configured according to a use scene of a set service; the method and the system provided by the application can help business department personnel to quickly access data, save labor development and maintenance cost brought by traditional data synchronization, greatly improve working efficiency, support the repair of abnormal data, meet the data synchronization requirements among various data components and ensure the timeliness of making business decisions according to data.

Description

Data processing method and system

Technical Field

The invention relates to the technical field of big data and Internet, in particular to a data processing method and a data processing system.

Background

The traditional MySQL can not support the analysis of a large amount of data according to various complex service scenes, and even if a database-based table-dividing scheme is adopted, a decision-making report form is difficult to generate in real time aiming at a large amount of services; in the traditional data synchronization scheme, a schema corresponding to a data table is maintained in excel, and data is written in one time through a written program; the management mode can only import T +1 data, cannot realize the real-time performance of the data, and lags the whole service progress; the performance of the data component is seriously influenced by overlarge data volume of single import, and the data component is unavailable in the process of import under a large condition; the storage space of a large data HDFS cluster is also continuously increased, the data tables and ETL tasks are tens of thousands, and manual processing is impossible, so that a life cycle management system based on the HIVE table is provided for automatic processing. .

In view of the above, it is desirable to provide a data synchronization method supporting multiple types of data and complex service scenarios to solve the above problems.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is how to provide a data synchronization processing method and system supporting multiple types of data and complex service scenarios to solve the problems in the prior art.

In order to address the foregoing scenario, in a first aspect, an embodiment of the present application provides a data processing method, including: disguising the service instance into MySQL through canal service, and acquiring data manipulation language DML and data definition language DDL logs of the corresponding main library from the library nodes; outputting the acquired data manipulation language DML and data definition language DDL logs to kafka; and synchronously outputting binlog data in the kafka to a set data storage component in a set mode according to a synchronization task configured according to the use scene of the set service.

With reference to the first aspect, in some embodiments, the task of setting the synchronization of the usage scenario configuration of the service includes: selecting a base table needing synchronization when a task is created, and specifying a synchronization type and a synchronization method, wherein the synchronization type comprises the following steps: stock and real-time synchronization, stock synchronization, real-time synchronization.

In combination with the first aspect, in some embodiments, an index or table is created to support automatic creation of a tidb table or es index template from a schema of data to be synchronized.

With reference to the first aspect, in some embodiments, an index name and a mapping relationship between a MySQL field and an es field are specified when synchronizing to es; and when the synchronization is carried out to the tidb, a target cluster, a library and a table are specified.

With reference to the first aspect, in some embodiments, said synchronously outputting binlog data in the kafka to a set data storage component by a set manner includes: after a task is started, a setting notification is issued in a zookeeper, after a data synchronization service monitors a setting notification message issued by the zookeeper, a new consumer thread is started according to configured kafka topic information, data is pulled in a set time interval or batch size, when an exception occurs in a processing process, unprocessed successful data is sent to a temporary topic of kafka, and the task consumes the data with the exception again in a retrying process, wherein the data exception includes but is not limited to dependent component service exception, network exception and cloud service exception.

With reference to the first aspect, in some embodiments, the output logs are monitored in real time, and the number and content of the abnormal logs are sent to a set client when an abnormality occurs; and when the kafka message is consumed and if the blocking occurs due to sudden increase of the traffic, dynamically adjusting and optimizing the parameters according to the blocking condition.

With reference to the first aspect, in some embodiments, it is checked whether preparation work of each part is complete during data processing during task approval to ensure accuracy of data synchronization; the rejection of illegal configurations or the rejection of tasks that are not suitable for synchronization during the processing of the verification data.

With reference to the first aspect, in some embodiments, writing data into topic corresponding to kafka by a canal service synchronization tool includes: the method comprises the steps of database authority application, white list configuration, topic mapping configuration, data format conversion and data writing.

With reference to the first aspect, in some embodiments, a user applies for a read permission of a corresponding master library according to a library in which a table to be synchronized is located, so as to open a permission for canal to pull MySQL binlog.

With reference to the first aspect, in some embodiments, a white list of a corresponding table is opened according to actual business requirements, and unnecessary table data is masked.

With reference to the first aspect, in some embodiments, a mapping relationship is automatically generated according to a white list configuration, and the mapping relationship includes a set topic that specifies which of the database table data falls into kafka, where the set topic may be specified separately by a library or a table, and the configuration is completed to support the automatic creation of kafka topic.

With reference to the first aspect, in some embodiments, binlog data for canal access is converted to json format according to an object mode.

With reference to the first aspect, in some embodiments, data is written to a specified topic according to a white list and topic mapping relationship.

With reference to the first aspect, in some embodiments, the method further comprises: data output storage components that support output include, but are not limited to, kudu, elastic search, tidb, and hdfs; and/or the data output supports the functions of writing and/or data repair.

In combination with the first aspect, in some embodiments, the method further comprises: and according to the verification result, data restoration is carried out on the data in the file through the condition of appointing the data to be restored.

With reference to the first aspect, in some embodiments, the method further comprises: checking the data of the last set time period at the set time of the set time period; and outputting the inconsistent data and reasons to a corresponding file so as to facilitate problem location and data repair.

With reference to the first aspect, in some embodiments, the method further includes: after a synchronous task is created on a data deep learning di platform, calling an approval interface to write data into a distributed application program coordination service zookeeper; the data synchronization service coordinates the change of a service zookeeper node by monitoring a distributed application program, and determines whether to start a new synchronization task according to the change condition of the node; the data monitoring service carries out timing monitoring on the data consistency of the new task; after a new synchronous task is started, pulling data from a data source according to an input and output mode appointed during configuration; and synchronizing the acquired data to the data storage component in real time, and creating stock data synchronization task repair data by changing the flow if data inconsistency occurs in the process of executing data synchronization.

In a second aspect, an embodiment of the present application provides a data processing method, further including: constructing a data processing service by using a springboot, and constructing a metadata bin by relying on a DI dispatching system;

receiving the task queue sent by the DI dispatching system by the zookeeper:

dynamically monitoring the changes of the zookeeper node to realize Data Lifecycle Management (DLM) distributed deployment;

and the data lifecycle management DLM performs related processing according to the set configuration rule to realize the full link analysis processing of the HIVE table data, wherein the processing comprises but is not limited to synchronous processing.

In combination with the second aspect, in some embodiments, the method further comprises:

adopting springboot to construct synchronous service;

newly adding and deleting operations of zookeeper nodes are dynamically monitored, and data synchronization tasks are issued and downloaded;

and the source data is consumed from kafka according to set rules to synchronize data, and the consumed data is output to a specified database component.

In combination with the second aspect, in some embodiments, further comprising: the setting of the configuration rule comprises:

adding life cycle management configuration based on Hive data table into a cluster table operation page of the deep learning DI system, and carrying out rule configuration by a user according to a set requirement condition in table building operation;

and automatically configuring the life cycle rule of the HIVE data table according to a preset rule by a configuration timing task of the deep learning DI system.

In combination with the second aspect, in some embodiments, further comprising: and the construction of the element bin comprises data acquisition of the data terminal and data cleaning according to the label rule.

In combination with the second aspect, in some embodiments, further comprising: the data acquisition warehousing comprises:

ETL tasks are loaded aiming at data extraction conversion, and the tasks are scheduled and delivered into a warehouse in a task analysis and SQL analysis acquisition mode; for h-HOOK, intercepting a user to execute SQL by HIVE and analyzing to schedule tasks to report to warehousing; analyzing by mail SQL aiming at a mail system, opening data report by mail burying points, warehousing by consuming kafka, and reporting and warehousing by scheduling tasks; analyzing log logs by analyzing SQL in the report system aiming at the report system and reporting the log logs to a warehouse by scheduling tasks;

in combination with the second aspect, in some embodiments, further comprising: and (4) establishing a metadata bin for scheduling and configuring related ETL tasks, and cleaning data according to the tags.

In combination with the second aspect, in some embodiments, further comprising: distributed deployment is achieved through the zookeeper, the zookeeper is used as a notification medium of message consistency, and issuing and execution of task calculation are executed.

In a third aspect, an embodiment of the present application provides a data synchronization processing system, including: the acquisition module is configured to disguise a service instance into MySQL through canal service, and acquire Data Manipulation Language (DML) and Data Definition Language (DDL) logs of a corresponding main library from a library node; the sending module is connected with the acquiring module and is configured to output the acquired data manipulation language DML and data definition language DDL logs to kafka; and the processing module is connected with the sending module and is configured to synchronously output the binlog data in the kafka to a set data storage component in a setting mode according to a synchronization task configured according to a use scene of a set service.

With reference to the third aspect, in some embodiments, further comprising: the task creation module is configured to select a base table needing synchronization, and specify a synchronization type and a synchronization method, wherein the synchronization type comprises: stock and real-time synchronization, stock synchronization and real-time synchronization; the creating module is configured to create an index or a table to support automatic creation of a tidb table or an es index template according to the schema of the data to be synchronized; when the fields are synchronized to es, an index name is appointed and the mapping relation between the MySQL field and the es field is appointed; and when the synchronization is to the tidb, a target cluster, a library and a table are specified.

With reference to the third aspect, in some embodiments, further comprising: the task running module is configured to issue a setting notification in a program coordination service zookeeper after a task is started, after monitoring the setting notification message issued by the zookeeper, a data synchronization service starts a new consumer thread according to configured kafka topoic information to pull data at a set time interval or in a batch size, when an exception occurs in a processing process, the data which is not successfully processed is sent to a temporary topoc of kafka, and the task consumes the data which is processed with the exception again in a retry process, wherein the data exception includes but is not limited to a dependent component service exception, a network exception and a cloud service exception; the real-time monitoring module is used for monitoring the output logs in real time and sending the abnormal number and content to the set client when abnormality occurs; and when the kafka message is consumed and if the traffic volume suddenly increases to cause the blocking, dynamically adjusting and optimizing the parameters according to the blocking condition.

With reference to the third aspect, in some embodiments, the task approval module is configured to check whether preparation work of each part is complete during data processing to ensure accuracy of data synchronization; and/or the task approval module is further configured to verify that the data processing is rejected for an illegal configuration or rejected for a task that is not suitable for synchronization.

With reference to the third aspect, in some embodiments, the right application module, configured to write data into topic corresponding to kafka through the canal service synchronization tool, includes: the method comprises the steps of database permission application, white list configuration, topic mapping configuration, data format conversion and data writing.

With reference to the third aspect, in some embodiments, the system further includes a permission application module configured to apply for a read permission of the corresponding main library according to the library in which the table to be synchronized is located, so as to open a permission that canal draws the MySQL binlog.

With reference to the third aspect, in some embodiments, the system further includes a white list module configured to open a white list of the corresponding table according to actual business requirements, and shield unnecessary table data.

With reference to the third aspect, in some embodiments, the system further includes a mapping configuration module configured to automatically generate a mapping relationship according to a white list configuration, where the mapping relationship includes which setting theme specifying the database table data falls into kafka, and the mapping relationship may be specified by a library or a table, respectively, and the configuration is completed to support automatic creation of kafka topic.

With reference to the third aspect, in some embodiments, the method further includes a data format conversion module configured to convert binlog data accessed by canal into json format according to an object mode.

With reference to the third aspect, in some embodiments, the method further includes a data writing module configured to write data into the specified topic according to the white list and topic mapping relationship.

With reference to the third aspect, in some embodiments, further comprising: a data output module configured to output data supporting output storage components including, but not limited to, kudu, elastic search, tidb, and hdfs; and/or the data output module supports the functions of writing and/or data repair;

with reference to the third aspect, in some embodiments, the method further includes an abnormal data repairing module configured to repair data by specifying a condition of data to be repaired for the data in the file according to the verification result.

With reference to the third aspect, in some embodiments, further comprising: the data checking module is configured to check the data of the last set time period at the set time of the set time period; and outputting the inconsistent data and reasons to a corresponding file so as to facilitate problem location and data repair.

The data synchronization method and system based on the big data technology and the internet technology are a resource intelligent automatic optimization system, can help business department personnel to quickly access data, save labor development and maintenance cost caused by traditional data synchronization, greatly improve working efficiency, support restoration of abnormal data, meet data synchronization requirements among various data components, and guarantee timeliness of business decision according to data.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

The invention will be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

fig. 1 shows a flow chart of a data processing method according to an embodiment of the present application.

FIG. 2 shows a block diagram of a data processing system according to an embodiment of the present application.

Fig. 3 shows a flow chart of a data processing method according to another embodiment of the present application.

Fig. 4 shows a flow chart of a data processing method of a further embodiment of the present application.

Fig. 5 is a system state diagram illustrating a data processing method according to an embodiment of the present application.

Fig. 6 shows a flowchart of a data processing method according to still another embodiment of the present application.

FIG. 7 is a block diagram of a data processing system according to an embodiment of the present application.

Detailed Description

The present application is further described with reference to the following detailed description and the accompanying drawings. It is to be understood that the illustrative embodiments of the present disclosure include, but are not limited to, related methods, devices, and systems, and that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application. In addition, for convenience of description, only a part of structures or processes related to the present application, not all of them, is illustrated in the drawings.

The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. While the description of the present application will be described in conjunction with the preferred embodiments, it is not intended to limit the features of the present invention to that embodiment. Rather, the invention as described in connection with the embodiments is intended to cover alternatives or modifications as may be extended based on the claims of the present application. In the following description, numerous specific details are included to provide a thorough understanding of the present application. The present application may be practiced without these particulars. Moreover, some of the specific details have been omitted from the description in order to avoid obscuring or obscuring the focus of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Further, various operations will be described as multiple discrete operations, in a manner that is most helpful in understanding the illustrative embodiments; however, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations need not be performed in the order of presentation.

The terms "comprising," "having," and "including" are synonymous, unless the context dictates otherwise.

The term "and/or" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. The phrase "A/B" means "A or B".

In this application, "at least one" means one or more, "a plurality" means two or more. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

In the embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or via other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), without limitation, a floppy diskette, optical disk, read-only memory (CD-ROM), magneto-optical disk, read-only memory (ROM), random Access Memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic or optical card, flash memory, or a tangible machine-readable memory for transmitting information over the Internet via electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.). Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some features of the structures or methods are shown in a particular arrangement and/or order. However, it is to be understood that such specific arrangement and/or ordering may not be required. In some embodiments, these features may be arranged in a manner and/or order different from that shown in the illustrative figures. In addition, the inclusion of a structural or methodological feature in a particular figure is not meant to imply that such feature is required in all embodiments, and in some embodiments may not be included or may be combined with other features.

It will be understood that, although the terms "first", "second", etc. may be used herein to describe various elements or data, these elements or data should not be limited by these terms. These terms are only used to distinguish one feature from another. For example, a first feature may be termed a second feature, and, similarly, a second feature may be termed a first feature, without departing from the scope of example embodiments.

It should be noted that in this specification, like reference numerals and letters refer to like items in the following drawings, and thus, once an item is defined in one drawing, it need not be further defined and explained in subsequent drawings.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

As used herein, the term module or unit may refer to or include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and/or memory (shared, dedicated, or group) that executes one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality, or may be part of an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and/or memory (shared, dedicated, or group) that executes one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality.

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be discussed further in subsequent figures.

Embodiments of the invention are applicable to computer systems/servers operable with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the computer system/server include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above, and the like.

The computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

In the prior art, data synchronization processing is iterated according to business requirements, the time consumption period for data synchronization of development tasks is long, a large amount of repetitive work is easy to cause, particularly, the problem of connection between stock data and real-time data involved in synchronization causes difficulty in guaranteeing data consistency before and after synchronization, manual intervention is needed to a great extent, and maintenance cost is increased.

Fig. 1 shows a flowchart of a data processing method according to an embodiment of the present application, and as shown in fig. 1, the method includes:

step 101, disguising the service instance as MySQL through canal service, and acquiring data manipulation language DML and data definition language DDL logs of the corresponding main library from the library nodes.

And 102, outputting the acquired data manipulation language DML and data definition language DDL logs to kafka.

And 103, synchronously outputting binlog data in the kafka to a set data storage component in a setting mode according to a synchronization task configured according to the use scene of the set service.

In a specific embodiment, according to the access configuration information corresponding to the source storage component, writing the source data corresponding to each service instance in the source storage component into the corresponding topic of the distributed log system Kafka in real time; when the existence of a synchronous task to be processed is monitored, configuration data of the synchronous task are obtained and comprise address information of a source storage component, a target theme of Kafka and a target storage component; and executing the synchronization task according to the configuration data.

In one particular embodiment, the MySQL languages are divided into four major categories: the data query language DQL, the data manipulation language DML, the data definition language DDL and the data control language DCL.

The DQL basic structure of the data query language is a query block consisting of a SELECT clause, an FROM clause, a WHERE and a clause: SELECT < field name table >, FROM < table or view name >, WHERE < query condition >.

The data manipulation language DML has three main forms: 1) Inserting: INSERT; 2) Updating: UPDATE; 3) And (4) deleting: DELETE.

The data definition language DDL is used to create various objects, tables, views, indexes, synonyms, clusters, etc. in a database such as: CREATE TABLE/VIEW View/INDEX INDEX/SYN synonym/CLUSTER CLUSTER.

The data control language DCL is used to grant or reclaim certain privileges to access databases, and to control when and how efficiently database manipulation transactions occur, to monitor databases, and the like. Such as: 1) GRANT: and (4) authorizing. 2) ROLLBACK [ WORK ] TO [ SAVEPOINT ]: back to a certain point. ROLLBACK- -ROLLBACK, the ROLLBACK command returns the state of the database to the last committed state. The format is as follows: SQL > ROLLBACK; 3) COMMIT [ WORK ]: and (5) submitting.

In database insert, delete, and modify operations, a transaction is only completed when it is committed to the database. Before the transaction commits, only the person operating the database can have the right to see what is done, and others can see only after the last commit is completed. There are three types of submitted data: explicit submission, implicit submission, and automatic submission. These three types are described separately below. (1) Explicit COMMIT, the COMMIT that is done directly with the COMMIT command is explicit COMMIT. The format is as follows: SQL > COMMIT; (2) implicit commit; a commit that is indirectly completed with an SQL command is an implicit commit. These commands are: ALTER, AUDIT, COMMENT, CONNECT, CREATE, DISCONNECT, DROP, EXIT, GRANT, NOAUDIT, QUIT, REVOKE, RENAME. (3) If AUTOCOMMIT is set to ON, the system will automatically submit after the insert, modify, delete statements are executed, which is automatic submission. The format is as follows: SQL > SET AUTOCOMMIT ON.

Fig. 2 is a schematic structural diagram of a data processing system according to an embodiment of the present application, and as shown in fig. 2, the system includes: the system comprises a MySQL database, a canal synchronization tool, a real-time synchronization framework, data access, data processing, data output, data verification and data repair, and the framework diagram is as follows:

in an embodiment, with reference to fig. 1 and fig. 2, a mysql synchronization tool canal masquerades a service instance as a mysql and acquires DML and DDL logs corresponding to a main library from a library node in real time, outputs the acquired logs to kafka, then outputs binlog data in kafka to different data storage components according to service usage scenarios, currently supports output to kudu, elastic search, tidb, and hdfs, and a service party can self-configure a synchronization task according to an analysis scenario.

In one embodiment, the data access phase writes data into the topic corresponding to kafka through the canal sync tool, including: the method comprises the steps of database permission application, white list configuration, topic mapping configuration, data format conversion and data writing.

In one embodiment, database permission application management may be set, specifically, a user applies for a read permission of a corresponding main library according to a library in which a table to be synchronized is located, so as to open a permission for canal to pull mysql binlog.

In one embodiment, a white list may be set, specifically, according to actual service requirements, the white list of the corresponding base table is opened, unnecessary base table data is shielded, and resource consumption in subsequent data processing is reduced.

In one embodiment, a topic mapping configuration can be set, a mapping relation is automatically generated according to a white list configuration, the relation mainly specifies which subject of kafka the database table data finally falls into, and can be respectively specified according to a library or a table, and after the configuration is completed, automatic creation of kafka topic can be supported.

In one embodiment, a data format conversion may be provided to convert binlog data accessed by canal into json format according to object mode for better compatibility with the use of subsequent components.

In one embodiment, data writing may be supported, and specifically, the system writes data into a specified topic according to a mapping relationship between a white list and the topic.

In one embodiment, the data processing stage prepares for data output, and the specific processing flow includes: task creation, index or table creation, task approval, task starting and task running.

Step 201, selecting a library table needing synchronization, and specifying a synchronization type, wherein the type comprises: stock and real-time synchronization, stock synchronization, real-time synchronization. When the synchronization is carried out to es, an index name needs to be specified, and the mapping relation between the mysql field and the es field needs to be specified, so that the aim of constructing a relatively accurate mapper, supporting a scene of business query index and improving query performance is fulfilled; when the synchronization is to the tidb, a target cluster, a library and a table need to be specified.

Step 202, creating an index or table, specifically, automatically creating a tidb table or an es index template according to the schema of the data to be synchronized.

And step 203, checking whether all preparation works of all parts are complete during data processing during task approval so as to ensure the accuracy of data synchronization, and rejecting illegal configuration or rejecting tasks which are not suitable for synchronization.

Step 204, after the task is approved, after the task is started, a setting notification is issued in a zookeeper, after the data synchronization service monitors the setting notification message issued by the zookeeper, a new consumer thread is started according to configured kafka topoic information, data is pulled at a set time interval or in a set batch size, when the processing process is abnormal, the data which is not successfully processed is sent to a temporary topoc of kafka, and the task consumes the data which is abnormally processed again in the retry process, wherein the data abnormality includes but is not limited to dependent component service abnormality, network abnormality and cloud service abnormality. The final consistency of the data can be ensured through the method.

In one embodiment, the set time interval or batch size pull data may be a batch size pull data every 500 milliseconds or 2000, as will be appreciated by those skilled in the art, the time and data batch size may be adjusted accordingly.

Step 205, monitoring the output logs in real time, and sending the abnormal number and content to a set client when abnormality occurs; and when the kafka message is consumed and if the traffic volume suddenly increases to cause the blocking, dynamically adjusting and optimizing the parameters according to the blocking condition.

In one embodiment, the storage components in the data output that support the output include, but are not limited to, kudu, elastic search, tidb, and hdfs; the data output supports the functions of writing and/or data repair.

In one embodiment, the data checking process may include: step a) checking data of a last set time period at the set time of the set time period; and b), outputting the inconsistent data and reasons to a corresponding file so as to facilitate problem location and data repair. Specifically, the hourly data can be checked at the 5 th minute of each hour, and inconsistent data and reasons are output to corresponding files; the data for the last time period may also be checked every half hour or 40 minutes.

In one embodiment, kafka may provide data query, consumption query functionality. Specifically, the data query may query the data corresponding to the offset through the offset, or query the data in the segment through the approximate time of entering Kafka, so as to see the partition, the offset, the time of entering Kafka, the version information of the AVRO, and the like of each piece of information. The consumption query may look at a message to see which consumer groups have been consumed and which have not. While one can see which IP it is currently being consumed by, we can conveniently locate on which machine there is a consumer who is not closed. It can also be seen that the consumption delay for each consumer group is accurate to the number of pieces, the delay of the partition. The total number of the partial messages can be seen, and the problem of message nonuniformity can be solved.

In one embodiment, the real-time monitoring and exception alarm can be used to know the inflow and outflow data of Topic, how many messages are written per second, how large size, and the outflow condition per second. The real-time monitoring alarm is to build some flow alarms or some delay alarms for the Topic, and only needs a user to subscribe after the real-time monitoring alarm is built, so that the user can operate and manage the abnormal data processing process very conveniently.

In one embodiment, the abnormal data repair includes: and according to the verification result, data restoration is carried out on the data in the file through the condition of appointing the data to be restored.

In one embodiment, other synchronous services except the canal open source service are constructed by adopting a springboot, and through dynamically monitoring the adding and deleting operations of zookeeper nodes, releasing and downloading data synchronization tasks, the source data are consumed from kafka, and the consumed data are output to a specified database component.

Fig. 3 is a flowchart of a data processing method according to another embodiment of the present application, and referring to fig. 3, as shown in fig. 3, the method includes:

step 301, after a synchronous task is created on a data deep learning di platform, calling an approval interface to write data into a distributed application program coordination service zookeeper;

step 302, the data synchronization service coordinates the change of a zookeeper node by monitoring a distributed application program, and determines whether to start a new synchronization task according to the change condition of the node;

step 303, the data monitoring service carries out timing monitoring on the data consistency of the new task;

step 304, after a new synchronization task is started, pulling data from a data source according to the input and output modes specified during configuration;

and 305, synchronizing the acquired data to the data storage component in real time, and creating stock data synchronization task repair data through a flow if data inconsistency occurs in the process of executing data synchronization.

The DI platform mainly faces to data query service of big data and cross-affiliation departments, is configured on the platform without compiling query codes, and provides a data table, query conditions, query fields, sequencing fields and the like. The platform can determine the deployment node of the service according to the qps requirement and the safety guarantee level of the client, provides a current limiting function and fusing, and guarantees the stability of the service to the maximum extent.

The DI platform is mainly developed based on a spring client system, the deployment adopts k8s, the system mainly comprises three modules, a core module, a web management module and a client module, and the client module is mainly responsible for providing a femto interface and setting shunt parameters; the web management module is mainly responsible for creating, approving and authorizing the api service; the core module is mainly responsible for processing requests from clients, the middle of the core module comprises apikey based on clients, the definition of the apikey is obtained and spliced into sql, and the sql is sent to each engine (mysql, tidb, clickhouse and the like) through a self-developed dam to query data; the data processing system is used for synchronizing, and original data are synchronized to other engines in real time through the data synchronization platform, so that pressure on a service library is avoided. And directly applying for an interface on a page provided by the DI system, issuing the service after approval, and enabling the interface service to be in an available state after the issuing is finished. When the interface is created, the calling is guaranteed to be legal through examination and approval; meanwhile, the common module has an auditing function, records the sql inquired each time and the returned data, and ensures that the inquiry can be carried out each time.

In one embodiment, the system Kafka is used to replicate data, transferring one data from data source to Kafka and then from Kafka to elsewhere, supporting batch and streaming, while supporting real-time and batch processing, such as 5min sync once. The system supports mutual copying among a plurality of systems, and a data source can be Mysql, SQL Server or Oracle. sink can be Hbase, hive, etc. The method defines a plug interface by itself, and can write a plurality of data sources and unsupported sink by itself. And the system realizes distributed parallel, supports perfect HA and load balance and provides a convenient RESTful interface.

Before the Kafka plug-in is not available, the operation and maintenance ETL is very troublesome; for canal, both servers and clients of canal need to be manually deployed, and if 1000 databases of 100 canal nodes exist, an administrator knows which base tables run on which machines, and the newly added tasks are put on which machines to run. In addition, if Mysql modifies a field, the programmer is required to see how the table is modified on the machine, and all downstream parts are required to complete the modification of the table structure, so that the programmer can run the table, and the response speed is very slow.

In one embodiment, if the upstream data is modified according to compatibility, kafka will also make some modifications of compatibility downstream, and automatically change the downstream table structure, thereby reducing the operation and maintenance burden. Kafka will store all information into Kafka, where config topoic stores metadata, stutas topoic indicates what job is currently running for which nodes, and offset topoic indicates which data is currently being consumed for which partitions, such as a topoic. WorKer is stateless, runs many tasks on it, and also a task1, possibly corresponds to 5 partitions, and if it is given three concurrences, it will be distributed across three machines. If one machine hangs, these jobs are distributed to the other two machines and synchronized in real time.

In one embodiment, the plugin of Kafka can be made by canal; native canal and Maxwell do not support AVRO, and Maxwell is modified by debezium idea to support the AVRO format, manage meta with Mysql, and support database switching of Mysql.

In one embodiment, for HDFS data, a condurance corporation HDFS plug-in is used, but there are many problems in itself, such as when writing hive, a column that is regarded as part is also written into the main table data, although the use of hive is not affected, but presto reading hive is affected, here we change the source code, and remove the columns in the main table; hdfs reads all files from Hdfs to determine which offset to continue with at plug-in restart, which has two problems: it takes too long to switch clusters and offset cannot be continued. It can support using the timestamp of Kafka to partition when plugin writes into hive, and also support using some columns in data to partition, but it does not support using both at the same time, we also modify it.

In one embodiment, the plugin of the Hbase only supports the most original derivation, and has some special requirements, for example, when a rowkey is customized, generally, the mysql master key is a self-increment ID, hbase does not recommend the self-increment ID to be used as the rowkey, and we have a requirement for reverse, and also have a requirement for multiple columns to be combined to be used as the rowkey, and the like, and this we also change the source code and support the generation by configuring the customized rowkey; original plugin does not support kerberos, while online hbase is entitled, all types can be converted into string and stored, delete is supported, json is supported, and the like.

In one embodiment, KUDU is used in many cases, some bugs of KUDU open-source plugin are used, data sources of Kudu are mysql, but mysql is frequently used for library refreshing, the capacity is large, KUDU sink has large delay, the plugin is changed, adaptive flow control is added, multithreading processing is automatically expanded, and the capacity is automatically reduced when the flow is small.

For data manipulation needing real-time updating, data enters Kafka through canal service and maxwell service, the processes are shared, incremental data can be written into kudu through a plug-in of kudu in real time, then ETL is made through impala, the generated data provides query of T +0.1 or T +0.01 for the outside, the data in Kafka is directly read through Flink to make real-time ETL, and real-time performance is improved.

The data synchronization method based on the big data technology and the internet technology can help business department personnel to quickly access data, saves labor development and maintenance cost brought by traditional data synchronization, greatly improves working efficiency, can support restoration of abnormal data, meets data synchronization requirements among various data assemblies, and guarantees timeliness of making business decisions according to data.

In the prior art, the storage space of a big data HDFS cluster is huge, the data tables and ETL tasks are tens of thousands, and manual processing is impossible, so that a life cycle management system based on an HIVE table is provided for automatic processing. Resources based on an HIVE table are divided into storage resources and computing resources at present, the storage resources are stored based on an HDFS, space monitoring and alarming are carried out by depending on a self-contained management system, then manual parameters are selected, parts of the resources are deleted from the ten thousand tables, and control is released; the computing resources are mainly concentrated in scheduling tasks, about 6000+ tasks are repeatedly executed every day, warning is carried out by means of a resource monitoring system, and then manual intervention, task analysis and cleaning of certain overdue tasks are carried out. The storage resources are optimized, manual deletion is selected from the ten thousand tables, the coverage rate is low, the release space is limited, the labor is occupied, and the effect is not obvious. Computing resource optimization, which is manually optimized from tens of thousands of tasks, needs to expend a great deal of energy for screening, solves the problem of task dependence and the like, and has high cost.

Fig. 4 shows a system state diagram of the data processing method according to an embodiment of the present application, and as shown in fig. 4, the system further includes a life cycle management configuration, a rule engine translation, a metadata warehouse construction, and a DLM process, specifically, by means of a DI-ETL system and a combing of an entire data flow chain, an ETL task is analyzed, a mail SQL is analyzed and a buried point log is added, a HIVE-HOOK collection user executes SQL and analyzes, log collection and analysis of each reporting system, and the like, monitoring of a data terminal is achieved, and full link analysis of entire data is achieved.

Fig. 5 shows a flow chart of a data processing method according to another embodiment of the present application, and as shown in fig. 5, the method includes:

step 501, lifecycle management configuration.

Step 502, rules engine translation.

Step 503, constructing the element bin.

Step 504, data Lifecycle Management (DLM) processing.

In one embodiment, the lifecycle management configuration phase is configured in two steps, including: user page configuration and system timing scanning unified configuration.

The specific steps may include: step 5011, adding life cycle management configuration in a cluster table operation page of the deep learning DI system, and carrying out rule configuration according to the set requirement condition during table building operation by a user. Step 5012, automatically configuring the life cycle rule of the HIVE data table according to the preset rule in the configuration timing task of the deep learning DI system.

Specifically, the user page is configured in a cluster table operation page of a DI-ETL (DI data conversion) system, life cycle management configuration is added, the user can carry out rule configuration as required during table building operation, and otherwise, the user can carry out default rule configuration. Default rule-offline task: the number of accesses in the last three months is 0& the number of depended tasks is 0. And (3) scanning the system at regular time, and automatically configuring the life cycle rule of the HIVE data table according to the rule agreed with the data warehouse in the configuration timing task of the DI-ETL system.

In one embodiment, the following rule settings of table 1 may be made:

TABLE 1

In one embodiment, the rules engine translation may include rule management, rule translation, wherein the rule management is that when a user configures the lifecycle rule, various tags and calculation expressions can be automatically combined, so that convenient operation is provided and the user configuration content is recorded.

IN one embodiment, the rules engine provides equal to, greater than, less than, greater than equal to, less than equal to, not equal to, IN compare operators, and provides union and or relational operators and brackets, saying that multiple expressions are related

In one embodiment, the rule translation refers to that rule information configured for a user on a page needs to be translated into task syntax processed by a subsequent module, for example:

[[{"tag":"last_three_month_visit_num","op":"＝","other":0,"tagType":"bigint","tagDict":"","enumValueList":[]},{"condition":"&&"},{"tag":"depended_task_num","op":"＝","other":0,"tagType":"bigint","tagDict":"","enumValueList":[]}]]

in one embodiment, the metadata store construction has first access to the service metadata, and in the prior art data table representation, only the technical metadata is accessed, such as: number of partitions in the table, partition size, number of data lines, etc. However, this time, the access service metadata is related, for example: how many tasks are used in the etl task, whether the task is alive, and frequency of use in the mail and reports. And through a point burying technical means, the opening rate monitoring of the mails and the reports is realized, the survival of the mails and the reports is sensed, and the activity condition of the data table is further confirmed. Through the capture of service metadata and the link analysis of an ETL task, a data table is perfected from data output to data conversion to data display of a terminal and a complete link portrait with zero utilization rate finally.

In one embodiment, the construction of the element number bin is divided into two parts, one is data acquisition of the data terminal, and the other is data cleaning according to the label rule.

Table 2 shows the rules for data collection and binning for data terminals, as shown in Table 2

TABLE 2

In one embodiment, data cleansing is performed according to tag rules, metadata bin scheduling (daily execution) is created in the DI-ETL system, related ETL tasks are configured, and data cleansing is performed according to tags.

Fig. 6 shows a flowchart of a data processing method according to an embodiment of the present application, and as shown in fig. 6, a DLM processing module is used as another core function of the architecture, distributed deployment is implemented by zookeeper, and the zookeeper is used as a notification medium for message consistency to complete issuing and executing of task computation. And because the core HDFS file deleting function is involved, in order to prevent the mistaken deletion, the file can enter the HDFS recycle bin first and can be recovered within 7 days. In the aspect of offline tasks, multiple judgments can be made during the offline tasks, whether the tasks are dependent on the downstream or not, whether portrait tags are recently used or not are available, offline operation is performed when the downstream dependency tags and the portrait tags are all met, the current blood margin dependency relationship snapshot of the tasks can be performed, and the dependency DAG relationship can be quickly recovered when problems occur. And the whole set of complete alarm notification is provided in time, and manual processing is informed to a plurality of bins for tables and tasks which cannot be automated by some systems. In the prior art, the contents are processed manually at a plurality of bin timing, so that the risk is high, the dependence on tasks is easy to destroy, and the HDFS is not deleted completely. And the DLM module is fully automatically processed, so that the HIVE60% of tables can be covered at present, a large amount of manpower is saved, and records can be searched.

In one embodiment, the data enters DLM for final processing after being processed by the element bin, and relevant operations are carried out according to the execution plan translated by the rule engine, and an alarm prompt is provided. Besides the alarm, a corresponding page in the system can view the historical execution condition.

In a specific embodiment, the whole service is constructed by adopting a springboot, and DLM distributed deployment is realized by dynamically monitoring the adding and deleting operations of the zookeeper node. And the integral construction of the element bin is carried out by depending on an internal DI-ETL scheduling system. And the DLM executor performs related processing according to the rule, and finally realizes the full link analysis of the HIVE table data.

The automatic resource optimization system for data processing, which is constructed based on the big data technology and the Internet technology, can greatly reduce the cost of manual maintenance, further reduce 50% of the cluster storage cost, automatically perform offline tasks by 15%, and greatly reduce the cluster calculation pressure.

Fig. 7 is a block diagram of a data processing system according to an embodiment of the present application, and as shown in fig. 7, the system includes:

the obtaining module 401 is configured to disguise the service instance as MySQL through canal service to obtain data manipulation language DML and data definition language DDL logs of the corresponding master library from the library node.

A sending module 402, connected to the obtaining module, configured to output the obtained data manipulation language DML and data definition language DDL logs to kafka.

And the processing module 403, connected to the sending module, is configured to synchronously output binlog data in the kafka to a set data storage component in a set manner according to a synchronization task configured according to a usage scenario of a set service.

In one embodiment, the system further comprises: the task creation module is configured to select a base table needing synchronization, and specify a synchronization type and a synchronization method, wherein the synchronization type comprises: stock and real-time synchronization, stock synchronization and real-time synchronization; when the fields are synchronized to es, an index name is appointed and the mapping relation between the MySQL field and the es field is appointed; and when the synchronization is carried out to the tidb, a target cluster, a library and a table are specified.

In one embodiment, the system further comprises a creation module configured to create an index or table to support automatic creation of a tidb table or es index template from the schema of the data to be synchronized;

in one embodiment, the system further comprises a task running module configured to issue a setting notification in a zookeeper after a task is started, start a new consumer thread according to configured kafka topic information after a data synchronization service monitors the setting notification message issued by the zookeeper, pull data at a set time interval or in a set batch size, send data which is not successfully processed to a temporary topic of kafka when an exception occurs in a processing process, and re-consume the data which is processed with the exception in a retry process by the task, wherein the data exception includes but is not limited to a dependent component service exception, a network exception, and a cloud service exception;

in one embodiment, the system further comprises a real-time monitoring module, wherein the real-time monitoring module is used for monitoring the output logs in real time and sending the number and the content of the abnormal logs to the set client when the abnormal logs occur; and when the kafka message is consumed and if the blocking occurs due to sudden increase of the traffic, dynamically adjusting and optimizing the parameters according to the blocking condition.

In one embodiment, the system further comprises a task approval module which is configured to check whether all parts of preparation work are complete during data processing so as to ensure the accuracy of data synchronization.

In one embodiment, the system further comprises a task approval module further configured to verify that the data processing is rejected for an illegal configuration or for a task that is not suitable for synchronization.

In one embodiment, the system further comprises a rights application module configured to write data into topic corresponding to kafka through the canal service synchronization tool, including: the method comprises the steps of database permission application, white list configuration, topic mapping configuration, data format conversion and data writing.

In one embodiment, the system further comprises a permission application module configured to apply for a read permission of the corresponding main library according to the library in which the table to be synchronized is located so as to open a permission that canal draws MySQL binlog.

In one embodiment, the system further comprises a white list module configured to open a white list of the corresponding table according to actual business requirements and shield unnecessary table data.

In one embodiment, the system further comprises a mapping configuration module configured to automatically generate a mapping relationship according to a white list configuration, the mapping relationship including specifying which set subject of kafka the database table data falls into, wherein the mapping relationship can be specified by a library or a table respectively, and the configuration is completed to support automatic creation of kafka topic.

In one embodiment, the system further comprises a data format conversion module configured to convert binlog data of canal access into json format according to an object mode.

In one embodiment, the system further comprises a data writing module configured to write data into the specified topic according to the white list and the topic mapping relationship.

In one embodiment, the system further comprises a data output module configured to output data supporting storage components including, but not limited to, kudu, elastic search, tidb, and hdfs; the data output module supports write and/or data repair functionality.

In one embodiment, the system further includes an abnormal data repair module configured to perform data repair on the data in the file according to the verification result by specifying a condition of the data to be repaired.

In one embodiment, the system further comprises a data checking module configured to check the data of a previous set time period at a set time of the set time period; and outputting the inconsistent data and reasons to a corresponding file so as to facilitate problem location and data repair.

The present application further provides a data processing system comprising: the element bin module is configured to adopt springboot to construct data processing service, and build an element bin by relying on a DI dispatching system; the zookeeper module is used for receiving the task queue sent by the DI dispatching system and monitoring the change of the nodes: a data lifecycle management module configured to dynamically monitor changes to the zookeeper node to enable data lifecycle management, DLM, distributed deployment; and the data lifecycle management DLM performs related processing according to the set configuration rule to realize full link analysis processing of the HIVE table data, wherein the processing comprises but is not limited to synchronous processing.

In one embodiment, the lifecycle management configuration module comprises: the system comprises a page configuration unit and a timing scanning unit, wherein the page configuration unit is configured to add life cycle management configuration based on a Hive data table into a cluster table operation page of a deep learning DI system, and a user performs rule configuration according to a set requirement condition in table building operation; the timing scanning unit is configured to automatically configure the life cycle rule of the HIVE data table according to a preset rule in a configuration timing task of the deep learning DI system.

In one embodiment, the metadata bin module comprises a data acquisition unit and a data cleaning unit, wherein a user of the data acquisition unit loads ETL tasks aiming at data extraction conversion, and reports the ETL tasks to be put into bins in a task scheduling and analyzing and acquiring mode through task analysis and SQL analysis; for h-HOOK, intercepting a user by HIVE to execute SQL and analyzing, and scheduling tasks to be reported and put into storage; analyzing by mail SQL aiming at a mail system, opening data report by mail burying points, warehousing by consuming kafka, and reporting and warehousing by scheduling tasks; analyzing log logs by analyzing SQL in the report system aiming at the report system and reporting the log logs to a warehouse by scheduling tasks; and/or the data cleaning unit is used for creating a metadata bin for scheduling and configuring related ETL tasks, and cleaning data according to the tags.

The data lifecycle management module is configured to implement distributed deployment through zookeeper and perform the issuing and execution of task computation by using zookeeper as a notification medium for message consistency.

And the rule engine translation module comprises a rule management unit and a rule translation unit. Specifically, the rule management unit is used for automatically combining various labels and calculation expressions when a user configures the life cycle rule, and needs to provide convenient operation and record user configuration content; 2. the rule translation unit is used for translating the rule information configured on the page by the user into the task grammar processed by the subsequent module.

The system also comprises a synchronization processing module which is configured to adopt springboot to construct synchronization service; newly adding and deleting operations of zookeeper nodes are dynamically monitored, and data synchronization tasks are issued and downloaded; the source data is consumed from kafka according to set rules for data synchronization and the consumed data is output to a designated database component.

The execution flow and the execution method of the system may refer to the above method steps, which are not described in this section. The data synchronization system based on the big data technology and the internet technology can help business department personnel to quickly access data, saves labor development and maintenance cost caused by traditional data synchronization, greatly improves working efficiency, can support repair of abnormal data, meets data synchronization requirements among various data components, and guarantees timeliness of making a business decision according to data.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

The various illustrative logical units and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but, in the alternative, the processor may be any conventional processor, control device, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in the embodiments herein may be embodied directly in hardware, in a software element executed by a processor, or in a combination of the two. The software cells may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The method and system of the present invention may be implemented in a number of ways. For example, the methods and systems of the present invention may be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.

The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A data processing method, comprising:

disguising the service instance into MySQL through canal service, and acquiring data manipulation language DML and data definition language DDL logs of the corresponding main library from the library nodes;

outputting the acquired data manipulation language DML and data definition language DDL logs to kafka;

and synchronously outputting binlog data in the kafka to a set data storage component in a setting mode according to a synchronization task configured according to the use scene of the setting service.

2. The method of claim 1, wherein the task of synchronizing the configuration of the usage scenario of the setting service comprises:

selecting a base table needing synchronization when a task is created, and specifying a synchronization type and a synchronization method, wherein the synchronization type comprises the following steps: stock and real-time synchronization, stock synchronization and real-time synchronization;

creating an index or a table to support automatic creation of a tidb table or an es index template according to a schema of data to be synchronized;

wherein, the first and the second end of the pipe are connected with each other,

when the fields are synchronized to es, an index name is appointed and the mapping relation between the MySQL field and the es field is appointed;

and when the synchronization is carried out to the tidb, a target cluster, a library and a table are specified.

3. The method of claim 2, wherein synchronously outputting binlog data in the kafka to a configured data storage component in a configured manner comprises:

after a task is started, a setting notification is issued in a zookeeper, a new consumer thread is started according to configured kafka topic information after a data synchronization service monitors a setting notification message issued by the zookeeper, data is pulled in a set time interval or batch size, when an exception occurs in a processing process, unprocessed successful data is sent to a temporary topic of kafka, and the task consumes the data with the exception again in a retrying process, wherein the data exception comprises but is not limited to dependent component service exception, network exception and cloud service exception;

monitoring the output logs in real time, and sending the abnormal number and content to a set client when abnormality occurs; and when the kafka message is consumed and if the traffic volume suddenly increases to cause the blocking, dynamically adjusting and optimizing the parameters according to the blocking condition.

4. The method according to any one of claims 1-3, further comprising:

writing data into topic corresponding to kafka through the canal service synchronization tool, wherein the method comprises the following steps: the method comprises the following steps of applying for database authority, configuring a white list, configuring topic mapping, converting data formats and writing data;

and/or

The user applies for the read permission of the corresponding main library according to the library in which the table to be synchronized is located so as to open the permission of canal for pulling the MySQL binlog;

and/or

Opening a white list of a corresponding base table according to actual business requirements, and shielding unnecessary base table data;

and/or

Automatically generating a mapping relation according to white list configuration, wherein the mapping relation comprises a set theme which the database table data falls into kafka, the set theme can be respectively specified according to a library or a table, and the kafka topic can be automatically created after the configuration is finished;

and/or

Converting binlog data accessed by canal into json format according to an object mode;

and/or

And writing the data into the designated topic according to the mapping relation between the white list and the topic.

5. The method of claim 4, further comprising:

data output storage components that support output include, but are not limited to, kudu, elastic search, tidb, and hdfs;

and/or

The data output supports the functions of writing and/or data repair;

and/or

The method further comprises the following steps:

according to the verification result, data restoration is carried out on the data in the file through the condition of appointing the data to be restored;

and/or

The method further comprises the following steps: checking the data of the last set time period at the set time of the set time period; outputting inconsistent data and reasons to corresponding files so as to facilitate problem location and data repair;

and/or

When the task is approved, whether preparation work of each part is complete or not is checked during data processing so as to ensure the accuracy of data synchronization;

and/or

The data processing is checked for illegal configurations or for tasks that are not suitable for synchronization.

6. A data processing method, characterized by further comprising:

adopting springboot to construct data processing service, and constructing a metadata bin by relying on a DI dispatching system;

receiving the task queue sent by the DI dispatching system by the zookeeper:

and the data lifecycle management DLM performs related processing according to the set configuration rule to realize full link analysis processing of the HIVE table data, wherein the processing comprises but is not limited to synchronous processing.

7. The method of claim 6, wherein the processing includes, but is not limited to, synchronization processing, further comprising:

adopting springboot to construct synchronous service;

the source data is consumed from kafka according to set rules for data synchronization and the consumed data is output to a designated database component.

8. The method of claim 6, wherein setting the configuration rule comprises:

adding life cycle management configuration based on the Hive data table into a cluster table operation page of the deep learning DI system, and carrying out rule configuration by a user according to a set requirement condition in table building operation;

9. The method according to any one of claims 6-8, comprising:

the construction of the element bin comprises data acquisition of a data terminal and data cleaning according to a label rule;

wherein the content of the first and second substances,

the data acquisition binning comprises:

loading ETL tasks aiming at data extraction conversion, and reporting and warehousing the ETL tasks in a scheduling task mode through task analysis and SQL analysis acquisition;

for h-HOOK, intercepting a user by HIVE to execute SQL and analyzing, and scheduling tasks to be reported and put into storage;

analyzing by mail SQL aiming at a mail system, opening data report by mail burying points, warehousing by consuming kafka, and reporting and warehousing by scheduling tasks;

analyzing log logs by analyzing SQL in a report system aiming at the report system, and reporting the log logs to a warehouse by a scheduling task;

and/or

Creating a metadata bin for scheduling, configuring related ETL tasks, and cleaning data according to the tags;

and/or

Distributed deployment is achieved through the zookeeper, the zookeeper is used as a notification medium of message consistency, and issuing and execution of task calculation are executed.

10. A data processing system, comprising:

the acquisition module is configured to disguise the service instances into MySQL through canal service, and acquire data manipulation language DML and data definition language DDL logs of the corresponding main library from the library nodes;

the sending module is connected with the acquiring module and is configured to output the acquired data manipulation language DML and data definition language DDL logs to kafka;

and the sending module is connected and configured to synchronously output binlog data in the kafka to a set data storage component in a set mode according to a synchronization task configured according to a use scene of a set service.