CN111522805B - Distributed batch data cleaning method and system - Google Patents

Distributed batch data cleaning method and system Download PDF

Info

Publication number
CN111522805B
CN111522805B CN202010325609.1A CN202010325609A CN111522805B CN 111522805 B CN111522805 B CN 111522805B CN 202010325609 A CN202010325609 A CN 202010325609A CN 111522805 B CN111522805 B CN 111522805B
Authority
CN
China
Prior art keywords
cleaning
cleaned
message queue
data
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010325609.1A
Other languages
Chinese (zh)
Other versions
CN111522805A (en
Inventor
肖慧闵
孙中军
周宝琛
林楷坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of China Ltd
Original Assignee
Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of China Ltd filed Critical Bank of China Ltd
Priority to CN202010325609.1A priority Critical patent/CN111522805B/en
Publication of CN111522805A publication Critical patent/CN111522805A/en
Application granted granted Critical
Publication of CN111522805B publication Critical patent/CN111522805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed batch data cleaning method and a system, wherein the method comprises the following steps: the master node sends cleaning instructions of each table to be cleaned to a first message queue; each remote child node corresponding to the main node reads a cleaning instruction of a table to be cleaned from a first message queue, generates one or more single table cleaning instructions of the table to be cleaned according to the database and table dividing conditions of the table to be cleaned, and sends the one or more single table cleaning instructions to a second message queue; each remote child node reads a list clearing instruction from the second message queue corresponding to one or more clearing nodes, carries out data clearing on the list, and returns the clearing result of the list to the third message queue; the remote child node reads the cleaning results of the single table from the third message queue, gathers the cleaning results of each table to be cleaned, and returns the cleaning results to the fourth message queue; and the master node reads the cleaning result of each table to be cleaned from the fourth message queue. According to the invention, the data are cleaned by a plurality of nodes at the same time, so that the data cleaning efficiency is greatly improved.

Description

Distributed batch data cleaning method and system
Technical Field
The invention relates to the field of data cleaning, in particular to a distributed batch data cleaning method and system.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
For business systems, the increasing volume of traffic has led to an increasing number of data tables stored in databases. The large amount of history data tables occupies too much memory space and affects the performance of the service system, so that the history data tables need to be cleaned regularly. For some business systems with complex scenes, some tables with larger data volume are often stored in a separate database and separate table.
According to the existing data cleaning method, an application program is deployed on a single machine, each data table is cleaned in a serial mode, the situation that the data tables are separated into tables and libraries is not considered, the data cleaning mode has the technical problems that the efficiency is low, the related data cleaning of the cross databases cannot be achieved, and service data cleaning in complex scenes cannot be dealt with.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a distributed batch data cleaning method, which is used for solving the technical problems that the data cleaning method adopted in the prior art is single machine deployment and serial mode cleaning, has low efficiency and can not realize the cleaning of related data across databases, and comprises the following steps: the master node sends cleaning instructions of each table to be cleaned to a first message queue, wherein the master node corresponds to a plurality of remote child nodes; the remote child node reads a cleaning instruction of a table to be cleaned from the first message queue, generates one or more single table cleaning instructions of the table to be cleaned according to the library and table division conditions of the table to be cleaned, and sends the one or more single table cleaning instructions to the second message queue, wherein each remote child node corresponds to one or more cleaning nodes; the clearing node reads a list clearing instruction of the list to be cleared from the second message queue, clears data of the list to be cleared, and returns a clearing result of the list to be cleared to the third message queue; the remote child node reads the cleaning results of each single table from the third message queue, gathers the cleaning results of each table to be cleaned, and returns the cleaning results of each table to be cleaned to the fourth message queue; and the master node reads the cleaning result of each table to be cleaned from the fourth message queue.
The embodiment of the invention also provides a distributed batch data cleaning system, which is used for solving the technical problems that the data cleaning method adopted in the prior art is single machine deployment and serial mode cleaning, has low efficiency and can not realize the cleaning of the related data of the cross database, and comprises the following steps: the system comprises a main node module, a remote sub-node module and a cleaning node module; the main node module corresponds to a plurality of remote sub-node modules; each remote sub-node module corresponds to one or more cleaning node modules, each remote sub-node module comprising: a main table cleaning module; each cleaning node module includes: a data cleaning module; the main node module is used for sending the cleaning instructions of the tables to be cleaned to the first message queue; the remote sub-node module is used for reading a cleaning instruction of a table to be cleaned from the first message queue, generating one or more single table cleaning instructions of the table to be cleaned according to the database and table dividing conditions of the table to be cleaned, and sending the one or more single table cleaning instructions to the second message queue; the clearing node module is used for reading a single table clearing instruction of the table to be cleared from the second message queue, clearing data of the single table of the table to be cleared, and returning a clearing result of the single table of the table to be cleared to the third message queue; the remote sub-node module is further used for reading the cleaning results of the single tables from the third message queue, summarizing the cleaning results of the tables to be cleaned, and returning the cleaning results of the tables to be cleaned to the fourth message queue; the master node module is further configured to read a cleaning result of each table to be cleaned from the fourth message queue.
The embodiment of the invention also provides computer equipment which is used for solving the technical problems that the data cleaning method adopted in the prior art is single-machine deployment and serial cleaning, has low efficiency and can not realize the cleaning of the related data of the cross database.
The embodiment of the invention also provides a computer readable storage medium for solving the technical problems that the data cleaning method adopted in the prior art is single-machine deployment and serial cleaning, has low efficiency and can not realize the cleaning of the related data of the cross database, and the computer readable storage medium stores a computer program for executing the distributed batch data cleaning method.
In the embodiment of the invention, the main node corresponds to a plurality of remote sub-nodes, and each remote node corresponds to one or a plurality of cleaning nodes; after the master node sends the cleaning instructions of each table to be cleaned to the first message queue, each remote child node corresponding to the master node reads the cleaning instruction of one table to be cleaned from the first message queue, generates one or more single table cleaning instructions of the table to be cleaned according to the separate library and the separate table condition of the table to be cleaned, and sends the one or more single table cleaning instructions to the second message queue; then, reading a single table cleaning instruction of the table to be cleaned from the second message queue by each remote child node corresponding to each cleaning node, cleaning data of the single table of the table to be cleaned, and returning the cleaning result of the single table of the table to be cleaned to the third message queue; reading the cleaning results of each single table from the third message queue by the remote child node, summarizing the cleaning results of each table to be cleaned, and returning the cleaning results of each table to be cleaned to the fourth message queue; and finally, the master node reads the cleaning result of each table to be cleaned from the fourth message queue.
According to the embodiment of the invention, the data is cleaned by the plurality of nodes, so that the data cleaning efficiency is greatly improved; the method is applicable to data cleaning under different service scenes.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:
FIG. 1 is a flowchart of a method for cleaning distributed batch data according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a specific implementation architecture of a distributed batch data cleaning method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a process for cleaning data of a table to be cleaned according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a distributed batch data cleaning system according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings. The exemplary embodiments of the present invention and their descriptions herein are for the purpose of explaining the present invention, but are not to be construed as limiting the invention.
In the description of the present specification, the terms "comprising," "including," "having," "containing," and the like are open-ended terms, meaning including, but not limited to. Reference to the terms "one embodiment," "a particular embodiment," "some embodiments," "for example," etc., means that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. The sequence of steps involved in the embodiments is used to schematically illustrate the practice of the present application, and is not limited thereto and may be appropriately adjusted as desired.
An embodiment of the present invention provides a method for cleaning distributed batch data, and fig. 1 is a flowchart of a method for cleaning distributed batch data, as shown in fig. 1, where the method includes the following steps:
s101, the master node sends cleaning instructions of each table to be cleaned to a first message queue, wherein the master node corresponds to a plurality of remote child nodes.
Optionally, the first message queue in the embodiment of the present invention adopts a Redis message queue.
S102, the remote child nodes read a cleaning instruction of a table to be cleaned from the first message queue, generate one or more single table cleaning instructions of the table to be cleaned according to the database and table division conditions of the table to be cleaned, and send the one or more single table cleaning instructions to the second message queue, wherein each remote child node corresponds to one or more cleaning nodes.
Optionally, the second message queue in the embodiment of the present invention adopts a Redis message queue.
It should be noted that, when executing the above S102, the method for cleaning distributed batch data provided in the embodiment of the present invention may further include the following steps: the remote child node judges whether the table to be cleaned has an associated table according to a cleaning instruction of the table to be cleaned, and performs data cleaning on the associated table of the table to be cleaned before performing data cleaning on the table to be cleaned under the condition that the table to be cleaned has the associated table.
It should be noted that when the data is cleaned up for the association table, one or more single table cleaning instructions of the association table are generated according to the database and table division conditions of the association table and sent to the second message queue, where each remote child node corresponds to one or more cleaning nodes.
In the embodiment of the invention, the database splitting table refers to the situation that one data table is divided into a plurality of splitting tables and stored in different databases.
S103, the clearing node reads a single table clearing instruction of the table to be cleared from the second message queue, carries out data clearing on the single table of the table to be cleared, and returns a clearing result of the single table of the table to be cleared to the third message queue.
It should be noted that, in the embodiment of the present invention, the single table cleaning instruction refers to an instruction for cleaning a single database single table, for example, for a table a and a table B to be cleaned, if the table a is divided into 10 single tables; if the table B has no database and table division, the main node sends 2 table cleaning instructions (namely a first instruction for cleaning data of the table A and a second instruction for cleaning data of the table B) to the remote child node; the remote child node receiving the first instruction sends 10 single-table cleaning instructions to the cleaning node; and the remote child node receiving the second instruction sends 1 list clearing instruction to the clearing node. When the embodiment of the invention adopts SQL to execute database table cleaning, the single table cleaning instruction in the embodiment of the invention can be a read/write SQL instruction, and the data to be cleaned is queried through the read SQL instruction; and deleting the queried data through writing SQL instructions.
Optionally, the third message queue in the embodiment of the present invention adopts a Redis message queue.
In the case that the table to be cleaned has an association table, the step S103 may further include the following steps: when the association table and the table to be cleaned are in the same database, the cleaning node directly inquires the data to be deleted in the association table in the database where the table to be cleaned is located; and when the association table and the table to be cleaned are not in the same database, the cleaning node inquires the association field from the database where the table to be cleaned is located, and inquires the data to be deleted in the association table from the database where the association table is located according to the association field.
S104, the remote sub-node reads the cleaning results of each single table from the third message queue, gathers the cleaning results of each table to be cleaned, and returns the cleaning results of each table to be cleaned to the fourth message queue.
Optionally, the fourth message queue in the embodiment of the present invention adopts a Redis message queue.
In one embodiment, the method for cleaning distributed batch data provided in the embodiment of the present invention may further include the following steps: and the remote child node records and updates the data cleaning log of each table to be cleaned.
S105, the master node reads the cleaning result of each table to be cleaned from the fourth message queue.
As can be seen from the above, in the distributed batch data cleaning method provided in the embodiment of the present invention, after the master node sends the cleaning instruction of each table to be cleaned to the first message queue, each remote child node corresponding to the master node reads the cleaning instruction of one table to be cleaned from the first message queue, and generates one or more single table cleaning instructions of the table to be cleaned according to the splitting and sorting conditions of the table to be cleaned, and sends the single table cleaning instructions to the second message queue; then, reading a single table cleaning instruction of the table to be cleaned from the second message queue by each remote child node corresponding to each cleaning node, cleaning data of the single table of the table to be cleaned, and returning the cleaning result of the single table of the table to be cleaned to the third message queue; reading the cleaning results of each single table from the third message queue by the remote child node, summarizing the cleaning results of each table to be cleaned, and returning the cleaning results of each table to be cleaned to the fourth message queue; and finally, the master node reads the cleaning result of each table to be cleaned from the fourth message queue.
According to the distributed batch data cleaning method provided by the embodiment of the invention, the data is cleaned by a plurality of nodes, so that the data cleaning efficiency is greatly improved; the method is applicable to data cleaning under different service scenes.
In one embodiment, the method for cleaning distributed batch data provided in the embodiment of the present invention may further include the following steps: the clearing node judges whether the single table to be cleared needs backup data according to the single table clearing instruction, reads the backup data in the single table to be cleared according to a pre-configured backup field under the condition that the single table to be cleared needs the backup data, and sends the backup data to a fifth message queue. Because some data of some data tables in some databases need to be backed up, the corresponding backup field data can be backed up before clearing single-table data by configuring the backup field.
Optionally, the fifth message queue in the embodiment of the present invention adopts a Kafka message queue. The purpose of flexibly backing up data can be realized through the Kafka message queue.
Fig. 2 is a schematic diagram of a specific implementation architecture of a distributed batch data cleaning method according to an embodiment of the present invention, where as shown in fig. 2, the method mainly includes: the system comprises a main node unit, a remote sub-node unit, an associated table data cleaning unit and a single table data cleaning unit. It should be noted that these four units may be deployed on the same machine or on different machines, respectively. The following describes the respective units one by one:
and (one) a main node unit:
the unit is used for configuring cleaning rules (including but not limited to whether to list, database, backup, cleaning frequency, cleaning condition, etc.); the master node partitions (i.e., divides the cleaning task into cleaning tasks for each type of table) according to the table names that are to be cleaned up as needed by the relevant configuration, and sends a data cleaning instruction to a message queue (e.g., a Redis message queue).
(two) remote child node units:
the unit is used for further analysis and splitting of the table to be cleaned and scheduling of the flow, and specific cleaning operation is not executed in the unit. After receiving a data cleaning instruction from the remote child node unit, the remote child node unit divides the cleaning of the table to be cleaned into three steps: (1) cleaning the association table, inquiring all association relations according to the single table after the table division and the library division of the table to be cleaned, carrying out remote partitioning again by taking each association table of the single database as granularity, waiting for a cleaning result message of each association table data cleaning operation step, and entering the next step after all cleaning is finished; (2) cleaning the table data, namely, according to the single tables after the table division and the library division of the table to be cleaned, carrying out remote partitioning again by taking a single database Shan Zhangbiao as granularity, waiting for the cleaning result of each associated table data cleaning operation step, and entering the next step after all cleaning is finished; (3) and counting update cleaning logs, and summarizing and recording the update process logs to a database.
And (III) an association table data cleaning unit:
the unit is used for cleaning up the association table data. After receiving an association table data cleaning instruction from a remote child node unit, analyzing the configured association relation, judging whether an association table and a package to be cleaned cross databases (namely belong to different databases), if so, directly splicing SQL of the association relation in the same database, and then generating query and deleting SQL (also called read and write SQL) according to the configuration of screening and backup; if the data are not in the same database, the association relation is required to be further analyzed, the association field of the data to be cleaned is queried from the primary table, and then the value of the searched association field and the analyzed condition are used for deleting the association field from the database where the association table is located. If the backup exists, the backup is sent to the kafka message queue according to the field content required by the backup of the configuration query before the backup is deleted.
(IV) a single table data cleaning unit:
the unit is used for generating query and deleting SQL according to the configuration of screening and backup after the data cleaning instruction of the form table to be cleaned is cleaned, and cleaning is completed, wherein the backup field content is sent to the kafka message queue.
Taking SQL as an example, the data cleaning of the table to be cleaned and the associated table is described below:
(1) Cleaning table data:
demand: clearing data in table TableA with colA less than date 1;
the backup fields are colA, colB and colC, and the main key is colD;
the cleaning node performs the following procedure:
reading: execution by cursor removal from a database
select colA,colB,colC,colD from TableA where colA<date1
And (3) treatment: transmitting data of the colA, colB, and colC backup fields to kafka
Write, delete data through primary key colD, delete from TableA where colD =: id
(2) Cleaning the association table of the same database:
demand: clearing data with COL_F smaller than date1 in the TABLE_M;
the associated data in its associated table is cleaned up first.
The association between the TABLE_M AND the TABLE_R is TABLE_M.COL_A1=TABLE_R.COL_A2 AND TABLE_M.COL_B1= 'Y' AND TABLE_R.COL_B2= 'Y', the backup field of the TABLE_R is COL_B2, COL_C2, AND the main key is COL_D2;
the cleaning node performs the following procedure:
reading: select TABLE_R.COL_B2, TABLE_R.COL_C2, TABLE_R.COL_D2from TABLE_M, TABLE_R write TABLE_M.COL_F < date1AND TABLE_M.COL_A1=TABLE_R.COL_A2 AND TABLE_M.COL_B1= 'Y' TABLE_R.COL_B2= 'Y ='
And (3) treatment: transmitting data of COL_B2 and COL_C2 backup fields to kafka
Write, delete data through primary key COL_D2, delete from TABLE _Rwhere COL_D2=: id
(3) Cleaning up the association table across databases:
the requirements are the same as (2);
the SQL of the association relation is segmented according to AND, AND classified:
(1) associated SQL such as TABLE_M.COL_A1=TABLE_R.COL_A2
(2) SQL for screening of the main TABLE, e.g., TABLE_M.COL_B1= 'Y'
(3) Screening SQL of association TABLE such as table_r.col_b2= 'Y'
The cleaning node performs the following procedure:
reading: select TABLE_M.COL_A1 from TABLE_M window TABLE_M.COL_F < date1and TABLE_M.COL_B1= 'Y'
And (3) treatment: querying select COL_B2, COL_C2, COL_D2 from TABLE_R write COL_A2=:COL_A1 AND TABLE_R.COL_B2= 'Y', sending data of COL_B2, COL_C2 backup fields to kafka
Write, delete data through primary key COL_D2, delete from TABLE _Rwhere COL_D2=: id
It should be noted that the ":id" in the code is a preprocessed placeholder of SQL, which is replaced with a specific value read before during execution.
In fig. 2, when the master node unit, the remote sub node unit, the association table data cleaning unit, and the single table data cleaning unit perform data cleaning on the table to be cleaned, the following steps are performed:
(1) the master node sends a data cleaning instruction: and according to the configured cleaning rule, partitioning according to each table name to be cleaned, and sending a data cleaning instruction of the table to be cleaned to a remote child node through a message queue.
(2) The remote child node receives the data cleaning instruction and executes the following data processing: A. judging whether the data of the association table needs to be cleaned, if so, carrying out secondary partition on the association table related to the table according to a rule of the database division and the table division, and sending an instruction for cleaning the data of the association table to a cleaning node through a message queue; B. after the associated table data is cleared, an instruction for clearing the main table data is sent to the clearing node through the message queue.
(3) The cleaning node receives a data cleaning instruction and executes the following data processing: and the clearing node reads the request queue information, analyzes the data, and if the data instruction of the association table is cleared, calls the data clearing operation step of the association table to clear the data in the association table. If the data instruction of the cleaning table is the data instruction, the data in the data cleaning operation step cleaning table is called. In the cleaning operation step, if backup is needed, backup data is acquired according to the configured backup field and sent to the kafka message queue.
(4) And the cleaning node returns a processing result: and after the cleaning node completes the relevant cleaning operation step, returning a processing result to the remote child node through the message queue.
(5) The remote child node returns the processing result: after the remote child node finishes the table cleaning operation, the processing result is returned to the master node through the message queue.
FIG. 3 is a schematic diagram of a flow chart for cleaning data of a table to be cleaned, in an embodiment of the present invention, as shown in FIG. 3, for a table to be cleaned A, it is first determined whether an association table exists, and if no association exists, the table A is cleaned directly; if the association table exists, firstly cleaning the data of the association table; and after the data of the association table is cleaned, cleaning the table A. When the table A or the association table is cleaned, judging whether the table A or the association table has the conditions of database separation and table separation, and cleaning corresponding data according to the conditions of database separation and table separation. For a main table with the associated tables, cleaning the main table after cleaning all the associated tables; and for the main table with the sub-tables in the database, after all sub-tables are cleaned, the tables are cleaned.
Based on the same inventive concept, a distributed batch data cleaning system is also provided in the embodiments of the present invention, as described in the following embodiments. Because the principle of solving the problem of the system embodiment is similar to that of the distributed batch data cleaning method, the implementation of the system embodiment can refer to the implementation of the method, and the repetition is omitted.
FIG. 4 is a schematic diagram of a distributed batch data cleaning system according to an embodiment of the present invention, as shown in FIG. 4, the system includes: a master node module 41, a remote child node module 42, and a cleaning node module 43; the main node module 41 corresponds to a plurality of remote sub-node modules 42; each remote sub-node module 42 corresponds to one or more cleaning node modules 43, each remote sub-node module 42 comprising: a master table cleaning module 421; each cleaning node module 43 includes: a data cleaning module 431;
the master node module 41 is configured to send a cleaning instruction of each table to be cleaned to the first message queue, where the master node corresponds to a plurality of remote child nodes;
the remote sub-node module 42 is configured to read a cleaning instruction of a table to be cleaned from the first message queue, generate one or more single table cleaning instructions of the table to be cleaned according to the database and table division conditions of the table to be cleaned, and send the one or more single table cleaning instructions to the second message queue, where each remote sub-node corresponds to one or more cleaning nodes;
the clearing node module 43 is configured to read a single table clearing instruction of the table to be cleared from the second message queue, clear data of the single table of the table to be cleared, and return a clearing result of the single table of the table to be cleared to the third message queue;
the remote sub-node module 42 is further configured to read the cleaning result of each single table from the third message queue, aggregate the cleaning result of each table to be cleaned, and return the cleaning result of each table to be cleaned to the fourth message queue; the master node module 41 is further configured to read the cleaning result of each table to be cleaned from the fourth message queue.
As can be seen from the above, in the distributed batch data cleaning system provided in the embodiment of the present invention, the cleaning instruction of each table to be cleaned is sent to the first message queue through the master node module 41; reading a cleaning instruction of a table to be cleaned from a first message queue through each remote sub-node module 42 corresponding to the main node module 41, generating one or more single table cleaning instructions of the table to be cleaned according to the database and table division conditions of the table to be cleaned, and sending the single table cleaning instructions to a second message queue; reading a list clearing instruction of the list to be cleared from the second message queue through each remote sub-node module 42 corresponding to each clearing node module 43, clearing data of one list of the list to be cleared, and returning a clearing result of one list of the list to be cleared to the third message queue; reading the cleaning results of the single tables from the third message queue through each remote sub-node module 42, summarizing the cleaning results of the tables to be cleaned, and returning the cleaning results of the tables to be cleaned to the fourth message queue; finally, the cleaning result of each table to be cleaned is read from the fourth message queue through the main node module 41.
According to the distributed batch data cleaning system provided by the embodiment of the invention, the data is cleaned by a plurality of nodes, so that the data cleaning efficiency is greatly improved; the method is applicable to data cleaning under different service scenes.
In one embodiment, the cleaning node module 43 may further include: the data backup module 432 is configured to determine, according to a single table cleaning instruction, whether the single table to be cleaned needs backup data, and read, according to a pre-configured backup field, the backup data in the single table to be cleaned and send the backup data to the fifth message queue when the single table to be cleaned needs backup data.
In one embodiment, the remote sub-node module 42 may further include: the association table cleaning module 422 is configured to determine whether an association table exists in the to-be-cleaned table according to a cleaning instruction of the to-be-cleaned table, and perform data cleaning on the association table of the to-be-cleaned table before performing data cleaning on the to-be-cleaned table in the case that the association table exists in the to-be-cleaned table.
Further, based on the above embodiment, the remote sub-node module 42 may further include: the log recording module 423 is configured to record and update the data cleaning log of each table to be cleaned.
Based on the same inventive concept, the embodiment of the invention also provides a computer device, which is used for solving the technical problems that the data cleaning method adopted in the prior art is single machine deployment and serial mode cleaning, has low efficiency and can not realize the related data cleaning across databases.
Based on the same inventive concept, the embodiment of the invention also provides a computer readable storage medium, which is used for solving the technical problems that the data cleaning method adopted in the prior art is single-machine deployment and serial cleaning, has low efficiency and can not realize the related data cleaning of the cross database, and the computer readable storage medium stores a computer program for executing the distributed batch data cleaning method.
In summary, the embodiment of the invention also provides a distributed batch data cleaning method, a system, a computer device and a computer readable storage medium, which adopt a multi-stage remote partition mode to clean table data and related data in a multi-node parallel manner, thereby improving the data cleaning efficiency and meeting the complex and diverse cleaning requirements.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (13)

1. A distributed batch data cleaning method, comprising:
the method comprises the steps that a master node sends cleaning instructions of each table to be cleaned to a first message queue, wherein the master node corresponds to a plurality of remote child nodes;
the remote child node reads a cleaning instruction of a table to be cleaned from the first message queue, generates one or more single table cleaning instructions of the table to be cleaned according to the database and table division conditions of the table to be cleaned, and sends the one or more single table cleaning instructions to the second message queue, wherein each remote child node corresponds to one or more cleaning nodes;
the clearing node reads a single table clearing instruction of the table to be cleared from the second message queue, carries out data clearing on the single table of the table to be cleared, and returns a clearing result of the single table of the table to be cleared to a third message queue;
the remote child node reads the cleaning results of each single table from the third message queue, gathers the cleaning results of each table to be cleaned, and returns the cleaning results of each table to be cleaned to the fourth message queue;
and the master node reads the cleaning result of each table to be cleaned from the fourth message queue.
2. The method of claim 1, wherein the first message queue, the second message queue, the third message queue, and the fourth message queue are Redis message queues.
3. The method of claim 1, wherein the method further comprises:
the clearing node judges whether the single table to be cleared needs backup data according to the single table clearing instruction, reads the backup data in the single table to be cleared according to a pre-configured backup field under the condition that the single table to be cleared needs the backup data, and sends the backup data to a fifth message queue.
4. The method of claim 3, wherein the fifth message queue is a Kafka message queue.
5. The method of claim 1, wherein the method further comprises:
the remote child node judges whether the table to be cleaned has an associated table according to a cleaning instruction of the table to be cleaned, and performs data cleaning on the associated table of the table to be cleaned before performing data cleaning on the table to be cleaned under the condition that the table to be cleaned has the associated table.
6. The method of claim 5, wherein the method further comprises:
when the association table and the table to be cleaned are in the same database, the cleaning node directly inquires the data to be deleted in the association table in the database where the table to be cleaned is located;
and when the association table and the table to be cleaned are not in the same database, the cleaning node inquires the association field from the database where the table to be cleaned is located, and inquires the data to be deleted in the association table from the database where the association table is located according to the association field.
7. The method of claim 1, wherein the method further comprises:
and the remote child node records and updates the data cleaning log of each table to be cleaned.
8. A distributed batch data cleaning system, comprising: the system comprises a main node module, a remote sub-node module and a cleaning node module; the main node module corresponds to a plurality of remote sub-node modules; each of the remote sub-node modules corresponds to one or more cleaning node modules, each of the remote sub-node modules comprising: a main table cleaning module; each cleaning node module comprises: a data cleaning module;
the main node module is used for sending the cleaning instructions of the tables to be cleaned to the first message queue;
the remote sub-node module is used for reading a cleaning instruction of a table to be cleaned from the first message queue, generating one or more single table cleaning instructions of the table to be cleaned according to the database and table division conditions of the table to be cleaned, and sending the one or more single table cleaning instructions to the second message queue;
the clearing node module is used for reading a single table clearing instruction of the table to be cleared from the second message queue, clearing data of the single table of the table to be cleared, and returning a clearing result of the single table of the table to be cleared to the third message queue;
the remote sub-node module is further configured to read the cleaning results of each single table from the third message queue, collect the cleaning results of each table to be cleaned, and return the cleaning results of each table to be cleaned to a fourth message queue; the master node module is further configured to read a cleaning result of each table to be cleaned from the fourth message queue.
9. The system of claim 8, wherein the cleaning node module further comprises: the data backup module is used for judging whether the single table to be cleaned needs backup data according to the single table cleaning instruction, and reading the backup data in the single table to be cleaned according to a pre-configured backup field and sending the backup data to a fifth message queue under the condition that the single table to be cleaned needs backup data.
10. The system of claim 8, wherein the remote sub-node module further comprises: the association table cleaning module is used for judging whether the association table exists in the table to be cleaned according to a cleaning instruction of the table to be cleaned, and carrying out data cleaning on the association table of the table to be cleaned before carrying out data cleaning on the table to be cleaned under the condition that the association table exists in the table to be cleaned.
11. The system of claim 8, wherein the remote sub-node module further comprises:
and the log recording module is used for recording and updating the data cleaning log of each table to be cleaned.
12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the distributed batch data cleaning method of any of claims 1 to 7 when the computer program is executed by the processor.
13. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program for executing the distributed batch data cleaning method of any one of claims 1 to 7.
CN202010325609.1A 2020-04-23 2020-04-23 Distributed batch data cleaning method and system Active CN111522805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010325609.1A CN111522805B (en) 2020-04-23 2020-04-23 Distributed batch data cleaning method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010325609.1A CN111522805B (en) 2020-04-23 2020-04-23 Distributed batch data cleaning method and system

Publications (2)

Publication Number Publication Date
CN111522805A CN111522805A (en) 2020-08-11
CN111522805B true CN111522805B (en) 2023-05-02

Family

ID=71910982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010325609.1A Active CN111522805B (en) 2020-04-23 2020-04-23 Distributed batch data cleaning method and system

Country Status (1)

Country Link
CN (1) CN111522805B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112463799A (en) * 2020-12-11 2021-03-09 天冕信息技术(深圳)有限公司 Data extraction method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107479829A (en) * 2017-08-03 2017-12-15 杭州铭师堂教育科技发展有限公司 A kind of Redis cluster mass datas based on message queue quickly clear up system and method
CN107783975A (en) * 2016-08-24 2018-03-09 北京京东尚科信息技术有限公司 The method and apparatus of distributed data base synchronization process
CN109753531A (en) * 2018-12-26 2019-05-14 深圳市麦谷科技有限公司 A kind of big data statistical method, system, computer equipment and storage medium
CN110287181A (en) * 2019-07-01 2019-09-27 网联清算有限公司 Data clearing method, device, electronic equipment and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107967117B (en) * 2016-10-20 2020-10-20 杭州海康威视数字技术股份有限公司 Data storage, reading and cleaning method and device and cloud storage system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107783975A (en) * 2016-08-24 2018-03-09 北京京东尚科信息技术有限公司 The method and apparatus of distributed data base synchronization process
CN107479829A (en) * 2017-08-03 2017-12-15 杭州铭师堂教育科技发展有限公司 A kind of Redis cluster mass datas based on message queue quickly clear up system and method
CN109753531A (en) * 2018-12-26 2019-05-14 深圳市麦谷科技有限公司 A kind of big data statistical method, system, computer equipment and storage medium
CN110287181A (en) * 2019-07-01 2019-09-27 网联清算有限公司 Data clearing method, device, electronic equipment and medium

Also Published As

Publication number Publication date
CN111522805A (en) 2020-08-11

Similar Documents

Publication Publication Date Title
US20220171781A1 (en) System And Method For Analyzing Data Records
US11650971B2 (en) System and method for large-scale data processing using an application-independent framework
Marcu et al. Spark versus flink: Understanding performance in big data analytics frameworks
CN108564470B (en) Transaction distribution method for parallel building blocks in block chain
EP2831767B1 (en) Method and system for processing data queries
US10769147B2 (en) Batch data query method and apparatus
US20130138731A1 (en) Automated client/server operation partitioning
CN111752959B (en) Real-time database cross-database SQL interaction method and system
US20070250517A1 (en) Method and Apparatus for Autonomically Maintaining Latent Auxiliary Database Structures for Use in Executing Database Queries
CN111061788A (en) Multi-source heterogeneous data conversion integration system based on cloud architecture and implementation method thereof
CN109033109B (en) Data processing method and system
CN101996102A (en) Method and system for mining data association rule
CN102063490A (en) Database partition method and device
CN109885642B (en) Hierarchical storage method and device for full-text retrieval
CN110928851B (en) Method, device and equipment for processing log information and storage medium
CN110941602B (en) Database configuration method and device, electronic equipment and storage medium
CN105989163A (en) Data real-time processing method and system
CN103765381A (en) Parallel operation on B+ trees
CN111522805B (en) Distributed batch data cleaning method and system
US20060143206A1 (en) Interval tree for identifying intervals that intersect with a query interval
CN111984625A (en) Database load characteristic processing method, device, medium and electronic equipment
Raıssi et al. Need for speed: Mining sequential patterns in data streams
CN114756629A (en) Multi-source heterogeneous data interaction analysis engine and method based on SQL
CN115794783A (en) Data deduplication method, device, equipment and medium
CN111881323B (en) Table separation method based on sequencing field and time routing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant