CN116225822A - Data processing method, computing device and computer storage medium - Google Patents

Data processing method, computing device and computer storage medium Download PDF

Info

Publication number
CN116225822A
CN116225822A CN202211551865.8A CN202211551865A CN116225822A CN 116225822 A CN116225822 A CN 116225822A CN 202211551865 A CN202211551865 A CN 202211551865A CN 116225822 A CN116225822 A CN 116225822A
Authority
CN
China
Prior art keywords
modification
log
update log
data
data table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211551865.8A
Other languages
Chinese (zh)
Inventor
张高迪
姜伟华
蒋光然
林俊浩
王华峰
胡一博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202211551865.8A priority Critical patent/CN116225822A/en
Publication of CN116225822A publication Critical patent/CN116225822A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • G06F11/3093Configuration details thereof, e.g. installation, enabling, spatial arrangement of the probes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the invention provides a data processing method, computing equipment and a computer storage medium. The data processing method comprises the following steps: determining a target data table to which the data processing request requests access; generating an update log based on modification information generated by a modification operation when the modification operation for the target data table is monitored; and sending the update log to a log receiver by utilizing a pre-configured data interface so that the log receiver consumes the update log. The technical scheme provided by the embodiment of the invention shortens the calculation link of real-time calculation and reduces the cost of real-time calculation.

Description

Data processing method, computing device and computer storage medium
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a data processing method, computing equipment and a computer storage medium.
Background
In the field of real-time calculation, such as real-time big data analysis, wind control early warning, real-time prediction, financial transaction and the like, the high timeliness of the data is required. The business value of data decreases rapidly with the passage of time, so that it is necessary to calculate and process data as soon as possible after it is generated. Therefore, the software system is required to have the capability of real-time calculation, so that the time delay of the full-link data stream is effectively shortened, the service logic is calculated in real time, and finally the service requirement of processing big data in real time is effectively met.
The update Log may be implemented as a data source driven real-time computing link, and the update Log may be implemented, for example, as a Binlog (Binary Log).
When the Binlog generated by the database in the related art is consumed, a data collection tool, for example Debezium, canal, is generally required to collect the Binlog, and then the data collection tool sends the collected Binlog to a message queue, so that a consumer obtains the Binlog in the message queue by subscribing to the message queue.
The inventor finds that the consumption mode of the update log in the related technology cannot meet the requirement of real-time in the process of realizing the inventive concept.
Disclosure of Invention
The embodiment of the invention provides a data processing method, a data processing device, computing equipment and a computer storage medium.
In a first aspect, an embodiment of the present invention provides a method for processing data, including:
determining a target data table for requesting access;
when the modification operation aiming at the target data table is monitored, creating an update log uniquely corresponding to the target data table;
writing modification information generated by the modification operation into the update log;
and sending the update log to a log receiving party by using a pre-configured data interface, so that the log receiving party starts a data processing thread in response to receiving the update log, and performing real-time calculation processing on the target data table by using the data processing thread.
In a second aspect, an embodiment of the present invention provides an apparatus for data processing, including:
the first determining module is used for determining a target data table for requesting access;
the log creation module is used for generating an update log based on modification information generated by the modification operation when the modification operation for the target data table is monitored;
and the log sending module is used for sending the update log to a log receiving party by utilizing a pre-configured data interface so that the log receiving party consumes the update log.
In a third aspect, another embodiment of the present invention provides a data processing method, including:
obtaining an update log by using a preset interface, wherein the update log is generated by writing modification information generated by modification operation under the condition that the modification operation for a target data table is monitored, and the target data table is determined by a data processing request;
and carrying out consumption processing on the update log.
In a fourth aspect, an embodiment of the present invention provides an apparatus for data processing, including:
the system comprises a log acquisition module, a data processing module and a data processing module, wherein the log acquisition module is used for acquiring an update log by using a preset interface, wherein the update log is generated by writing modification information generated by a modification operation under the condition that the modification operation for a target data table is monitored, and the target data table is determined by a data processing request;
And the log consumption module is used for carrying out consumption processing on the update log.
In a fifth aspect, in an embodiment of the present invention, a computing device is provided, including a processing component and a storage component;
the storage component stores one or more computer instructions; the one or more computer instructions are used for being called and executed by the processing component to realize the data processing method provided by the embodiment of the invention.
In a sixth aspect, in an embodiment of the present invention, there is provided a computer storage medium storing a computer program, where the computer program is executed by a computer to implement a data processing method provided in the embodiment of the present invention.
The embodiment of the invention provides a data processing method, which adopts a target data table for determining that a data processing request requests access; under the condition that the modification operation aiming at the target data table is monitored, an update log uniquely corresponding to the target data table is created; writing the modification information generated by the modification operation into an update log; the method has the advantages that the update log is sent to the log receiver by using the pre-configured data interface, so that the log receiver consumes the technical scheme of the update log, after the update log is generated, the update log can be directly sent to the log receiver through the pre-configured interface, the update log is not required to be acquired by using a data acquisition tool and is transmitted through a message queue, the calculation link of real-time calculation is shortened, the cost of real-time calculation is reduced, and the real-time performance of update log consumption is improved.
These and other aspects of the invention will be more readily apparent from the following description of the embodiments.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 schematically illustrates a flow chart of a data processing method provided by one embodiment of the present invention;
FIG. 2 schematically illustrates a flow chart of a method for determining a modification type provided by an embodiment of the present invention;
FIG. 3 schematically illustrates writing data to a data storage system provided by an embodiment of the present invention;
FIG. 4 schematically illustrates a flow chart of a data processing method according to another embodiment of the present invention;
FIG. 5 schematically illustrates a block diagram of a data processing apparatus provided by one embodiment of the present invention;
FIG. 6 schematically illustrates a block diagram of a data processing apparatus provided by one embodiment of the present invention;
FIG. 7 schematically illustrates a block diagram of a computing device provided by one embodiment of the invention.
Detailed Description
In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.
In some of the flows described in the specification and claims of the present invention and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.
In this context, it is to be understood that the terms so referred to may be technical means or other summarizing terms used to implement a portion of the invention. For example, the term may include:
Binlog: binary journaling is the journaling of all possible data change events in a database, such as TABLE structure changes (e.g., CREATE, ALTER TABLE) and TABLE data modification (INSERT, UPDATE, DELETE) events.
WAL: pre-written Log (Write-Ahead Log) is a standard method of ensuring data integrity. The central idea is that modifications to the data file can only occur after these modifications are logged, i.e. after the log describing these modifications is written to disk. Following this procedure, it is not necessary to flush the data page to disk every time a transaction commits, since the database can be restored with the log in the event of a crash, and any modifications not yet applied to the data page can be reworked from the log record.
Storage format: the storage format of the data at the bottom of the database is mainly divided into row memory and column memory. The list is suitable for OLAP scenes, and is suitable for various complex queries, data association, scanning, filtering and statistics; the line memory table is suitable for KV (key-value) scenes and is suitable for spot check and scanning based on primary keys.
And (3) calculating in real time: the service value of the data can be rapidly reduced along with the lapse of time, so that the data must be calculated and processed as soon as possible after the data occur, and particularly, the high timeliness of the information is required in the fields of real-time big data analysis, wind control early warning, real-time prediction, financial transaction and the like. Therefore, the software system is required to have the capability of real-time calculation, so that the time delay of the full-link data stream is effectively shortened, the service logic is calculated in real time, and finally the service requirement of processing big data in real time is effectively met.
Sboard: for a distributed storage engine, data of a data Table (Table) is divided into a plurality of fragments (Table Group Shard, short for Shard) and stored on different working nodes (workbench nodes).
Index (Index): index refers to a data Table in a database that consists of multiple indices, each Index indexing a row in the Table in some sort of order.
Tablet: tablet refers to the division of an Index of a data table on the current Sard.
LSN: i.e., log sequence number (Log Sequence Number) for maintaining the order of the logs.
Binlog (Binary Log) is commonly used in MySQL databases to record all logs that occur in the database that may cause data change events, such as table data modification (INSERT, UPDATE, DELETE) events.
MySQL Binlog originally had two main uses:
master-slave replication: the master server transmits the events contained in its Binlog file to the slave server, and the slave server executes the events to make the same data change as the master server, thereby ensuring data consistency between the master server and the slave server.
Data recovery: the database is restored to the latest state before the problem occurs by re-executing the event recorded in the Binlog file.
In the field of real-time calculation, such as real-time big data analysis, wind control early warning, real-time prediction, financial transaction and the like, the high timeliness of the data is required. The business value of data decreases rapidly with the passage of time, so that it is necessary to calculate and process data as soon as possible after it is generated. Therefore, the software system is required to have the capability of real-time calculation, so that the time delay of the full-link data stream is effectively shortened, the service logic is calculated in real time, and finally the service requirement of processing big data in real time is effectively met.
The inventor finds that Binlog can be used as an update log, so that in the field of real-time computing, binlog can also be used as a data source to drive a real-time computing link.
Further, in implementing the present inventive concept, it is found that when the update log generated by the database in the related art is consumed, a data collection tool, for example Debezium, canal, is generally required to collect Binlog, and then the data collection tool sends the collected update log to the message queue, and the consumer obtains the update log in the message queue by subscribing to the message queue. After the update log is generated, the update log can be sent to the destination end only through layer-by-layer transmission of the data acquisition tool and the message queue, so that the technical problems of long calculation link and high calculation cost exist when the update log is consumed, and the real-time requirement cannot be met.
In order to solve the technical problems existing in the related art, the embodiment of the invention provides a data processing method, which adopts a target data table for determining that a data processing request requests access; under the condition that the modification operation aiming at the target data table is monitored, an update log uniquely corresponding to the target data table is created; writing the modification information generated by the modification operation into an update log; the method has the advantages that the update log is sent to the log receiver by using the pre-configured data interface, so that the log receiver consumes the technical scheme of the update log, after the update log is generated, the update log can be directly sent to the log receiver through the pre-configured interface, the update log is not required to be acquired by using a data acquisition tool and is transmitted through a message queue, the calculation link of real-time calculation is shortened, the cost of real-time calculation is reduced, and the real-time performance of update log consumption is improved.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
Fig. 1 schematically illustrates a flowchart of a data processing method according to an embodiment of the present invention, where, as shown in fig. 1, the data processing method may include the following steps:
101, determining a target data table to which a data processing request requests access;
102, when a modification operation for the target data table is monitored, generating an update log based on modification information generated by the modification operation;
103, sending the update log to a log receiver by using a pre-configured data interface so that the log receiver consumes the update log.
The data processing method provided by the embodiment of the invention can be executed by a data storage system, and the data storage system can comprise a data warehouse and a database, and a plurality of data tables can be stored in the data storage system.
According to an embodiment of the present invention, a request initiator may send a data processing request to a data storage system, where the data processing request may include identification information of a data table requested to be processed, where the identification information may include, for example, a table name, UUID (Universally Unique Identifier, universally unique identification code).
According to the embodiment of the invention, after the data storage system receives the data processing request, the data processing request can be analyzed to obtain the identification information carried by the data processing request, and then the target data table corresponding to the identification information can be determined from a plurality of data tables stored in the data storage system according to the identification information.
According to the embodiment of the invention, the data processing request can be used for requesting the data storage system to start the update log generation function aiming at the target data table, after the update log generation function is started, the data storage system can monitor the target data table and generate the update log uniquely corresponding to the target data table under the condition that the modification operation aiming at the target data table is monitored.
According to an embodiment of the present invention, in the case where a modification operation for the target data table is monitored, generating the update log based on modification information generated by the modification operation may be specifically implemented as:
in response to the modifying operation, creating a log data table uniquely corresponding to the target data table;
acquiring modification information generated by modification operation;
writing the modification information into a log data table to obtain an update log.
According to an embodiment of the present invention, the modification operation may include a modification operation that may cause the target data table to generate a data change, and the modification operation may include, for example, a write operation, a delete operation, and an update operation.
In the data processing method provided by the embodiment of the invention, the data storage system can independently start the update log generation function for any target data table stored in the data storage in response to the data processing request, so that the update log can be generated in a fine granularity, and the flexibility of generating the update log is improved.
Update log according to the preferred embodiment of the present invention, since the column store table is suitable for OLAP (Online Analytical Processing) scenario, for various complex queries, data association, scanning, filtering and statistics, and the row store table is suitable for KV (key-value) scenario, for primary key (primary key) based spot check and scanning, in the embodiment of the present invention, both the target data table and the update log may be row store tables.
According to the embodiment of the invention, after the modification operation on the target data table is monitored, the update log uniquely corresponding to the target data table is created, so that when the update log is generated, a new update log is created instead of adding a column on the basis of the target data table, and the new update log exist at the same time, and therefore, only one more storage space at the table level is occupied after the update log generation function is started.
According to embodiments of the present invention, mySQL itself does not provide a component that parses Binlog into a message stream, since MySQL Binlog is used for master-slave replication and data recovery. In addition, the format of MySQL Binlog is complex, various information is stored, however, a large part of information is useless for the consumption scene under real-time calculation, so when consumption of MySQL Binlog is required, a data acquisition tool is required to be adopted, and the Binlog file of MySQL is analyzed by the data acquisition tool so as to acquire logs concerned by the real-time calculation scene from MySQL Binlog and analyze the logs into byte streams. And the byte stream is changed by a plurality of data tables, so the cost of reading Binlog by the data acquisition tool is high, and therefore, a consumer downstream can select a table to consume repeatedly only by putting the byte stream acquired from MySQL Binlog into a message queue.
The data storage system provided by the embodiment of the invention can integrate and analyze the Binlog into a module of byte stream, and because the update log which is generated aiming at a single target data table and uniquely corresponds to the single target data table in the embodiment of the invention, the update log can be transferred based on a pre-provided data interface.
The preset interface may include, for example, a postgresql logical replication (logical copy) interface, according to an embodiment of the present invention.
According to the embodiment of the invention, the preset interface can be in butt joint with the data processing engines such as the Flink, the JDBC and the like, the update log can be sent to the data processing engine through the preset interface, and the data processing engine processes the update log and then sends the update log to the log receiver.
According to embodiments of the present invention, the log recipient may include, for example, a database, a data lake, a real-time number bin, an offline number bin, and the like.
According to an embodiment of the present invention, the modification information includes modification contents and modification types.
According to an embodiment of the present invention, the modification type may include a modification type corresponding to the modification operation, and the modification content may include modification content caused by the modification operation.
According to the embodiment of the invention, writing the modification information into the log data table to obtain the update log can be specifically realized as follows:
Determining modification content corresponding to the modification operation and a modification type corresponding to the modification operation;
writing the modification type and the modification content into a log data table to obtain an update log.
According to an embodiment of the present invention, the modification type is determined by:
acquiring a primary key in the modified content;
retrieving the target data table to determine whether a target primary key corresponding to the primary key is included in the target data table;
if yes, determining the modification type as a first modification type, wherein the first modification type represents modification of the data record of the target main key in the target data table by using the modification content;
if not, determining that the modification type is a second modification type, and writing modification contents into the target data table by the second modification type characterization.
According to an embodiment of the present invention, the modification operation may be a modification operation performed on any data row in the target data table, and thus the modification content may include at least one primary key.
According to the embodiment of the present invention, after the primary key included in the modified content is determined, whether or not there is the same target primary key as the primary key included in the modified content among the plurality of data records stored in the target data table can be determined by retrieving the target data table.
According to an embodiment of the present invention, if there is a target primary key that is the same as a primary key included in the modification content in a plurality of data records stored in the target data table, the modification operation is characterized as modification to one existing data record in the target data table, so that it is determined that the modification type is the first modification type.
According to an embodiment of the present invention, if the same target primary key as the primary key included in the modification content does not exist in the plurality of data records stored in the target data table, the modification operation is characterized in that the purpose is to add a new data record to the target data table, and the modification type may be determined to be the second modification type.
According to an embodiment of the present invention, in the case where the modification type is determined to be the first modification type, the data processing method further includes:
determining whether the data record corresponding to the main key of the modified content has the same field;
if yes, determining that the modification type is a third modification type, wherein the third modification type represents modification of only fields which are different from modification contents in the data record;
if not, determining that the modification type is a fourth modification type, wherein the fourth modification type represents that the data record corresponding to the main key is replaced by the modification content.
According to the embodiment of the invention, if the data record corresponding to the main key and the modification content have the same field, the purpose of characterizing the modification operation is to update a part of fields in the data record corresponding to the main key, that is, only the fields in the data record different from the modification content are modified, so that the modification type can be determined as the third modification type. If the data record corresponding to the main key does not have the same field, the purpose of the modification operation is to replace one existing data record in the target data table with the modification content, so that the modification type can be determined as a fourth modification type.
According to an embodiment of the present invention, the third modification type includes a first field characterizing the corresponding data record as a pre-update data record and a second field characterizing the corresponding data record as an updated data record.
According to an embodiment of the present invention, in the case where the modification type is determined to be the third modification type, writing the modification type and the modification content into the update log may be specifically implemented as:
splicing the first field with the data record corresponding to the target main key to generate a first data record;
splicing the second field with the modified content to generate a second data record;
And writing the first modification record and the second modification record into the update log respectively.
According to the embodiment of the invention, when the modification type is the fourth modification type, if the modification content is null, replacing the original data record in the target data table with the null modification content can be understood as deleting the original data record in the target data table, so that when the modification content and the modification type are written into the update log, only the data record corresponding to the first field and the target main key can be spliced to generate the first data record, and then the first data record is written into the update log.
According to the embodiment of the invention, different field values can be configured for different modification types, so that when the modification content and the modification type are written into the update log, the field value corresponding to the modification type can be directly written into the update log, so that the memory occupation of the update log is reduced.
According to an embodiment of the present invention, the third modification type may be a field update type, wherein a field value of the first field may be configured to be 3 and a field value of the second field may be configured to be 7. The second modification type may be an insert modification type, and a field value corresponding to the second modification type may be 5. The fourth modification type may be a delete modification type and the field value of the fourth modification type may be 2.
Fig. 2 schematically shows a flowchart of a method for determining a modification type according to an embodiment of the present invention.
In a preferred embodiment of the present invention, it may be first determined whether the modify operation belongs to a write operation or a delete operation.
If the modification operation is a delete operation, it may be determined directly that the modification type is a fourth modification type, namely a delete type delete, so that the field value of the delete type may be configured to be 2.
If the modification operation belongs to the write operation, it may be first determined whether the target primary key that is the same as the primary key of the modification content is recorded in the target data table:
if not, the modification type is determined to be the first modification type, namely the insert type insert, so that the field value of the delete type can be configured to be 5.
If so, determining that the modification type is a third modification type, namely updating the type update, configuring the field value update before of the first field to be 3, and configuring the field value update after of the second field to be 7.
In the embodiment of the present invention, there is a case where the target data table is an empty table, and if the target data table is an empty table, that is, there is no target primary key in the target data table, in this case, it may be determined that there is no target primary key identical to the primary key in the target data table.
According to an embodiment of the present invention, the data processing method further includes:
acquiring a system time in response to receiving the modification operation;
generating a timestamp based on the system time;
distributing a log serial number for the update log;
the log sequence number and the time stamp are written into the update log.
Table 1 below schematically shows a schematic diagram of update logs provided by an embodiment of the present invention.
TABLE 1
hg_binlog_lsn user_table_column hg_binlog_event_type hg_binlog_timestamp_us
Table 2 below schematically shows field descriptions of update logs provided by embodiments of the present invention.
TABLE 2
Figure BDA0003981514150000091
The data processing method provided by examples of the present invention is schematically shown below in conjunction with tables 3 and 4.
TABLE 3 Table 3
Key Value
Id1 Title1,body1
Id2 Title2,body2
TABLE 4 Table 4
hg_binlog_lsn_1 Id3,Title3,body3 5 1656408346430986
hg_binlog_lsn_2 Id2,Title2,new_body2 7 1656408346430975
hg_binlog_lsn_3 Id2,Title2,body2 3 1656408346430975
Table 3 may be a target data table and table 4 may be an update log uniquely corresponding to the target data table.
As shown in table 3, there are two data records in the target data table, the primary key of the first data record is Id1, and the fields include Title1 and body1; the primary key of the second data record is Id2, and the field includes Title2, body2.
There may be two modification operations for the target data table, the first modification operation may be writing a data record with Id3 as a primary key and Title3 and body3 as fields into the target data table, and the second modification operation may be updating field body2 to new_body2 in the data record with Id2 as a primary key in the target data table.
For the first modification operation, the modification content of the first modification operation is first determined to be Id3, title3 and body3, and since a new data record is written, the field value corresponding to the current modification operation is 5, then a log serial number may be allocated to the first modification operation, the log serial number may be hg_binlog_ lsn _1, and the system time of the data storage system receiving the modification operation is acquired, and a timestamp, for example 1656408346430986, is generated according to the system time. Based on this, the log sequence number, modification content, modification type, timestamp may be written into the update log.
For the second modification operation, it is first determined that the second modification operation will cause a change in the data record of the main key Id2, the field including Title2, and body2 in the target data table, further, the body2 is changed to be new_body2, so the modification operation will generate two data records, one of which is a data record with hg_binlog_ lsn _2 as a log sequence number, id2, title2, new_body2 as modified content, 7 as a modification type, 1656408346430975 as a timestamp, where the modification type 7 indicates that the modification record is an updated record of the existing record, and the other of which is a data record with hg_binlog_ lsn _3 as a log sequence number, id2, title2, body2 as modified content, 3 as a modification type, and 1656408346430975 as a timestamp, where the modification type 3 indicates that the modification record is a record of the existing record before the update. It can be seen that the time stamps of the two data records are identical.
According to an embodiment of the present invention, the data processing method further includes:
setting a survival time value for the update log;
storing the time-to-live value and the update log into a cache;
monitoring a lifecycle of the update log based on the time-to-live value;
when the lifecycle ends, the update log is deleted.
According To the embodiment of the invention, TTL (Time To Live value) can be independently set for the update log through DDL (Data Definition Language, database schema definition language), and compared with MySQL Binlog, the update log can be managed with finer granularity and more flexibility only aiming at setting of the Live value of a server instance.
According to the implementation of the invention, since the update log uniquely corresponding to the target data table is generated when the update log is generated, for example, when the update log is stored, only one more storage space at the table level is needed.
According to an embodiment of the present invention, writing the log sequence number and the time stamp into the update log may be specifically implemented as:
writing the log serial number and the time stamp into an update log as fields respectively;
according to an embodiment of the present invention, the data processing method further includes:
Acquiring a query request, wherein the query request is generated according to a log sequence number field;
analyzing the query request to obtain a log serial number field carried by the query request;
the update log is queried based on the log sequence number field.
According to the embodiment of the invention, when the update log is generated, a line memory table with the hg_binlog_ lsn field as a key, the modified content, hg_binlog_event_type and hg_binlog_timestamp_us fields are combined as a value is newly created, so that the fields of the update log are fixed, which can be said to be a strong Schema, and thus, when the update log needs to be queried after being stored, a user can query the update log by adding hg_binlog_lsn, hg_binlog_event_type and hg_binlog_timestamp_us fields in a query request. When one of the three fields hg_binlog_lsn, hg_binlog_event_type, and hg_binlog_timestamp_us is present in the query, the data storage system automatically routes the query request to the update log query.
According to the preferred embodiment of the invention, the data storage system bottom layer can store data in the form of a line memory table, so that when the update log is queried, the update log query can be preferably performed with hg_binlog_ lsn, and if the query request comprises hg_binlog_event_type or hg_binlog_time stamp_us field, the query request may be evolved into a full-table query, which results in low query efficiency.
FIG. 3 schematically illustrates writing data to a data storage system provided by an embodiment of the present invention.
In an embodiment of the present invention, the basic abstraction of a data storage system is a distributed Table, which may be split into fragments (tables groups, shards) for storage in order to make the system scalable. Each partition constitutes a Unit of storage management and Recovery (Recovery Unit). The upper diagram shows the basic architecture of a slice. A tile is made up of multiple tablelets that share a Write-Ahead Log (WAL), which is used by the data storage system to ensure atomicity and persistence of data. When a modification operation such as INSERT, UPDATE, DELETE occurs, the data storage system generates a WAL first, writes the WAL to a MemTable corresponding to the table, waits until the MemTable accumulates to a certain scale or a certain time elapses, switches the MemTable to flushing MemTable, and newly opens a MemTable to receive a new write request. The unalterable flushing MemTable may be written to disk to become an unalterable file; when an unalterable file is generated, the data may be persisted. When the system is in error breakdown, the WAL is read when the system is restarted, and the data which is not durable yet is recovered.
FIG. 3 illustrates a process for single-chip writing to a data storage system, comprising the steps of: (1) Assigning a LSN (Log Sequence Numbe, log sequence number) to the write request, the LSN consisting of a timestamp and an incremented sequence number; (2) A new WAL log is created and persisted in the file system. The WAL log contains the information needed for recovering the writing operation, and writing is submitted to the table after the log is completely reserved; (3) This write operation is performed in the memory table of the corresponding tablelet and made visible to the new read request. Notably, updates on different tableets can be parallelized. After one memtab is full, (4) flush it to the file system and initialize a new memtab; (5) The multiple sharded files are asynchronously combined (compression) in the background.
The update log may be generated prior to writing the WAL, i.e., the first step in the single slice write introduced above. Where hg binlog Isn can be directly multiplexed Log Sequence Number. The system time is used as the value of hg_binlog_timestamp_us when generating LSN. And then judging the writing type to generate the value of the hg_binlog_event_type field, and finally splicing an update log by combining the field needing to be updated and the original data of the table. After the update log generation is completed, the persistent log operation in WAL completion step (2) is created. When the step (3) is entered, the original update writes the table corresponding to the corresponding Index, and the update log writes the table corresponding to the update log Index. The WAL of the target data table and the update log data changes can be seen to be persisted simultaneously, thus ensuring that the events recorded by the update log and the changes to the target data table are corresponding.
Fig. 4 schematically illustrates a schematic diagram of a data processing method according to another embodiment of the present invention, where the data processing method may include the following steps:
401, acquiring an update log by using a preset interface, wherein the update log is generated by writing modification information generated by a modification operation in the case that the modification operation for a target data table is monitored, and the target data table is determined by a data processing request;
and 402, performing consumption processing on the update log.
According to an embodiment of the present invention, the data processing method shown in fig. 3 may be performed by a log receiving party such as a database, a data lake, a real-time data bin, an offline data bin, and the like.
According to an embodiment of the present invention, after receiving the update log, the log receiving side performs consumption processing on the update log may include: and driving the real-time computing link by taking the update log as a data source. Specifically, the log receiver may use the received update log as a driving condition to start a real-time calculation thread to perform real-time calculation processing after receiving the update log.
According to the embodiment of the present invention, the real-time calculation processing may be performed with respect to the update log, but not limited thereto, and may be performed with respect to a target data table corresponding to the update log, or may be performed with respect to other service tables having an association relationship with the target data table.
In one possible implementation, the consumption processing of the update log may be implemented as:
determining a first field contained in the update log;
determining a second field corresponding to the first field from a first data table having an association relationship with the target data table based on the first field;
and writing the first field and the second field into a third data table.
According to embodiments of the present invention, in some scenarios, data may be partitioned into multiple data tables by subject matter, latitude, etc., and in performing large data analysis, different data tables need to be associated to analyze the overall data, a process that may be referred to as data broadening.
In one possible implementation manner, for example, in the e-commerce field, the target data table may be an order table, in which information such as a user id, a commodity id and the like is recorded, and price information of the commodity may be recorded in a first data table, which may be a price dimension table. When sales calculation and analysis are needed, the data in the target data table and the first data table are needed to be associated.
Based on this, when a new order is generated, the target data table may be modified by a modification operation, for example, when order information of the new order is written into the target data table, an update log may be generated according to the modification operation, and the update log may be sent to the log receiver by using a preset interface. After the log receiving party obtains the update log, the first data table may be searched based on the commodity id in the order information recorded in the update log, so as to determine the price of the commodity corresponding to the commodity id recorded in the first data table, and the price of the commodity and the order information are written into the summary table, that is, the third data table, where the commodity id may be a first field and the price of the commodity may be a second field.
Fig. 5 schematically illustrates a block diagram of a data processing apparatus according to an embodiment of the present invention, and as illustrated in fig. 4, a data processing apparatus 500 may include a first determining module 501, a log creating module 502, an information writing module 503, and a log transmitting module 504.
A first determining module 501, configured to determine a target data table to which the data processing request requests access;
a log creation module 502, configured to, when a modification operation for the target data table is monitored, generate an update log based on modification information generated by the modification operation;
the log sending module 503 is configured to send the update log to the log receiving party by using a pre-configured data interface, so that the log receiving party starts a data processing thread in response to receiving the update log, and performs real-time calculation processing on the target data table by using the data processing thread.
According to an embodiment of the present invention, the log creation module 502 includes:
a data table creation sub-module, configured to create a data table uniquely corresponding to the target data table in response to the modification operation;
the information acquisition sub-module is used for acquiring the modification information generated by the modification operation;
and the information writing sub-module is used for writing the modification information into the log data table to obtain an update log.
According to an embodiment of the present invention, the modification information includes modification contents and modification types;
according to an embodiment of the present invention, an information writing sub-module includes:
a first determining unit configured to determine modification content corresponding to the modification operation and a modification type corresponding to the modification operation;
and the first writing unit is used for writing the modification type and the modification content into a log data table to obtain the update log.
According to an embodiment of the present invention, the first determination unit block includes:
a first acquisition subunit configured to acquire a primary key in the modified content;
a search unit for searching the target data table to determine whether the target data table includes a target primary key corresponding to the primary key;
a first determining subunit, configured to determine, when the target data table includes a target primary key corresponding to the primary key, that the modification type is a first modification type, where the first modification type characterizes modification of the data record of the target primary key in the target data table with the modification content;
and the second determining subunit is used for determining that the modification type is a second modification type in the condition that the target main key corresponding to the main key is not included in the target data table, and the second modification type represents writing modification contents into the target data table.
According to an embodiment of the present invention, in the case where the modification type is determined to be the first modification type, the first determination unit further includes:
a third determining subunit, configured to determine whether the data record corresponding to the modified content and the primary key has the same field;
a fourth determining subunit, configured to determine, when the data record corresponding to the modified content and the primary key has the same field, that the modification type is a third modification type, where the third modification type characterizes modification only of a field in the data record that is different from the modified content;
and a fifth determining subunit, configured to determine that the modification type is a fourth modification type if the data record corresponding to the primary key does not have the same field, where the fourth modification type characterizes replacement of the data record corresponding to the primary key with the modification content.
According to an embodiment of the present invention, the third modification type includes a first field characterizing the corresponding data record as a pre-update data record and a second field characterizing the corresponding data record as an updated data record
In the case where the modification type is determined to be the third modification type, the first writing unit includes:
the first splicing subunit is used for splicing the first field with the data record corresponding to the target main key to generate a first data record;
The second splicing subunit is used for splicing the second field with the modified content to generate a second data record;
and the first writing subunit is used for writing the first modification record and the second modification record into the update log respectively.
According to an embodiment of the present invention, the data processing apparatus 500 further includes:
the time setting module is used for setting a survival time value for the update log;
the storage module is used for storing the survival time value and the update log into the cache;
the monitoring module is used for monitoring the life cycle of the update log based on the life time value;
and the deleting module is used for deleting the update log when the life cycle is finished.
According to an embodiment of the present invention, the data processing apparatus 500 further includes:
the time acquisition module is used for acquiring the system time in response to receiving the modification operation;
a timestamp generation module for generating a timestamp based on the system time;
the serial number distribution module is used for distributing log serial numbers to the update logs;
and the information writing module is used for writing the log serial number and the time stamp into the update log.
According to an embodiment of the present invention, an information writing module includes:
the information writing unit is used for writing the serial number and the time stamp of the log into the update log as fields respectively;
According to an embodiment of the present invention, the data processing apparatus 500 further includes:
the request acquisition module is used for acquiring a query request, and the query request is generated according to the log sequence number field;
the request analysis module is used for analyzing the query request and acquiring a log serial number field carried by the query request;
and the query module is used for querying the update log based on the log serial number field.
The data processing apparatus shown in fig. 5 may perform the data processing method described in the embodiment shown in fig. 1, and its implementation principle and technical effects are not repeated. The specific manner in which the respective modules and units of the data processing apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
Fig. 6 schematically illustrates a block diagram of a data processing apparatus according to another embodiment of the present invention, and as shown in fig. 6, a data processing apparatus 600 may include a log obtaining module 601 and a log consuming module 602.
A log obtaining module 601, configured to obtain an update log using a preset interface, where the update log is generated by writing modification information generated by a modification operation when the modification operation for a target data table is monitored, and the target data table is determined by a data processing request;
The log consuming module 602 is configured to perform consuming processing on the update log.
The data processing apparatus shown in fig. 6 may perform the data processing method described in the embodiment shown in fig. 4, and its implementation principle and technical effects are not repeated. The specific manner in which the respective modules and units of the data processing apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
In one possible design, the data processing apparatus provided by the embodiments of the present invention may be implemented as a computing device, which may include a storage component 701 and a processing component 702, as shown in fig. 7;
the storage component 701 stores one or more computer instructions for the processing component 702 to invoke and execute, so as to implement the data processing method provided by the embodiment of the present invention.
Of course, the computing device may necessarily include other components, such as input/output interfaces, communication components, and the like. The input/output interface provides an interface between the processing component and a peripheral interface module, which may be an output device, an input device, etc. The communication component is configured to facilitate wired or wireless communication between the computing device and other devices, and the like.
The computing device may be a physical device or an elastic computing host provided by the cloud computing platform, and at this time, the computing device may be a cloud server, and the processing component, the storage component, and the like may be a base server resource rented or purchased from the cloud computing platform.
When the computing device is a physical device, the computing device may be implemented as a distributed cluster formed by a plurality of servers or terminal devices, or may be implemented as a single server or a single terminal device.
The embodiment of the invention also provides a computer readable storage medium which stores a computer program, and the computer program can realize the data processing method provided by the embodiment of the invention when being executed by a computer.
The embodiment of the invention also provides a computer program product, which comprises a computer program, wherein the computer program can realize the data processing method provided by the embodiment of the invention when being executed by a computer.
Wherein the processing components of the respective embodiments above may include one or more processors to execute computer instructions to perform all or part of the steps of the methods described above. Of course, the processing component may also be implemented as one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic elements for executing the methods described above.
The storage component is configured to store various types of data to support operation in the device. The memory component may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (12)

1. A method of data processing, comprising:
determining a target data table to which the data processing request requests access;
generating an update log based on modification information generated by a modification operation when the modification operation for the target data table is monitored;
and sending the update log to a log receiver by utilizing a pre-configured data interface so that the log receiver consumes the update log.
2. The method of claim 1, wherein the generating an update log based on modification information generated by a modification operation when the modification operation is monitored for the target data table comprises:
creating a log data table uniquely corresponding to the target data table in response to the modifying operation;
acquiring modification information generated by the modification operation;
and writing the modification information into the log data table to obtain the update log.
3. The method of claim 2, wherein the modification information includes modification content and modification type;
writing the modification information into the log data table to obtain the update log comprises the following steps:
Determining modification content corresponding to the modification operation and modification type corresponding to the modification operation;
and writing the modification type and the modification content into the log data table to obtain the update log.
4. A method according to claim 3, wherein the modification type is determined by:
acquiring a primary key in the modified content;
retrieving the target data table to determine whether a target primary key corresponding to the primary key is included in the target data table;
if yes, determining the modification type as a first modification type, wherein the first modification type represents modification of the data record of the target main key in the target data table by utilizing the modification content;
if not, determining that the modification type is a second modification type, wherein the second modification type characterizes writing the modification content into the target data table.
5. The method of claim 4, wherein in the event that the modification type is determined to be a first modification type, the method further comprises:
determining whether the data record corresponding to the main key of the modified content has the same field;
if yes, determining that the modification type is a third modification type, wherein the third modification type represents modification of only fields which are different from the modification content in the data record;
If not, determining that the modification type is a fourth modification type, wherein the fourth modification type represents that the data record corresponding to the main key is replaced by the modification content.
6. The method of claim 5, wherein the third modification type comprises a first field characterizing the corresponding data record as a pre-update data record and a second field characterizing the corresponding data record as an updated data record;
in the case where the modification type is determined to be the third modification type, the writing the modification type and the modification content to the update log includes:
splicing the first field with the data record corresponding to the target main key to generate a first data record;
splicing the second field with the modified content to generate a second data record;
and writing the first modification record and the second modification record into the update log respectively.
7. The method according to any one of claims 1 to 6, further comprising:
setting a time-to-live value for the update log;
storing the time-to-live value and the update log into a cache;
Monitoring a lifecycle of the update log based on the time-to-live value;
and deleting the update log when the life cycle is finished.
8. The method of claim 7, wherein the method further comprises:
acquiring a system time in response to receiving the modification operation;
generating a timestamp based on the system time;
distributing a log serial number for the update log;
and writing the log serial number and the timestamp into the update log.
9. The method of claim 8, wherein the writing the log sequence number and the timestamp to the update log comprises:
writing the log serial number and the timestamp into the update log as fields respectively;
the method further comprises the steps of:
acquiring a query request, wherein the query request is generated according to a log sequence number field;
analyzing the query request to acquire the log sequence number field carried by the query request;
querying the update log based on the log sequence number field.
10. A method of data processing, comprising:
obtaining an update log by using a preset interface, wherein the update log is generated by writing modification information generated by modification operation under the condition that the modification operation for a target data table is monitored, and the target data table is determined by a data processing request;
And carrying out consumption processing on the update log.
11. A computing device comprising a processing component and a storage component;
the storage component stores one or more computer instructions; the one or more computer instructions are configured to be invoked by the processing component to perform a data processing method according to any one of claims 1 to 9, or to perform a data processing method according to claim 10.
12. A computer storage medium, characterized in that a computer program is stored, which, when being executed by a computer, implements the data processing method according to any one of claims 1 to 9 or implements the data processing method according to claim 10.
CN202211551865.8A 2022-12-05 2022-12-05 Data processing method, computing device and computer storage medium Pending CN116225822A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211551865.8A CN116225822A (en) 2022-12-05 2022-12-05 Data processing method, computing device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211551865.8A CN116225822A (en) 2022-12-05 2022-12-05 Data processing method, computing device and computer storage medium

Publications (1)

Publication Number Publication Date
CN116225822A true CN116225822A (en) 2023-06-06

Family

ID=86571926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211551865.8A Pending CN116225822A (en) 2022-12-05 2022-12-05 Data processing method, computing device and computer storage medium

Country Status (1)

Country Link
CN (1) CN116225822A (en)

Similar Documents

Publication Publication Date Title
US11816126B2 (en) Large scale unstructured database systems
US11960464B2 (en) Customer-related partitioning of journal-based storage systems
JP7410181B2 (en) Hybrid indexing methods, systems, and programs
US11366829B2 (en) System and method for analysis and management of data distribution in a distributed database environment
US10346434B1 (en) Partitioned data materialization in journal-based storage systems
US9740706B2 (en) Management of intermediate data spills during the shuffle phase of a map-reduce job
CN111046034B (en) Method and system for managing memory data and maintaining data in memory
CN113227998A (en) Technology for comprehensively supporting autonomous JSON document object (AJD) cloud service
US20130110873A1 (en) Method and system for data storage and management
US11436194B1 (en) Storage system for file system objects
CN111339073A (en) Real-time data processing method and device, electronic equipment and readable storage medium
CN108509453B (en) Information processing method and device
CN105556474A (en) Managing memory and storage space for a data operation
CN111917834A (en) Data synchronization method and device, storage medium and computer equipment
CN111324604A (en) Database table processing method and device, electronic equipment and storage medium
US10235407B1 (en) Distributed storage system journal forking
CN111680017A (en) Data synchronization method and device
US10162841B1 (en) Data management platform
US20220044144A1 (en) Real time model cascades and derived feature hierarchy
CN115599871A (en) Lake and bin integrated data processing system and method
CN116225822A (en) Data processing method, computing device and computer storage medium
CN114896250A (en) Key value separated key value storage engine index optimization method and device
CN114398334A (en) Prometheus remote storage method and system based on ZNBase cluster
CN116561138A (en) Data processing method and device
CN116010452A (en) Industrial data processing system and method based on stream type calculation engine and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination