WO2017008657A1 - 一种数据快照获取的方法及系统 - Google Patents

一种数据快照获取的方法及系统 Download PDF

Info

Publication number
WO2017008657A1
WO2017008657A1 PCT/CN2016/088518 CN2016088518W WO2017008657A1 WO 2017008657 A1 WO2017008657 A1 WO 2017008657A1 CN 2016088518 W CN2016088518 W CN 2016088518W WO 2017008657 A1 WO2017008657 A1 WO 2017008657A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
data record
snapshot
change
change data
Prior art date
Application number
PCT/CN2016/088518
Other languages
English (en)
French (fr)
Inventor
邓小勇
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017008657A1 publication Critical patent/WO2017008657A1/zh
Priority to US15/870,675 priority Critical patent/US20180137134A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/128Details of file system snapshots on the file-level, e.g. snapshot creation, administration, deletion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1471Saving, restoring, recovering or retrying involving logging of persistent data for recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/84Using snapshots, i.e. a logical point-in-time copy of the data

Definitions

  • the present application relates to the field of data storage technologies, and in particular, to a method for acquiring a data snapshot, and a system for acquiring a data snapshot.
  • EAI Enterprise Application Integration
  • ETL the abbreviation of Extraction-Transformation-Loading, which is the process of extracting, transforming and loading, is an important part of building data warehouse.
  • ETL is a process of loading and transforming data of a business system into a data warehouse after being extracted and cleaned. The purpose is to integrate scattered, disorderly, and non-uniform data in the enterprise to provide an analysis basis for enterprise decision-making.
  • the commonly used data extraction means is offline extraction, and the offline extraction mainly relies on the data interface provided by the source end to traverse all the data to extract the desired data, and the traditional source extraction (such as select) belongs to offline extraction. Since offline extraction is a traversal of the current state of the source data, data snapshots can be obtained directly without data calculation.
  • the snapshot time point cannot be specified: the execution time of the offline extraction is an approximate time (or the time after the specified time), and the snapshot at the specified time point cannot be obtained;
  • the technical problem to be solved by the embodiments of the present application is to provide a method for acquiring a data snapshot, which is used to accurately obtain a snapshot of data at a specified time.
  • the embodiment of the present application further provides a system for acquiring a data snapshot to ensure implementation and application of the foregoing method.
  • the embodiment of the present application discloses a method for acquiring a data snapshot, where the method includes:
  • the first data snapshot corresponding to the specified time is obtained according to the change data record.
  • the method further includes:
  • the specified location has a corresponding second data snapshot that has been acquired before, the first data snapshot and the second data snapshot are organized into a third data snapshot of the specified time.
  • the method further includes:
  • tag information is added to the change data record as needed.
  • the change data record includes a primary key identifier, and in the destination end, when the preset specified time is reached, the step of acquiring the first data snapshot corresponding to the specified time according to the change data record includes: :
  • the change data record further includes a data number, and if the specified time is reached, the step of performing reverse sorting on the change data record of each primary key identifier is:
  • the change data record for each primary key identifier is sorted according to the data number.
  • the step of loading the change data record to the destination end comprises:
  • the change data record after the converted data format is loaded by the subscribing program into the data table of the corresponding destination end.
  • the change data record includes a source change time stamp
  • the step of loading, by the subscriber, the change data record after the converted data format into a data table of a corresponding destination includes:
  • the change data record is written into a corresponding storage partition, and an identifier corresponding to the storage partition is written in a position corresponding to the change data record in the data table.
  • the method further includes:
  • the source end is multiple, and according to the redo log redolog file recorded in the source end, the step of obtaining the change data record includes:
  • the redolog files recorded in the plurality of source ends starting from the specified location are collectively collected into the publishing program, and the redo file is parsed by the publishing program to obtain a corresponding change data record.
  • the first data snapshot includes a first primary key identifier, a change type, and first value information
  • the second data snapshot includes a second primary key identifier and second numeric information
  • the first data snapshot is The step of organizing the second data snapshot into the third data snapshot of the specified time includes:
  • the second primary key identifier and the second numerical value information are organized into a third data snapshot
  • the first primary key identifier is not empty, the first primary key identifier and the first numerical value information are organized into a third data snapshot.
  • the embodiment of the present application further discloses a system for acquiring a data snapshot, where the system includes:
  • a data record obtaining module configured to obtain a corresponding change data record based on a redo log redolog file starting from a specified location recorded in the source end;
  • a data loading module configured to load the change data record to a corresponding destination end
  • the snapshot obtaining module is configured to acquire, according to the change data record, a first data snapshot corresponding to the specified time, when the preset specified time is reached.
  • the system further comprises:
  • a merging module configured to organize the first data snapshot and the second data snapshot into a third data snapshot of the specified time when the specified location has a corresponding second acquired data snapshot.
  • the system further comprises:
  • a tag adding module configured to add tag information to the change data record as needed when loading the change data record to the destination end.
  • the change data record includes a primary key identifier
  • the snapshot obtaining module includes:
  • a determining submodule configured to detect the label information in the destination end, and determine whether the specified time is reached
  • a sorting submodule configured to perform a reverse sorting on the change data record identified by each primary key when the specified time is reached;
  • the snapshot obtaining sub-module is configured to obtain a data record in which each primary key identifier is sorted in the first place as a first data snapshot of the primary key identifier.
  • the change data record further includes a data number
  • the ordering sub-module is further configured to:
  • the change data record for each primary key identifier is sorted according to the data number.
  • the data loading module comprises:
  • a publishing submodule for publishing the change data record to a preset subscription program by a preset publisher, wherein the subscriber is a program that subscribes the change data record to the publisher;
  • a format conversion submodule configured by the subscriber to convert the change data record into a data format required by the destination end
  • the change data record includes a source change time stamp
  • the format conversion submodule includes:
  • a partition determining unit configured to determine, by the subscriber, a storage partition to which the change data record belongs according to a source change timestamp of each change data record, where the storage partition includes a storage partition identifier;
  • a writing unit configured to write the change data record into a corresponding storage partition, and write an identifier corresponding to the storage partition into a position corresponding to the change data record in the data table.
  • the system further comprises:
  • a storage module for storing change data records within a preset time period.
  • the source end is multiple
  • the data record obtaining module includes:
  • the collecting sub-module is configured to collect the redolog files recorded in the plurality of source ends and start from the specified location into the publishing program, and the redo file is parsed by the publishing program to obtain a corresponding change data record.
  • the first data snapshot includes a first primary key identifier, a change type, and first value information
  • the second data snapshot includes a second primary key identifier and second numeric information
  • the merge module includes:
  • a merging submodule configured to merge the first data snapshot and the second data snapshot according to the first primary key identifier and/or the second primary key identifier;
  • a first organization submodule configured to organize the second primary key identifier and the second numerical value information into a third data snapshot when the first primary key identifier is empty;
  • the second organization submodule is configured to organize the first primary key identifier and the first numerical value information into a third data snapshot when the first primary key identifier is not empty.
  • the embodiments of the present application include the following advantages:
  • the change flow obtained by the redolog file can be sorted by time with the key data, and after calculation, a snapshot of the data at any time point can be obtained;
  • Isolated production The embodiment of the present application does not affect the online production by reading the redolog file, instead of taking the source data access interface.
  • the embodiment of the present application can be used for a storage system with a redolog file. As long as the change flow can be obtained from the redolog, all the above technical effects can be achieved.
  • Embodiment 1 is a flow chart showing the steps of Embodiment 1 of a method for acquiring a data snapshot according to the present application;
  • Embodiment 2 is a schematic diagram of an incremental merge code in Embodiment 1 of a method for acquiring a data snapshot according to the present application;
  • FIG. 3 is a flow chart of steps of a second embodiment of a data snapshot acquisition method according to the present application.
  • FIG. 4 is a schematic diagram of a full amount of merged code in a second embodiment of a data snapshot acquisition method according to the present application.
  • FIG. 5 is a structural block diagram of a system embodiment of data snapshot acquisition according to the present application.
  • FIG. 1 a flow chart of steps in a first embodiment of a data snapshot acquisition method of the present application is shown, which may include the following steps:
  • Step 101 Acquire a corresponding change data record based on a redo log redolog file that is recorded in the source end and starts at a specified location;
  • the embodiment of the present application can be applied to an ETL system.
  • the application can obtain a change data record based on the redo log redolog file recorded in the source end.
  • the storage system (such as mysql, oralce, hbase, etc.) of the source end stores structured data (ie, row data), and the structured data is data expressed by a two-dimensional table logical structure in the storage system, and the data is
  • the storage system generally has a redolog file.
  • the redolog file refers to the structured storage to record the change process of its own data into the log. Specifically, each time the database system performs an update operation, it will cause a database change, so a certain amount of redo logs will be generated. It will be recorded in the redo log file redolog file, so that when the database fails or the media fails, the redo log file can be used to recover the database.
  • the redolog file is used to obtain the data change in the source. running water.
  • step 101 may specifically be:
  • Sub-step S11 the redolog files recorded in the plurality of source ends starting from the specified location are collectively collected into a preset publishing program, and the redolog file is parsed by the publishing program to obtain a corresponding change data record.
  • the redolog files in the multiple source terminals can be collected in a preset publishing program, and the redo file is parsed by the publishing program without performing data parsing in the source end of the redolog file. Thereby forming a good isolation from the source end, saving resource consumption at the source end.
  • the binary log binlog may be collected by the master-slave copy; the source end of the existing redolog replication scheme may pass directly.
  • the position of the collection cutoff can be marked.
  • the position of one unit of data after the mark position or mark position can be used as the designated position, and the next time the acquisition is performed. , start collecting from the specified location. In the specific implementation, it can be marked based on the time parameter. For example, the previous collection of the redolog file at 8:00 on April 20, 2015 is marked at 8:00 on April 20, 2015, and the designated location can be 2015. At 8 o'clock on April 20th; or, it can be marked based on the file number parameter.
  • the specified position can be numbered as The next position of the position of "4", for example, the position numbered "5".
  • the specified location may also be determined in other manners, and the embodiment of the present application does not need to be limited thereto.
  • the publishing program After collecting the redolog file, the publishing program further parses the redolog file into a recognizable data format, and obtains a corresponding change data record record.
  • a database change can be added to the redolog file continuously, that is, redolog records the data change flow, and after the redolog is parsed, the corresponding change flow record can be obtained.
  • the identifiable format can be excel's row and column format (one record is a record, different columns have different meanings), or it can be json format (a json is a record, different internal keys have different meanings), or it can be a Program object, which contains records, and can be other serialized formats such as protobuf, avro, and so on.
  • this application can also adopt different parsing methods, such as oracle logminer, mysql mysqlbinlog and so on.
  • Meta information can include primary key identification, change type (for example, insert INSERT, update UPDATE, delete DELETE, etc., for DELETE changes, at least give the changed primary key identifier), source library name, source table name, source change Timestamps, etc.; numeric information can be recorded in List ⁇ Field>, each Field object also has its own meta information, such as: data type (INT, FLOAT, DOUBLE, STRING, DATE, BLOB, etc.), Field name, Field code , specific Field values, and so on.
  • change type for example, insert INSERT, update UPDATE, delete DELETE, etc., for DELETE changes, at least give the changed primary key identifier
  • source library name for example, source table name, source change Timestamps, etc.
  • numeric information can be recorded in List ⁇ Field>
  • each Field object also has its own meta information, such as: data type (INT, FLOAT, DOUBLE, STRING, DATE, BLOB, etc.), Field name,
  • the embodiment of the present application is not limited to the centralized collection mode described above, and those skilled in the art may separately process the redolog file according to the location of the sub-database, that is, the local redo file is parsed at the source end.
  • the advantage is that it saves online transmission of redolog files (no need to transfer redolog files to the publisher), but this method consumes various resources (such as CPU, memory, network, etc.) from various sources. It will affect online services.
  • the embodiment of the present application may further store the change data record in the preset time period.
  • abnormality such as user logic error or data loss or system downtime may occur, and the real-time data stream is continuously moving forward, and an abnormal situation needs to be abnormal after the occurrence of the abnormal situation.
  • the entire flow point is retracted to a safe time point (preset time period), and the flow data is redrawn (which can be considered as a process of supplementing data).
  • the change data record can be directly dragged from the buffered pipeline data (of course, the time point that can be rolled back needs to be cached) Within the time range), without having to re-draw the redolog file from the source and parse it.
  • the parsed change data records of the same primary key are arranged in the actual change order, that is, the same key data is ordered.
  • this application is only for the time difference between the two (between two keys or between two tables or between two libraries), so for some data transactions is required.
  • sexual scenarios data transactional scenarios can refer to a batch of data operations that either succeed or fail. For example, if a bank needs to increase a certain batch of accounts (such as 10 accounts) by 100 yuan, then the data transactional scenario is this time.
  • the operation needs to ensure that all 10 accounts are increased by 100 yuan; otherwise, all of them are not increased, but some cannot be increased, and some are not increased. Otherwise, it is difficult to find out who has succeeded, who has failed, and brings inconvenience to the next operation. ) are not required.
  • the data extraction process obtains the change data record by extracting the redolog file, and can sort the same key data by time, without traversing the entire data set, greatly improving the ETL speed, and being able to obtain data of any specified time. Snapshot (for any given time, it will be explained below).
  • Step 102 Load the change data record to a corresponding destination end
  • the change data record may be loaded into the corresponding destination.
  • step 102 may include the following sub-steps:
  • the change data record is issued to a preset subscription program by a preset publishing program, wherein the subscription program is a program that subscribes the change data record to the publisher;
  • the embodiment of the present application is also provided with a subscription program, wherein the distribution program focuses on parsing and publishing the redolog file, and does not pay attention to the flow and use of the data after the release;
  • the subscriber is used to subscribe the data published by the publisher to the destination.
  • the embodiment of the present application implements the parsing and loading operations through two different programs, thereby achieving light coupling between systems, due to the distributed environment.
  • the amount of data is too large. Considering the performance of the subscriber, multiple subscribers can process the change data records published by the publisher. For example, if 10 subscribers subscribe to the same batch of data, each subscriber can be responsible for 1/10. The amount of data.
  • a source does not change operation for a long time, it is difficult for the subscriber to determine whether there is no new data arrival or a problem with the parsing and publishing process.
  • the publisher can continuously subscribe to the subscriber.
  • a heartbeat (such as 1 time/second) is issued to ensure that the data time advances.
  • Sub-step S22 the change data record is converted by the subscribing program into a data format required by the destination end;
  • the change data record can be converted into a data format required by the destination end, and the format data is a link for calculation from the source end and written to the destination end, and specifically includes a format of the meta information. Conversion and format conversion of numeric information.
  • the change data record can be converted to the format of Table 1 below:
  • Multi-bit cyclic synchronization sequence number that is, the data number of the change data record, the data length of the field is equal length, and the synchronization sequence number is cyclically allocated (the number of synchronization sequence numbers depends on the granularity of the source-side change timestamp and the end-source of the granularity)
  • the maximum number of changes in a single table for example, the granularity of the source change timestamp is 1 second, then assume that the maximum number of changes in a table in 1 second is 5w times, that is, up to 5w times, then the number of the synchronization sequence number
  • the number can be 5 digits or more, for example, 00001, that is, the change in the next 1 second which can be represented by 5 digits or more, since the source end change time stamp of the next 1 second is different from the previous 1 second data.
  • the multi-bit synchronization sequence number can be recycled, and the field is a sort field of the destination end and the key data;
  • Source change timestamp The real time of the source data change, the role of which is to determine the storage interval to which the change data record is written to the destination.
  • the sort field uses the above-mentioned multi-bit cyclic synchronization sequence number, instead of using the field.
  • the resolution of the change time may be large. For example, even if it is in microseconds, the source and the key data may have multiple changes within one microsecond and cannot distinguish the order;
  • Source library name and source table name Label the source of the source data to facilitate data analysis and problem location;
  • Type of change mainly includes INSERT, UPDATE, and DELETE parsed from the foregoing;
  • Field value The specific numerical information obtained by parsing. Here, encoding conversion or type conversion may be performed. But to ensure that the value is high fidelity or not.
  • the change data record when written in the above table 1, it can be written in rows and columns; it can also be an entire line but the built-in row and column are separated (the row and column separators need to set special characters, when the row and column separators appear in the data) Row/column replacement is required; it can also be other serialization formats such as json, protobuffer, etc., but the basic meta information is as described in Table 1 above.
  • Sub-step S23 the change data record after the converted data format is loaded by the subscribing program into the data table of the corresponding destination end.
  • the change data record after the conversion format can be loaded into the data table of the corresponding destination.
  • the sub-step S23 may further include the following sub-steps:
  • Sub-step S231 the sub-time change time stamp of each change data record is determined by the sub-program, and the storage partition to which the change data record belongs is determined;
  • the embodiment of the present application divides multiple storage partitions for data storage, and each storage partition can store a certain time. Range data, for example, if each storage partition can store 1 hour of data, you can divide 24 hours a day into 24 storage partitions, the first storage partition stores data of 00:00-01:00, and the second The storage partition stores data from 01:01-02:00, the third storage partition stores data from 02:01-03:00, and so on.
  • the source change timestamp of each change data record may be determined to determine the storage partition to which the change data record belongs, for example, the source change of the current change data record.
  • the timestamp is 01:30, and the storage partition to which it belongs is the second storage partition.
  • Sub-step S232 the change data record is written into the corresponding storage partition, and the identifier corresponding to the storage partition is written in a position corresponding to the change data record in the data table.
  • the subscriber After determining the storage partition to which the change data record belongs, the subscriber writes the change data record to the storage partition, for example, writing the change data record with the source change time stamp of 01:30 to the second storage partition, and The identifier of the second storage partition is written in the data table of the destination.
  • the identifier of the storage partition may be identified by a data number, for example, a number of "1", "2", etc., or a time range corresponding to the storage partition, that is, a range timestamp identifier, as shown in Table 2 below, the range time
  • the stamp can be expressed using, for example, 2014042009 (storing the identifier of the storage partition at 9:00 to 10:00 on April 20, 2014).
  • the purpose of writing the identifier of the storage partition in the data table is to quickly locate the location where the data is located.
  • the identifier of the storage partition can be expressed by a separate label range field for the SQL class storage, such as the range timestamp shown in Table 2; for the NoSQL class storage, the partition can be used for expression, for example, can be similar to one
  • the column is expressed separately, but in NoSQL the fields of this separate column belong to meta information, not specific values.
  • tag information is added to the change data record as needed.
  • the tag information can be added to the change data record on time according to the business requirement (ie, it is punctured), for example, one point is written every hour of data written, or the specified time is reached.
  • the destination can judge whether the data is complete and whether the specified time is reached according to the management situation.
  • the processing from the processing of the publisher to the processing of the subscriber needs to be completed within a time range (for example, 5 minutes) to ensure that the data processing time difference of different keys is within a reasonable range, and avoid subsequent Calculate the time delay or the data is incomplete and the data is lost.
  • a time range for example, 5 minutes
  • one piece of data is already at 0:00 in the morning, but there is still data at 20:00 in the evening. If there is no limited time range, then the current time cannot be obtained from the global at 0:00. The change has been made.
  • Step 103 In the destination end, when the preset specified time is reached, the first data snapshot corresponding to the specified time is obtained according to the change data record.
  • the first data snapshot corresponding to the specified time of the user (for example, a developer, a data analyst, etc.) can be obtained, where the data snapshot refers to the storage system at the source end.
  • Data is a copy of the data at some point in the storage system (ie, the destination) at another location or location.
  • step 103 may include the following sub-steps:
  • Sub-step S31 detecting the tag information in the destination end, determining whether the specified time is reached;
  • the dot is a dot according to the specified time, when the dot is detected, the specified time is reached, The change data record at the management is fully loaded.
  • the dot is a dot according to the time interval, for example, one dot for one hour, if the specified time is 04:00, starting from 00:00, the fourth dot is read to indicate that the specified time arrives, and the change at the fourth dot is changed.
  • the data record is loaded completely.
  • Sub-step S32 if the specified time is reached, the change data record identified for each primary key is reverse sorted;
  • Sub-step S33 obtaining a data record in which the respective primary key identifiers are sorted in the first place as the first data snapshot of the primary key identifier.
  • the change data record for each primary key identifier may be synchronized according to the multi-bit cyclic sequence number, that is, the data. Sort the numbers.
  • the preferred sorting mode of the embodiment of the present application is reverse sorting, and the data record sorted in the first place is the data record of the final state of the primary key identifier.
  • the process of the sub-step S32 and the sub-step S33 is a process of incremental merging
  • the first data snapshot is an incremental snapshot obtained by performing incremental merging on the change data record, where the incremental merge is Specifying the start and end point T0 of the change flow (that is, the specified position in the embodiment of the present application), and combining the data of the same key in the order of change (that is, sorting the same key data), thereby obtaining all the data at the cut-off time point T1 ( That is, the final state of the specified time in the embodiment of the present application, the data in this final state is called an incremental snapshot.
  • the pseudo SQL code of the first data snapshot acquisition process is shown in the incremental merge code diagram of FIG. 2, wherein the meaning of the 9th line is to select all data in the data range (partition field), and the same
  • the primary key, the source change timestamp, and the synchronization serial number ie, the multi-bit cyclic synchronization serial number
  • the synchronization serial number ie, the multi-bit cyclic synchronization serial number
  • Step 1 According to the business requirements, select the redolog file from the specified location to the specified time range.
  • the specified location selected in this example is 2015-04-20 00:00:00, and the specified time is 2015-04-21 00:00: 00, that is, select the redolog file with the time range [2015-04-20 00:00:00, 2015-04-21 00:00:00), after parsing it, the change data record loaded to the destination is as shown in Table 3 below. Shown
  • Step 2 group the change data records in the above table 3 according to the primary key key identifier, and perform reverse sorting according to the change sequence number (that is, the multi-bit loop synchronization sequence number) for each group of the same key data, and the result is as follows:
  • Step 3 Obtain the data record in the first place of each primary key identifier, and you can get the final status before 2015-04-21 00:00:00, which is the first of 2015-04-21 00:00:00
  • the data snapshot is shown in Table 5 below:
  • time range of the above example is only one day, that is, only taking a snapshot of the data of one day.
  • the time range can be more fine-grained, such as an accurate snapshot of the hour-level, and even a precise snapshot of the minute level can be obtained.
  • the change flow obtained by the redolog file can be sorted in time with the key data, and after calculation, a snapshot of the data at any time point can be obtained;
  • Isolated production The embodiment of the present application does not affect the online production by reading the redolog file, instead of taking the source data access interface.
  • the embodiment of the present application can be used for a storage system with a redolog file. As long as the change flow can be obtained from the redolog, all the above technical effects can be achieved.
  • FIG. 3 a flow chart of the steps of the second embodiment of the data snapshot acquisition method of the present application is shown, which may include the following steps:
  • Step 301 Acquire a corresponding change data record based on the redo log redolog file recorded in the source end and starting from the specified location;
  • step 301 may include the following sub-steps:
  • Sub-step S41 the redolog files recorded in the plurality of source ends starting from the specified location are collectively collected into a preset publishing program, and the redo file is parsed by the publishing program to obtain a corresponding change data record.
  • the data records recorded in the parsing and the data recorded in the redolog file at the source end are not completely identical, the data in the source end is only specific data fields, and the record may include meta information and values.
  • the meta information may include a primary key identifier, a change type, a source end library name, a source end table name, a source end change time stamp, and the like;
  • the numerical information may be recorded in the List ⁇ Field>, and each Field object also has its own Meta information, such as: data type (INT, FLOAT, DOUBLE, STRING, DATE, BLOB, etc.), Field name, Field code, specific Field value, and so on.
  • the embodiment of the present application may further store the change data record in the preset time period.
  • abnormality such as user logic error or data loss or system downtime may occur, and the real-time data stream is continuously moving forward, and an abnormal situation needs to be abnormal after the occurrence of the abnormal situation.
  • the entire flow point is retracted to a safe time point (preset time period), and the flow data is redrawn (which can be considered as a process of supplementing data).
  • the change data record can be directly dragged from the buffered pipeline data (of course, the time point that can be rolled back needs to be cached) Within the time range), without having to re-draw the redolog file from the source and parse it.
  • Step 302 Load the change data record to a corresponding destination end
  • the change data record may be loaded into the corresponding destination.
  • step 302 may include the following sub-steps:
  • Sub-step S51 the change data record is posted to a preset subscription program by a preset publishing program, where The subscriber is a program that subscribes the change data record to the publisher;
  • Sub-step S52 the change data record is converted by the subscribing program into a data format required by the destination end;
  • Sub-step S53 the change data record after the converted data format is loaded by the subscribing program into the data table of the corresponding destination end.
  • the sub-step S53 may further include the following sub-steps:
  • Sub-step S531 the sub-program changes the time stamp of each change data record by the subscribing program, and determines a storage partition to which the change data record belongs;
  • Sub-step S532 the change data record is written into the corresponding storage partition, and the identifier corresponding to the storage partition is written in a position corresponding to the change data record in the data table.
  • tag information is added to the change data record as needed.
  • Step 303 in the destination end, when the preset specified time is reached, acquiring the first data snapshot corresponding to the specified time according to the change data record;
  • step 303 may include the following sub-steps:
  • Sub-step S61 detecting the tag information in the destination end, determining whether the specified time is reached;
  • Sub-step S62 if the specified time is reached, the change data record for each primary key identifier is reverse sorted;
  • the sub-step S62 may specifically be: if the specified time is reached, the change data record for each primary key identifier is reverse sorted according to the data number.
  • Sub-step S63 obtaining a data record in which the respective primary key identifiers are sorted in the first place as the first data snapshot of the primary key identifier.
  • Step 304 If the specified location has a corresponding second data snapshot that has been acquired before, the first data snapshot and the second data snapshot are organized into a third data snapshot of the specified time.
  • the first data snapshot includes a first primary key identifier, a change type, and first value information
  • the second data snapshot includes a second primary key identifier and second numeric information.
  • Step 304 can include the following sub-steps:
  • Sub-step S71 if the specified location has a corresponding second data snapshot that has been previously acquired, the first data snapshot and the first data snapshot are merged according to the first primary key identifier and/or the second primary key identifier. Second data snapshot;
  • the second data snapshot is a full-volume snapshot of the specified location
  • the full-volume snapshot is the result of full-merging the data of the specified location
  • the full-quantity merging refers to incrementing the data in the range of T0 to T1. After merging, it is merged into the full-scale snapshot at time T0, so that the full-quantity snapshot at time T1 is obtained, and the method is repeatedly used, and a full-volume snapshot of each time point can be obtained. Therefore, if the specified location is the initial location, the second data snapshot does not exist in the designated location; otherwise, if the specified location is not the initial location, the specified location has a second data snapshot.
  • the 10-13 rows are incrementally combined to obtain the temporary increment table incre, that is, the incremental snapshot of the specified time is obtained; the eighth row is the full value table total for obtaining the previous service point. That is, the second data snapshot of the specified position is obtained; the two are spliced into the full outer join, and the condition of the join is that the primary key is equal (the 14th line), thereby obtaining a full-increment wide table (starting with total and incre respectively, for no change) The incre part of the data is all null).
  • Sub-step S72 if the first primary key identifier is empty, the second primary key identifier and the second numerical value information are organized into a third data snapshot;
  • Sub-step S73 if the first primary key identifier is not empty, the first primary key identifier and the first numerical value information are organized into a third data snapshot.
  • the acquisition of the third data snapshot and the acquisition of the first data snapshot may be implemented by the same computing program, that is, directly in one step, using a calculation program to complete the entire incremental combination, but not storing the incremental snapshot.
  • the pseudo code in FIG. 4 is the mode; it can also be implemented by two different calculation programs, that is, after obtaining the first data snapshot, a calculation program is started to calculate the third data snapshot.
  • the first step is to obtain the incremental final state table (ie, the first data snapshot) of the business time point (2015-04-20) according to the incremental combination, as shown in Table 6;
  • the second step get the previous day, the specified position (2015-04-19) full scale, as shown in Table 7;
  • the third step make a full outer join of Table 6 and Table 7, the join condition is based on the key (primary key) union, as shown in Table 8;
  • the fourth step according to the judgment logic for data filtering, the judgment logic is: retain the incremental primary key is empty or the incremental change type is not DELETE data, as shown in Table 9:
  • Step 5 Take the value. When the incremental primary key is empty, all the fields take the full value. Otherwise, all the incremental values are used to obtain the final full snapshot of 2015-04-20, as shown in Table 10.
  • Id (account) Value 1111 7000 3333 5000 4444 6000
  • time range is only one day, that is, only take a snapshot of the data of one day.
  • time range can be more fine-grained, such as a precise snapshot of the hour.
  • FIG. 5 a structural block diagram of a system embodiment for data snapshot acquisition according to the present application is shown, which may specifically include the following modules:
  • the data record obtaining module 501 is configured to obtain a corresponding change data record based on the redo log redolog file recorded in the source end and starting from the specified location;
  • a data loading module 502 configured to load the change data record to a corresponding destination end
  • the snapshot obtaining module 503 is configured to: when the preset specified time is reached, obtain the first data snapshot corresponding to the specified time according to the change data record.
  • system may further include:
  • a merging module configured to organize the first data snapshot and the second data snapshot into a third data snapshot of the specified time when the specified location has a corresponding second acquired data snapshot.
  • system may further include:
  • a tag adding module configured to add tag information to the change data record as needed when loading the change data record to the destination end.
  • the change data record includes a primary key identifier
  • the snapshot obtaining module 503 may include the following submodules:
  • a determining submodule configured to detect the label information in the destination end, and determine whether the specified time is reached
  • a sorting submodule configured to perform a reverse sorting on the change data record identified by each primary key when the specified time is reached;
  • the snapshot obtaining sub-module is configured to obtain a data record in which each primary key identifier is sorted in the first place as a first data snapshot of the primary key identifier.
  • the change data record further includes a data number
  • the ordering sub-module may further be used to:
  • the change data record for each primary key identifier is sorted according to the data number.
  • the data loading module 502 may include the following sub-modules:
  • a publishing submodule for publishing the change data record to a preset subscription program by a preset publisher, wherein the subscriber is a program that subscribes the change data record to the publisher;
  • a format conversion submodule configured by the subscriber to convert the change data record into a data format required by the destination end
  • the change data record includes a source change time stamp
  • the format conversion submodule may include the following units:
  • a partition determining unit configured to determine, by the subscriber, a storage partition to which the change data record belongs according to a source change timestamp of each change data record, where the storage partition includes a storage partition identifier;
  • a writing unit configured to write the change data record into a corresponding storage partition, and write an identifier corresponding to the storage partition into a position corresponding to the change data record in the data table.
  • system may further include:
  • a storage module for storing change data records within a preset time period.
  • the source end is multiple, and the data record obtaining module 501 may include the following sub-modules:
  • the collecting sub-module is configured to collect the redolog files recorded in the plurality of source ends and start at the specified location into the publishing program, and the redo file is parsed by the publishing program to obtain a corresponding change data record.
  • the first data snapshot includes a first primary key identifier, a change type, and first value information
  • the second data snapshot includes a second primary key identifier and second numeric information.
  • the merge module may include the following submodules:
  • a merging submodule configured to merge the first data snapshot and the second data snapshot according to the first primary key identifier and/or the second primary key identifier;
  • a first organization submodule configured to organize the second primary key identifier and the second numerical value information into a third data snapshot when the first primary key identifier is empty;
  • the second organization submodule is configured to organize the first primary key identifier and the first numerical value information into a third data snapshot when the first primary key identifier is not empty.
  • embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG.
  • These computer program operating instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine for execution by a processor of a computer or other programmable data processing terminal device
  • the operational instructions generate means for implementing the functions specified in one or more of the flow or in a block or blocks of the flowchart.
  • the computer program operating instructions may also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that operational instructions stored in the computer readable memory produce manufacturing including the operational command device
  • the operation instruction means implements the functions specified in one block or a plurality of blocks of a flow or a flow and/or a block diagram of the flowchart.
  • These computer program operating instructions can also be loaded onto a computer or other programmable data processing terminal device such that a series of operational steps are performed on the computer or other programmable terminal device to produce computer-implemented processing, such that the computer or other programmable terminal
  • the operational instructions executed on the device provide steps for implementing the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据快照获取的方法及系统,其中所述方法包括:基于源端中记录的始于指定位置的重做日志redolog文件,获取对应的变更数据记录(101);将所述变更数据记录加载到对应的目的端(102);在所述目的端中,当到达预设的指定时间时,依据所述变更数据记录获取所述指定时间对应的第一数据快照(103),可以精确获得指定时间的数据快照。

Description

一种数据快照获取的方法及系统
本申请要求2015年07月14日递交的申请号为201510413154.8、发明名称为“一种数据快照获取的方法及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据存储技术领域,特别是涉及一种数据快照获取的方法,以及一种数据快照获取的系统。
背景技术
随着企业信息化建设的发展,越来越多的企业建立了众多的信息系统,以帮助企业进行内外部业务的处理和管理工作。但是随着信息系统的增加,各自孤立工作的信息系统造成了大量的冗余数据和业务人员的重复劳动。企业应用集成(EAI,Enterprise Application Integration)应运而生,而ETL是实现数据集成的主要技术。
ETL,Extraction-Transformation-Loading的缩写,即数据抽取(Extract)、转换(Transform)、加载(Load)的过程,它是构建数据仓库的重要环节。ETL是将业务系统的数据经过抽取、清洗转换之后加载到数据仓库的过程,目的是将企业中的分散、零乱、标准不统一的数据整合到一起,为企业的决策提供分析依据。
现有技术中,常用的数据抽取手段是离线抽取,离线抽取主要是根据源端提供的数据接口遍历全部数据从而抽取出想要的数据,传统的源端抽取(比如select)就属于离线抽取。由于离线抽取是直接对源数据当时状态的遍历,故不需要数据计算就能直接得到数据快照。
然而,发明人在实践中发现离线抽取存在以下不足:
(1)无法指定快照时间点:离线抽取的执行时间是近似时间(或者是指定时间之后的时间),无法拿到指定时间点上的快照;
(2)不是精准快照:在有分库分表等分布式场景下,由于分布式调度无法做到所有执行点同时调度,故快照中的数据状态时间点不一致;
(3)不够快速:由于离线抽取需要遍历整个源端数据集,特别是在获取总基数大而增量小的数据快照时,投入产出比相当低;
(4)影响在线生产:由于离线抽取通过源端接口抽取数据,而在线生产和离线抽取都在调用这个数据出入口,可能导致在线生产延迟响应甚至崩溃。
因此,目前需要本领域技术人员迫切解决的一个技术问题就是:如何提出一种数据快照获取方案,用以精确获得指定时间的数据快照。
发明内容
本申请实施例所要解决的技术问题是提供一种数据快照获取的方法,用以精确获得指定时间的数据快照。
相应的,本申请实施例还提供了一种数据快照获取的系统,用以保证上述方法的实现及应用。
为了解决上述问题,本申请实施例公开了一种数据快照获取的方法,所述的方法包括:
基于源端中记录的始于指定位置的重做日志redolog文件,获取对应的变更数据记录;
将所述变更数据记录加载到对应的目的端;
在所述目的端中,当到达预设的指定时间时,依据所述变更数据记录获取所述指定时间对应的第一数据快照。
优选地,所述方法还包括:
若所述指定位置具有对应的在先已获取的第二数据快照,则将所述第一数据快照与所述第二数据快照组织成所述指定时间的第三数据快照。
优选地,所述方法还包括:
在将所述变更数据记录加载到所述目的端时,按需为所述变更数据记录添加标签信息。
优选地,所述变更数据记录包括主键标识,所述在所述目的端中,当到达预设的指定时间时,依据所述变更数据记录获取所述指定时间对应的第一数据快照的步骤包括:
在目的端中检测所述标签信息,判断是否到达所述指定时间;
若到达所述指定时间,针对每个主键标识的变更数据记录进行倒排序;
获得各个主键标识排序在首位的数据记录,作为所述主键标识的第一数据快照。
优选地,所述变更数据记录还包括数据编号,所述若到达所述指定时间,针对每个主键标识的变更数据记录进行倒排序的步骤为:
若到达所述指定时间,针对每个主键标识的变更数据记录,按照所述数据编号进行倒排序。
优选地,所述将所述变更数据记录加载到目的端的步骤包括:
由预置的发布程序将所述变更数据记录发布至预置的订阅程序,其中,所述订阅程序为向所述发布程序订阅所述变更数据记录的程序;
由所述订阅程序将所述变更数据记录转换成所述目的端所需的数据格式;
由所述订阅程序将所述转换数据格式后的变更数据记录,加载到对应的目的端的数据表中。
优选地,所述变更数据记录包括源端变更时间戳,所述由所述订阅程序将所述转换数据格式后的变更数据记录加载到对应的目的端的数据表中的步骤包括:
由所述订阅程序按照每条变更数据记录的源端变更时间戳,确定所述变更数据记录所属的存储分区,所述存储分区包括存储分区标识;
将所述变更数据记录写入对应的存储分区中,并将所述存储分区对应的标识写入所述数据表中与所述变更数据记录对应的位置。
优选地,所述方法还包括:
存储预设时间段内的变更数据记录。
优选地,所述源端为多个,所述根据源端中记录的重做日志redolog文件,获取变更数据记录的步骤包括:
分别将多个源端中记录的,始于指定位置的redolog文件集中采集到所述发布程序中,由所述发布程序解析所述redolog文件,获得对应的变更数据记录。
优选地,所述第一数据快照包括第一主键标识、变更类型以及第一数值信息,所述第二数据快照包括第二主键标识及第二数值信息,所述将所述第一数据快照与所述第二数据快照组织成所述指定时间的第三数据快照的步骤包括:
按照所述第一主键标识和/或所述第二主键标识,合并所述第一数据快照以及所述第二数据快照;
若所述第一主键标识为空,则将所述第二主键标识及所述第二数值信息组织成第三数据快照;
若所述第一主键标识不为空,则将所述第一主键标识及所述第一数值信息组织成第三数据快照。
本申请实施例还公开了一种数据快照获取的系统,所述的系统包括:
数据记录获取模块,用于基于源端中记录的始于指定位置的重做日志redolog文件,获取对应的变更数据记录;
数据加载模块,用于将所述变更数据记录加载到对应的目的端;
快照获取模块,用于在所述目的端中,当到达预设的指定时间时,依据所述变更数据记录获取所述指定时间对应的第一数据快照。
优选地,所述系统还包括:
合并模块,用于在所述指定位置具有对应的在先已获取的第二数据快照时,将所述第一数据快照与所述第二数据快照组织成所述指定时间的第三数据快照。
优选地,所述系统还包括:
标签添加模块,用于在将所述变更数据记录加载到所述目的端时,按需为所述变更数据记录添加标签信息。
优选地,所述变更数据记录包括主键标识,所述快照获取模块包括:
判断子模块,用于在目的端中检测所述标签信息,判断是否到达所述指定时间;
排序子模块,用于在到达所述指定时间时,针对每个主键标识的变更数据记录进行倒排序;
快照获得子模块,用于获得各个主键标识排序在首位的数据记录,作为所述主键标识的第一数据快照。
优选地,所述变更数据记录还包括数据编号,所述排序子模块还用于:
若到达所述指定时间,针对每个主键标识的变更数据记录,按照所述数据编号进行倒排序。
优选地,所述数据加载模块包括:
发布子模块,用于由预置的发布程序将所述变更数据记录发布至预置的订阅程序,其中,所述订阅程序为向所述发布程序订阅所述变更数据记录的程序;
格式转换子模块,用于由所述订阅程序将所述变更数据记录转换成所述目的端所需的数据格式;
加载子模块,用于由所述订阅程序将所述转换数据格式后的变更数据记录,加载到对应的目的端的数据表中。
优选地,所述变更数据记录包括源端变更时间戳,所述格式转换子模块包括:
分区确定单元,用于由所述订阅程序按照每条变更数据记录的源端变更时间戳,确定所述变更数据记录所属的存储分区,所述存储分区包括存储分区标识;
写入单元,用于将所述变更数据记录写入对应的存储分区中,并将所述存储分区对应的标识写入所述数据表中与所述变更数据记录对应的位置。
优选地,所述系统还包括:
存储模块,用于存储预设时间段内的变更数据记录。
优选地,所述源端为多个,所述数据记录获取模块包括:
采集子模块,用于分别将多个源端中记录的,始于指定位置的redolog文件集中采集到所述发布程序中由所述发布程序解析所述redolog文件,获得对应的变更数据记录。
优选地,所述第一数据快照包括第一主键标识、变更类型以及第一数值信息,所述第二数据快照包括第二主键标识及第二数值信息,所述合并模块包括:
合并子模块,用于按照所述第一主键标识和/或所述第二主键标识,合并所述第一数据快照以及所述第二数据快照;
第一组织子模块,用于在所述第一主键标识为空时,将所述第二主键标识及所述第二数值信息组织成第三数据快照;
第二组织子模块,用于在所述第一主键标识不为空时,将所述第一主键标识及所述第一数值信息组织成第三数据快照。
与背景技术相比,本申请实施例包括以下优点:
a.指定任意快照时间点:本申请实施例通过redolog文件拿到的变更流水,能够做到同key数据按时间倒排序,经过计算,就能得到任意时间点上的数据快照;
b.精准快照:即使在分布式情况下(忽略分布式各主机的时钟差异),只要从redolog中获取了变更数据时间点,就能得到时间点一致的数据快照;
c.快速获取:本申请实施例通过redolog文件获得变更数据记录,不用遍历整个数据集,故ETL速度很快;
d.隔离生产:本申请实施例通过读取redolog文件,不是走源端数据出入接口,故对在线生产影响小甚至没有影响;
e.通用:本申请实施例对于有redolog文件的存储系统可以通用,只要能从redolog中获取变更流水,就能达到以上所有技术效果。
附图说明
图1是本申请的一种数据快照获取的方法实施例一的步骤流程图;
图2是本申请的一种数据快照获取的方法实施例一中的增量合并代码示意图;
图3是本申请的一种数据快照获取的方法实施例二的步骤流程图;
图4是本申请的一种数据快照获取的方法实施例二中的全量合并代码示意图;
图5是本申请一种数据快照获取的系统实施例的结构框图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
参照图1,示出了本申请的一种数据快照获取的方法实施例一的步骤流程图,可以包括如下步骤:
步骤101,基于源端中记录的始于指定位置的重做日志redolog文件,获取对应的变更数据记录;
本申请实施例可以应用于ETL系统中,在数据抽取阶段,本申请可以基于源端中记录的重做日志redolog文件来获得变更数据记录。
在具体实现中,源端的存储系统(比如mysql、oralce、hbase等)中存储着结构化数据(即行数据),该结构化数据为存储系统中用二维表逻辑结构表达的数据,这些数据在存储系统中一般具有redolog文件。其中,redolog文件是指结构化存储将自身数据的变更过程记录到日志中,具体来说,数据库系统每执行一个更新操作时,都会引起数据库的变化,因此都会生成一定数量的重做日志,它们将被记录到重做日志文件redolog文件中,以便在数据库出现例程失败或介质故障时,可以利用重做日志文件来恢复数据库,本申请实施例中采用redolog文件来获取源端中的数据变更流水。
本申请实施例可以应用于分布式环境中,则源端可以有多个,在本申请实施例的一种优选实施例中,步骤101具体可以为:
子步骤S11,分别将多个源端中记录的,始于指定位置的redolog文件集中采集到预设的发布程序中,由所述发布程序解析所述redolog文件,获得对应的变更数据记录。
在一种实施方式中,多个源端中的redolog文件可以集中采集到预置的发布程序中,由该发布程序进行redolog文件的解析,而无需在redolog文件所在的源端中进行数据解析,从而与该源端形成良好的隔离,节省了源端的资源消耗。
在具体实现中,在进行集中采集时,对于不同的源端,可以采用不同的采集方式,例如,对于mysql可以通过主从复制二进制日志binlog进行采集;没有现成redolog复制方案的源端可以直接通过拷贝文件的方式实现,比如rsync(类unix系统下的数据镜像 备份工具)等等。
为了避免redolog文件的重复采集,在前一次的redolog文件采集完毕以后,可以对采集截止的位置进行标记,该标记位置或标记位置之后的一个单位数据的位置可以作为指定位置,则下一次采集时,从指定位置开始进行采集。在具体实现中,可以基于时间参数进行标记,例如,前一次采集到2015年4月20日8时的redolog文件,则在2015年4月20日8时处进行标记,则指定位置可以为2015年4月20日8时;或者,也可以基于文件编号参数进行标记,例如,前一次采集到的redolog文件编号为4,则在编号为“4”处进行标记,则指定位置可以为编号为“4”的位置的下一位置,例如,编号为“5”的位置。当然,还可以以其他方式确定指定位置,本申请实施例对此无需加以限制。
采集redolog文件以后,发布程序进一步将该redolog文件解析成可识别的数据格式,得到对应的变更数据记录record。在实际中,一个数据库的变更可以不断追加到redolog文件中,也即redolog记录了数据变更流水,redolog经过解析后,就能得到相应的变更流水record。
其中,可识别的格式可以为excel的行列格式(一行记录是一条record,不同列有不同含义),也可以是json格式(一个json是一条record,内部不同key有不同含义),也可以是一个程序对象,对象里面包含了record,还可以是protobuf、avro等等其他序列化格式。
在具体实现中,对于不同的源端的redolog文件,本申请也可以采用不同的解析方式,比如oracle的logminer,mysql的mysqlbinlog等。
在本申请实施例中,解析后得到的变更数据记录record和源端的redolog文件中记录的数据并不完全相同,源端中的数据仅是各具体的数据字段,而record可以包含元信息和数值信息两部分。元信息可以包括主键标识、变更类型(例如,插入INSERT、更新UPDATE、删除DELETE等,对于DELETE的变更,至少要给出变更的主键标识)、源端库名、源端表名、源端变更时间戳等等;数值信息可以记录在List<Field>中,每个Field对象也有自己的元信息,比如:数据类型(INT、FLOAT、DOUBLE、STRING、DATE、BLOB等)、Field名、Field编码、具体Field值,等等。
需要说明的是,本申请实施例并不限于上述的集中式采集方式,本领域技术人员也可以按分库分表位置对redolog文件进行单独处理,即在源端本地进行redolog文件的解析,这样做的好处是节省了redolog文件的在线传输(无需将redolog文件传送至发布程序中),但这种方式会消耗各个源端的各种资源(比如CPU、内存、网络等资源),从 而会影响在线服务。
应用于本申请实施例,在获得变更数据记录以后,本申请实施例还可以存储预设时间段内的变更数据记录。具体来说,在本实施例的系统运行过程中,可能出现用户逻辑出错或丢失数据或系统宕机等异常情况,而实时数据流又是不断向前推移的,则异常情况出现以后需要在异常恢复之后把整个流水位点回退到一个安全时间点(预设时间段)上,重新拉取流水数据(可认为是一个补数据的过程)。如果在解析后缓存了该预设时间段内的流水数据(即变更数据记录),那么就可以直接从这部分缓存流水数据中拖取变更数据记录(当然能够回退的时间位点需要在缓存的时间范围内),而不用从源端再重复拖取redolog文件后解析。
需要说明的是,由于本申请实施例是对redolog文件的解析,所以解析出来的同主键key的变更数据记录是按实际变更顺序排列的,即同key数据是保序的。对于异key数据或分库分表数据,本申请仅是对其两者之间(两个key之间或两个表之间或两个库之间)时间差不要过大作要求,故对于一些需要数据事务性等场景(数据事务性的场景可以指一批数据操作要么都成功,要么都失败。比如银行需要将某一批账号(比如10个账号)都增加100元,那么数据事务性场景是这次操作需要保证10个账号全部都增加100元;否则全部都没增加,而不能一部分增加了,一部分没增加,否则很难查出谁增加成功了,谁失败了,给接下来的操作带来不便)均不要求。
本申请实施例在数据抽取阶段,通过抽取redolog文件来获取变更数据记录,能够做到同key数据按时间排序,不用遍历整个数据集,大大提升了ETL速度,并且,能够获取任意指定时间的数据快照(对于任意指定时间,下文将对其进行说明)。
步骤102,将所述变更数据记录加载到对应的目的端;
由于本申请实施例应用于分布式环境中,则目的端也可以有多个,在获取变更数据记录以后,可以将该变更数据记录加载(load)到对应的目的端中。
在本申请实施例的一种优选实施例中,步骤102可以包括如下子步骤:
子步骤S21,由预置的发布程序将所述变更数据记录发布至预置的订阅程序,其中,所述订阅程序为向所述发布程序订阅所述变更数据记录的程序;
为了系统间的轻耦合,本申请实施例除了设置有发布程序,还设置有订阅程序,其中,发布程序专注于把redolog文件进行解析和发布,并不关注发布后的数据的流向和用途;而订阅程序用于把发布程序发布的数据订阅出来,加载到目的端;本申请实施例通过两个不同的程序完成解析及加载的操作,实现了系统间的轻耦合,由于分布式环境下 的数据量太大,考虑到订阅程序的性能,可以由多个订阅程序处理发布程序发布的变更数据记录,例如,10个订阅程序订阅消费同一批数据,则每个订阅程序可以负责1/10的数据量。
在具体实现中,如果某个源端长时间确实没有发生变更操作,那么订阅程序就难以判断究竟是没有新的数据到达还是解析及发布过程出了问题,此时,发布程序可以不断向订阅程序发出心跳(比如1次/秒)来确保数据时间的前移。
子步骤S22,由所述订阅程序将所述变更数据记录转换成所述目的端所需的数据格式;
订阅程序获得变更数据记录以后,可以将该变更数据记录转换成目的端所需的数据格式,该格式数据是从源端获取并写入目的端的用于计算的纽带,具体可以包括元信息的格式转换以及数值信息的格式转换。
在一种实施方式中,可以将变更数据记录转换成下表1的格式形式:
Figure PCTCN2016088518-appb-000001
表1
在上表1中:
多位循环同步序号:即该条变更数据记录的数据编号,该字段数据长度是等长的,同步序号循环分配(同步序号的位数取决于源端变更时间戳的粒度和该粒度内源端单表变更数目的最大值,例如,源端变更时间戳的粒度是1秒,那么假设一张表在1秒的变更数目的最大值是5w次,即最多变更5w次,那么同步序号的位数可以是5位及以上,例如00001,就是说用5位及以上位数能够表示的下1秒内的变更,由于下1秒的源端变更时间戳已经和上1秒的数据不同了,这时又可以循环使用这多位同步序号),该字段是目的端同key数据的排序字段;
源端变更时间戳:源端数据变更的真实时间,其作用是决定该条变更数据记录写入目的端所属的存储区间,排序字段采用上述多位循环同步序号,而不采用该字段的原因是,解析出来的这个变更时间的精度可能较大,比如即使到了微秒,源端同key数据在一微秒内也可能有多条变更而无法区别先后顺序;
源端库名、源端表名:标注源端数据的真实来源,便于数据分析和问题定位;
变更类型:主要包括从前述解析出的INSERT、UPDATE和DELETE;
Field数值:解析得到的具体的数值信息,此处可能进行了编码转换或者类型转换, 但要保证数值的高保真或不失真。
在实际中,变更数据记录写入上表1时,可以按行列的方式写入;也可以是一整行但内建行列分隔(行列分隔符需设置特殊字符,当数据中出现的行列分隔符时需要进行行/列替换);也可以是json、protobuffer等其他序列化格式,但基本元信息如上表1所述。
子步骤S23,由所述订阅程序将所述转换数据格式后的变更数据记录,加载到对应的目的端的数据表中。
订阅程序转换了变更数据记录的数据格式以后,可以将该转换格式后的变更数据记录加载到对应的目的端的数据表中。
在本申请实施例的一种优选实施例中,子步骤S23可以进一步包括如下子步骤:
子步骤S231,由所述订阅程序按照每条变更数据记录的源端变更时间戳,确定所述变更数据记录所属的存储分区;
由于在分布式环境中,需要处理的数据量都是庞大的,为了降低目的端的存储压力以及数据处理压力,本申请实施例划分出多个存储分区进行数据存储,每个存储分区可以存储一定时间范围的数据,例如,若每个存储分区可以存储1小时的数据,则可以将一天24小时划分成24个存储分区,第一个存储分区存储第00:00-01:00的数据,第二个存储分区存储第01:01-02:00的数据,第三个存储分区存储第02:01-03:00的数据,如此类推。
在确定每个存储分区所存储的时间段的数据以后,则可以根据每条变更数据记录的源端变更时间戳,确定该变更数据记录所属的存储分区,例如,当前变更数据记录的源端变更时间戳为01:30,则所属的存储分区为第二存储分区。
子步骤S232,将所述变更数据记录写入对应的存储分区中,并将所述存储分区对应的标识写入所述数据表中与所述变更数据记录对应的位置。
在确定变更数据记录所属的存储分区以后,订阅程序将该变更数据记录写入该存储分区中,例如,将源端变更时间戳为01:30的变更数据记录写入第二存储分区,并将该第二存储分区的标识写入目的端的数据表中。其中,存储分区的标识可以用数据编号标识,例如编号“1”、“2”等,也可以用该存储分区对应的时间段范围,即范围时间戳标识,如下表2所示,该范围时间戳可以使用诸如2014042009(存储2014年4月20日9时-10时的存储分区的标识)等表示。
Figure PCTCN2016088518-appb-000002
表2
本申请实施例将存储分区的标识写入数据表中的目的是可以快速定位到数据所处的位置。
需要说明的是,存储分区的标识对于SQL类存储可以使用一个单独的标注范围字段表达,如表2所示的范围时间戳;对于NoSQL类存储可以使用分区来表达,例如,可以用类似于一个单独列来表达,但是在NoSQL中这个单独的列的字段属于元信息,而不是具体数值。
应用于本申请实施例,为了保证数据的完整性,在数据加载过程中,还可以包括如下步骤:
在将所述变更数据记录加载到所述目的端时,按需为所述变更数据记录添加标签信息。
具体来说,在变更数据记录写入目的端的过程中,可以根据业务需求按时为变更数据记录添加标签信息(即对其打点),比如每写入一小时的数据打一个点,或者在到达指定时间时进行打点,或根据存储分区的时间进行打点,则目的端可以根据打点情况来判断数据是否完整,以及是否到达指定时间。
需要说明的是,对于不同key的数据,从发布程序的处理到订阅程序的处理需要在一个时间范围(比如5分钟)内完成,以保证不同key的数据处理时间差异在合理范围内,避免后续计算时间延迟或数据不完整而导致数据丢失,例如,一条数据已经是凌晨0点了,但还有数据在晚上20:00,如果没有限定时间范围,那么凌晨0点就不能从全局上获取当前的变更情况了。
步骤103,在所述目的端中,当到达预设的指定时间时,依据所述变更数据记录获取所述指定时间对应的第一数据快照。
将变更数据记录加载到目的端以后,在目的端中,可以获取用户(例如,开发人员、数据分析人员等)指定时间对应的第一数据快照,其中,数据快照是指源端的存储系统中的数据在另一种或另一处存储系统(即目的端)中某一时刻的数据拷贝。
在本申请实施例的一种优选实施例中,步骤103可以包括如下子步骤:
子步骤S31,在目的端中检测所述标签信息,判断是否到达所述指定时间;
若打点是按照指定时间进行的打点,则在检测到打点时,说明该指定时间到达,在 打点处的变更数据记录加载完整。
若打点是按照时间间隔进行的打点,例如一小时打一个点,若指定时间为04:00,从00:00开始,读到第四个打点说明指定时间到达,在第四个打点处的变更数据记录加载完整。
子步骤S32,若到达所述指定时间,针对每个主键标识的变更数据记录进行倒排序;
子步骤S33,获得各个主键标识排序在首位的数据记录,作为所述主键标识的第一数据快照。
在一种实施方式中,由于目的端同key数据的排序字段是多位循环同步序号,因此在到达指定时间时,针对每个主键标识的变更数据记录,可以按照多位循环同步序号,即数据编号进行排序。为了节约数据遍历时间,提高数据查找效率,本申请实施例优选的排序方式为倒排序,则排序在首位的数据记录为该主键标识最终状态的数据记录。在本申请实施例中,子步骤S32及子步骤S33的过程为增量合并的过程,则第一数据快照为对变更数据记录执行增量合并后得到的增量快照,其中,增量合并为指定变更流水的起止点T0(即本申请实施例中的指定位置),按照变更的先后顺序联合相同key的数据(即对同key数据进行倒排序),从而获取所有数据在截止时间点T1(即本申请实施例中的指定时间)的最终状态,把处于这个最终状态的数据叫做增量快照。
在具体实现中,第一数据快照的获取过程的伪SQL代码如图2的增量合并代码示意图所示,其中,第9行的含义是在数据范围(分区字段)选择所有数据,并将同主键的数据分发到一起后,按主键、源端变更时间戳、同步序号(即多位循环同步序号)进行倒排,得到临时表a,第11行的where条件表示选择第一条变更数据,最终插入到增量表的对应数据范围内,从而得到所有主键标识的在指定时间最晚值时刻的第一数据快照。
为了使本领域技术人员更好的理解本申请实施例,以下以临时表的形式对第一数据快照的获取过程进行示例性说明,但需要说明的是,下述表3及表4为临时表,在实际实现中不一定对其进行存储,本示例仅是一种过程解析示例,图2中的伪SQL代码就是直接处理了这些逻辑后而生成最终的第一数据快照:
第一步:根据业务要求选择指定位置至指定时间的时间范围的redolog文件,本示例选择的指定位置为2015-04-20 00:00:00,指定时间为2015-04-21 00:00:00,即选择时间范围为[2015-04-20 00:00:00,2015-04-21 00:00:00)的redolog文件,对其进行解析后,加载到目的端的变更数据记录如下表3所示;
变更序号 源端变更时间戳 源端库名 源端表名 变更类型 id(账号) value(余额) 时间范围(按小时)
00552 2015-04-20 09:30:00 cash_00 account_00 UPDATE 1111 10000 2015042009
00034 2015-04-20 09:40:00 cash_01 account_04 UPDATE 2222 5000 2015042009
22222 2015-04-20 14:30:00 cash_03 account_10 INSERT 3333 5000 2015042014
45678 2015-04-20 15:00:00 cash_00 account_00 UPDATE 1111 11000 2015042015
02222 2015-04-20 15:10:00 cash_00 account_00 UPDATE 1111 12000 2015042015
02223 2015-04-20 15:10:00 cash_00 account_00 UPDATE 1111 11000 2015042015
38762 2015-04-20 15:20:00 cash_01 account_04 DELETE 2222 删除 2015042015
00001 2015-04-20 16:10:00 cash_00 account_00 UPDATE 1111 7000 2015042016
表3
第二步:根据主键key标识对上述表3中的变更数据记录进行分组,针对每组同key的数据,按变更序号(即多位循环同步序号)进行倒排序,得到结果如下表4:
Figure PCTCN2016088518-appb-000003
表4
第三步:获得每个主键标识中排序在首位的数据记录,即可得到2015-04-21 00:00:00时刻前的最终状态,即2015-04-21 00:00:00的第一数据快照,如下表5所示:
变更类型 id(账号) value(余额) 时间范围(按小时)
UPDATE 1111 7000 2015042016
DELETE 2222 删除 2015042015
INSERT 3333 5000 2015042014
表5
需要说明的是,以上示例的时间范围仅是一天,即仅仅是取一天的数据快照,实际上该时间范围可以为更细粒度,比如小时级的精准快照,甚至可以获得分钟级别的精准快照。
采用本申请实施例,可以达到如下有益效果:
a.指定任意快照时间点:本申请实施例通过redolog文件拿到的变更流水,能够做到同key数据按时问倒排序,经过计算,就能得到任意时间点上的数据快照;
b.精准快照:即使在分布式情况下(忽略分布式各主机的时钟差异),只要从redolog中获取了变更数据时间点,就能得到时间点一致的数据快照;
c.快速获取:本申请实施例通过redolog文件获得变更数据记录,不用遍历整个数据集,故ETL速度很快;
d.隔离生产:本申请实施例通过读取redolog文件,不是走源端数据出入接口,故对在线生产影响小甚至没有影响;
e.通用:本申请实施例对于有redolog文件的存储系统可以通用,只要能从redolog中获取变更流水,就能达到以上所有技术效果。
参照图3,示出了本申请的一种数据快照获取的方法实施例二的步骤流程图,可以包括如下步骤:
步骤301,基于源端中记录的始于指定位置的重做日志redolog文件,获取对应的变更数据记录;
在本申请实施例的一种优选实施例中,步骤301可以包括如下子步骤:
子步骤S41,分别将多个源端中记录的,始于指定位置的redolog文件集中采集到预设的发布程序中,由所述发布程序解析所述redolog文件,获得对应的变更数据记录。
在本申请实施例中,解析后得到的变更数据记录record和源端的redolog文件中记录的数据并不完全相同,源端中的数据仅是各具体的数据字段,而record可以包含元信息和数值信息两部分,元信息可以包括主键标识、变更类型、源端库名,源端表名、源端变更时间戳等等;数值信息可以记录在List<Field>中,每个Field对象也有自己的元信息,比如:数据类型(INT、FLOAT、DOUBLE、STRING、DATE、BLOB等)、Field名、Field编码、具体Field值,等等。
应用于本申请实施例,在获得变更数据记录以后,本申请实施例还可以存储预设时间段内的变更数据记录。具体来说,在本实施例的系统运行过程中,可能出现用户逻辑出错或丢失数据或系统宕机等异常情况,而实时数据流又是不断向前推移的,则异常情况出现以后需要在异常恢复之后把整个流水位点回退到一个安全时间点(预设时间段)上,重新拉取流水数据(可认为是一个补数据的过程)。如果在解析后缓存了该预设时间段内的流水数据(即变更数据记录),那么就可以直接从这部分缓存流水数据中拖取变更数据记录(当然能够回退的时间位点需要在缓存的时间范围内),而不用从源端再重复拖取redolog文件后解析。
步骤302,将所述变更数据记录加载到对应的目的端;
由于本申请实施例应用于分布式环境中,则目的端也可以有多个,在获取变更数据记录以后,可以将该变更数据记录加载load到对应的目的端中。
在本申请实施例的一种优选实施例中,步骤302可以包括如下子步骤:
子步骤S51,由预置的发布程序将所述变更数据记录发布至预置的订阅程序,其中, 所述订阅程序为向所述发布程序订阅所述变更数据记录的程序;
子步骤S52,由所述订阅程序将所述变更数据记录转换成所述目的端所需的数据格式;
子步骤S53,由所述订阅程序将所述转换数据格式后的变更数据记录,加载到对应的目的端的数据表中。
在本申请实施例的一种优选实施例中,子步骤S53可以进一步包括如下子步骤:
子步骤S531,由所述订阅程序按照每条变更数据记录的源端变更时间戳,确定所述变更数据记录所属的存储分区;
子步骤S532,将所述变更数据记录写入对应的存储分区中,并将所述存储分区对应的标识写入所述数据表中与所述变更数据记录对应的位置。
应用于本申请实施例,为了保证数据的完整性,在数据加载过程中,还可以包括如下步骤:
在将所述变更数据记录加载到所述目的端时,按需为所述变更数据记录添加标签信息。
步骤303,在所述目的端中,当到达预设的指定时间时,依据所述变更数据记录获取所述指定时间对应的第一数据快照;
在本申请实施例的一种优选实施例中,步骤303可以包括如下子步骤:
子步骤S61,在目的端中检测所述标签信息,判断是否到达所述指定时间;
子步骤S62,若到达所述指定时间,针对每个主键标识的变更数据记录进行倒排序;
在本申请实施例的一种优选实施例中,子步骤S62具体可以为:若到达所述指定时间,针对每个主键标识的变更数据记录,按照所述数据编号进行倒排序。
子步骤S63,获得各个主键标识排序在首位的数据记录,作为所述主键标识的第一数据快照。
步骤304,若所述指定位置具有对应的在先已获取的第二数据快照,则将所述第一数据快照与所述第二数据快照组织成所述指定时间的第三数据快照。
在本申请实施例的一种优选实施例中,所述第一数据快照包括第一主键标识、变更类型及第一数值信息,所述第二数据快照包括第二主键标识及第二数值信息,步骤304可以包括如下子步骤:
子步骤S71,若所述指定位置具有对应的在先已获取的第二数据快照,则按照所述第一主键标识和/或所述第二主键标识,合并所述第一数据快照以及所述第二数据快照;
应用于本申请实施例,第二数据快照为该指定位置的全量快照,全量快照是对指定位置的数据进行全量合并的结果,全量合并是指将T0到T1时间段范围内的数据做增量合并后,再合并到T0时刻的全量快照上,从而得到T1时刻的全量快照,不断重复使用该方法,可以得到顺延各时间点的全量快照。因此,若指定位置为初始位置,则该指定位置不存在第二数据快照,否则,若指定位置不为初始位置,该指定位置存在第二数据快照。
参考图4所示的全量合并代码示意图,10-13行增量合并得到临时增量表incre,即得到指定时间的增量快照;第8行是获取上一业务点的全量值表total,即获得指定位置的第二数据快照;两者做拼接full outer join,join的条件是主键相等(第14行),从而得到全增量的宽表(分别以total和incre打头,对于没有变更的数据其incre部分全部为null)。
子步骤S72,若所述第一主键标识为空,则将所述第二主键标识及所述第二数值信息组织成第三数据快照;
子步骤S73,若所述第一主键标识不为空,则将所述第一主键标识及所述第一数值信息组织成第三数据快照。
在数据过滤以后,单条数据是选择全量的数值信息还是增量的数值信息由图4的3-6行的if条件决定,如果增量的数值信息不存在则选择全量的数值信息字段值,否则选择增量的数值信息字段值,从而得到指定时间最晚值时刻所有数据的全量快照。
需要说明的是,第三数据快照的获取与第一数据快照的获取可以由同一个计算程序实现,即直接一步到位,用一个计算程序完成整个增全量的合并,但不会存储增量快照,图4中的伪代码即为该方式;也可以由两个不同的计算程序实现,即在得到第一数据快照之后再启动一个计算程序计算第三数据快照。
为了使本领域技术人员更好的理解本申请实施例,以下以临时表的形式对第三数据快照的获取过程进行示例性说明,但需要说明的是,下述表8及表9为临时表,在实际实现中不一定对其进行存储,本示例仅是一种过程解析示例,图4中的伪伪SQL代码就是直接处理了这些逻辑后而最终的第三数据快照:
第一步:根据增量合并拿到业务时间点(2015-04-20)的增量最终状态表(即第一数据快照),如表6所示;
变更类型 id(帐号) value(余额) 时间范围(按小时)
UPDATE 1111 7000 2015042016
DELETE 2222 删除 2015042015
INSERT 3333 5000 2015042014
表6
第二步:拿到前一天,即指定位置(2015-04-19)全量表,如表7所示;
id(账号) value(余额) 时间范围
1111 5000 20150419
2222 3000 20150419
4444 6000 20150419
表7
第三步:将表6及表7做full outer join,join条件是根据key(主键)联合,如表8所示;
增量变更类型 增量id(账号) 增量value(余额) 全量id(账号) 5:lil va:ue(余额)
UPDATE 1111 7000 1111 5000
DELETE 2222 删除 2222 3000
INSERT 3333 5000    
      4444 6000
表8
第四步:根据判断逻辑进行数据过滤,判断逻辑是:保留增量主键为空或增量变更类型不为DELETE的数据,如表9所示:
增量变更类型 增量id(账号) 增量value(余额) 全量id(账号) 全量value(余额)
UPDATE 1111 7000 1111 5000
INSERT 3333 5000    
      4444 6000
表9
第五步:取值,当增量主键为空,则字段全部取全量值,否则全部取增量值,即可得到2015-04-20最终的全量快照,如表10所示。
id(帐号) value(余额)
1111 7000
3333 5000
4444 6000
表10
需要说明的是,的时间范围仅是一天,即仅仅是取一天的数据快照,实际上该时间范围可以为更细粒度,比如小时级的精准快照。
对于图3实施例而言,由于其与图1的方法实施例基本相似,所以描述的比较简单, 相关之处参见方法实施例的部分说明即可。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
参照图5,示出了本申请一种数据快照获取的系统实施例的结构框图,具体可以包括如下模块:
数据记录获取模块501,用于基于源端中记录的始于指定位置的重做日志redolog文件,获取对应的变更数据记录;
数据加载模块502,用于将所述变更数据记录加载到对应的目的端;
快照获取模块503,用于在所述目的端中,当到达预设的指定时间时,依据所述变更数据记录获取所述指定时间对应的第一数据快照。
在本申请实施例的一种优选实施例中,所述系统还可以包括:
合并模块,用于在所述指定位置具有对应的在先已获取的第二数据快照时,将所述第一数据快照与所述第二数据快照组织成所述指定时间的第三数据快照。
在本申请实施例的一种优选实施例中,所述系统还可以包括:
标签添加模块,用于在将所述变更数据记录加载到所述目的端时,按需为所述变更数据记录添加标签信息。
在本申请实施例的一种优选实施例中,所述变更数据记录包括主键标识,所述快照获取模块503可以包括如下子模块:
判断子模块,用于在目的端中检测所述标签信息,判断是否到达所述指定时间;
排序子模块,用于在到达所述指定时间时,针对每个主键标识的变更数据记录进行倒排序;
快照获得子模块,用于获得各个主键标识排序在首位的数据记录,作为所述主键标识的第一数据快照。
在本申请实施例的一种优选实施例中,所述变更数据记录还包括数据编号,所述排序子模块还可以用于:
若到达所述指定时间,针对每个主键标识的变更数据记录,按照所述数据编号进行倒排序。
在本申请实施例的一种优选实施例中,所述数据加载模块502可以包括如下子模块:
发布子模块,用于由预置的发布程序将所述变更数据记录发布至预置的订阅程序,其中,所述订阅程序为向所述发布程序订阅所述变更数据记录的程序;
格式转换子模块,用于由所述订阅程序将所述变更数据记录转换成所述目的端所需的数据格式;
加载子模块,用于由所述订阅程序将所述转换数据格式后的变更数据记录,加载到对应的目的端的数据表中。
在本申请实施例的一种优选实施例中,所述变更数据记录包括源端变更时间戳,所述格式转换子模块可以包括如下单元:
分区确定单元,用于由所述订阅程序按照每条变更数据记录的源端变更时间戳,确定所述变更数据记录所属的存储分区,所述存储分区包括存储分区标识;
写入单元,用于将所述变更数据记录写入对应的存储分区中,并将所述存储分区对应的标识写入所述数据表中与所述变更数据记录对应的位置。
在本申请实施例的一种优选实施例中,所述系统还可以包括:
存储模块,用于存储预设时间段内的变更数据记录。
在本申请实施例的一种优选实施例中,所述源端为多个,所述数据记录获取模块501可以包括如下子模块:
采集子模块,用于分别将多个源端中记录的,始于指定位置的redolog文件集中采集到所述发布程序中,由所述发布程序解析所述redolog文件,获得对应的变更数据记录。
在本申请实施例的一种优选实施例中,所述第一数据快照包括第一主键标识、变更类型以及第一数值信息,所述第二数据快照包括第二主键标识及第二数值信息,所述合并模块可以包括如下子模块:
合并子模块,用于按照所述第一主键标识和/或所述第二主键标识,合并所述第一数据快照以及所述第二数据快照;
第一组织子模块,用于在所述第一主键标识为空时,将所述第二主键标识及所述第二数值信息组织成第三数据快照;
第二组织子模块,用于在所述第一主键标识不为空时,将所述第一主键标识及所述第一数值信息组织成第三数据快照。
对于系统实施例而言,由于其与上述方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序操作指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序操作指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的操作指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序操作指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的操作指令产生包括操作指令装置的制造品,该操作指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序操作指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的操作指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这 种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个......”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种数据快照获取的方法及系统进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种数据快照获取的方法,其特征在于,所述的方法包括:
    基于源端中记录的始于指定位置的重做日志redolog文件,获取对应的变更数据记录;
    将所述变更数据记录加载到对应的目的端;
    在所述目的端中,当到达预设的指定时间时,依据所述变更数据记录获取所述指定时间对应的第一数据快照。
  2. 根据权利要求1所述的方法,其特征在于,还包括:
    若所述指定位置具有对应的在先已获取的第二数据快照,则将所述第一数据快照与所述第二数据快照组织成所述指定时间的第三数据快照。
  3. 根据权利要求1或2所述的方法,其特征在于,还包括:
    在将所述变更数据记录加载到所述目的端时,按需为所述变更数据记录添加标签信息。
  4. 根据权利要求3所述的方法,其特征在于,所述变更数据记录包括主键标识,所述在所述目的端中,当到达预设的指定时间时,依据所述变更数据记录获取所述指定时间对应的第一数据快照的步骤包括:
    在目的端中检测所述标签信息,判断是否到达所述指定时间;
    若到达所述指定时间,针对每个主键标识的变更数据记录进行倒排序;
    获得各个主键标识排序在首位的数据记录,作为所述主键标识的第一数据快照。
  5. 根据权利要求4所述的方法,其特征在于,所述变更数据记录还包括数据编号,所述若到达所述指定时间,针对每个主键标识的变更数据记录进行倒排序的步骤为:
    若到达所述指定时间,针对每个主键标识的变更数据记录,按照所述数据编号进行倒排序。
  6. 根据权利要求1所述的方法,其特征在于,所述将所述变更数据记录加载到目的端的步骤包括:
    由预置的发布程序将所述变更数据记录发布至预置的订阅程序,其中,所述订阅程序为向所述发布程序订阅所述变更数据记录的程序;
    由所述订阅程序将所述变更数据记录转换成所述目的端所需的数据格式;
    由所述订阅程序将所述转换数据格式后的变更数据记录,加载到对应的目的端的数 据表中。
  7. 根据权利要求6所述的方法,其特征在于,所述变更数据记录包括源端变更时间戳,所述由所述订阅程序将所述转换数据格式后的变更数据记录加载到对应的目的端的数据表中的步骤包括:
    由所述订阅程序按照每条变更数据记录的源端变更时间戳,确定所述变更数据记录所属的存储分区,所述存储分区包括存储分区标识;
    将所述变更数据记录写入对应的存储分区中,并将所述存储分区对应的标识写入所述数据表中与所述变更数据记录对应的位置。
  8. 根据权利要求1所述的方法,其特征在于,还包括:
    存储预设时间段内的变更数据记录。
  9. 根据权利要求6所述的方法,其特征在于,所述源端为多个,所述根据源端中记录的重做日志redolog文件,获取变更数据记录的步骤包括:
    分别将多个源端中记录的,始于指定位置的redolog文件集中采集到所述发布程序中,由所述发布程序解析所述redolog文件,获得对应的变更数据记录。
  10. 根据权利要求2所述的方法,其特征在于,所述第一数据快照包括第一主键标识、变更类型以及第一数值信息,所述第二数据快照包括第二主键标识及第二数值信息,所述将所述第一数据快照与所述第二数据快照组织成所述指定时间的第三数据快照的步骤包括:
    按照所述第一主键标识和/或所述第二主键标识,合并所述第一数据快照以及所述第二数据快照;
    若所述第一主键标识为空,则将所述第二主键标识及所述第二数值信息组织成第三数据快照;
    若所述第一主键标识不为空,则将所述第一主键标识及所述第一数值信息组织成第三数据快照。
  11. 一种数据快照获取的系统,其特征在于,所述的系统包括:
    数据记录获取模块,用于基于源端中记录的始于指定位置的重做日志redolog文件,获取对应的变更数据记录;
    数据加载模块,用于将所述变更数据记录加载到对应的目的端;
    快照获取模块,用于在所述目的端中,当到达预设的指定时间时,依据所述变更数据记录获取所述指定时间对应的第一数据快照。
  12. 根据权利要求11所述的系统,其特征在于,还包括:
    合并模块,用于在所述指定位置具有对应的在先已获取的第二数据快照时,将所述第一数据快照与所述第二数据快照组织成所述指定时间的第三数据快照。
  13. 根据权利要求11或12所述的系统,其特征在于,还包括:
    标签添加模块,用于在将所述变更数据记录加载到所述目的端时,按需为所述变更数据记录添加标签信息。
  14. 根据权利要求13所述的系统,其特征在于,所述变更数据记录包括主键标识,所述快照获取模块包括:
    判断子模块,用于在目的端中检测所述标签信息,判断是否到达所述指定时间;
    排序子模块,用于在到达所述指定时间时,针对每个主键标识的变更数据记录进行倒排序;
    快照获得子模块,用于获得各个主键标识排序在首位的数据记录,作为所述主键标识的第一数据快照。
  15. 根据权利要求14所述的系统,其特征在于,所述变更数据记录还包括数据编号,所述排序子模块还用于:
    若到达所述指定时间,针对每个主键标识的变更数据记录,按照所述数据编号进行倒排序。
  16. 根据权利要求11所述的系统,其特征在于,所述数据加载模块包括:
    发布子模块,用于由预置的发布程序将所述变更数据记录发布至预置的订阅程序,其中,所述订阅程序为向所述发布程序订阅所述变更数据记录的程序;
    格式转换子模块,用于由所述订阅程序将所述变更数据记录转换成所述目的端所需的数据格式;
    加载子模块,用于由所述订阅程序将所述转换数据格式后的变更数据记录,加载到对应的目的端的数据表中。
  17. 根据权利要求16所述的系统,其特征在于,所述变更数据记录包括源端变更时间戳,所述格式转换子模块包括:
    分区确定单元,用于由所述订阅程序按照每条变更数据记录的源端变更时间戳,确定所述变更数据记录所属的存储分区,所述存储分区包括存储分区标识;
    写入单元,用于将所述变更数据记录写入对应的存储分区中,并将所述存储分区对应的标识写入所述数据表中与所述变更数据记录对应的位置。
  18. 根据权利要求11所述的系统,其特征在于,还包括:
    存储模块,用于存储预设时间段内的变更数据记录。
  19. 根据权利要求16所述的系统,其特征在于,所述源端为多个,所述数据记录获取模块包括:
    采集子模块,用于分别将多个源端中记录的,始于指定位置的redolog文件集中采集到所述发布程序中由所述发布程序解析所述redolog文件,获得对应的变更数据记录。
  20. 根据权利要求12所述的系统,其特征在于,所述第一数据快照包括第一主键标识、变更类型以及第一数值信息,所述第二数据快照包括第二主键标识及第二数值信息,所述合并模块包括:
    合并子模块,用于按照所述第一主键标识和/或所述第二主键标识,合并所述第一数据快照以及所述第二数据快照;
    第一组织子模块,用于在所述第一主键标识为空时,将所述第二主键标识及所述第二数值信息组织成第三数据快照;
    第二组织子模块,用于在所述第一主键标识不为空时,将所述第一主键标识及所述第一数值信息组织成第三数据快照。
PCT/CN2016/088518 2015-07-14 2016-07-05 一种数据快照获取的方法及系统 WO2017008657A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/870,675 US20180137134A1 (en) 2015-07-14 2018-01-12 Data snapshot acquisition method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510413154.8A CN106339274B (zh) 2015-07-14 2015-07-14 一种数据快照获取的方法及系统
CN201510413154.8 2015-07-14

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/870,675 Continuation US20180137134A1 (en) 2015-07-14 2018-01-12 Data snapshot acquisition method and system

Publications (1)

Publication Number Publication Date
WO2017008657A1 true WO2017008657A1 (zh) 2017-01-19

Family

ID=57756700

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/088518 WO2017008657A1 (zh) 2015-07-14 2016-07-05 一种数据快照获取的方法及系统

Country Status (3)

Country Link
US (1) US20180137134A1 (zh)
CN (1) CN106339274B (zh)
WO (1) WO2017008657A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523002A (zh) * 2020-04-23 2020-08-11 中国农业银行股份有限公司 一种主键分配方法、装置、服务器及存储介质
CN112084117A (zh) * 2020-09-27 2020-12-15 网易(杭州)网络有限公司 一种测试方法和装置
CN112988661A (zh) * 2021-02-19 2021-06-18 金蝶软件(中国)有限公司 一种余额表更新方法及其相关设备
CN115934428A (zh) * 2023-01-10 2023-04-07 湖南三湘银行股份有限公司 一种mysql数据库的主灾备切换方法、装置及电子设备

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229670A (zh) * 2017-04-10 2017-10-03 中国科学院信息工程研究所 基于Avro的通用数据序列化及反序列化方法
CN107330003A (zh) * 2017-06-12 2017-11-07 上海藤榕网络科技有限公司 数据同步方法、系统、存储器及数据同步设备
CN109299036B (zh) * 2017-07-25 2021-01-05 北京嘀嘀无限科技发展有限公司 标签生成方法、装置、服务器和计算机可读存储介质
CN109947841A (zh) * 2017-07-27 2019-06-28 成都蓝盾网信科技有限公司 基于事务日志分析的单导系统中oracle数据库同步技术
CN109408279A (zh) * 2017-08-16 2019-03-01 北京京东尚科信息技术有限公司 数据备份方法和装置
CN107885881A (zh) * 2017-11-29 2018-04-06 顺丰科技有限公司 业务数据实时上报、获取方法、装置、设备及其存储介质
CN108536798B (zh) * 2018-04-02 2020-12-01 携程旅游网络技术(上海)有限公司 订单级别的数据库数据的恢复方法及系统
US10606829B1 (en) * 2018-04-17 2020-03-31 Intuit Inc. Methods and systems for identifying data inconsistencies between electronic record systems using data partitioning
CN109213817B (zh) * 2018-08-10 2019-09-06 杭州数梦工场科技有限公司 增量数据抽取方法、装置及服务器
CN109710454A (zh) * 2018-11-08 2019-05-03 厦门集微科技有限公司 一种云主机快照方法及装置
CN109857991B (zh) * 2018-12-25 2023-08-18 北京像素软件科技股份有限公司 数据存储方法、装置及电子设备
CN109840246A (zh) * 2019-01-31 2019-06-04 北京三快在线科技有限公司 一种用于计算目标特征的历史数据的方法及装置
CN109918231B (zh) * 2019-02-28 2021-02-26 上海达梦数据库有限公司 数据重整的异常修复方法、装置、设备和存储介质
CN110196832A (zh) * 2019-06-04 2019-09-03 北京百度网讯科技有限公司 用于获取快照信息的方法及装置
CN111211864B (zh) * 2019-12-25 2022-07-29 安徽机电职业技术学院 一种数据传输错误处理方法及系统
CN111309673B (zh) * 2020-02-12 2023-06-23 普信恒业科技发展(北京)有限公司 增量数据的快照数据生成方法及装置
CN112667744B (zh) * 2020-12-28 2022-05-13 武汉达梦数据库股份有限公司 一种数据库中数据批量同步更新的方法和装置
CN112925768B (zh) * 2021-03-03 2024-02-27 北京中安星云软件技术有限公司 一种基于Protobuf协议的HBASE数据库解析方法及系统
CN113360322B (zh) * 2021-06-25 2023-06-13 上海上讯信息技术股份有限公司 一种基于备份系统恢复数据的方法及设备
CN115510144B (zh) * 2022-11-17 2023-04-07 北京滴普科技有限公司 一种用于数据库实时变化数据抓取的方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080162590A1 (en) * 2007-01-03 2008-07-03 Oracle International Corporation Method and apparatus for data rollback
CN101436207A (zh) * 2008-12-16 2009-05-20 浪潮通信信息系统有限公司 一种基于日志快照的数据恢复和同步方法
CN101807210A (zh) * 2010-04-26 2010-08-18 中兴通讯股份有限公司 一种数据库间数据同步的方法、系统及设备
US20110060724A1 (en) * 2009-09-08 2011-03-10 Oracle International Corporation Distributed database recovery
CN103678494A (zh) * 2013-11-15 2014-03-26 北京奇虎科技有限公司 客户端同步服务端数据的方法及装置
CN103678718A (zh) * 2013-12-31 2014-03-26 金蝶软件(中国)有限公司 数据库同步方法及系统

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6553388B1 (en) * 2000-07-20 2003-04-22 International Business Machines Corporation Database deltas using Cyclic Redundancy Checks
US20060259527A1 (en) * 2005-05-13 2006-11-16 Devarakonda Murthy V Changed files list with time buckets for efficient storage management
US20070027938A1 (en) * 2005-07-26 2007-02-01 Scribe Software Inc. Detecting data changes
TWI316188B (en) * 2006-05-17 2009-10-21 Ind Tech Res Inst Mechanism and method to snapshot data
US8161077B2 (en) * 2009-10-21 2012-04-17 Delphix Corp. Datacenter workflow automation scenarios using virtual databases
CN104123201B (zh) * 2013-04-28 2016-06-29 广州鼎甲计算机科技有限公司 一种零损失数据的备份方法和系统
CN104133757A (zh) * 2013-11-28 2014-11-05 腾讯科技(成都)有限公司 一种获取内存信息的方法及终端
US8938414B1 (en) * 2014-06-05 2015-01-20 GoodData Corporation Data abstraction layer for interfacing with reporting systems

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080162590A1 (en) * 2007-01-03 2008-07-03 Oracle International Corporation Method and apparatus for data rollback
CN101436207A (zh) * 2008-12-16 2009-05-20 浪潮通信信息系统有限公司 一种基于日志快照的数据恢复和同步方法
US20110060724A1 (en) * 2009-09-08 2011-03-10 Oracle International Corporation Distributed database recovery
CN101807210A (zh) * 2010-04-26 2010-08-18 中兴通讯股份有限公司 一种数据库间数据同步的方法、系统及设备
CN103678494A (zh) * 2013-11-15 2014-03-26 北京奇虎科技有限公司 客户端同步服务端数据的方法及装置
CN103678718A (zh) * 2013-12-31 2014-03-26 金蝶软件(中国)有限公司 数据库同步方法及系统

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523002A (zh) * 2020-04-23 2020-08-11 中国农业银行股份有限公司 一种主键分配方法、装置、服务器及存储介质
CN111523002B (zh) * 2020-04-23 2023-06-09 中国农业银行股份有限公司 一种主键分配方法、装置、服务器及存储介质
CN112084117A (zh) * 2020-09-27 2020-12-15 网易(杭州)网络有限公司 一种测试方法和装置
CN112084117B (zh) * 2020-09-27 2023-08-08 网易(杭州)网络有限公司 一种测试方法和装置
CN112988661A (zh) * 2021-02-19 2021-06-18 金蝶软件(中国)有限公司 一种余额表更新方法及其相关设备
CN112988661B (zh) * 2021-02-19 2024-03-19 金蝶软件(中国)有限公司 一种余额表更新方法及其相关设备
CN115934428A (zh) * 2023-01-10 2023-04-07 湖南三湘银行股份有限公司 一种mysql数据库的主灾备切换方法、装置及电子设备
CN115934428B (zh) * 2023-01-10 2023-05-23 湖南三湘银行股份有限公司 一种mysql数据库的主灾备切换方法、装置及电子设备

Also Published As

Publication number Publication date
US20180137134A1 (en) 2018-05-17
CN106339274A (zh) 2017-01-18
CN106339274B (zh) 2019-07-02

Similar Documents

Publication Publication Date Title
WO2017008657A1 (zh) 一种数据快照获取的方法及系统
US11829360B2 (en) Database workload capture and replay
US9589041B2 (en) Client and server integration for replicating data
CN107544984B (zh) 一种数据处理的方法和装置
US9747127B1 (en) Worldwide distributed job and tasks computational model
CN111400408A (zh) 数据同步方法、装置、设备及存储介质
CN106126601A (zh) 一种社保大数据分布式预处理方法及系统
Liang et al. Express supervision system based on NodeJS and MongoDB
US10877995B2 (en) Building a distributed dwarf cube using mapreduce technique
TW201530328A (zh) 爲半結構化資料構建NoSQL資料庫索引的方法及裝置
WO2020192064A1 (zh) 一种增量数据一致性实现方法及装置
WO2017096892A1 (zh) 索引构建方法、查询方法及对应装置、设备、计算机存储介质
US20140095549A1 (en) Method and Apparatus for Generating Schema of Non-Relational Database
US20170270175A1 (en) Tracking data replication and discrepancies in incremental data audits
WO2023061249A1 (zh) 分布式数据库的数据处理方法、系统、设备和存储介质
US11553023B2 (en) Abstraction layer for streaming data sources
US20170270153A1 (en) Real-time incremental data audits
CN112214453B (zh) 大规模工业数据压缩存储方法、系统及介质
CN110019169B (zh) 一种数据处理的方法及装置
Kang et al. Reducing i/o cost in olap query processing with mapreduce
US10019472B2 (en) System and method for querying a distributed dwarf cube
Ma et al. Live data replication approach from relational tables to schema-free collections using stream processing framework
CN113032368A (zh) 一种数据迁移方法、装置、存储介质及平台
Mekterović et al. Delta view generation for incremental loading of large dimensions in a data warehouse
CN107526573B (zh) 采用并行流水线处理遥感图像的方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16823804

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16823804

Country of ref document: EP

Kind code of ref document: A1