Data push method and system
Technical field
The present invention relates to the large-scale data stream real-time consumption technology of resolving based on database Incremental Log, more specifically, relate to a kind of data push method and system.
Background technology
Along with the development of corporate business, the demand for large-scale data pick-up is strong all the more.Data pick-up refers to from source database and extracts the data needed.In practical application, the more employing of source database be relational database.Common data pick-up mode comprises: (one) full dose data pick-up, and full dose data pick-up is similar to Data Migration or data Replica, and it extracts intact for the data of the table in source or view from database; (2) incremental data extracts, and incremental data extracts data that are newly-increased in the table only extracting and will extract in database since last time is extracted or amendment.Incremental data extracts comparatively full dose data pick-up and applies wider.The data of how to catch change are the keys that incremental data extracts.Generally there are 2 requirements to catching method: accuracy, the delta data in operation system can be captured exactly; Performance, reduces as far as possible and causes too large pressure to operation system, affect existing business.
When the mode of traditional timing dump (dump) incremental data can not satisfy the demands, the acquisition of real-time incremental data becomes the trend addressed this problem.Data-pushing platform is a set of total solution realizing the data-pushing realizing increasing/full dose between heterogeneous database.Increment pushes the propelling movement incremental data that is intended to solve extensive lower efficient stable to subscribing to client to meet the object of business demand.
The extraction mode of traditional incremental data is that timing dump incremental data generates corresponding data file, but this mode exists following problem: need to increase timestamp field in source traffic table, pollute source traffic table to a certain extent; Timing performs and causes data existence time delay to a certain degree; Dump data file out and business association tension, not easily reused; Timed task often causes larger instantaneous pressure to source database and network when starting.
And solve the problems referred to above well based on the incremental data stream that database Incremental Log is resolved, can to source traffic table without intrusion, quasi real time (as long as subscription end consumption rate is enough fast) of incremental data, extracts and subscription data as required, h falls apart, and not easily occurs instantaneous peak value.Such as, the canal open source projects adopted in prior art is exactly resolve based on database Incremental Log, and provides increment subscription and consumption function.Canal can be good at fulfillment database Incremental Log parsing work, and based on the function of canal, we can extract incremental data by configuration extraction task from source and be temporary in internal memory, then according to the subscription client of business demand exploitation customization.
Fig. 1 schematically illustrates the view of the data delivery system according to prior art.The data delivery system 100 of prior art comprises: source database 101; Database Incremental Log resolution server 103, resolves based on database Incremental Log, provides increment to subscribe to and consumption function, database Incremental Log resolution server 103 such as canal server; Subscription end 105, for generation of subscription task.
Although canal provides the basic function that incremental data is resolved, but still faces following problem:
(1). data pick-up and data subscription cannot realize one-to-many
Because canal incremental data is not landed the design of (that is, data temporary storage is in internal memory) and limit, canal cannot accomplish only extract a secondary data and subscribed to by multi-client.This also just means extraction and subscribes to and must occur in pairs, the pressure that too much extraction task increases extraction source database, the load-bearing capacity and the handling capacity that waste a large amount of network traffics, reduce separate unit canal server.
(2). the incremental data holding time is not controlled
This problem is also relevant with above-mentioned design, because incremental data is not landed, and the daily record holding time of source database can not by data-pushing platform courses, so data-pushing platform also cannot promise to undertake the incremental data pot-life to the subscription end in downstream.This reduces the ability that data-pushing platform externally provides service.
(3). subscription end with extract hold relied on strong
Here dependence is divided into two aspects.
System level: extracting end with subscription end is based on the serial processing under long connection, and the consumption rate of subscription end directly affects the extraction speed extracting end.
Service layer: extracting end task take database instance as border, cannot shield the information that the subscription end such as source point storehouse submeter originally should not be concerned about, also just result in coupling stronger between extraction task and subscription task to subscription end.
(4). subscription end lacks common function support
Canal itself is only absorbed in and the extraction of data and subscription, and it is more weak for some conventional function supports of subscription end, such as: although the structured data of increment can be obtained by the subscription end SDK (SoftwareDevelopment Kit: SDK (Software Development Kit)) of canal, support not enough for structured data being resolved, changing into corresponding DML (Data Manipulation Language: data manipulation language (DML)) and finally insert this modal application scenarios of target DB (Database: database).Another open source projects otter (chained address of increasing income of project: https: //github.com/alibaba/otter) although support above-mentioned partial function, but still cannot meet the customized demand of Complex Flexible.
Summary of the invention
According to an aspect of the present invention, provide a kind of data push method, comprise: extract task creation process, with database instance be border by data from source database dump to database Incremental Log resolution server, resolved by database Incremental Log and obtain incremental data, and the incremental data of acquisition is temporary in the internal memory of database Incremental Log resolution server; Message distribution process, by the incremental data rule as required that obtains using corresponding theme and mark as message distribution in the corresponding one or more message queue in Distributed Message Queue server, theme represents the library name of logic, mark represents the table name of logic, and the incremental data that every bar is distributed as message all comprises corresponding theme and mark; And subscribe to and push process, the theme that Distributed Message Queue server is subscribed to based on each subscription task of subscription end and described one or more message queue is pushed to subscription end.
According to a further aspect in the invention, provide a kind of data delivery system, comprising: source database; Database Incremental Log resolution server; Distributed Message Queue server; And subscription end, wherein, with database instance be border by data from source database dump to database Incremental Log resolution server, database Incremental Log resolution server is resolved by database Incremental Log and is obtained incremental data, and the incremental data of acquisition is temporary in the internal memory of database Incremental Log resolution server; Database Incremental Log resolution server by the incremental data rule as required that obtains using corresponding theme and mark as message distribution in the corresponding one or more message queue in Distributed Message Queue server, theme represents the library name of logic, mark represents the table name of logic, and the incremental data that every bar is distributed as message all comprises corresponding theme and mark; And the theme subscribed to based on each subscription task of subscription end of Distributed Message Queue server and described one or more message queue is pushed to subscription end.
The present invention is by introducing persistent message queue means, can realize only extracting a secondary data and being subscribed to by multi-client, that is extract and subscribe to and need not occur in pairs, thus also greatly reduce the flow of network, make incremental data be become possibility by many subscription end consumption, greatly improve load-bearing capacity and the service efficiency of server.In addition, subscription end, by using the special SDK after encapsulating based on RocketMQ client, can realize the customized development of subscription end, thus meet the customized demand of Complex Flexible.
Accompanying drawing explanation
Fig. 1 schematically illustrates the view of the data delivery system according to prior art.
Fig. 2 schematically illustrates the view of data delivery system according to an embodiment of the invention.
Fig. 3 schematically illustrates the process flow diagram of data push method according to an embodiment of the invention.
Embodiment
Explain technical scheme according to an embodiment of the invention in detail below with reference to accompanying drawings.
Term " dump " (" dump ") implication refers to and the data (can contain data structure) in MySQL is exported to text, to import on other database instance or as other application or the input data of system.
Fig. 2 schematically illustrates the view of data delivery system according to an embodiment of the invention.As shown in Figure 2, data delivery system 200 comprises: source database 201, database Incremental Log resolution server 203, Distributed Message Queue server 205, subscription end 207.For purposes of illustration, three subscription task 1,2,3 that illustrate only three message queues 1,2,3 in Fig. 2 and correspond respectively, but it will be appreciated by those skilled in the art that and the present invention is not limited thereto to have more or less message queue and subscription task.
Source database 201 is not limited to certain specific type of data storehouse, such as, can be MySQL database." MySQL " is the small-sized correlation data base management system (DBMS) of an open source code.
Database Incremental Log resolution server 203, for resolving based on database Incremental Log, provides increment to subscribe to and consumption function.Database Incremental Log resolution server 203 can be such as canal server." canal " resolves based on MySQL database Incremental Log, provides incremental data to subscribe to and consumption function.The chained address of increasing income of canal project: https: //github.com/alibaba/canal.
Distributed Message Queue server 205 for providing distributed, persistent message queue service, it has the guarantee order of message and the feature of message duration, thus can meet the feature (strong order) of incremental data and the feature of persistence (persistence) demand.Can according to the remember history disposing Distributed Message Queue server 205 disk size situation external commitment data, if data in the term of validity retained, subscription end can from optional position consumption data.Distributed Message Queue server 205 can be such as RocketMQ server." RocketMQ " is the message-oriented middleware of a distributed, queuing model, has following characteristics: support strict message sequence; Support theme (Topic) and queue (Queue) two kinds of patterns; Hundred million grades of message pile up ability; More friendly distributed nature; Support to push away (Push) and pull (Pull) mode to consume message simultaneously.The chained address of increasing income of RocketMQ project: https: //github.com/alibaba/RocketMQ.By the introducing of the Distributed Message Queue middleware of above-mentioned such as RocketMQ, substantially increase the extendability of system, the distribution situation of incremental data in different distributions formula Message Queuing server can be adjusted in time, to reach the object of Quick Extended according to the size of business datum amount.Term " middleware " is one independently system software or service routine, and Distributed Application software is by this software shared resource between different technology.
Subscription end 207, for generation of subscription task.Although Fig. 2 schematically illustrates a subscription end 207 and have subscribed three subscription task 1,2,3, but those skilled in the art understand, the present invention is not limited thereto, the number of subscription end can be one or more, and the subscription task of each subscription end can be one or more.If subscription end 207 is modal databases push scene, only need does some and the most simply subscribe to configuration (theme and target DB as subscription) and subscription end 207 can be started and without the need to developing again.If subscription end 207 needs customized development, so can use the special SDK after encapsulating based on such as RocketMQ client, after having specified topic of subscription, can incremental data be obtained.Multiple subscription end can subscribe to the incremental data of same subject simultaneously, and does not affect each other.By developing general module for subscription end, the object reducing subscription end exploitation and maintenance cost can be reached.General module exploitation the cost of development of subscription end under common scene is reduced greatly, and make subscription end hold decoupling zero with extracting, perception need not extract point storehouse submeter information of end, make overall plan be easier to expand.By modular design and to the effective boundary demarcation of extraction subscription task, coordinate existing robotization deployment system, management maintenance can be carried out to subscription extraction task easily, greatly reduce the maintenance work amount on line.
Fig. 3 schematically illustrates the process flow diagram of data push method according to an embodiment of the invention.
As shown in Figure 3, in step 301, database Incremental Log resolution server 203 creates extraction task, be that data are dumped to database Incremental Log resolution server 203 from source database 201 by border with database instance, database Incremental Log resolution server 203 is resolved by database Incremental Log and is obtained incremental data, and the incremental data of acquisition is temporary in the internal memory of database Incremental Log resolution server 203.
In step 303, database Incremental Log resolution server 203 by the incremental data rule as required that obtains using corresponding theme and mark as message distribution in the message queue 1,2,3 in Distributed Message Queue server 205, every bar all comprises respective theme and mark so that subscription end 207 is subscribed to and filtered as the incremental data of message, described theme represents the library name of logic, and described mark represents the table name of logic.The rule of needs mentioned here refers to the mapping ruler of the storehouse table information of source incremental data to queue theme and mark, modal application scenarios is that source has carried out point storehouse submeter, as library name: product_pop_1, product_pop_2 ... .product_pop_N the product_pop theme of queue can be mapped to, in like manner, table name sku_N is mapped to mark sku.This rule freely can define according to service needed.
In step 305, the theme subscribed to based on each subscription task 1,2,3 of subscription end 207 of Distributed Message Queue server 205 and message queue 1,2,3 is pushed to subscription end 207.
The extraction task in past is not in units of database instance, in units of the logical base table of being correlated with by one group of business, extraction task in the past a database instance has a lot of logical base tables, so can open the connection of multiple extracted data on a database instance according to the number of business library table group.This namely in Fig. 1 the connection (dump 1-3) of source database 101 and database Incremental Log resolution server 103 have 3, and the connection of source database 201 in Fig. 2 and database Incremental Log resolution server 203 only has the reason of 1.In contrast, data delivery system according to the above embodiment of the present invention and method, persistent message queue means are used by making the propelling movement of incremental data, because the establishment of the task of extraction take database instance as border, compare extraction task state in the past, directly reduce the pressure of source database, thus link 1 that is reduced to as shown in Figure 2 by schematically illustrating 3 (for dozens or even hundreds of in practical application scene) dumps as Fig. 1, and extract and subscribe to and need not occur in pairs again, thus also greatly reduce the flow of network, incremental data is made to be become possibility by many subscription end consumption, greatly improve load-bearing capacity and the service efficiency of server.
Above-described embodiment is only the preferred embodiments of the present invention, is not limited to the present invention.It will be apparent for a person skilled in the art that without departing from the spirit and scope of the present invention, various amendment and change can be carried out to embodiments of the invention.Therefore, the invention is intended to contain all such amendments within the scope of the present invention as defined by the appended claims of falling into or modification.