Data push method and system
Technical field
The present invention relates to the large-scale data stream real-time consumption technology parsed based on database Incremental Log, more specifically,
It is related to a kind of data push method and system.
Background technology
It is strong all the more for the demand of large-scale data pick-up with the development of corporate business.Data pick-up refer to from
Source database extracts the data needed.In practical application, source database is more using relational database.Common number
Include according to extraction mode:(1) full dose data pick-up, full dose data pick-up are similar to Data Migration or data duplication, and it is by source
In table or the data of view intact extracted from database;(2) incremental data extracts, and incremental data extracts
The data for having increased newly or having changed in the table to be extracted in database since being only drawn from last time extraction.Incremental data is extracted compared with full dose number
According to extracting, application is wider.How to capture the data of change is the key that incremental data extracts.Typically have want to catching method at 2 points
Ask:Accuracy, the delta data in operation system can be accurately captured;Performance, reduce as far as possible and operation system is caused
Too big pressure, influences existing business.
In the case where the mode of traditional timing dump (dump) incremental data has been unable to meet demand, real-time incremental number
According to acquisition become solve this problem trend.Data-pushing platform be it is a set of realize realize between heterogeneous database increase/it is complete
The total solution of the data-pushing of amount.Increment push aims to solve the problem that the push incremental data of extensive lower efficient stable to ordering
Client is read to meet the purpose of business demand.
The extraction mode of traditional incremental data is that timing dump incremental data generates corresponding data file, but this side
Problems be present in formula:Need to increase timestamp field in source traffic table, pollute source traffic table to a certain extent;Regularly
Execution causes data a certain degree of delay to be present;The data file and business association tension that dump comes out, are not easy to be reused;It is fixed
When task start when larger instantaneous pressure is often caused to source database and network.
, can be to source industry and the incremental data stream based on the parsing of database Incremental Log solves above mentioned problem well
Table be engaged in without intrusion, quasi real time (as long as subscription end consumption rate is enough fast) of incremental data, extracts on demand and subscribes to data, h dissipates, no
Easily there is instantaneous peak value.For example, the canal open source projects used in the prior art are namely based on the parsing of database Incremental Log,
And provide increment subscription and consumption function.Canal can be good at realizing that database Incremental Log parses work, be based on
Canal function, we can be temporarily stored into internal memory by configuring extraction task from source extraction incremental data, then according to business
The subscription client of demand exploitation customization.
Fig. 1 schematically illustrates the view of the data delivery system according to prior art.The data-pushing of prior art
System 100 includes:Source database 101;Database Incremental Log resolution server 103, parsed based on database Incremental Log,
Increment subscription and consumption function, such as canal servers of database Incremental Log resolution server 103 are provided;Subscription end 105,
For producing subscription task.
Although canal provides the basic function of incremental data parsing, but still faces following problem:
(1) data pick-ups can not be realized one-to-many with data subscription
Because the design that canal incremental datas do not land (that is, data are temporarily stored into internal memory) is limited, canal can not accomplish only
Extract a data and subscribed to by multi-client.This, which also implies that extraction and subscribed to, to occur in pairs, and excessive extraction is appointed
Business increases the pressure for extracting source database, the carrying for wasting substantial amounts of network traffics, reducing separate unit canal servers
Ability and handling capacity.
(2) the incremental datas holding time is uncontrolled
This problem is also relevant with above-mentioned design, because incremental data is not landed, and the daily record holding time of source database
Can not be by data-pushing platform courses, so data-pushing platform also can not promise to undertake incremental data storage life to the subscription end in downstream
Limit.This reduces the ability that data-pushing platform externally provides service.
(3) subscription ends are too strong with extracting end dependence
Here dependence is divided into two aspects.
System level:It is direct based on the serial process under long connect, the consumption rate of subscription end to extract end and subscription end
Influence the extraction speed at extraction end.
Service layer:It is using database instance as border to extract end task, can not divide storehouse point to subscribing to end shield source
The information that the subscription ends such as table originally should not be concerned about, also have led to coupling stronger between extraction task and subscription task.
(4) subscription ends lack common function support
Canal is only absorbed in extraction and subscription with data in itself, and the function support that some are conventional for subscription end is weaker,
Such as:Although subscription end SDK (the Software Development Kit for passing through canal:SDK) it can obtain
The structured data of increment is taken, but for structured data to be parsed to, changed into corresponding DML (Data Manipulation
Language:DML) and it is finally inserted into target DB (Database:Database) this most common application scenarios branch
Hold deficiency.Another open source projects otter (chained addresses of increasing income of project:https://github.com/alibaba/
Otter) although supporting above-mentioned partial function, but still the customized demand of Complex Flexible can not be met.
The content of the invention
According to an aspect of the invention, there is provided a kind of data push method, including:Task creation processing is extracted, with
Database instance be border by data from source database dump into database Incremental Log resolution server, pass through database
Incremental Log parsing obtains incremental data, and the incremental data of acquisition is temporarily stored into database Incremental Log resolution server
In internal memory;Message distribution processing, the rule of the incremental data of acquisition as required is regard as message using corresponding theme and mark
It is distributed in corresponding one or more message queues in Distributed Message Queue server, theme represents the library name of logic,
Mark represents the table name of logic, and every incremental data being distributed as message all includes corresponding theme and mark;And order
Read push processing, the theme that each subscription task of the Distributed Message Queue server based on subscription end is subscribed to and will be one
Or multiple message queues are pushed to subscription end.
According to another aspect of the present invention, there is provided a kind of data delivery system, including:Source database;Database increases
Measure daily record resolution server;Distributed Message Queue server;And subscription end, wherein, it is border by data using database instance
From source database dump into database Incremental Log resolution server, database Incremental Log resolution server passes through data
Storehouse Incremental Log parsing obtains incremental data, and the incremental data of acquisition is temporarily stored into database Incremental Log resolution server
Internal memory in;Database Incremental Log resolution server by the incremental data of acquisition rule as required with corresponding theme and
As in corresponding one or more message queues in message distribution to Distributed Message Queue server, theme represents mark
The library name of logic, mark represent the table name of logic, every incremental data being distributed as message all include corresponding theme with
Mark;And the theme subscribed to of each subscription task of the Distributed Message Queue server based on subscription end and by one or
Multiple message queues are pushed to subscription end.
The present invention is by introducing persistent message queue means, it is possible to achieve only extracts a data and is ordered by multi-client
Read, that is to say, that extracting and subscribing to occur in pairs, so as to also greatly reduce the flow of network so that incremental data quilt
More subscription end consumption are possibly realized, and greatly improve the bearing capacity and service efficiency of server.In addition, subscription end is by making
With based on the special SDK after RocketMQ client encapsulation, it is possible to achieve the customized development of subscription end, so as to meet complicated spirit
Customized demand living.
Brief description of the drawings
Fig. 1 schematically illustrates the view of the data delivery system according to prior art.
Fig. 2 schematically illustrates the view of data delivery system according to an embodiment of the invention.
Fig. 3 schematically illustrates the flow chart of data push method according to an embodiment of the invention.
Embodiment
Technical scheme according to an embodiment of the invention is explained in detail below with reference to accompanying drawings.
Term " dump " (" dump ") implication refers to the data (can contain data structure) in MySQL exporting to text text
Part, to be imported or as other application or the input data of system on other database instances.
Fig. 2 schematically illustrates the view of data delivery system according to an embodiment of the invention.As shown in Fig. 2 number
Include according to supplying system 200:Source database 201, database Incremental Log resolution server 203, Distributed Message Queue clothes
Business device 205, subscription end 207.For purposes of illustration, three message queues 1,2,3 and difference are illustrate only in Fig. 2 in contrast
Three subscription tasks 1,2,3 answered, however it is understood by skilled practitioners that the invention is not restricted to this, can have more or less
Message queue and subscription task.
Source database 201 is not limited to certain specific type of data storehouse, such as can be MySQL database." MySQL " is
The small-sized correlation data base management system of one open source code.
Database Incremental Log resolution server 203 be used for based on database Incremental Log parse, there is provided increment subscribe to and
Consumption function.Database Incremental Log resolution server 203 for example can be canal servers." canal " is to be based on MySQL
Database Incremental Log parses, there is provided incremental data is subscribed to and consumption function.The chained address of increasing income of canal projects:
https://github.com/alibaba/canal。
Distributed Message Queue server 205 is used to provide distributed, persistent message queue service, and it disappears with guarantee
The characteristics of order and message duration of breath, (strong order) and persistence the characteristics of so as to meet incremental data
(persistence) the characteristics of demand.Can externally it be promised to undertake according to the deployment disk size situation of Distributed Message Queue server 205
The reservation number of days of data, as long as subscription end can be from optional position consumption data in the term of validity of reservation for data.Distribution disappears
It for example can be RocketMQ servers to cease queue server 205." RocketMQ " is a distributed, message of queuing model
Middleware, have the characteristics that:Support strict message sequence;Support theme (Topic) and queue (Queue) both of which;
Hundred million grades of message accumulate ability;Compare friendly distributed nature;Support that pushing away (Push) and pull (Pull) mode consumes simultaneously
Breath.The chained address of increasing income of RocketMQ projects:https://github.com/alibaba/RocketMQ.Pass through above-mentioned example
Such as the introducing of RocketMQ Distributed Message Queue middleware, the autgmentability of system is substantially increased, can be according to business number
Distribution situation of the incremental data in different distributions formula Message Queuing server is adjusted in time according to the size of amount, to reach quick expansion
The purpose of exhibition.Term " middleware " is a kind of independent system software or service routine, and Distributed Application software is by this soft
Part shared resource between different technologies.
Subscription end 207, for producing subscription task.Although Fig. 2 schematically illustrates a subscription end 207 and have subscribed three
Individual subscription task 1,2,3, however it is understood by skilled practitioners that the invention is not restricted to this, the number of subscription end can be one
Or it is multiple, and the subscription task of each subscription end can be one or more.If subscription end 207 is most common database
Push scene, need to only do some it is simplest subscribe to configuration (such as the theme and target DB subscribed to) can start subscription end 207 and
Without developing again.If subscription end 207 needs customized development, then can be used and is encapsulated based on such as RocketMQ client
Special SDK afterwards, incremental data can be obtained after specifying good topic of subscription.Multiple subscription ends can subscribe to the increasing of identical theme simultaneously
Data are measured, and are not influenceed between each other.Used by developing general module for subscription end, can reach, which reduces subscription end, opens
The purpose of hair and maintenance cost.The exploitation of general module causes development cost of the subscription end under common scene to substantially reduce,
And cause subscription end with extracting end decoupling, it is not necessary to which table information is divided in point storehouse that end is extracted in perception, is more easily extensible overall plan.It is logical
Cross modularized design and to extracting the effective boundary demarcation of subscription task, coordinate existing automatically dispose system, can be convenient
Maintenance is managed to subscribing to extraction task, is substantially reduced to the maintenance work amoun on line.
Fig. 3 schematically illustrates the flow chart of data push method according to an embodiment of the invention.
As shown in figure 3, in step 301, database Incremental Log resolution server 203 creates extraction task, with data
Storehouse example is that data are dumped in database Incremental Log resolution server 203 by border from source database 201, and database increases
Measure daily record resolution server 203 and acquisition incremental data is parsed by database Incremental Log, and the incremental data of acquisition is temporary
It is stored in the internal memory of database Incremental Log resolution server 203.
In step 303, rule of the database Incremental Log resolution server 203 by the incremental data of acquisition as required
Using corresponding theme and mark as message distribution into the message queue 1,2,3 in Distributed Message Queue server 205, often
Bar all comprising respective theme and is marked so that subscription end 207 is subscribed to and is filtered, the theme generation as the incremental data of message
The library name of table logic, it is described to mark the table name for representing logic.The rule of needs mentioned here refers to the storehouse table of source incremental data
For information to queue theme and the mapping ruler of mark, most common application scenarios are that source has carried out a point storehouse and divides table, such as library name:
Product_pop_1, product_pop_2 ... .product_pop_N can be mapped to the product_pop themes of queue, together
Reason, table name sku_N are mapped to mark sku.This rule can need freely to define according to business.
In step 305, each subscription task 1,2,3 of the Distributed Message Queue server 205 based on subscription end 207 is ordered
The theme read and message queue 1,2,3 is pushed to subscription end 207.
Past extraction task be not in units of database instance, be using the related logical base table of one group of business as
Unit, many logical base tables are had on a database instance, so past extraction task can be in a database reality
Multiple connections for extracting data are opened according to the number of business library table group in example.The sum of source database 101 in this namely Fig. 1
There are 3 according to the connection (dump 1-3) of storehouse Incremental Log resolution server 103, and source database 201 and database in Fig. 2
The connection of Incremental Log resolution server 203 only has 1 the reason for.In contrast, data according to the above embodiment of the present invention
Supplying system and method, persistent message queue means are used by the push of incremental data, because the wound of the task of extraction
Build using database instance as border, compared to past extraction task state, directly reduce the pressure of source database, so as to by
As Fig. 1 be schematically illustrated 3 (being dozens or even hundreds of in practical application scene) dumps link be reduced to it is as shown in Figure 2
1, and extract and subscribe to and need not occur in pairs again, so as to also greatly reduce the flow of network so that incremental data quilt
More subscription end consumption are possibly realized, and greatly improve the bearing capacity and service efficiency of server.
Above-described embodiment is only the preferred embodiments of the present invention, is not intended to limit the invention.To those skilled in the art
It is readily apparent that without departing from the spirit and scope of the present invention, embodiments of the invention can be carried out various
Modifications and changes.Therefore, the invention is intended to cover fall into as defined by the appended claims within the scope of the present invention it is all this
The modification or modification of sample.