CN109388677A

CN109388677A - Method of data synchronization, device, equipment and its storage medium between cluster

Info

Publication number: CN109388677A
Application number: CN201810978213.XA
Authority: CN
Inventors: 陈文彪; 林国峰; 曾宪成
Original assignee: SF Technology Co Ltd
Current assignee: SF Technology Co Ltd
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2019-02-26
Anticipated expiration: 2038-08-23
Also published as: CN109388677B

Abstract

This application discloses method of data synchronization, device, equipment and its storage mediums between cluster.This method comprises: reading the target message offset of the first subregion of the first theme of target cluster；Compare the oldest message offset of the second subregion of the second theme of target message offset and source cluster, second theme and the first theme subject name having the same, the first subregion and the second subregion subregion serial number having the same；If target message offset is less than oldest message offset, data are filled to the first subregion；And the primary copy of synchronous second subregion is to the first subregion.It according to the technical solution of the embodiment of the present application, is handled by data filling, overcomes the problem of theme partition data of source cluster in the prior art and target cluster is unevenly distributed.

Description

Method of data synchronization, device, equipment and its storage medium between cluster

Technical field

Present application relates generally to big data processing technology fields, and in particular to kafka technical field more particularly to cluster it Between method of data synchronization, device, equipment and its storage medium.

Background technique

With the development of big data, for example, MPP database, data mining, distributed file system, point Cloth database, cloud computing platform etc. are all being continuously updated.

Kafka subscribes to message system as a kind of distributed post of high-throughput, support through kafka server and Consumer cluster distinguishes message.Using existing synchronization means, there are the theme subregions of target cluster between consumer's cluster The non-uniform situation of theme subregion message distribution of message and source cluster, for example, MirrorMaker tool.

Summary of the invention

In view of drawbacks described above in the prior art or deficiency, it is intended to provide the technical side that data are synchronous between a kind of cluster Case.

In a first aspect, the embodiment of the present application provides method of data synchronization between a kind of cluster, this method comprises:

Read the target message offset of the first subregion of the first theme of target cluster；

Compare the oldest message offset of the second subregion of the second theme of target message offset and source cluster, second is main Topic and the first theme subject name having the same, the first subregion and the second subregion subregion serial number having the same；

If target message offset is less than oldest message offset, data are filled to the first subregion；And

The primary copy of synchronous second subregion is to the first subregion.

Second aspect, the embodiment of the present application provide data synchronization unit between a kind of cluster, which includes:

Target offset amount reading unit, the target message offset of the first subregion of the first theme for reading target cluster Amount；

Offset comparing unit, for compare target message offset and source cluster second theme the second subregion most Early message offset amount, second theme and the first theme subject name having the same, the first subregion have identical with the second subregion Subregion serial number；

Data fills unit fills data to the if being less than oldest message offset for current message offset One subregion；

Synchronization unit, for synchronizing the primary copy of the second subregion to the first subregion.

The third aspect, the embodiment of the present application provide a kind of computer equipment, including memory, processor and are stored in On memory and the computer program that can run on a processor, the processor realize such as the embodiment of the present application when executing the program The method of description.

Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, are stored thereon with computer journey Sequence, the computer program are used for:

The method as described in the embodiment of the present application is realized when the computer program is executed by processor.

The synchronous technical solution of data provided by the embodiments of the present application, data are synchronous between solution kafka cluster asks It when topic, is handled by data filling, the offset for overcoming the identical partitions of identical theme between cluster in the prior art exists The problem of cannot corresponding.

Further, it is written into the state of the subregion progress synchronous come monitoring data by obtaining primary copy, is convenient for user It checks data, improves the Experience Degree of user.Also by the way that the case where whether subregion offset stores judged, data processing is promoted Efficiency.Also guarantee the safety and consistency of data by the data of primary copy.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:

Fig. 1 shows the flow diagram of method of data synchronization between cluster provided by the embodiments of the present application；

The flow diagram of method of data synchronization between the cluster provided Fig. 2 shows the another embodiment of the application；

Fig. 3 shows the schematic block diagram of data synchronization unit between cluster provided by the embodiments of the present application；

Fig. 4 shows the schematic block diagram of data synchronization unit between the cluster that the another embodiment of the application provides；

Fig. 5 shows the structural schematic diagram for being suitable for the computer system for the terminal device for being used to realize the embodiment of the present application.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to invention is illustrated only in attached drawing.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Referring to FIG. 1, Fig. 1 shows the process signal of method of data synchronization between cluster provided by the embodiments of the present application Figure.

As shown in Figure 1, this method comprises:

Step 110, the target message offset of the first subregion of the first theme of target cluster is read.

It according to Type division is different themes by data, the data of same type are stored in different subregions again by theme It is interior.Each subregion is by a series of orderly, and immutable message composition, these message are continuously appended in subregion, subregion In each message include a continuous sequence number, it may be assumed that offset.Offset is for determining that message is unique in subregion Position.For example, the message of kafka subregion is stored in subregion according to incremental order, remembered for every message using offset Record the size of message of current partition storage.In addition to the quantity of record partitioning storage message, guarantor can also be arranged in the message of kafka subregion The time is deposited, when message is more than the holding time, then the message is removed, but the corresponding offset of the message is retained.By regular The message consumed is removed or deleted, disk is reduced and occupies.Speed deletes discarded record, effectively improves disk utilization.

The embodiment of the present application, during execution data are synchronous, processing unit reads the first theme of target cluster The target message offset of first subregion.Target cluster, that is, target object to be synchronized.First theme for example indicates target cluster In any one theme.First subregion for example indicates any one subregion under the first theme.Note that in the embodiment of the present application The words such as the first, the second occurred are only explained as the difference of same technique, and the limit in sequence or operation timing is not to be construed as It is fixed.

In the embodiment of the present application, target message offset is, for example, the offset of the current message of target cluster.

Step 120, the oldest message for comparing the second subregion of the second theme of target message offset and source cluster deviates Amount, wherein second theme and the first theme subject name having the same, the first subregion and the second subregion subregion having the same Serial number.

The embodiment of the present application compares mesh after the earliest offset amount for the second subregion for determining the second theme of source cluster Mark the size of offset and oldest message offset.Wherein, source cluster is the source object of synchronization process.I.e. by data from source object It is synchronized to target object.The first theme subject name having the same of second theme and target cluster, that is, indicate the first theme Identical theme is indicated with second theme.Second subregion and the first subregion subregion serial number having the same, the second subregion of set representations With the first subregion subregion serial number having the same.For example, the corresponding message of theme A subregion 0 of source kafka cluster, is synchronized to mesh Mark the theme A subregion 0 of kafka cluster.

Step 130, if target message offset is less than oldest message offset, data are filled to the first subregion.

The embodiment of the present application, after consumer successively consumes the message of subregion 0 of theme A from source cluster, then to object set Group gradually produces the message of the subregion 0 of the theme A obtained from source cluster.But the usual having time of message of theme partitioned storage Limitation, when consuming message from source cluster, leading to the offset of some subregion of some theme of source cluster is opened from initial value Begin, but passes through time restriction treated some offset.For example, the message offset amount of the subregion 2 as the theme A of source cluster It is an orderly sequential value section [2,8], wherein 2 be start offset amount, also referred to as earliest offset amount.When start offset amount It is compared with the target offset amount of target cluster, target offset amount indicates the current offset of target cluster.If starting Offset is greater than target offset amount, then is filled using filling processing mode to the target partition of target topic in target cluster Processing.Such as data can be filled by the way of being manually arranged, or realize the filling of data by calling stuffing function. So that the offset of source cluster and object set faciation are corresponded with the offset of theme subregion.

In the embodiment of the present application, the method for calling of data filling for example be may is that

When target message offset is less than oldest message offset, Program Synchronization calls filling message module, fills message Module first reads predefined String message in configuration file, and the data of byte type, message are converted by type Data, to promote fill rate, can call the api interface of the produce of Kafka to write data into target cluster through overcompression In the subregion of corresponding theme, after the correspondence subregion of the target topic of target cluster is written in data, the asynchronous returned data of program is filled out Fill whether successful result.If successfully will continue to write-in data, data filling can be stopped if failure, program exits.

Step 140, the primary copy of synchronous second subregion is to the first subregion.

In the embodiment of the present application, after establishing one-to-one offset mapping relations, by identical point of identical theme The message in area is synchronized to target partition from the node that the primary copy of source cluster stores.

The embodiment of the present application is by establishing the one-to-one relationship of offset, so that source cluster and target cluster Theme subregion can correspond to each other, and improve the synchronous accuracy of data.

Further, the embodiment of the present application also proposed the technology that can check the synchronous progress of data during synchrodata Scheme.Referring to FIG. 2, Fig. 2 shows the flow diagrams of method of data synchronization between cluster provided by the embodiments of the present application.

As shown in Fig. 2, this method comprises:

Step 210, the target message offset of the first subregion of the first theme of target cluster is read.

Step 220, the oldest message for comparing the second subregion of the second theme of target message offset and source cluster deviates Amount, second theme and the first theme subject name having the same, the first subregion and the second subregion subregion serial number having the same.

Step 230, if target message offset is less than oldest message offset, data are filled to the first subregion.

Step 240, the stored node of the primary copy of the second subregion is determined.

Step 250, the subregion offset of the primary copy of the second subregion is obtained, wherein the initial position of subregion offset is Oldest message offset.

Step 260, primary copy corresponding with subregion offset is obtained from node.

Step 270, the first subregion is written into primary copy.

Step 280, the state that primary copy is written into the first subregion is obtained.

Wherein step 210-230 is identical as step 110-130, is understood in implementing referring to the description content of step 110-130 Hold.

The embodiment of the present application, processing unit determine that the second of source cluster is main after completing the data filling of the first subregion The stored node of primary copy of second subregion of topic.And determine the subregion offset of second subregion, wherein subregion offset It is interpreted as the message total amount of the second subregion, i.e., from initial position to rest position.Usual situation, initial position 0, then end position The specific sequence number value for setting corresponding message offset amount can directly indicate message total amount.But due to the time limit of storage message System causes its initial position to be not 0.Subregion offset is using oldest message offset as initial position.Rest position can be The position of last offset.

After determining the stored place node of primary copy and subregion offset, according to consumer's mode (consumer) Primary copy corresponding with subregion offset is obtained from the node, then pushes primary copy according to producer's mode (producer) To target partition, that is, complete the process of synchrodata.

In Kafka, if the subregion of each theme has N number of copy, Kafka realizes that failure is automatic by more copy mechanism Transfer, to guarantee the safety of data, copy is stored on the different nodes of cluster.For the data behaviour of the subregion of theme Make, needs all to operate all nodes of the subregion of theme, to keep the consistency of data.

Consumer stores the offset of message in the subregion, next time again after the message that kafka server reads subregion When reading the message of the subregion, meeting according to whether consumer behavior (commit) is executed to message come determine from the subregion it is specific which A offset starts to read.It is then since next serial number of the position of the offset in case of consumer behavior.If do not sent out Raw consumer behavior, then since the position of the offset after preceding one-time-consumption behavior.

After consumer reads message, message and offset are stored accordingly.The producer by news release to certain When a subregion, need first to find the node where the primary copy of subregion, then the producer only will be where news release to primary copy Node, the node where other copies then pass through primary copy place node keep data consistency.

After determining the stored node of primary copy, the content of primary copy is obtained according to the subregion offset of primary copy, Then the content of primary copy is being written to target partition (target partition of target cluster i.e. to be synchronized).

On that basi of the above embodiments, the state that primary copy is written into the first subregion can also be obtained.In order to improve user Experience Degree, by call data-pushing call back function, the data of the subregion of theme are written to the target partition of target topic The case where return to the producer, to facilitate the progress for checking that data are synchronous.

On the basis of the above embodiments of the present application, for the treatment effeciency for further promoting data.In the master for obtaining the second subregion After the subregion offset of copy, judge whether subregion offset stores.For example, this method can also include:

Step 250a, judges whether subregion offset is stored in subregion offset deposit unit；

Step 205b, if it is present reading the subregion offset of subregion offset deposit unit storage；

Step 250c, if it does not exist, then subregion offset acquiring unit is called, to pass through subregion offset acquiring unit The subregion offset and partition holding offset for obtaining the primary copy of the second subregion are to subregion offset deposit unit.

Whether it is stored in corresponding storage device by subregion offset, further to save the processing time of data, Promote the efficiency of data processing.Storage device, for example, Zookeeper, subregion offset deposit unit etc..

It should be noted that although describing the operation of the method for the present invention in the accompanying drawings with particular order, this is not required that Or hint must execute these operations in this particular order, or have to carry out operation shown in whole and be just able to achieve the phase The result of prestige.On the contrary, the step of describing in flow chart can change and execute sequence.Additionally or alternatively, it is convenient to omit certain Multiple steps are merged into a step and executed, and/or a step is decomposed into execution of multiple steps by step.

Further referring to FIG. 3, it illustrates data synchronization units 300 between the cluster according to the application one embodiment Schematic block diagram.

As shown in figure 3, the device 300 includes:

Target offset amount reading unit 310, the target message of the first subregion of the first theme for reading target cluster Offset.

Offset comparing unit 320, the second subregion of the second theme for comparing target message offset and source cluster Oldest message offset, wherein second theme and the first theme subject name having the same, the first subregion and the second subregion Subregion serial number having the same.

Data fills unit 330 is filled data and is arrived if being less than oldest message offset for target message offset First subregion.

Synchronization unit 340, for synchronizing the primary copy of the second subregion to the first subregion.

Further, the embodiment of the present application also proposed the technology that can check the synchronous progress of data during synchrodata Scheme.Referring to FIG. 4, Fig. 4 shows the schematic structure of data synchronization unit 400 between cluster provided by the embodiments of the present application Block diagram.

As shown in figure 4, the device 400 includes:

Target offset amount reading unit 410, the target message of the first subregion of the first theme for reading target cluster Offset.

Offset comparing unit 420, the second subregion of the second theme for comparing target message offset and source cluster Oldest message offset, second theme and the first theme subject name having the same, the first subregion have with the second subregion Identical subregion serial number.

Data fills unit 430 is filled data and is arrived if being less than oldest message offset for target message offset First subregion.

It determines subelement 440, determines the stored node of the primary copy of the second subregion.

Subregion offset obtains subelement 450, the subregion offset of the primary copy for obtaining the second subregion, wherein point The initial position of area's offset is oldest message offset；

Primary copy obtains subelement 460, for obtaining primary copy corresponding with subregion offset from node；

Subelement 470 is written, for the first subregion to be written in primary copy.

Data-pushing adjusts back unit 480, the state for being written into the first subregion for obtaining primary copy.

Wherein target offset amount reading unit 410- data fills unit 430 and target offset amount reading unit 310- data Fills unit 330 is identical, is understood in implementing referring to the description content of target offset amount reading unit 310- data fills unit 330 Hold.

On the basis of the above embodiments of the present application, for the treatment effeciency for further promoting data.In the master for obtaining the second subregion After the subregion offset of copy, judge whether subregion offset stores.For example, the device can also include:

Judgment sub-unit 450a, for judging whether subregion offset is stored in subregion offset deposit unit；

Reading subunit 450b, for if it is present reading the subregion offset of subregion offset deposit unit storage；

Subelement 450c is called, is used for if it does not exist, then subregion offset acquiring unit is called, to deviate by subregion The subregion offset and partition holding offset to subregion offset for measuring the primary copy of acquiring unit the second subregion of acquisition are deposited Unit.

In the embodiment of the present application, subelement 440 is determined, first obtains subelement 450, and second obtains subelement 460, write-in Subelement 470 can for example be realized by synchronization unit.And judgment sub-unit 450a, reading subunit 450b, call son The subelements such as unit 450c integrally realize its function by synchronization unit as selectable unit (SU).

It should be appreciated that each in the method that all units or module recorded in device 300-400 are described with reference Fig. 1-2 Step is corresponding.Device 300-400 and wherein included is equally applicable to above with respect to the operation and feature of method description as a result, Unit, details are not described herein.Device 300-400 can realizes in advance in the browser of electronic equipment or other security applications, It can also be loaded into the browser or its security application of electronic equipment by modes such as downloadings.Phase in device 300-400 Answer unit that can cooperate with the unit in electronic equipment to realize the scheme of the embodiment of the present application.

Below with reference to Fig. 5, it illustrates the calculating of the terminal device or server that are suitable for being used to realize the embodiment of the present application The structural schematic diagram of machine system 500.

As shown in figure 5, computer system 500 includes central processing unit (CPU) 501, it can be read-only according to being stored in Program in memory (ROM) 502 or be loaded into the program in random access storage device (RAM) 503 from storage section 508 and Execute various movements appropriate and processing.In RAM 503, also it is stored with system 500 and operates required various programs and data. CPU 501, ROM 502 and RAM 503 are connected with each other by bus 504.Input/output (I/O) interface 505 is also connected to always Line 504.

I/O interface 505 is connected to lower component: the importation 506 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 507 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.；Storage section 508 including hard disk etc.； And the communications portion 509 of the network interface card including LAN card, modem etc..Communications portion 509 via such as because The network of spy's net executes communication process.Driver 510 is also connected to I/O interface 505 as needed.Detachable media 511, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 510, in order to read from thereon Computer program be mounted into storage section 508 as needed.

Particularly, in accordance with an embodiment of the present disclosure, it is soft to may be implemented as computer for the process above with reference to the description of Fig. 1/2 Part program.For example, embodiment of the disclosure includes a kind of computer program product comprising be tangibly embodied in machine readable Jie Computer program in matter, the computer program include the program code for executing the method for Fig. 1-2.In such implementation In example, which can be downloaded and installed from network by communications portion 509, and/or from detachable media 511 It is mounted.

Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of aforementioned modules, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart, Ke Yiyong The dedicated hardware based system of defined functions or operations is executed to realize, or can be referred to specialized hardware and computer The combination of order is realized.

Being described in the embodiment of the present application involved unit or module can be realized by way of software, can also be with It is realized by way of hardware.Described unit or module also can be set in the processor, for example, can be described as: A kind of processor includes target offset amount reading unit, offset comparing unit, data fills unit and synchronization unit.Its In, the title of these units or module does not constitute the restriction to the unit or module itself under certain conditions, for example, synchronous Unit is also described as " for synchronizing the primary copy of the second subregion to the unit of the first subregion ".

As on the other hand, present invention also provides a kind of computer readable storage medium, the computer-readable storage mediums Matter can be computer readable storage medium included in aforementioned device in above-described embodiment；It is also possible to individualism, not The computer readable storage medium being fitted into equipment.Computer-readable recording medium storage has one or more than one journey Sequence, foregoing routine is used to execute by one or more than one processor is described in the data side of synchronization between the cluster of the application Method.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from aforementioned invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims

1. method of data synchronization between a kind of cluster, which is characterized in that this method comprises:

Compare the oldest message offset of the second subregion of the second theme of the target message offset and source cluster, described the Two themes and the first theme subject name having the same, first subregion and having the same point of second subregion Area's serial number；

If the target message offset is less than the oldest message offset, data are filled to first subregion；With And

The primary copy of second subregion is synchronized to first subregion.

2. the method according to claim 1, wherein the primary copy for synchronizing second subregion is to described One subregion, comprising:

Determine the stored node of primary copy of second subregion；

The subregion offset of the primary copy of second subregion is obtained, the initial position of the subregion offset disappears earliest to be described Cease offset；

Primary copy corresponding with the subregion offset is obtained from the node；

First subregion is written into the primary copy.

3. according to the method described in claim 2, it is characterized in that, in the subregion offset for obtaining the primary copy of second subregion After amount, this method further include:

Judge whether the subregion offset is stored in subregion offset deposit unit；

If it is present reading the subregion offset of the subregion offset deposit unit storage；

If it does not exist, then the subregion offset acquiring unit is called, to obtain by the subregion offset acquiring unit The subregion offset of the primary copy of second subregion, and the subregion offset is stored to the subregion offset storage receipt Member.

4. according to the method described in claim 2, it is characterized in that, after first subregion is written in the primary copy, This method comprises:

Obtain the state that the primary copy is written into first subregion.

5. according to the described in any item methods of claim 2-4, which is characterized in that described that the primary copy is written described first Subregion, comprising:

The primary copy is pushed into first subregion.

6. data synchronization unit between a kind of cluster, which is characterized in that the device includes:

Target offset amount reading unit, the target message offset of the first subregion of the first theme for reading target cluster；

Offset comparing unit, most for the second subregion of the target message offset and the second theme of source cluster Early message offset amount, the second theme and the first theme subject name having the same, first subregion with it is described Second subregion subregion serial number having the same；

Data fills unit fills data if being less than the oldest message offset for the current message offset To first subregion；

Synchronization unit, for synchronizing the primary copy of second subregion to first subregion.

7. device according to claim 6, which is characterized in that the synchronization unit, comprising:

Subelement is determined, for determining the stored node of primary copy of second subregion；

Subregion offset acquisition subelement, the subregion offset of the primary copy for obtaining second subregion, the subregion are inclined The initial position of shifting amount is the oldest message offset；

Primary copy obtains subelement, for obtaining primary copy corresponding with the subregion offset from the node；

Subelement is written, for first subregion to be written in the primary copy.

8. device according to claim 7, which is characterized in that after first obtains subelement, the synchronization unit is also Include:

Judgment sub-unit, for judging whether the subregion offset is stored in subregion offset deposit unit；

Reading subunit, for if it is present reading the subregion offset of the subregion offset deposit unit storage；

Subelement is called, for if it does not exist, then the subregion offset acquiring unit is called, to deviate by the subregion Amount acquiring unit obtains the subregion offset of the primary copy of second subregion.

9. device according to claim 7, which is characterized in that after said write subelement, the device further include:

Data-pushing adjusts back unit, the state for being written into first subregion for obtaining the primary copy.

10. according to the described in any item devices of claim 7-9, which is characterized in that said write subelement includes:

Data-pushing subelement, for the primary copy to be pushed to first subregion.

11. a kind of computer equipment, can run on a memory and on a processor including memory, processor and storage Computer program, which is characterized in that the processor realizes such as side as claimed in any one of claims 1 to 5 when executing described program Method.

12. a kind of computer readable storage medium is stored thereon with computer program, the computer program is used for:

Such as method as claimed in any one of claims 1 to 5 is realized when the computer program is executed by processor.