CN115658347A - Data consumption method, device, electronic equipment, storage medium and program product - Google Patents

Data consumption method, device, electronic equipment, storage medium and program product Download PDF

Info

Publication number
CN115658347A
CN115658347A CN202211379891.7A CN202211379891A CN115658347A CN 115658347 A CN115658347 A CN 115658347A CN 202211379891 A CN202211379891 A CN 202211379891A CN 115658347 A CN115658347 A CN 115658347A
Authority
CN
China
Prior art keywords
data
kafka
consumed
partition
offset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211379891.7A
Other languages
Chinese (zh)
Inventor
关鑫
关振宇
朱家强
郑为锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lakala Payment Co ltd
Original Assignee
Lakala Payment Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lakala Payment Co ltd filed Critical Lakala Payment Co ltd
Priority to CN202211379891.7A priority Critical patent/CN115658347A/en
Publication of CN115658347A publication Critical patent/CN115658347A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure discloses a data consumption method, a data consumption device, an electronic device, a storage medium and a program product, wherein the method comprises the following steps: requesting to distribute Kafka partitions to the Kafka server; after receiving the information of the distributed Kafka partitions returned by the Kafka server, consuming data from the distributed Kafka partitions, and monitoring the redistribution events of the distributed Kafka partitions; after the reallocation event of the Kafka partition is monitored, acquiring target data consumed but not submitted from the Kafka partition; and clearing the target data, and consuming the target data from the Kafka partition again after the Kafka partition reallocation is completed. In this way, the problem of data repeated consumption after the partition reallocation event of the Kafka partition can be avoided.

Description

Data consumption method, device, electronic equipment, storage medium and program product
Technical Field
The disclosed embodiments relate to the field of computer technologies, and in particular, to a data consumption method, an apparatus, an electronic device, a storage medium, and a program product.
Background
The Flume is a distributed high-availability system for acquisition, aggregation and transmission of massive logs. In the big data era, flume is widely applied to a plurality of scenes as an excellent data acquisition tool. The Flume mainly comprises: the method comprises four parts, namely Source (a component for acquiring data in the Flume), an interceptor (a component for processing data in the Flume), channel (a storage component in the Flume) and Sink (a data transmission component in the Flume), so as to finish the collection, aggregation and transmission of logs.
Kafka is essentially a message queue into which data producer data can be placed from production. Flume as a consumer may consume data from the Kafka message queue and complete the corresponding processing of the consumed data. However, during the process of consuming data from the Kafka message queue by the flash, the situation of consuming data repetition is easily caused. Therefore, a solution is needed to solve the problem of the flash end repeatedly consuming data.
Disclosure of Invention
The disclosed embodiments provide a data consumption method, an apparatus, an electronic device, a storage medium, and a program product.
In a first aspect, an embodiment of the present disclosure provides a data consumption method, including:
requesting to distribute Kafka partitions to the Kafka server;
after receiving the information of the distributed Kafka partitions returned by the Kafka server, consuming data from the distributed Kafka partitions, and monitoring the redistribution events of the distributed Kafka partitions;
after the reallocation event of the Kafka partition is monitored, acquiring target data consumed but not submitted from the Kafka partition;
and clearing the target data, and consuming the target data from the Kafka partition again after the Kafka partition reallocation is completed.
Further, obtaining target data consumed from the Kafka partition but not committed, comprises:
and acquiring the data consumed from the Kafka partition from a cache as target data.
Further, consuming data from the assigned Kafka partition, comprising:
reading consumption data from a message queue of the Kafka partition;
caching the read consumption data, and updating an offset of the consumed data based on the data amount of the read consumption data.
Further, after caching the read consumption data and updating the offset of the consumed data based on the data amount of the read consumption data, the method further includes:
submitting an offset of the updated consumed data to the Kafka partition;
and after the offset of the updated consumed data is successfully submitted, writing the cached consumed data into a target storage device.
Further, after caching a predetermined amount of the consumption data and updating an offset of the consumed data based on the predetermined amount, the method further comprises:
submitting an offset of the updated consumed data to the Kafka partition;
and after the offset of the consumed data after updating fails to be submitted, restoring the offset of the consumed data to a value before updating, and deleting the cached consumed data.
Further, the method further comprises:
after the target data is cleared, determining whether the offset of the consumed data is updated based on the data amount of the target data;
and when the offset of the consumed data is updated based on the data amount of the target data, restoring the offset of the consumed data to a value between updates.
In a second aspect, an embodiment of the present disclosure provides a data consuming method, including:
requesting to distribute Kafka partitions to the Kafka server;
after receiving the information of the distributed Kafka partitions returned by the Kafka server, reading a predetermined amount of consumption data from the message queue of the distributed Kafka partitions;
caching a predetermined amount of the consumption data and updating an offset of the consumed data based on the predetermined amount;
submitting an offset of the updated consumed data to a coordinator of the Kafka partition;
after the updated offset of the consumed data is successfully submitted, writing the cached consumed data into a target storage device;
and after the offset of the consumed data after updating fails to be submitted, restoring the offset of the consumed data to a value before updating, and deleting the cached consumed data.
Further, the method further comprises:
and monitoring the reallocation event of the allocated Kafka partition.
Further, the method further comprises:
after listening to the reallocation event of the Kafka partition, acquiring target data consumed but not submitted from the Kafka partition;
and clearing the target data, and consuming the target data from the Kafka partition again after the Kafka partition is reallocated.
Further, the method further comprises:
after the target data is cleared, determining whether the offset of the consumed data is updated based on the data amount of the target data;
and when the offset of the consumed data is updated based on the data amount of the target data, restoring the offset of the consumed data to a value between updates.
In a third aspect, an embodiment of the present disclosure provides a data consuming apparatus, including:
a first request module configured to request the Kafka server to allocate Kafka partitions;
the first receiving module is configured to consume data from the allocated Kafka partition after receiving the information of the allocated Kafka partition returned by the Kafka server, and simultaneously monitor the reallocation event of the allocated Kafka partition;
a first acquisition module configured to acquire target data consumed but not submitted from the Kafka partition after listening for a reallocation event of the Kafka partition;
a first purge module configured to purge the target data and to re-consume the target data from the Kafka partition after the Kafka partition reallocation is complete.
Further, the first obtaining module includes:
and the first acquisition submodule is configured to acquire the data consumed from the Kafka partition from a cache as target data.
Further, the first receiving module includes:
a read submodule configured to read consumption data from a message queue of the Kafka partition;
the first cache submodule is configured to cache the read consumption data and update an offset of the consumed data based on the data volume of the read consumption data.
Further, after the first cache submodule, the apparatus further includes:
a first commit module configured to commit the offset of the updated consumed data to the Kafka partition;
a first write module configured to write the cached consumed data to a target storage device after the offset of the updated consumed data is successfully committed.
Further, after the first cache submodule, the apparatus further includes:
a second commit module configured to commit the offset of the updated consumed data to the Kafka partition;
a first deletion module configured to restore the offset of consumed data to a value before update and delete the cached consumed data after the offset of the updated consumed data fails to commit.
Further, the apparatus further comprises:
a first determining module configured to determine whether an offset of consumed data is updated based on a data amount of the target data after the target data is cleared;
a first recovery module configured to recover the offset of the consumed data to a value between updates if the offset of the consumed data is updated based on the data amount of the target data.
In a fourth aspect, an embodiment of the present disclosure provides a data consuming apparatus, including:
the second request module is configured to request the Kafka server side to allocate the Kafka partition;
a second receiving module, configured to, after receiving the information of the assigned Kafka partition returned by the Kafka server, read a predetermined amount of consumption data from a message queue of the assigned Kafka partition;
a caching module configured to cache a predetermined amount of the consumption data and update an offset of the consumed data based on the predetermined amount;
a third commit module configured to commit the offset of the updated consumed data to a coordinator of the Kafka partition;
a second writing module configured to write the cached consumption data into a target storage device after the offset of the updated consumed data is successfully submitted;
a second deletion module configured to restore the offset of consumed data to a value before update and delete the cached consumed data after the offset of the updated consumed data fails to commit.
Further, the apparatus further comprises:
and the monitoring module is configured to monitor the reallocation event of the allocated Kafka partition.
Further, the apparatus further comprises:
a second obtaining module configured to obtain target data consumed but not submitted from the Kafka partition after listening to the reassignment event of the Kafka partition;
a second purge module configured to purge the target data and to re-consume the target data from the Kafka partition after the Kafka partition reallocation is complete.
Further, the apparatus further comprises:
a second determination module configured to determine whether an offset of the consumed data is updated based on a data amount of the target data after the target data is cleared;
a second recovery module configured to recover the offset of the consumed data to a value between updates if the offset of the consumed data is updated based on the data amount of the target data.
The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above-described functions.
In one possible design, the apparatus includes a memory configured to store one or more computer instructions that enable the apparatus to perform the corresponding method, and a processor configured to execute the computer instructions stored in the memory. The apparatus may also include a communication interface for the apparatus to communicate with other devices or a communication network.
In a fifth aspect, the disclosed embodiments provide an electronic device, including a memory for storing one or more computer instructions that support any of the above apparatus to perform the corresponding method described above, and a processor configured to execute the computer instructions stored in the memory. Any of the above may also include a communication interface for communicating with other devices or a communication network.
In a sixth aspect, the disclosed embodiments provide a computer-readable storage medium for storing computer instructions for use by any of the above-mentioned apparatuses, including computer instructions for performing any of the above-mentioned methods.
In a seventh aspect, the disclosed embodiments provide a computer program product comprising computer instructions for implementing the steps of the method according to any one of the above aspects when executed by a processor.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
in the method, a flash system requests to allocate a Kafka partition to a Kafka server, after receiving the Kafka partition allocated by the Kafka server, the flash system requests to consume data from the Kafka partition, and the Kafka partition allows the flash system to read unconsumed data after the offset of the consumed data from a message queue based on the offset of the consumed data submitted by the flash system. After the flare system reads the consumed data, it temporarily caches it, updates the offset of the consumed data based on the read data, and commits the offset of the consumed data to the Kafka partition.
To avoid a partition reallocation event occurring at the Kafka partition, resulting in an offset of consumed data submitted by the Flume system failing to be successfully received by the Kafka partition, resulting in data that had been consumed before being repeatedly consumed by the Flume system after partition reallocation is complete, the Flume system listens for partition reallocation events for the Kafka partition while consuming data. If the flash system monitors the partition reallocation event of the Kafka partition, the target data in the cache is cleared and is not written into the target storage device, and the target storage device can be a file storage device in the flash system or a database storage device. In this way, the problem of data repeated consumption after the partition reallocation event of the Kafka partition can be avoided.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of embodiments of the disclosure.
Drawings
Other features, objects, and advantages of embodiments of the disclosure will become apparent from the following detailed description of non-limiting embodiments, which proceeds with reference to the accompanying drawings. In the drawings:
FIG. 1 shows a flow diagram of a data consumption method according to an embodiment of the present disclosure;
FIG. 2 shows a flow diagram of a data consumption method according to another embodiment of the present disclosure;
FIG. 3 illustrates an overall flow diagram of a data consumption method according to an embodiment of the present disclosure;
FIG. 4 illustrates a block diagram of a data consumption device according to an embodiment of the present disclosure;
FIG. 5 shows a block diagram of a data consumption device according to another embodiment of the present disclosure;
FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing a data consumption method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, exemplary embodiments of the disclosed embodiments will be described in detail with reference to the accompanying drawings so that they can be easily implemented by those skilled in the art. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.
In the disclosed embodiments, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.
It should also be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
The Flume system generally comprises a plurality of data receiving Source layers, interceptors, channel transmission channels and a data delivery component Sink. The Source (can collect data from log files, network ports, kafka and the like) and various data sources, and packages the data into an Event (a basic transmission unit for transmitting data in flash, which is formed by an Event Header and an Event Body), and the Event can be processed through a series of interceptors and written into a Channel. After the data is successfully written into the Channel, sink (a data transfer component in Flume) actively draws the data from the Channel and writes the data into a plurality of big data components such as HDFS, HBase, hive, ES and the like for subsequent data processing.
The number of the self-contained interceptors in the flash is many, the self-contained interceptors comprise a timestamp adding interceptor, a Host adding interceptor, a regular extraction interceptor and the like, the timestamp adding interceptor adds the timestamp of the current time in the Event Header in the transmitted Event, the Host adding interceptor can add the name of the current Host in the Event Header in the transmitted Event during transmission, and the regular extraction interceptor can extract the field in the Event Body to the Event Header through a regular expression.
Fig. 1 shows a flowchart of a data consumption method according to an embodiment of the present disclosure, as shown in fig. 1, the data consumption method includes the steps of:
in step S101, a Kafka partition is requested to be allocated to the Kafka server;
in step S102, after receiving the information of the assigned Kafka partition returned by the Kafka server, consuming data from the assigned Kafka partition, and listening to a reallocation event of the assigned Kafka partition;
in step S103, after listening to the reallocation event of the Kafka partition, acquiring target data consumed but not submitted from the Kafka partition;
in step S104, the target data is cleared, and the target data is consumed again from the Kafka partition after the Kafka partition reallocation is completed.
As mentioned above, flume is a distributed, highly available system for mass log collection, aggregation, and transmission. In the big data era, flume is widely applied to a plurality of scenes as an excellent data acquisition tool. The Flume mainly comprises: the method comprises four parts, namely Source (a component for acquiring data in the Flume), an interceptor (a component for processing data in the Flume), channel (a storage component in the Flume) and Sink (a data transmission component in the Flume), so as to finish the collection, aggregation and transmission of logs.
Kafka is essentially a message queue into which production data from a data producer can be placed. Flume as a consumer may consume data from the Kafka message queue and complete the corresponding processing of the consumed data. However, during the process of consuming data from the Kafka message queue by the flash, the situation of consuming data repetition is easily caused. Therefore, a solution is needed to solve the problem of the flash end repeatedly consuming data.
The inventor of the present disclosure finds that, when the existing Flume system consumes data from the Kafka message queue, if a network connection problem or a partition reallocation problem of the Kafka message queue is encountered, an offset of the consumed data submitted by the Flume system to the Kafka partition may not successfully reach the Kafka end, so that when the consumption is performed next time, the Kafka partition still transmits the last consumed data, and the data consumption is repeated.
In view of the above problem, in this embodiment, a data consumption method is proposed, in which a flash system requests to assign a Kafka partition to a Kafka server, and upon receiving the Kafka partition assigned by the Kafka server, the flash system requests to consume data from the Kafka partition, and the Kafka partition allows the flash system to read unconsumed data after an offset of the consumed data from a message queue based on the offset of the consumed data submitted by the flash system. After the flux system reads the consumed data, it temporarily buffers it, updates the offset of the consumed data based on the read data, and submits the offset of the consumed data to the Kafka partition.
To avoid a partition reallocation event occurring at the Kafka partition, resulting in an offset of consumed data submitted by the Flume system being unsuccessfully received by the Kafka partition, resulting in data that was previously consumed being repeatedly consumed by the Flume system after partition reallocation is complete, the Flume system listens for partition reallocation events for the Kafka partition while consuming data. If the flash system monitors the partition reallocation event of the Kafka partition, the target data in the cache is cleared and is not written into the target storage device, and the target storage device can be a file storage device in the flash system or a database storage device. In this way, the problem of data repeated consumption after the partition reallocation event of the Kafka partition can be avoided.
In an embodiment of the present disclosure, the data consumption method may be adapted to be executed on an Agent server of a Flume system.
In an embodiment of the present disclosure, the data source system may be any system that generates data, and the data generated by the data source system may be collected, aggregated, and transmitted by the Flume system, and distributed to the corresponding data receiver system. In some embodiments, the data recipient system may be any data storage system. The Flume system can comprise a plurality of agents, each Agent corresponds to a group of Source layer, channel layer and Sink layer, each Agent can correspond to a data Source system, and different data Source systems correspond to different agents. Each Agent can correspond to one or more data receiver systems, and data received from the same data source system can be transmitted to one or more data receiver systems after being processed. The data receiving systems corresponding to different agents can be different.
Data generated by the data source system may be placed in the Kafka message queue for reading by the Flume system.
In an embodiment of the present disclosure, target data consumed but not committed from the Kafka partition may be understood as data read from a message queue of the Kafka partition, but that data is only cached in the Flume system and has not yet been written to the target storage device.
The Kafka partition, when reassigned, may send a notification message to the Flume system that the Kafka partition is currently being reassigned. The Flume system may listen to the Kafka partition for a partition reallocation event based on the notification message.
Through the mode of the embodiment of the disclosure, the problem of data repetition caused when the Flume system consumes data from the Kafka partition can be solved, the data consumption efficiency can be improved, and the data consumption cost can be saved.
In an embodiment of the present disclosure, step S103, namely the step of obtaining the target data consumed but not submitted from the Kafka partition, further includes the following steps:
and acquiring the data consumed from the Kafka partition from a cache as target data.
In this embodiment, the target data is data stored in a flash system cache, and the data is read from a message queue of the Kafka partition by the flash system, and the read success of the data is not fed back to the Kafka partition, and the data is not committed to a target storage device such as a database or a file. The data may be deleted from the cache and not submitted to a target storage device, such as a database or file, for storage. After waiting for the partition reallocation of the Kafka partition to end, the data clear before consumption can be re-obtained, i.e., the consumption data can be re-obtained from the Kafka partition, so that the situation of repeatedly consuming the data can be prevented.
In an embodiment of the present disclosure, step S102, namely the step of consuming data from the assigned Kafka partition, further includes the following steps:
reading consumption data from a message queue of the Kafka partition;
caching the read consumption data, and updating an offset of the consumed data based on the data amount of the read consumption data.
In this alternative implementation, the process of the Flume system consuming data from the Kafka partition includes that the Flume system reads consumption data from a message queue of the Kafka partition allocated to the Kafka server by the Kafka server, where the consumption data is determined by the Kafka partition based on an offset of consumed data fed back by the Flume system last time, that is, the current consumed data of the Flume system is determined based on the offset of the consumed data, and thus part of the unconsumed data is placed in the message queue for reading the Flume data.
After the consumption data are read by the flash system, the consumption data are stored in a cache of the flash system, the offset of the consumed data is updated by the flash system according to the data volume of the current consumption data, and then the offset of the consumed data is fed back to the Kafka partition, so that the Kafka partition can put new consumption data into a message queue for the flash system to read when the data are consumed next time.
In an embodiment of the present disclosure, after the step of caching the read consumption data and updating the offset of the consumed data based on the data amount of the read consumption data, the method further includes the steps of:
submitting an offset of the updated consumed data to the Kafka partition;
and after the updated offset of the consumed data is successfully submitted, writing the cached consumed data into a target storage device.
In this alternative implementation, in order to further avoid the problem of repeated consumption of data due to network problems or partition reallocation, the data consumed by the Flume system from the Kafka partition is cached first, and after the offset of the consumed data submitted by the Flume system to the Kafka partition succeeds, the cached consumed data is written into the target storage device. The offset of consumed data is updated based on the amount of data currently consumed after each consumption of data. For example, the offset of the original consumed data is x, the currently consumed data amount is n, and the offset of the updated consumed data may be x + n.
In an embodiment of the present disclosure, after the step of caching the read consumption data and updating the offset of the consumed data based on the data amount of the read consumption data, the method further includes the steps of:
submitting an offset of the updated consumed data to the Kafka partition;
and after the offset of the consumed data after updating fails to be submitted, restoring the offset of the consumed data to a value before updating, and deleting the cached consumed data.
In this alternative implementation, in order to further avoid the problem of data repeated consumption caused by network problems or partition reallocation, the data consumed by the Flume system from the Kafka partition is cached first, after the offset of the consumed data submitted to the Kafka partition by the Flume system is unsuccessful, the cached consumed data is cleared without being written into the target storage device, and the offset of the updated consumed data is restored to the value before updating, so that it can be ensured that the current consumed data whose offset of the current consumed data is not successfully submitted cannot be written into the target storage device, and after the data is consumed again from the Kafka partition next time, data repetition on the target storage device cannot be caused. The offset of consumed data is updated based on the amount of data currently consumed after each consumption of data. For example, the offset of the original consumed data is x, the currently consumed data amount is n, the offset of the updated consumed data may be x + n, and the consumed offset is restored to x after the offset of the updated consumed data is not successfully committed.
In an embodiment of the present disclosure, the method further comprises the steps of:
after the target data is cleared, determining whether the offset of the consumed data is updated based on the data volume of the target data;
and when the offset of the consumed data is updated based on the data amount of the target data, restoring the offset of the consumed data to a value between updates.
In this alternative implementation, after an event that the Kafka partition is reallocated to a partition is monitored, the Flume system further needs to restore the offset of the consumed data after the target data is cleared. That is, if the data amount of the target data with the offset of the consumed data cleared is updated, the offset of the consumed data can be restored to the value which is not updated, so that the target data which is currently clear can be consumed again after the partition reallocation is completed.
Fig. 2 shows a flow chart of a data consumption method according to another embodiment of the present disclosure, which, as shown in fig. 2, comprises the steps of:
in step S201, a Kafka partition is requested to be allocated to the Kafka server;
in step S202, after receiving the information of the assigned Kafka partition returned by the Kafka server, reading a predetermined amount of consumption data from the message queue of the assigned Kafka partition;
in step S203, caching a predetermined amount of the consumption data, and updating an offset of the consumed data based on the predetermined amount;
submitting an offset of the updated consumed data to a coordinator of the Kafka partition in step S204;
in step S205, after the offset of the updated consumed data is successfully submitted, writing the cached consumed data into a target storage device;
in step S206, after the offset of the consumed data after updating fails to be submitted, the offset of the consumed data is restored to the value before updating, and the cached consumed data is deleted.
As mentioned above, flume is a distributed, highly available system for mass log collection, aggregation and transmission. In the big data era, flume is widely applied to a plurality of scenes as an excellent data acquisition tool. The Flume mainly comprises: the collection, aggregation and transmission of the logs are completed by four parts, namely Source (a component in Flume for acquiring data), interceptor (a component in Flume for processing data), channel (a storage component in Flume) and Sink (a data transmission component in Flume).
Kafka is essentially a message queue into which production data from a data producer can be placed. Flume as a consumer may consume data from the Kafka message queue and complete the corresponding processing of the consumed data. However, during the process of consuming data from the Kafka message queue by Flume, the situation of consuming data repetition is easily caused. Therefore, a solution is needed to solve the problem of consuming data repeatedly by the Flume end.
The inventor of the present disclosure finds that, when the existing Flume system consumes data from the Kafka message queue, if a network connection problem or a partition reallocation problem of the Kafka message queue is encountered, an offset of the consumed data submitted by the Flume system to the Kafka partition may not successfully reach the Kafka end, so that when the consumption is performed next time, the Kafka partition still transmits the last consumed data, and the data consumption is repeated.
In view of the above problem, in this embodiment, a data consumption method is proposed, in which a flash system requests to assign a Kafka partition to a Kafka server, and upon receiving the Kafka partition assigned by the Kafka server, the flash system requests to consume data from the Kafka partition, and the Kafka partition allows the flash system to read unconsumed data after an offset of the consumed data from a message queue based on the offset of the consumed data submitted by the flash system. After the consumption data is read by the Flume system, the consumption data is temporarily cached, the offset of the consumed data is updated based on the read data, and the updated offset of the consumed data is submitted to the Kafka partition. If the updated offset of the consumed data is successfully submitted, the cached consumed data is written into the target storage equipment, if the updated offset of the consumed data is not successfully submitted, the offset of the consumed data is restored to the value between the updates, and the cached consumed data is deleted.
By the method, under the condition that the network is unstable or the Kafka partition is subjected to partition reallocation, repeated data consumption is avoided, the data consumption efficiency can be improved, and the data consumption cost is saved.
In an embodiment of the present disclosure, the data consumption method may be adapted to be executed on an Agent server of a Flume system.
In an embodiment of the present disclosure, the data source system may be any system that generates data, and the data generated by the data source system may be collected, aggregated, and transmitted by the Flume system, and distributed to the corresponding data receiver system. In some embodiments, the data recipient system may be any data storage system. The Flume system can comprise a plurality of agents, each Agent corresponds to a group of Source layer, channel layer and Sink layer, each Agent can correspond to one data Source system, and different data Source systems correspond to different agents. Each Agent can correspond to one or more data receiver systems, and data received from the same data source system can be transmitted to one or more data receiver systems after being processed. The data receiving systems corresponding to different agents can be different.
Data generated by the data source system may be placed in the Kafka message queue for reading by the Flume system.
In one embodiment of the present disclosure, target data consumed but not committed from the Kafka partition may be understood as data read from a message queue of the Kafka partition, but that data is only buffered in the flash system and has not yet been written to the target storage device.
In order to further avoid the problem of network problems or repeated consumption of data caused by partition reallocation, the data consumed by the FLUME system from the Kafka partition is cached first, and after the offset of the consumed data submitted to the Kafka partition by the FLUME system is successful, the cached consumed data is written into the target storage device. The offset of consumed data is updated based on the amount of data currently consumed after each consumption of data. For example, the offset of the original consumed data is x, the currently consumed data amount is n, and the offset of the updated consumed data may be x + n.
After the offset of the consumed data submitted to the Kafka partition by the Flume system is not successful, the cached consumed data is cleared without being written into the target storage device, and the offset of the updated consumed data is restored to the value before updating, so that the current consumed data with the offset of the current consumed data being not successfully submitted cannot be written into the target storage device, and the data on the target storage device cannot be repeated after the data is consumed again from the Kafka partition next time. The offset of consumed data is updated based on the amount of data currently consumed after each consumption of data. For example, the offset of the original consumed data is x, the currently consumed data amount is n, the offset of the updated consumed data may be x + n, and the consumed offset is restored to x after the offset of the updated consumed data is not successfully committed.
In an embodiment of the present disclosure, the method further comprises the steps of:
and monitoring the reallocation event of the allocated Kafka partition.
In this alternative implementation, to avoid a partition reallocation event occurring at the Kafka partition, resulting in an offset of consumed data submitted by the Flume system being unsuccessfully received by the Kafka partition, resulting in repeated consumption of previously consumed data by the Flume system after partition reallocation is complete, the Flume system listens for partition reallocation events for the Kafka partition while consuming data.
In an embodiment of the present disclosure, the method further comprises the steps of:
after the reallocation event of the Kafka partition is monitored, acquiring target data consumed but not submitted from the Kafka partition;
and clearing the target data, and consuming the target data from the Kafka partition again after the Kafka partition reallocation is completed.
In this alternative implementation, if the flash system listens to a partition reallocation event of the Kafka partition, the target data in the cache is cleared, and is not written into the target storage device, where the target storage device may be a file storage device in the flash system or a database storage device. In this way, the problem of data repeated consumption after the partition reallocation event of the Kafka partition can be avoided.
In one embodiment of the present disclosure, target data consumed but not committed from the Kafka partition may be understood as data read from a message queue of the Kafka partition, but that data is only buffered in the flash system and has not yet been written to the target storage device.
The Kafka partition, when reassigned, may send a notification message to the Flume system that the Kafka partition is currently being reassigned. The Flume system may listen to the Kafka partition for a partition reallocation event based on the notification message.
In an embodiment of the present disclosure, the method further comprises the steps of:
after the target data is cleared, determining whether the offset of the consumed data is updated based on the data volume of the target data;
and when the offset of the consumed data is updated based on the data amount of the target data, restoring the offset of the consumed data to a value between updates.
In this alternative implementation, the process of the Flume system consuming data from the Kafka partition includes that the Flume system reads consumption data from a message queue of the Kafka partition allocated to the Kafka server by the Kafka server, where the consumption data is determined by the Kafka partition based on an offset of consumed data fed back by the Flume system last time, that is, the current consumed data of the Flume system is determined based on the offset of the consumed data, and thus part of the unconsumed data is placed in the message queue for reading the Flume data.
After the consumption data are read by the flash system, the consumption data are stored in a cache of the flash system, the offset of the consumed data is updated by the flash system according to the data volume of the current consumption data, and then the offset of the consumed data is fed back to the Kafka partition, so that the Kafka partition can put new consumption data into a message queue for the flash system to read when the data are consumed next time.
After monitoring the event of partition reallocation of the Kafka partition, the Flume system also needs to restore the offset of the consumed data after clearing the target data. That is, if the data amount of the target data with the offset of the consumed data cleared is updated, the offset of the consumed data can be restored to the value which is not updated, so that the target data which is currently clear can be consumed again after the partition reallocation is completed.
Technical terms and technical features related to the technical terms and technical features shown in fig. 2 and the related embodiment are the same as or similar to the technical terms and technical features mentioned in fig. 1 and the related embodiment, and for the explanation and the description of the technical terms and technical features related to the technical terms and technical features shown in fig. 2 and the related embodiment, reference may be made to the above description of the explanation of the technical terms and technical features shown in fig. 1 and the related embodiment, and no repeated description is provided here.
FIG. 3 illustrates an overall flow diagram of a data consumption method according to an embodiment of the present disclosure. As shown in FIG. 3, the data source system sends the generated data to the Kafka system, which stores the data in different partitions. The flash system requests the consumption of data generated by the data source system from the Kafka system, which assigns a corresponding Kafka partition to the flash system, from which the flash system reads the corresponding data. And stores the read data in the cache of the flash system. The offset of the consumed data is then updated based on the amount of data in the read data and fed back to the Kafka system.
During the data consumption, the Flume system listens for the reallocation partition event generated by the Kafka system in real time, and once the event is listened to, the Flume system clears the target data stored in the cache and restores the offset of the consumed data updated based on the data amount of the target data.
The Flume system also flushes the cached data in the event that the offset of the consumed data fails to successfully commit to the Kafka system, and restores the offset of the consumed data with the updated amount of data of the flushed data.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods.
Fig. 4 shows a block diagram of a data consumption apparatus according to an embodiment of the present disclosure, which may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 4, the data consumption apparatus includes:
a first request module 401 configured to request the Kafka server to allocate Kafka partitions;
a first receiving module 402, configured to, after receiving the information of the assigned Kafka partition returned by the Kafka server, consume data from the assigned Kafka partition, and listen to a reallocation event of the assigned Kafka partition;
a first obtaining module 403, configured to obtain target data consumed but not submitted from the Kafka partition after listening to the reallocation event of the Kafka partition;
a first purge module 404 configured to purge the target data and to re-consume the target data from the Kafka partition after the Kafka partition reallocation is complete.
Fig. 5 shows a block diagram of a data consumption apparatus according to another embodiment of the present disclosure, which may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 5, the data consumption apparatus includes:
a second request module 501 configured to request the Kafka server to allocate a Kafka partition;
a second receiving module 502, configured to, after receiving the information of the assigned Kafka partition returned by the Kafka server, read a predetermined amount of consumption data from the message queue of the assigned Kafka partition;
a caching module 503 configured to cache a predetermined amount of the consumed data and update an offset of the consumed data based on the predetermined amount;
a third commit module 504 configured to commit the updated offset of consumed data to a coordinator of the Kafka partition;
a second writing module 505 configured to write the cached consumed data into a target storage device after the offset of the updated consumed data is successfully submitted;
a second deleting module 506 configured to restore the offset of the consumed data to a value before updating and delete the cached consumed data after the offset of the updated consumed data fails to commit.
The technical features related to the above device embodiments and the corresponding explanations and descriptions thereof are the same as, corresponding to or similar to the technical features related to the above method embodiments and the corresponding explanations and descriptions thereof, and for the technical features related to the above device embodiments and the corresponding explanations and descriptions thereof, reference may be made to the technical features related to the above method embodiments and the corresponding explanations and descriptions thereof, and details of the disclosure are not repeated herein.
The embodiment of the present disclosure also discloses an electronic device, which includes a memory and a processor; wherein the content of the first and second substances,
the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to perform any of the method steps described above.
FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing a data consumption method according to an embodiment of the present disclosure.
As shown in fig. 6, the computer system 600 includes a processing unit 601 which can execute various processes in the above-described embodiments according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the computer system 600 are also stored. The processing unit 601, the ROM602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that the computer program read out therefrom is mounted in the storage section 608 as necessary. The processing unit 601 may be implemented as a CPU, a GPU, a TPU, an FPGA, an NPU, or other processing units.
In particular, the methods described above may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the data transmission method. In such embodiments, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611.
A computer program product is also disclosed in embodiments of the present disclosure, the computer program product comprising computer programs/instructions which, when executed by a processor, implement any of the above method steps.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.
As another aspect, the disclosed embodiment also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus in the foregoing embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the embodiments of the present disclosure.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (10)

1. A method of data consumption, comprising:
requesting Kafka partition allocation to a Kafka server side;
after receiving the information of the distributed Kafka partitions returned by the Kafka server, consuming data from the distributed Kafka partitions, and monitoring the redistribution events of the distributed Kafka partitions;
after the reallocation event of the Kafka partition is monitored, acquiring target data consumed but not submitted from the Kafka partition;
and clearing the target data, and consuming the target data from the Kafka partition again after the Kafka partition is reallocated.
2. The method of claim 1, wherein obtaining target data consumed from the Kafka partition but not committed comprises:
and acquiring the data consumed from the Kafka partition from a cache as target data.
3. The method of claim 1 or 2, wherein consuming data from the assigned Kafka partition comprises:
reading consumption data from a message queue of the Kafka partition;
caching the read consumption data, and updating an offset of the consumed data based on the data amount of the read consumption data.
4. The method of claim 3, wherein after caching the read consumption data and updating an offset of consumed data based on an amount of the read consumption data, the method further comprises:
submitting an offset of the updated consumed data to the Kafka partition;
and after the offset of the updated consumed data is successfully submitted, writing the cached consumed data into a target storage device.
5. A method of data consumption, comprising:
requesting Kafka partition allocation to a Kafka server side;
after receiving the information of the distributed Kafka partitions returned by the Kafka server, reading a predetermined amount of consumption data from the message queue of the distributed Kafka partitions;
caching a predetermined amount of the consumption data and updating an offset of the consumed data based on the predetermined amount;
submitting an offset of the updated consumed data to a coordinator of the Kafka partition;
after the updated offset of the consumed data is successfully submitted, writing the cached consumed data into target storage equipment;
and after the offset of the consumed data after updating fails to be submitted, restoring the offset of the consumed data to a value before updating, and deleting the cached consumed data.
6. A data consumption device, comprising:
the Kafka partition allocation module is configured to request allocation of a Kafka partition to a Kafka server side;
the first receiving module is configured to consume data from the assigned Kafka partition after receiving the information of the assigned Kafka partition returned by the Kafka server, and simultaneously monitor the reassignment event of the assigned Kafka partition;
a first acquisition module configured to acquire target data consumed but not submitted from the Kafka partition after listening for a reallocation event of the Kafka partition;
a first purge module configured to purge the target data and to re-consume the target data from the Kafka partition after the Kafka partition reallocation is complete.
7. A data consumption device, comprising:
a second request module configured to request the Kafka server to allocate a Kafka partition;
a second receiving module, configured to, after receiving the information of the assigned Kafka partition returned by the Kafka server, read a predetermined amount of consumption data from a message queue of the assigned Kafka partition;
a caching module configured to cache a predetermined amount of the consumption data and update an offset of the consumed data based on the predetermined amount;
a third commit module configured to commit the offset of the updated consumed data to a coordinator of the Kafka partition;
a second write module configured to write the cached consumed data into a target storage device after the offset of the updated consumed data is successfully submitted;
a second deletion module configured to restore the offset of the consumed data to a value before updating and delete the cached consumed data after the offset of the updated consumed data fails to be submitted.
8. An electronic device comprising a memory and a processor; wherein, the first and the second end of the pipe are connected with each other,
the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the steps of the method of any one of claims 1-5.
9. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by a processor, implement the steps of the method of any one of claims 1-5.
10. A computer program product comprising computer programs/instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 5.
CN202211379891.7A 2022-11-04 2022-11-04 Data consumption method, device, electronic equipment, storage medium and program product Pending CN115658347A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211379891.7A CN115658347A (en) 2022-11-04 2022-11-04 Data consumption method, device, electronic equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211379891.7A CN115658347A (en) 2022-11-04 2022-11-04 Data consumption method, device, electronic equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN115658347A true CN115658347A (en) 2023-01-31

Family

ID=85016231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211379891.7A Pending CN115658347A (en) 2022-11-04 2022-11-04 Data consumption method, device, electronic equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN115658347A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821117A (en) * 2023-08-30 2023-09-29 广州睿帆科技有限公司 Stream data processing method, system, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109493076A (en) * 2018-11-09 2019-03-19 武汉斗鱼网络科技有限公司 A kind of unique consuming method of Kafka message, system, server and storage medium
CN112181686A (en) * 2020-09-28 2021-01-05 北京金山云网络技术有限公司 Data processing method, device and system, electronic equipment and storage medium
CN113779149A (en) * 2021-09-14 2021-12-10 北京知道创宇信息技术股份有限公司 Message processing method and device, electronic equipment and readable storage medium
CN114448989A (en) * 2022-01-26 2022-05-06 北京百度网讯科技有限公司 Method, device, electronic equipment, storage medium and product for adjusting message distribution
CN115202898A (en) * 2021-04-13 2022-10-18 深圳市酷开网络科技股份有限公司 Message consumption method and device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109493076A (en) * 2018-11-09 2019-03-19 武汉斗鱼网络科技有限公司 A kind of unique consuming method of Kafka message, system, server and storage medium
CN112181686A (en) * 2020-09-28 2021-01-05 北京金山云网络技术有限公司 Data processing method, device and system, electronic equipment and storage medium
CN115202898A (en) * 2021-04-13 2022-10-18 深圳市酷开网络科技股份有限公司 Message consumption method and device, computer equipment and storage medium
CN113779149A (en) * 2021-09-14 2021-12-10 北京知道创宇信息技术股份有限公司 Message processing method and device, electronic equipment and readable storage medium
CN114448989A (en) * 2022-01-26 2022-05-06 北京百度网讯科技有限公司 Method, device, electronic equipment, storage medium and product for adjusting message distribution

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孟月昊,等: "关于Kafka分区策略对系统性能影响的研究", 计算机时代, no. 11, pages 11 - 15 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821117A (en) * 2023-08-30 2023-09-29 广州睿帆科技有限公司 Stream data processing method, system, equipment and storage medium
CN116821117B (en) * 2023-08-30 2023-12-12 广州睿帆科技有限公司 Stream data processing method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
US10642861B2 (en) Multi-instance redo apply
US9767138B2 (en) In-database sharded queue for a shared-disk database
CN110232054B (en) Log transmission system and streaming log transmission method
US9197695B2 (en) Redundant data requests with cancellation
CN112015713B (en) Database task processing method and device, electronic equipment and readable medium
CN111078147A (en) Processing method, device and equipment for cache data and storage medium
CN111414389B (en) Data processing method and device, electronic equipment and storage medium
CN109508326B (en) Method, device and system for processing data
CN102307206A (en) Caching system and caching method for rapidly accessing virtual machine images based on cloud storage
JP2017531250A (en) Granular / semi-synchronous architecture
CN111221469B (en) Method, device and system for synchronizing cache data
CN115658347A (en) Data consumption method, device, electronic equipment, storage medium and program product
US20030187969A1 (en) Most eligible server in a common work queue environment
CN112307119A (en) Data synchronization method, device, equipment and storage medium
CN111475480A (en) Log processing method and system
WO2019041670A1 (en) Method, device and system for reducing frequency of functional page requests, and storage medium
CN112835885B (en) Processing method, device and system for distributed form storage
CN111309707B (en) Data processing method and device, electronic equipment and computer readable storage medium
CN117056123A (en) Data recovery method, device, medium and electronic equipment
KR101029416B1 (en) Ranking data system, ranking query system and ranking computation method for computing large scale ranking in real time
CN111611090A (en) Distributed message processing method and system
CN115756955A (en) Data backup and data recovery method and device and computer equipment
CN115048353A (en) Data processing method, data processing apparatus, electronic device, storage medium, and program product
CN114063931A (en) Data storage method based on big data
CN114282968A (en) Serial number acquisition method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination