CN112732165A - Offset management method, device and storage medium - Google Patents

Offset management method, device and storage medium Download PDF

Info

Publication number
CN112732165A
CN112732165A CN201911031175.8A CN201911031175A CN112732165A CN 112732165 A CN112732165 A CN 112732165A CN 201911031175 A CN201911031175 A CN 201911031175A CN 112732165 A CN112732165 A CN 112732165A
Authority
CN
China
Prior art keywords
offset
data set
distributed data
time information
elastic distributed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911031175.8A
Other languages
Chinese (zh)
Inventor
张志刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Wodong Tianjun Information Technology Co Ltd
Priority to CN201911031175.8A priority Critical patent/CN112732165A/en
Publication of CN112732165A publication Critical patent/CN112732165A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Quality & Reliability (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses an offset management method, an offset management device and a storage medium, wherein the method comprises the following steps: when Kafka is accessed in Spark Streaming, reading data from the Kafka into an elastic distributed data set; storing offset and time information corresponding to the elastic distributed data set; processing the data in the elastic distributed data set, and storing the offset and the time information corresponding to the elastic distributed data set in a reporting list when the data processing is successful; and reporting the first offset to the Kafka if the elastic distributed data set is the first elastic distributed data set in the report list. Therefore, when the data processing is unsuccessful, the offset corresponding to the elastic distributed data set with the data unprocessed and completed cannot be reported to Kafka, and the problem that the last batch of failed data is lost when the task is restarted is solved.

Description

Offset management method, device and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to an offset management method, an offset management apparatus, and a storage medium.
Background
When Kafka (Kaffka) is accessed in Spark Streaming, a fault recovery mechanism needs to be established in order to prevent data loss caused by abnormal program fault, and a detection (check point) scheme provided by Spark Streaming is too strong in limitation and is not suitable for being used in a production environment. A more reasonable failure recovery mechanism in a production environment is failure recovery by managing message offsets in Kafka.
In the prior art, data acquired by Spark Streaming from Kafka is stored in an elastic Distributed data set (RDD), specifically, after the data acquired by Spark Streaming from Kafka is stored in RDD, an offset corresponding to the RDD is reported to Kafka. However, when the task is abnormal, the data in the RDD is not processed, but in the prior art, since the offset corresponding to the RDD is already submitted to Kafka, when the task is started again, the batch of data with the failed final processing is lost.
Disclosure of Invention
The embodiment of the application provides an offset management method, an offset management device and a storage medium, and solves the problem of data loss caused when SparkStreaming accesses Kafka.
In a first aspect, an embodiment of the present application provides an offset management method, including:
when Kafka is accessed in Spark Streaming, reading data from the Kafka into an elastic distributed data set;
storing offset and time information corresponding to the elastic distributed data set;
processing the data in the elastic distributed data set, and storing the offset and the time information corresponding to the elastic distributed data set in a reporting list when the data processing is successful;
reporting a first offset to the Kafka if the elastic distributed data set is a first elastic distributed data set in the report list, wherein an interval of the first offset includes an interval of an offset corresponding to the elastic distributed data set.
In a possible implementation manner of the first aspect, a minimum value of the interval of the first offset amount is 0, and a maximum value of the interval of the first offset amount is a maximum value of the interval of the offset amount corresponding to the elastic distributed data set.
In a possible implementation manner of the first aspect, the method further includes:
determining time information corresponding to the first offset;
and storing the first offset and the time information corresponding to the first offset in the report list.
In a possible implementation manner of the first aspect, the method further includes:
and if the minimum value of the interval of the offset corresponding to the elastic distributed data set is equal to 0, determining that the first offset is the offset corresponding to the elastic distributed data set, and the time information corresponding to the first offset is the time information corresponding to the elastic distributed data set.
In a possible implementation manner of the first aspect, the method further includes:
if the minimum value of the interval of the offsets corresponding to the elastic distributed data set is greater than 0, determining that the first offset is the combination of a second offset and the offsets corresponding to the elastic distributed data set, and the time information corresponding to the first offset is the time information corresponding to the second offset;
the minimum value of the interval of the second offset is 0, the maximum value of the interval of the second offset is the minimum value of the interval of the offset corresponding to the elastic distributed data set, and the time information corresponding to the second offset is the time information at the previous moment of the time information corresponding to the elastic distributed data set.
In a possible implementation manner of the first aspect, the storing offset and time information corresponding to the elastically distributed data set includes:
and storing the offset and the time information corresponding to the elastic distributed data set in a memory.
In a possible implementation manner of the first aspect, before storing the offset and the time information corresponding to the elastically distributed data set in a report list, the method further includes:
and acquiring the offset corresponding to the elastic distributed data set from the memory according to the time information corresponding to the elastic distributed data set.
In a second aspect, an embodiment of the present application provides an offset management apparatus, including:
the reading module is used for reading data from the Kafka to an elastic distributed data set when the Kafka is accessed in Spark Streaming;
the storage module is used for storing the offset and the time information corresponding to the elastic distributed data set;
the processing module is used for processing the data in the elastic distributed data set and storing the offset and the time information corresponding to the elastic distributed data set in a reporting list when the data processing is successful;
a reporting module, configured to report a first offset to the Kafka if the elastic distributed data set is a first elastic distributed data set in the reporting list, where an interval of the first offset includes an interval of offsets corresponding to the elastic distributed data set.
In a possible implementation manner of the second aspect, a minimum value of the interval of the first offset amount is 0, and a maximum value of the interval of the first offset amount is a maximum value of the interval of the offset amount corresponding to the elastic distributed data set.
In a possible implementation manner of the second aspect, the apparatus further includes:
the determining module is used for determining the time information corresponding to the first offset;
the storage module is further configured to store the first offset and the time information corresponding to the first offset in the report list.
In a possible implementation manner of the second aspect, the determining module is specifically configured to determine that the first offset is an offset corresponding to the elastic distributed data set if a minimum value of an interval of offsets corresponding to the elastic distributed data set is equal to 0, and the time information corresponding to the first offset is time information corresponding to the elastic distributed data set.
In a possible implementation manner of the second aspect, the determining module is specifically configured to determine that the first offset is a combination of a second offset and an offset corresponding to the elastic distributed data set if a minimum value of an interval of offsets corresponding to the elastic distributed data set is greater than 0, where time information corresponding to the first offset is time information corresponding to the second offset;
the minimum value of the interval of the second offset is 0, the maximum value of the interval of the second offset is the minimum value of the interval of the offset corresponding to the elastic distributed data set, and the time information corresponding to the second offset is the time information at the previous moment of the time information corresponding to the elastic distributed data set.
In a possible implementation manner of the second aspect, the storage module is specifically configured to store the offset and the time information corresponding to the elastic distributed data set in a memory.
In a possible implementation manner of the second aspect, the obtaining module is configured to obtain, from the memory, an offset corresponding to the elastic distributed data set according to time information corresponding to the elastic distributed data set.
In a third aspect, an embodiment of the present application provides an electronic device, including:
a memory for storing a computer program;
the processor is configured to execute the computer program, and in particular, to execute the offset management method according to the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer storage medium, which includes computer instructions that, when executed by a computer, cause the computer to implement the offset management method according to the first aspect of the claims.
In a fifth aspect, the present embodiments provide a computer program product, the program product comprising a computer program, the computer program being stored in a readable storage medium, the computer program being readable from the readable storage medium by at least one processor of a computer, the at least one processor executing the computer program to cause the computer to perform the method of any of the first aspects.
According to the offset management method, the offset management device and the storage medium, when Kafka is accessed in Spark Streaming, data are read from the Kafka to the elastic distributed data set; storing offset and time information corresponding to the elastic distributed data set; processing the data in the elastic distributed data set, and storing the offset and the time information corresponding to the elastic distributed data set in a reporting list when the data processing is successful; and if the elastic distributed data set is the first elastic distributed data set in the report list, reporting the first offset to Kafka, wherein the interval of the first offset comprises the interval of the offset corresponding to the elastic distributed data set. Therefore, when the data processing is unsuccessful, namely the task processing is abnormal, the Spark Streaming does not report the offset corresponding to the elastic distributed data set with the data unprocessed to the Kafka, so that the problem that the last failed batch of data is lost when the task is started again due to the abnormal task can be solved.
Drawings
Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application;
FIG. 2 is a schematic diagram of DStream generated in 4 s;
FIG. 3 is a schematic view of a Kafka partition according to an embodiment of the present application;
fig. 4 is a flowchart of an offset management method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of information interaction according to an embodiment of the present application;
fig. 6 is a flowchart of an offset management method according to an embodiment of the present application;
fig. 7 is a schematic diagram of an offset management apparatus according to an embodiment of the present application;
fig. 8 is a schematic diagram of an offset management apparatus according to an embodiment of the present application;
fig. 9 is a schematic diagram of an offset management apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Fig. 1 is a schematic view of an application scenario related to an embodiment of the present application, including: spark Streaming, Kafka, and a memory cell.
Spark Streaming is an extension of a Core Spark Application Programming Interface (Core Spark API), is an implementation module for processing Streaming files in Spark, and supports processing of flexible, high-throughput, fault-tolerant real-time data streams. Spark Streaming acquires data from data sources such as Kafka, processes the acquired data, and pushes the processed data to storage units such as files, databases, and real-time dashboards.
For ease of understanding, Spark streaming proposes a discrete Stream (DStream) object, representing a continuous input Stream. DStream is a continuous elastic Distributed data set (RDD) sequence, each RDD represents a computation cycle (multiple RDDs inside DStream), and all operations applied to DStream are mapped to operations on RDDs inside DStream. DStream is essentially a hash table of values keyed by time, RDD, holding RDDs that are generated chronologically. Spark Streaming adds a newly generated RDD to the hash table each time, but removes RDDs that are no longer needed from the hash table, so DStream can also be understood simply as a dynamic sequence of time-keyed RDDs. Assuming a batch processing time interval of 1s, FIG. 2 is a diagram of DStream generated in 4 s.
For continuous data, Spark Streaming firstly performs discretization processing on continuous data streams in a segmentation mode. And correspondingly generating an RDD (resource description device) every time the data stream is cut once, wherein each RDD comprises all the data acquired in a time interval.
The elastic Distributed data set (RDD) is the most basic data abstraction in Spark, and represents a collection of immutable, partitionable, and parallel-computable elements inside. RDD has the characteristics of a data flow model: automatic fault tolerance, location-aware scheduling, and scalability. RDD allows a user to explicitly cache a working set in memory when executing multiple queries, and subsequent queries can reuse the working set, which greatly increases query speed.
Kafka is a distributed streaming platform with three important capabilities: publish and subscribe to the recorded stream. This is like a message queue or an enterprise message; storing the recording stream in a fault tolerant manner; the distribution time handles the recording stream. The method is applied to two broad types of applications: establishing a real-time stream data pipeline for reliably acquiring data in or among applications; real-time stream processing applications are established for converting or responding to data streams. Kafka runs on a cluster of one or more servers, which stores a stream of records of a class called a body (topic), each record consisting of a key, a value, and a timestamp. Kafka has four classes of core Application Programming Interfaces (APIs): a Producer interface (Producer API) allows an application to publish a stream of records to one or more topics; a Consumer interface (Consumer API) allows for subscribing to one or more topics and processing the recorded stream published to the topics; the stream processing interface (streamapi) allows an application, as a stream processor (stream processor), to consume an incoming data stream from a topic, and to publish an outgoing stream to an output topic, effectively converting the incoming stream into an outgoing stream; the Connector interface (Connector API) allows for the creation and operation of reusable producers and consumers, linking Kafka themes with existing applications or data. For example, a relational database connector may capture each change to a table.
The client and server communication in Kafka is achieved by a simple, high performance, language independent TCP protocol. Versions of the protocol are backward compatible. The packages provide Kafka's Java client, which can be implemented in a variety of languages.
Topic (topic) is a category or name of a published record. A topic in Kafka always has multiple subscribers (subscribers), i.e., a topic may have zero, one, or many subscribers to the data written to the topic. Each topic, Kafka cluster, maintains a partition log, as shown in FIG. 3, where each partition is an ordered, immutable sequence of records that are appended to a structured commit log. Each record in a partition is assigned a sequential identification number, called offset (offset), that uniquely identifies each record within the partition.
The Kafka cluster maintains all published records for a configurable period, regardless of whether they are consumed or not. For example, the storage policy is two days, then the record is consumable within two days after release, and the record is deleted later to free up disk space. The performance of Kafka is constant regardless of data size, so storing data for a longer time is not an issue. In fact, the metadata that each consumer maintains uniquely is the offset or location of the consumer in the log. The offset is controlled by the consumer: the normal read record is a consumer linear advance offset, but in practice it can consume records in any order since the position is controlled by the consumer. For example, the consumer may return to the oldest record to reprocess past data or may jump to the newest record to consume the record from "now".
As shown in fig. 1, when Kafka is accessed in Spark Streaming, a failure recovery mechanism needs to be established to prevent data loss caused by program abnormal failure. A more reasonable failure recovery mechanism in a production environment is failure recovery by managing message offsets in Kafka.
The Kafka Direct discrete stream (Direct Dstream) provided by Spark Streaming can be consumed starting from Kafka specifying the offset, and if enable. When the enable, auto, commit, or true is specified, the spare Streaming will periodically submit the offset to Kafka, but when the task is abnormal, the last failed batch of data will be lost when the task is started again because the offset has already been submitted to Kafka.
In order to solve the above technical problem, an offset management mechanism is provided in an embodiment of the present application, which can effectively solve the problem of data loss when Kafka is accessed in Spark Streaming.
In the present embodiment, the phrase "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.
In the description of the present application, "plurality" means two or more than two unless otherwise specified.
In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.
The following describes in detail the offset management method when Kafka is accessed in Spark Streaming, which is provided in the embodiments of the present application, with specific examples.
Fig. 4 is a flowchart of an offset management method according to an embodiment of the present application, and fig. 5 is a schematic diagram of information interaction according to the embodiment of the present application; as shown in fig. 4 and 5, the method of the embodiment of the present application includes:
s101, when Kafka is accessed in Spark Streaming, reading data from the Kafka to an elastic distributed data set.
The execution subject of the embodiment of the present application is Spark Streaming, which is mounted on an electronic device, as shown in fig. 1, the Spark Streaming has access to Kafka, and can acquire data from Kafka.
Specifically, as shown in fig. 5, Spark Streaming reads data from Kafka and places the read data in RDD.
As can be seen from the above, the data placed in the RDD may be understood as data in one or more partitions specified by Kafka.
And S102, storing the offset and the time information corresponding to the elastic distributed data set.
As can be seen from fig. 2, the spare Streaming corresponds to a plurality of consecutive RDDs, i.e. to a sequence of RDDs, each RDD containing all data acquired in a time interval. For example, as shown in fig. 2, the RDD corresponding to time 1 includes data acquired in a time interval from time 0 to time 1, so that time 1 can be used as the time information of the RDD.
In DStream, RDDs are arranged according to time information, each RDD corresponds to an offset and time information, and the offset corresponding to the RDD can be understood as the offset of the RDD relative to the head of the DStream.
Specifically, as shown in fig. 2 and 3, it is assumed that an RDD stores data of 3 partitions, each partition is an ordered and immutable sequence of records, and each record in a partition is assigned an offset (offset) to uniquely identify each record in the partition. For example, as shown in table 1, the offset corresponding to each RDD includes an offset interval of a piece of data of each partition stored in the RDD.
TABLE 1
RDD Time information Offset amount
RDD t1 Partition 0[0-12]]1[0-9] of the partition]2[0-12] of the partition]
RDD t2 Partition 0[13-26]]Partition 1[10-23]]Partition 2[13-27]
RDD t3 Partition 0[27-39 ]]Partition 1[24-35 ]]Partition 2[28-40 ]]
……. …… ……
As shown in Table 1, the RDD at time t1 stores data with an offset interval of [0-12] in partition 0, data with an offset interval of [0-9] in partition 1, and data with an offset interval of [0-12] in partition 2. the RDD at time t2 stores data with offset intervals [13-26] in partition 0, data with offset intervals [10-23] in partition 1, and data with offset intervals [13-27] in partition 2. the RDD at time t3 stores data with offset intervals [27-39] in partition 0, data with offset intervals [24-35] in partition 1, and data with offset intervals [28-40] in partition 2. And by analogy, storing the data of a section of offset interval in each partition in the RDDs at different times.
Based on the above, it is assumed that S101 puts the data obtained from Kafka into the RDD at time t2, so that the offset corresponding to the RDD at time t2 can be obtained from table 1 as follows: the time information corresponding to the partitions 0[13-26], 1[10-23] and 2[13-27] is t 2.
When data in the RDD is processed, the offset corresponding to the RDD is lost, but the time information can be retained. Therefore, after each batch of data is read to the RDD, the time information and the offset corresponding to the RDD are stored, for example, as shown in fig. 5, the time information and the offset corresponding to the RDD are stored in the storage unit.
Optionally, the time information and the offset corresponding to the RDD are stored in a Key-Value form, where Key is the time information and Value is the offset. For example, the offset and the time information corresponding to RDD2 are stored in the following manner: t 2: an offset.
In one possible implementation, the offset and the time information corresponding to the RDD may be stored in the memory. Specifically, in practical applications, the steps of S101 to S102 are executed by a thread, where the thread corresponds to a memory area, and thus after the thread obtains the offset and the data information corresponding to the RDD, the offset and the time information corresponding to the RDD are stored in the memory area corresponding to the thread. So that the thread can quickly read the offset and the time information corresponding to the RDD from the memory area.
S103, processing the data in the elastic distributed data set, and storing the offset and the time information corresponding to the RDD in a report list when the data processing is successful.
And storing the offset and the time information corresponding to the RDD according to the steps, and then processing the data in the RDD.
If the data in the RDD is successfully processed, for example, successfully consumed, the offset and the time information corresponding to the RDD are stored in the report list. If the data processing in the RDD fails, the offset and the time information corresponding to the RDD will not be stored in the report list
In a possible implementation manner, if the data processing in the RDD is successful, the offset corresponding to the RDD is obtained from the offset and the time information corresponding to the RDD stored in the above S102 according to the time information corresponding to the RDD.
For example, in S102, the offset and the time information corresponding to the RDD are obtained, so that in S103, the offset corresponding to the RDD may be searched for from the memory according to the time information corresponding to the RDD.
According to the method, after the offset corresponding to the RDD is obtained, the offset corresponding to the RDD and the time information are stored in the report list.
As can be seen by continuing with the above example, assuming that the RDD for successful data processing is the RDD at time t2, the offset and the time information corresponding to the RDD at time t2 are stored in the report list, as shown in table 2:
TABLE 2
RDD Time information Offset amount
RDD t2 Partition 0[13-26]]Partition 1[10-23]]Partition 2[13-27]]
RDD t4 Partition 0[40-50]1[36-46 ] zone]2[41-52 ] of the sub-area]
RDD t5 Partition 0[51-71]Partition 1[47-60 ]]2[53-73 ] partition]
…… …… ……
The offset and the time information corresponding to the RDD stored in table 2 are both the offset and the time information corresponding to the RDD with which the data processing is successful, for example, the RDD at time t4 and the RDD at time t5 in table 2 both process successfully.
In this step, if the data in the RDD is processed, the data processing fails, for example, when the data in the RDD is processed, the system is jammed or powered off, so that when the data processing in the RDD fails, the offset and the time information corresponding to the RDD are not stored in the report list.
And S104, reporting a first offset to the Kafka if the elastic distributed data set is the first elastic distributed data set in the report list, wherein the interval of the first offset comprises the interval of the offset corresponding to the RDD.
And after storing the offset and the time information corresponding to the RDD with successful data processing in the report list according to the step S103, determining whether the RDD is the first RDD in the report list.
And if the RDD is judged to be the first RDD in the report list, determining a first offset, and reporting the first offset to Kafka.
The first offset interval includes an offset interval corresponding to the RDD.
Optionally, the first offset is an offset corresponding to the RDD.
Optionally, the interval of the first offset is greater than the interval of the offset corresponding to the RDD.
For example, referring to table 2, it is assumed that the RDD at the time t2 in table 2 is the RDD at the time t2, and it is determined that the RDD at the time t2 is the first RDD in the report list, so that a first offset can be determined according to the offset of the RDD at the time t 2. The first offset includes an offset corresponding to the RDD at time t 2. For example, the first offset may be an offset of the RDD at time t2, such as partitions 0[13-26], partitions 1[10-23], and partitions 2[13-27], or the first offset may be greater than the offset of the RDD at time t2, such as the first offset is: partitions 0[10-26], partitions 1[4-23] and partitions 2[5-27 ].
According to the method and the device, the offset corresponding to the RDD is reported to the Kafka after the completion of the data processing in the RDD is detected, so that the problem that when a task is abnormal, Spark Streaming automatically submits the offset to the Kafka before the data are not processed completely, and when the task is restarted, the last failed batch of data is lost is solved.
Further, considering that Spark Streaming allows a plurality of RDDs to be processed simultaneously, it is possible to generate a case where the subsequent RDD has already been processed in a case where the previous RDD has not been processed, in which case the offset is not committed. For example, as shown in table 2, the data in the RDD at time t4 has been processed and succeeded, but the data in the RDD at time t3 before the RDD at time t4 has not been processed yet, and at this time, the offset corresponding to the RDD at time t4 is not reported. This allows Kafka to know the nodes at which data is processed sequentially at the current time.
According to the offset management method provided by the embodiment of the application, when Kafka is accessed in Spark Streaming, data is read from the Kafka to RDD; storing offset and time information corresponding to the RDD; processing the data in the RDD, and storing the offset and the time information corresponding to the RDD in a report list when the data processing is successful; and if the RDD is the first RDD in the report list, reporting the first offset to Kafka, wherein the interval of the first offset comprises the interval of the offset corresponding to the RDD. Therefore, when the data processing is unsuccessful, namely the task processing is abnormal, the Spark Streaming does not report the offset corresponding to the RDD of which the data is not processed to the Kafka, so that the problem that the last batch of failed data is lost when the task is started again due to the abnormality can be solved.
In one possible implementation manner, the minimum value of the interval of the first offset amount is 0, and the maximum value of the interval of the first offset amount is the maximum value of the interval of the offset amount corresponding to the RDD.
Continuing to refer to table 2, the RDD at time t2 is the first RDD in the report list, and the offset corresponding to the RDD at time t2 is: partitions 0[13-26], partitions 1[10-23], and partitions 2[13-27], so that the minimum value of the interval of the first offset is 0, and the maximum value is the maximum value of the interval of the offset corresponding to the RDD at time t2, that is, the first offset is: partition 0[0-26], partition 1[0-23], and partition 2[0-27 ].
Fig. 6 is a flowchart of an offset management method according to an embodiment of the present application, where on the basis of the foregoing embodiment, as shown in fig. 6, the method according to the embodiment of the present application includes:
s201, when Kafka is accessed in Spark Streaming, reading data from the Kafka to an elastic distributed data set.
S202, storing the offset and the time information corresponding to the elastic distributed data set.
S203, processing the data in the elastic distributed data set, and storing the offset and the time information corresponding to the elastic distributed data set in a report list when the data processing is successful.
The processes of S201 to S203 are the same as the processes of S101 to S103, and refer to the detailed descriptions of S101 to S103, which are not repeated herein.
And S204, if the elastic distributed data set is the first RDD in the report list, determining a first offset and time information corresponding to the first offset.
S205, storing the first offset and the time information corresponding to the first offset in the report list.
In an example, if a minimum value of an interval of offsets corresponding to the RDD is equal to 0, it is determined that the first offset is an offset corresponding to the RDD, and time information corresponding to the first offset is time information corresponding to the RDD.
For example, it is assumed that the RDD in the above step corresponds to an offset: partition 0[0-26], partition 1[0-23], partition 2[0-27], then the first offset may be determined as: partition 0[0-26], partition 1[0-23], and partition 2[0-27 ].
The time information corresponding to the first offset is the time information corresponding to the RDD, for example, if the time information corresponding to the RDD is t2, the time information corresponding to the first offset is also t 2.
In another example, if the minimum value of the interval of the offsets corresponding to the RDD is greater than 0, it is determined that the first offset is a combination of a second offset and the offset corresponding to the RDD, and the time information corresponding to the first offset is the time information corresponding to the second offset.
The minimum value of the interval of the second offset is 0, the maximum value of the interval of the second offset is the minimum value of the interval of the offset corresponding to the RDD, and the time information corresponding to the second offset is the time information of the previous time of the time information corresponding to the RDD.
Specifically, if the minimum value of the interval of the offset amount corresponding to the RDD is greater than 0, it means that another RDD exists before the RDD, and the minimum value of the interval of the total offset amount corresponding to these RDDs is 0, and the maximum value thereof is the minimum value of the interval of the offset amount corresponding to the RDD.
For example, referring to table 2, it is assumed that the RDD is the RDD at time t2 in table 2, and the corresponding offset of the RDD at time t2 is: partition 0[13-26], partition 1[10-23], partition 2[13-27], and RDD2 at time t2 is the first RDD in the report list, so that it can be determined that the data in the RDD before the RDD at time t2 has been processed completely, and assuming that the RDD before the RDD at time t2 is the RDD at time t1, the offset corresponding to the RDD at time t1 is: partitions 0[0-12], 1[0-9], and 2[0-12] are marked as a second offset. This may add the offset and time information corresponding to the RDD at time t1 to table 2, which may be obtained as shown in table 3 below.
TABLE 3
RDD Time information Offset amount
RDD t1 Partition 0[0-12]]1[0-9] of the partition]2[0-12] of the partition]
RDD t2 Partition 0[13-26]]Partition 1[10-23]]Partition 2[13-27]]
RDD t4 Partition 0[40-50]1[36-46 ] zone]2[41-52 ] of the sub-area]
RDD t5 Partition 0[51-71]Partition 1[47-60 ]]2[53-73 ] partition]
…… …… ……
As can be seen from table 3, the offsets corresponding to the RDD at time t1 and the RDD at time t2 are consecutive, so that the offset corresponding to the RDD at time t1 and the offset corresponding to the RDD at time t2 can be combined to obtain the offset: partition 0[0-26], partition 1[0-23], partition 2[0-27], the offset after merging: partition 0[0-26], partition 1[0-23], and partition 2[0-27] as the first offset.
Optionally, the offsets corresponding to the RDD at t4 time and the RDD at t5 time with consecutive offsets may be merged to obtain the report list shown in table 4.
TABLE 4
RDD Time information Offset amount
RDD t1 Partition 0[0-26]]1[0-23] zone]Partition 2[0-27]]
RDD t4 Partition 0[40-71]1[36-60] zone]2[41-73] zone]
…… …… ……
Referring to table 4, offset: partition 0[0-26], partition 1[0-23], and partition 2[0-27] are used as the first offset, and time information t1 is used as the time information corresponding to the first offset.
Optionally, if there are other offsets consecutive to the first offset in the reporting list, the first offset and the consecutive offsets are merged to be used as a new first offset.
And S206, reporting the first offset to the Kafka.
Continuing with table 4, the first offset is: partitions 0[0-26], partitions 1[0-23], and partitions 2[0-27] are reported to Kafka.
Since in table 4, the offset: partition 0[0-26], partition 1[0-23], partition 2[0-27] and offset: partitions 0[40-71], 1[36-60], and 2[41-73] are not contiguous, i.e., the data in the RDD at time t3 has not been processed, so the offset: partitions 0[40-71], partitions 1[36-60], and partitions 2[41-73] are not reported to Kafka, so that Kafka may report the first offset: partition 0[0-26], partition 1[0-23], partition 2[0-27], the offset may be determined as: the data for partitions 0[0-26], 1[0-23], and 2[0-27] have been successfully processed.
According to the offset management method provided by the embodiment of the application, if the minimum value of the interval of the offsets corresponding to the RDD is equal to 0, the first offset is determined to be the offset corresponding to the RDD, and the time information corresponding to the first offset is the time information corresponding to the RDD; and if the minimum value of the interval of the offset corresponding to the RDD is greater than 0, determining that the first offset is the combination of the second offset and the offset corresponding to the RDD, wherein the time information corresponding to the first offset is the time information corresponding to the second offset, the minimum value of the interval of the second offset is 0, the maximum value of the interval of the second offset is the minimum value of the interval of the offset corresponding to the RDD, and the time information corresponding to the second offset is the time information at the previous moment of the time information corresponding to the RDD. The first offset and the time information corresponding to the first offset can be accurately determined by the method.
Fig. 7 is a schematic diagram of an offset management apparatus according to an embodiment of the present application, where the apparatus is applied to Spark Streaming, or the apparatus is Spark Streaming, and as shown in fig. 7, the offset management apparatus 100 includes:
the reading module 110 is configured to, when Kafka is accessed in Spark Streaming, read data from Kafka to an elastic distributed data set;
a storage module 120, configured to store offset and time information corresponding to the elastic distributed data set;
a processing module 130, configured to process the data in the elastic distributed data set, and store an offset and time information corresponding to the elastic distributed data set in a reporting list when the data processing is successful;
a reporting module 140, configured to report a first offset to the Kafka if the elastic distributed data set is a first elastic distributed data set in the reporting list, where an interval of the first offset includes an interval of offsets corresponding to the elastic distributed data set.
The offset management apparatus of the embodiment of the present application may be configured to implement the technical solution of the foregoing method, and the implementation principle and the technical effect are similar, which are not described herein again.
In a possible implementation manner, the minimum value of the interval of the first offset amount is 0, and the maximum value of the interval of the first offset amount is the maximum value of the interval of the offset amount corresponding to the elastic distributed data set.
Fig. 8 is a schematic diagram of an offset management apparatus according to an embodiment of the present application, and based on the foregoing embodiment, as shown in fig. 8, the offset management apparatus 100 further includes:
a determining module 150, configured to determine time information corresponding to the first offset;
the storage module 120 is further configured to store the first offset and the time information corresponding to the first offset in the report list.
In a possible implementation manner, the determining module 150 is specifically configured to determine that the first offset is an offset corresponding to the elastic distributed data set if a minimum value of an interval of offsets corresponding to the elastic distributed data set is equal to 0, and the time information corresponding to the first offset is time information corresponding to the elastic distributed data set.
In another possible implementation manner, the determining module 150 is specifically configured to determine that the first offset is a combination of a second offset and an offset corresponding to the elastic distributed data set if a minimum value of an interval of offsets corresponding to the elastic distributed data set is greater than 0, where time information corresponding to the first offset is time information corresponding to the second offset;
the minimum value of the interval of the second offset is 0, the maximum value of the interval of the second offset is the minimum value of the interval of the offset corresponding to the elastic distributed data set, and the time information corresponding to the second offset is the time information at the previous moment of the time information corresponding to the elastic distributed data set.
In another possible implementation manner, the storage module 120 is specifically configured to store the offset and the time information corresponding to the elastic distributed data set in a memory.
The offset management apparatus of the embodiment of the present application may be configured to implement the technical solution of the foregoing method, and the implementation principle and the technical effect are similar, which are not described herein again.
Fig. 9 is a schematic diagram of an offset management apparatus according to an embodiment of the present application, and based on the foregoing embodiment, as shown in fig. 9, the offset management apparatus 100 further includes:
an obtaining module 160, configured to obtain, according to the time information corresponding to the elastic distributed data set, an offset corresponding to the elastic distributed data set from the memory.
The offset management apparatus of the embodiment of the present application may be configured to implement the technical solution of the foregoing method, and the implementation principle and the technical effect are similar, which are not described herein again.
Alternatively, the devices shown in fig. 7 to 9 may exist in the form of a chip product.
Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 200 is loaded with Spark Streaming, and the electronic device may implement the method embodiments, and may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules or units corresponding to the above functions.
In one possible design, the electronic device 200 includes a processor 210 and a memory 220, and the processor 210 is configured to support the electronic device 200 to perform the corresponding functions of the above-mentioned apparatuses. The electronic device 200 may also include a memory 220, the memory 220 for coupling with the processor 210, which stores program instructions and data necessary for the electronic device 200.
When the electronic device 200 is turned on, the processor 210 can read the program instructions and data in the memory 220, interpret and execute the program instructions, and process the data of the program instructions.
Those skilled in the art will appreciate that fig. 10 shows only one memory 220 and one processor 210 for ease of illustration. In an actual electronic device 200, there may be multiple processors 210 and multiple memories 220. The memory 220 may also be referred to as a storage medium or a storage device, etc., which is not limited in this application.
The electronic device of the embodiment of the present application may be configured to implement the technical solutions of the above device embodiments, and the implementation principles and technical effects thereof are similar and will not be described herein again.
Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the apparatus according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing embodiments of the apparatuses, and are not described herein again. In addition, the device embodiments and the device embodiments may also refer to each other, and the same or corresponding contents in different embodiments may be referred to each other, which is not described in detail.

Claims (10)

1. An offset management method, comprising:
when Kafka is accessed in Spark Streaming, reading data from the Kafka into an elastic distributed data set;
storing offset and time information corresponding to the elastic distributed data set;
processing the data in the elastic distributed data set, and storing the offset and the time information corresponding to the elastic distributed data set in a reporting list when the data processing is successful;
reporting a first offset to the Kafka if the elastic distributed data set is a first elastic distributed data set in the report list, wherein an interval of the first offset includes an interval of an offset corresponding to the elastic distributed data set.
2. The method according to claim 1, wherein a minimum value of the interval of the first offsets is 0, and a maximum value of the interval of the first offsets is a maximum value of the interval of the offsets corresponding to the elastic distributed data set.
3. The method of claim 2, further comprising:
determining time information corresponding to the first offset;
and storing the first offset and the time information corresponding to the first offset in the report list.
4. The method of claim 3, further comprising:
and if the minimum value of the interval of the offset corresponding to the elastic distributed data set is equal to 0, determining that the first offset is the offset corresponding to the elastic distributed data set, and the time information corresponding to the first offset is the time information corresponding to the elastic distributed data set.
5. The method of claim 3, further comprising:
if the minimum value of the interval of the offsets corresponding to the elastic distributed data set is greater than 0, determining that the first offset is the combination of a second offset and the offsets corresponding to the elastic distributed data set, and the time information corresponding to the first offset is the time information corresponding to the second offset;
the minimum value of the interval of the second offset is 0, the maximum value of the interval of the second offset is the minimum value of the interval of the offset corresponding to the elastic distributed data set, and the time information corresponding to the second offset is the time information at the previous moment of the time information corresponding to the elastic distributed data set.
6. The method according to any one of claims 1-5, wherein the storing the offset and the time information corresponding to the elastic distributed data set comprises:
and storing the offset and the time information corresponding to the elastic distributed data set in a memory.
7. The method of claim 6, wherein before storing the offset and time information corresponding to the elastically distributed data set in a reporting list, the method further comprises:
and acquiring the offset corresponding to the elastic distributed data set from the memory according to the time information corresponding to the elastic distributed data set.
8. An offset management apparatus, comprising:
the reading module is used for reading data from the Kafka to an elastic distributed data set when the Kafka is accessed in Spark Streaming;
the storage module is used for storing the offset and the time information corresponding to the elastic distributed data set;
the processing module is used for processing the data in the elastic distributed data set and storing the offset and the time information corresponding to the elastic distributed data set in a reporting list when the data processing is successful;
a reporting module, configured to report a first offset to the Kafka if the elastic distributed data set is a first elastic distributed data set in the reporting list, where an interval of the first offset includes an interval of offsets corresponding to the elastic distributed data set.
9. An electronic device, comprising:
a memory for storing a computer program;
the processor is configured to execute the computer program, in particular to execute the offset management method according to any of claims 1 to 7.
10. A computer storage medium comprising computer instructions which, when executed by a computer, cause the computer to implement the offset management method of any one of claims 1 to 7.
CN201911031175.8A 2019-10-28 2019-10-28 Offset management method, device and storage medium Pending CN112732165A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911031175.8A CN112732165A (en) 2019-10-28 2019-10-28 Offset management method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911031175.8A CN112732165A (en) 2019-10-28 2019-10-28 Offset management method, device and storage medium

Publications (1)

Publication Number Publication Date
CN112732165A true CN112732165A (en) 2021-04-30

Family

ID=75588990

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911031175.8A Pending CN112732165A (en) 2019-10-28 2019-10-28 Offset management method, device and storage medium

Country Status (1)

Country Link
CN (1) CN112732165A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040054977A (en) * 2002-12-20 2004-06-26 삼성에스디에스 주식회사 Distributed mpeg file management system and method thereof
US20090150180A1 (en) * 2005-09-23 2009-06-11 Ron Cohen Method, apparatus and solftware for identifying responders in clinical environment
CN105975582A (en) * 2016-05-05 2016-09-28 重庆市城投金卡信息产业股份有限公司 Method and system for generating RFID (Radio Frequency Identification) data into tripping OD (Origin Destination) matrix on the basis of Spark
WO2016162858A1 (en) * 2015-04-10 2016-10-13 Universita' Degli Studi Di Salerno Purifying apparatus based on photocatalysis through modulation of light emission
US20170006509A1 (en) * 2015-07-02 2017-01-05 Nokia Technologies Oy User equipment adaptation of reporting triggers based on active set size
WO2017117879A1 (en) * 2016-01-08 2017-07-13 中兴通讯股份有限公司 Personal identification processing method, apparatus and system
WO2018094961A1 (en) * 2016-11-28 2018-05-31 华为技术有限公司 Write request processing method, device, and data center
CN108108126A (en) * 2017-12-15 2018-06-01 北京奇艺世纪科技有限公司 A kind of data processing method, device and equipment
CN109656887A (en) * 2018-12-11 2019-04-19 东北大学 A kind of Distributed Time sequence pattern search method of magnanimity high-speed rail axis temperature data
KR20190069229A (en) * 2017-12-11 2019-06-19 한국교통대학교산학협력단 Method and system for managing moving objects in distributed memory
CN110287038A (en) * 2019-06-10 2019-09-27 天翼电子商务有限公司 Promote the method and system of the data-handling efficiency of Spark Streaming frame

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040054977A (en) * 2002-12-20 2004-06-26 삼성에스디에스 주식회사 Distributed mpeg file management system and method thereof
US20090150180A1 (en) * 2005-09-23 2009-06-11 Ron Cohen Method, apparatus and solftware for identifying responders in clinical environment
WO2016162858A1 (en) * 2015-04-10 2016-10-13 Universita' Degli Studi Di Salerno Purifying apparatus based on photocatalysis through modulation of light emission
US20170006509A1 (en) * 2015-07-02 2017-01-05 Nokia Technologies Oy User equipment adaptation of reporting triggers based on active set size
WO2017117879A1 (en) * 2016-01-08 2017-07-13 中兴通讯股份有限公司 Personal identification processing method, apparatus and system
CN105975582A (en) * 2016-05-05 2016-09-28 重庆市城投金卡信息产业股份有限公司 Method and system for generating RFID (Radio Frequency Identification) data into tripping OD (Origin Destination) matrix on the basis of Spark
WO2018094961A1 (en) * 2016-11-28 2018-05-31 华为技术有限公司 Write request processing method, device, and data center
KR20190069229A (en) * 2017-12-11 2019-06-19 한국교통대학교산학협력단 Method and system for managing moving objects in distributed memory
CN108108126A (en) * 2017-12-15 2018-06-01 北京奇艺世纪科技有限公司 A kind of data processing method, device and equipment
CN109656887A (en) * 2018-12-11 2019-04-19 东北大学 A kind of Distributed Time sequence pattern search method of magnanimity high-speed rail axis temperature data
CN110287038A (en) * 2019-06-10 2019-09-27 天翼电子商务有限公司 Promote the method and system of the data-handling efficiency of Spark Streaming frame

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谭亮;周静;: "基于Spark Streaming的实时交通数据处理平台", 计算机系统应用, no. 10, pages 137 - 143 *

Similar Documents

Publication Publication Date Title
US11422982B2 (en) Scaling stateful clusters while maintaining access
CA2929777C (en) Managed service for acquisition, storage and consumption of large-scale data streams
CA2930101C (en) Partition-based data stream processing framework
US10373247B2 (en) Lifecycle transitions in log-coordinated data stores
US9002805B1 (en) Conditional storage object deletion
CA2929776C (en) Client-configurable security options for data streams
US10025802B2 (en) Automated configuration of log-coordinated storage groups
US9052942B1 (en) Storage object deletion job management
US9417917B1 (en) Equitable resource allocation for storage object deletion
US9063946B1 (en) Backoff-based scheduling of storage object deletions
WO2018178641A1 (en) Data replication system
JP6996812B2 (en) How to process data blocks in a distributed database, programs, and devices
CN110895488B (en) Task scheduling method and device
US7877757B2 (en) Work item event monitor for procession of queued events
CN111597270A (en) Data synchronization method, device, equipment and computer storage medium
CN110019045B (en) Log floor method and device
CN112579552A (en) Log storage and calling method, device and system
CN115952227A (en) Data acquisition system and method, electronic device and storage medium
CN112732165A (en) Offset management method, device and storage medium
CN115562933A (en) Processing method and device of operation monitoring data, storage medium and electronic equipment
US8630976B2 (en) Fast search replication synchronization processes
CN114254039A (en) Distributed synchronization system
US11775864B2 (en) Feature management platform
CN111274316A (en) Execution method and device of multi-level data flow task, electronic equipment and storage medium
Lev-Ari et al. Quick: a queuing system in cloudkit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination