CN106776855B

CN106776855B - Processing method for reading Kafka data based on Spark Streaming

Info

Publication number: CN106776855B
Application number: CN201611069230.9A
Authority: CN
Inventors: 程永新; 谢涛; 王仁铮
Original assignee: Shanghai Qingwei Software Co Ltd
Current assignee: Shanghai Qingwei Software Co Ltd
Priority date: 2016-11-29
Filing date: 2016-11-29
Publication date: 2020-03-13
Anticipated expiration: 2036-11-29
Also published as: CN106776855A

Abstract

The invention discloses a processing method for reading Kafka data based on Spark Streaming, which comprises the following steps: s1) storing the data in the topic using Kafka; s2) slicing the real-time input data stream into blocks in units of time slices using Spark Streaming; s3) setting sparkStreaming complement scheduling time in advance according to the Kafka data failure record number; s4) monitoring the Kafka data reading process of spark streaming in real time; s5) rereading Kafka data by SparkStreaming. According to the invention, the spark streaming complement scheduling time is set according to the Kafka data failure record number, the reading process is monitored in real time, the failure record number is read again for complement, and the zero loss number guarantee is realized more flexibly and conveniently.

Description

Processing method for reading Kafka data based on Spark Streaming

Technical Field

The invention relates to a Kafka data processing method, in particular to a processing method for reading Kafka data based on Spark Streaming.

Background

Spark Streaming is the decomposition of Streaming into a series of short batch jobs. The batch processing engine is Spark, that is, the input data of Spark Streaming is divided into one piece of data (partitioned Stream) according to the batch size (e.g. 1 second), each piece of data is converted into RDD (resoilentpartitioned data set) in Spark, then the Transformation operation on dsstream in Spark Streaming is changed into the Transformation operation on RDD in Spark, and RDD is changed into an intermediate result and stored in the memory. The whole stream computation can superpose the intermediate results according to the service requirement or store the results to an external device. Fig. 1 shows the overall flow of Spark Streaming.

Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn corporation and later became part of the Apache project. Kafka is a distributed, partitionable, redundant backup, persistent log service. It is mainly used for processing active streaming data, as shown in fig. 2.

As is well known, the requirements of the big data era on the real-time property, stability and accuracy of data processing are higher and higher; the existing emerging combined architecture has spark streaming docking Kafka, and the real-time performance of data processing is further achieved by utilizing the spark streaming based on the memory iterative computation advantage and the high concurrent data distribution capability of Kafka; however, in the SparkStreaming docking kafka process, a potential data loss scene still inevitably occurs, and the specific process is as follows:

1. two Exectors have received input data from the receiver and cached it in Exector's memory; 2. the receiver notifies the input source that data has been received; 3. the Execturer starts to process the cached data according to the code of the application program; 4. at this time, Driver is suddenly hung up; 5. from a design perspective, once a Driver hangs down, the Exectuor it maintains will also be kill in its entirety; 6. since all Exectuors are kill, the data cached in their memory will also be lost. As a result, the cached data that has been notified of the data source but has not yet been processed is lost; 7. when cached, recovery is not possible because they are cached in Exector's memory, so data is lost.

From the above, a method for preventing zero loss is urgently needed to ensure the processing stability of SparkStreaming docking Kafka data.

Disclosure of Invention

The invention aims to solve the technical problem of providing a processing method for reading Kafka data based on Spark Streaming, which can effectively prevent data loss and re-consume data from Kafka after failure recovery, thereby ensuring zero-loss number more flexibly and conveniently under the abnormal condition of Spark Streaming program.

The technical scheme adopted by the invention for solving the technical problems is to provide a processing method for reading Kafka data based on Spark Streaming, which comprises the following steps: s1) storing the data in topics using Kafka, each topic containing a number of configurable number of partitions; s2) utilizing Spark Streaming to segment the real-time input data stream into blocks by taking a time slice as a unit, and generating a Spark Job process for each block; s3)

Setting sparkStreaming complement scheduling time according to the Kafka data failure record number in advance; s4) monitoring the processing process of reading Kafka data by spark streaming in real time; s5) re-reading the Kafka data lost by the failure by SparkStreaming based on the Kafka data failure record number and the schedule time.

In the above processing method for reading Kafka data based on Spark Streaming, in step S3), two database tables are created by using a relational database, which are a scheduling table and a failure record table, respectively, where the scheduling table stores information of a scheduling number id, a start time, an end time, a state and creation time, the failure record table stores information of a failure record id, an offset, a Kafka topic and Kafka node list information, and the scheduling number id in the scheduling table and the failure record id in the failure record table are in a primary foreign key relationship.

The processing method for reading Kafka data based on Spark Streaming, wherein the step S4) includes: in the process of reading Kafka data by spark streaming, if the corresponding Kafka topic data is not null, acquiring the offset of the data being read from Kafka, and storing the data offset, Kafka topic and Kafka node list information into a relational database failure number recording table, and if the data processing is abnormal, modifying the state in the data table to be failure.

In the above processing method for reading Kafka data based on Spark Streaming, in step S4), Spark Streaming is directly connected to the Kafka node in a Direct manner, and obtains the offset of data being read from Kafka by a createDirectStream method, and simultaneously identifies the state in the schedule table as in progress; when the abnormal condition occurs to cause the program not to be executed normally in the process of reading and processing data by the spark streaming docking Kafka, the state in the schedule table is modified to be failure.

The processing method for reading Kafka data based on Spark Streaming, wherein the step S5) includes: firstly, scanning a scheduling table according to a scheduling table state field as a query condition, obtaining the earliest scheduling record according to a creation time field as a sequencing descending order, then obtaining a scheduling number id, obtaining all Kafka failure record numbers by taking the field as a query failure record table condition, and then re-reading Kafka data according to Kafka topics and offset.

In the processing method for reading Kafka data based on Spark Streaming, in step S4), the scheduling table and the failure number recording table in the relational database are read and cached in the memory, and then the data in the cache is updated periodically by the thread to perform real-time monitoring.

Compared with the prior art, the invention has the following beneficial effects: according to the processing method for reading Kafka data based on Spark Streaming, Spark Streaming complement scheduling time is set according to the number of Kafka data failure records, the reading process is monitored in real time, the number of failure records is read again for complement, and therefore data loss can be effectively prevented, data is consumed again from Kafka after failure recovery, and zero-loss number guarantee is achieved more flexibly and conveniently under the condition that Spark Streaming program is abnormal.

Drawings

FIG. 1 is a diagram of a Spark Streaming architecture for use with the present invention;

FIG. 2 is a schematic illustration of Kafka processing of streaming data for use in the present invention;

FIG. 3 is a block diagram of a scheduling table and failure count record table model according to the present invention;

FIG. 4 is a flow chart of monitoring Kafka data reading based on Spark Streaming according to the present invention;

FIG. 5 is a flow chart of the failure record complement of the present invention.

Detailed Description

The invention is further described below with reference to the figures and examples.

According to the processing method for reading Kafka data based on Spark Streaming, provided by the invention, two database tables are created by using a relational database, wherein the two database tables are respectively a scheduling table (control) and a failure record number table (fai core). The scheduling table stores scheduling information including scheduling number id, start time, end time, state, creation time and the like. The failure number record table stores detailed information of specific failure data records, including information such as failure record id, offset, topic (topic), Kafka node list and the like. And the scheduling number id in the scheduling table and the id of the failure number recording table are in a main foreign key relationship.

In the process of processing data by sparkStreaming docking Kafka reading, firstly, the offset of data being read from Kafka is acquired by using a createDirectStream method of sparkStreaming, and the data offset information is put into a relational database failure number recording table, wherein the state indicates that the data is in progress.

When abnormality occurs in the course of processing data by reading Kafka in spark streaming docking, the program cannot be normally executed, and the state is modified to be failure according to the captured Exception information and the corresponding data offset information; otherwise, the modification is successful.

And combining the failure number recording table, manually setting a scheduling table, performing failure complement setting, scanning the scheduling table and the failure number recording table when the sparkStreaming program is restarted to obtain a complement strategy, and re-reading data on topic specified by Kafka.

The invention relates to two modes, namely a Receiver mode and a Direct mode, for sparkStreaming to acquire Kafka data, wherein the Receiver mode is to connect a Kafka queue through a zookeeper, and the Direct mode is to directly connect to a node of the Kafka to acquire the data. The method is based on a Receiver, and the method uses the Receiver to acquire data. Receiver is implemented using Kafka's high-level Consumer API. Data obtained by the Receiver from Kafka are stored in the memory of Spark executive, and then the jobs initiated by Spark Streaming will process those data. However, in the default configuration, this approach may lose data due to an underlying failure. If the high reliability mechanism is to be enabled, let the data be lost zero, then the Write Ahead logging mechanism (WAL) of Spark Streaming must be enabled. The mechanism will synchronously write the received Kafka data into a pre-write log on a distributed file system (such as HDFS); however, the Receiver mode has the following disadvantages: 1. WAL reduces receiver throughput because the received data must be saved to a reliable distributed file system; 2. for some input sources, it will repeat the same data. For example, when reading data from Kafka, a copy of data is saved in brokers of Kafka, and a copy is saved in Spark Streaming. The technical scheme of the invention is carried out on the premise of a method for acquiring the kafka data Direct by spark streaming, and compared with a first zero-loss mode, the technical scheme of the invention can bring remarkable beneficial effects, and has the following specific advantages: 1. the Kafka receiver is no longer required, and the exterctor consumes data from Kafka directly using the Simple Consumer API; 2. the WAL mechanism is no longer needed, and data can still be re-consumed from Kafka after recovery from failure; 3. the exact-once semantics are saved, and repeated data are not read from the WAL any more; 4. the method can guarantee zero loss more flexibly and conveniently under the abnormal condition of the spark streaming program.

The Spark Streaming used by the invention is a real-time computing framework established on Spark, and a user can combine Streaming, batch processing and interactive trial query application through rich API provided by the Spark Streaming and high-speed execution engine based on a memory; with the development of big data, the processing requirements of people on the big data are higher and higher, and the original batch processing framework MapReduce is suitable for offline calculation and cannot meet the service with higher real-time requirements. Therefore, how to ensure that the Spark Streaming acquires the kafka data is efficient and stable is very important. Aiming at the problem that Spark Streaming acquires kafka data loss data, the method for Spark Streaming to read kafka failure complement number mainly relates to three aspects of design of a scheduling and monitoring model, design of a complement scheduling center, design of a monitoring center and the like. The specific implementation process is as follows:

1. a schedule table (control) and a failure record number table (failure) are created in the relational database, and the specific table structure is shown in fig. 3.

2. In the process of processing data by the spark streaming docking Kafka reading, firstly, the offset of data being read from the Kafka is obtained by the createDirectstream method of spark streaming, and the information such as the data offset (offset) is put into a relational database failure number recording table, and the state indicates that the data is in progress. When abnormality occurs in the course of processing data by reading kafka in spark streaming docking, causing that a program cannot be normally executed, and calling update modification state as failure according to the captured Exception information and corresponding data offset information; otherwise the modification is successful as shown in figure 4.

3. The complement scheduling center interface is a scheduling center program, and as shown in fig. 5, firstly, scans a scheduling table according to a scheduling table state field as a query condition, obtains the earliest scheduling record according to a creation time field as a sort descending order, then obtains a scheduling number ID, obtains the number of all Kafka failure records by using the field as a query failure number record table condition, and then reads Kafka data again according to topic and offset (offset) for processing.

Although the present invention has been described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A processing method for reading Kafka data based on Spark Streaming is characterized by comprising the following steps:

s1) storing the data in topics using Kafka, each topic containing a number of configurable number of partitions;

s2) utilizing Spark Streaming to segment the real-time input data stream into blocks by taking a time slice as a unit, and generating a Spark Job process for each block;

s3) setting sparkStreaming complement scheduling time in advance according to the Kafka data failure record number;

s4) monitoring the processing process of reading Kafka data by spark streaming in real time;

s5) re-reading the Kafka data lost by failure by SparkStreaming according to the Kafka data failure record number and the scheduling time;

the step S3) creates two database tables, namely a scheduling table and a failure record table, using the relational database, wherein the scheduling table stores information of a scheduling number id, a starting time, an ending time, a state and a creation time, the failure record table stores information of a failure record id, an offset, a Kafka topic and Kafka node list information, and the scheduling number id in the scheduling table and the failure record id in the failure record table are in a main foreign key relationship.

2. The Spark Streaming based Kafka data reading processing method according to claim 1, wherein the step S4) includes: in the process of reading Kafka data by spark streaming, if the corresponding Kafka topic data is not null, acquiring the offset of the data being read from Kafka, and storing the data offset, Kafka topic and Kafka node list information into a relational database failure number recording table, and if the data processing is abnormal, modifying the state in the data table to be failure.

3. The processing method for reading Kafka data based on Spark Streaming according to claim 2, wherein Spark Streaming in step S4) is directly connected to the Kafka node by Direct and acquires an offset that is being read from Kafka to the data by createDirectStream method while identifying a state in the schedule table as being in progress; when the abnormal condition occurs to cause the program not to be executed normally in the process of reading and processing data by the spark streaming docking Kafka, the state in the schedule table is modified to be failure.

4. The Spark Streaming based Kafka data reading processing method according to claim 3, wherein the step S5) includes: firstly, scanning a scheduling table according to a scheduling table state field as a query condition, obtaining the earliest scheduling record according to a creation time field as a sequencing descending order, then obtaining a scheduling number id, obtaining all Kafka failure record numbers by taking the field as a query failure record table condition, and then re-reading Kafka data according to Kafka topics and offset.

5. The Spark Streaming based Kafka data processing method of claim 2, wherein step S4) first reads the schedule table and the failure record table in the relational database and caches them in the memory, and then updates the data in the cache periodically by a thread to perform real-time monitoring.