CN106776855B - Processing method for reading Kafka data based on Spark Streaming - Google Patents

Processing method for reading Kafka data based on Spark Streaming Download PDF

Info

Publication number
CN106776855B
CN106776855B CN201611069230.9A CN201611069230A CN106776855B CN 106776855 B CN106776855 B CN 106776855B CN 201611069230 A CN201611069230 A CN 201611069230A CN 106776855 B CN106776855 B CN 106776855B
Authority
CN
China
Prior art keywords
kafka
data
spark streaming
scheduling
reading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611069230.9A
Other languages
Chinese (zh)
Other versions
CN106776855A (en
Inventor
程永新
谢涛
王仁铮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Qingwei Software Co Ltd
Original Assignee
Shanghai Qingwei Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Qingwei Software Co Ltd filed Critical Shanghai Qingwei Software Co Ltd
Priority to CN201611069230.9A priority Critical patent/CN106776855B/en
Publication of CN106776855A publication Critical patent/CN106776855A/en
Application granted granted Critical
Publication of CN106776855B publication Critical patent/CN106776855B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Abstract

The invention discloses a processing method for reading Kafka data based on Spark Streaming, which comprises the following steps: s1) storing the data in the topic using Kafka; s2) slicing the real-time input data stream into blocks in units of time slices using Spark Streaming; s3) setting sparkStreaming complement scheduling time in advance according to the Kafka data failure record number; s4) monitoring the Kafka data reading process of spark streaming in real time; s5) rereading Kafka data by SparkStreaming. According to the invention, the spark streaming complement scheduling time is set according to the Kafka data failure record number, the reading process is monitored in real time, the failure record number is read again for complement, and the zero loss number guarantee is realized more flexibly and conveniently.

Description

Processing method for reading Kafka data based on Spark Streaming
Technical Field
The invention relates to a Kafka data processing method, in particular to a processing method for reading Kafka data based on Spark Streaming.
Background
Spark Streaming is the decomposition of Streaming into a series of short batch jobs. The batch processing engine is Spark, that is, the input data of Spark Streaming is divided into one piece of data (partitioned Stream) according to the batch size (e.g. 1 second), each piece of data is converted into RDD (resoilentpartitioned data set) in Spark, then the Transformation operation on dsstream in Spark Streaming is changed into the Transformation operation on RDD in Spark, and RDD is changed into an intermediate result and stored in the memory. The whole stream computation can superpose the intermediate results according to the service requirement or store the results to an external device. Fig. 1 shows the overall flow of Spark Streaming.
Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn corporation and later became part of the Apache project. Kafka is a distributed, partitionable, redundant backup, persistent log service. It is mainly used for processing active streaming data, as shown in fig. 2.
As is well known, the requirements of the big data era on the real-time property, stability and accuracy of data processing are higher and higher; the existing emerging combined architecture has spark streaming docking Kafka, and the real-time performance of data processing is further achieved by utilizing the spark streaming based on the memory iterative computation advantage and the high concurrent data distribution capability of Kafka; however, in the SparkStreaming docking kafka process, a potential data loss scene still inevitably occurs, and the specific process is as follows:
1. two Exectors have received input data from the receiver and cached it in Exector's memory; 2. the receiver notifies the input source that data has been received; 3. the Execturer starts to process the cached data according to the code of the application program; 4. at this time, Driver is suddenly hung up; 5. from a design perspective, once a Driver hangs down, the Exectuor it maintains will also be kill in its entirety; 6. since all Exectuors are kill, the data cached in their memory will also be lost. As a result, the cached data that has been notified of the data source but has not yet been processed is lost; 7. when cached, recovery is not possible because they are cached in Exector's memory, so data is lost.
From the above, a method for preventing zero loss is urgently needed to ensure the processing stability of SparkStreaming docking Kafka data.
Disclosure of Invention
The invention aims to solve the technical problem of providing a processing method for reading Kafka data based on Spark Streaming, which can effectively prevent data loss and re-consume data from Kafka after failure recovery, thereby ensuring zero-loss number more flexibly and conveniently under the abnormal condition of Spark Streaming program.
The technical scheme adopted by the invention for solving the technical problems is to provide a processing method for reading Kafka data based on Spark Streaming, which comprises the following steps: s1) storing the data in topics using Kafka, each topic containing a number of configurable number of partitions; s2) utilizing Spark Streaming to segment the real-time input data stream into blocks by taking a time slice as a unit, and generating a Spark Job process for each block; s3)
Setting sparkStreaming complement scheduling time according to the Kafka data failure record number in advance; s4) monitoring the processing process of reading Kafka data by spark streaming in real time; s5) re-reading the Kafka data lost by the failure by SparkStreaming based on the Kafka data failure record number and the schedule time.
In the above processing method for reading Kafka data based on Spark Streaming, in step S3), two database tables are created by using a relational database, which are a scheduling table and a failure record table, respectively, where the scheduling table stores information of a scheduling number id, a start time, an end time, a state and creation time, the failure record table stores information of a failure record id, an offset, a Kafka topic and Kafka node list information, and the scheduling number id in the scheduling table and the failure record id in the failure record table are in a primary foreign key relationship.
The processing method for reading Kafka data based on Spark Streaming, wherein the step S4) includes: in the process of reading Kafka data by spark streaming, if the corresponding Kafka topic data is not null, acquiring the offset of the data being read from Kafka, and storing the data offset, Kafka topic and Kafka node list information into a relational database failure number recording table, and if the data processing is abnormal, modifying the state in the data table to be failure.
In the above processing method for reading Kafka data based on Spark Streaming, in step S4), Spark Streaming is directly connected to the Kafka node in a Direct manner, and obtains the offset of data being read from Kafka by a createDirectStream method, and simultaneously identifies the state in the schedule table as in progress; when the abnormal condition occurs to cause the program not to be executed normally in the process of reading and processing data by the spark streaming docking Kafka, the state in the schedule table is modified to be failure.
The processing method for reading Kafka data based on Spark Streaming, wherein the step S5) includes: firstly, scanning a scheduling table according to a scheduling table state field as a query condition, obtaining the earliest scheduling record according to a creation time field as a sequencing descending order, then obtaining a scheduling number id, obtaining all Kafka failure record numbers by taking the field as a query failure record table condition, and then re-reading Kafka data according to Kafka topics and offset.
In the processing method for reading Kafka data based on Spark Streaming, in step S4), the scheduling table and the failure number recording table in the relational database are read and cached in the memory, and then the data in the cache is updated periodically by the thread to perform real-time monitoring.
Compared with the prior art, the invention has the following beneficial effects: according to the processing method for reading Kafka data based on Spark Streaming, Spark Streaming complement scheduling time is set according to the number of Kafka data failure records, the reading process is monitored in real time, the number of failure records is read again for complement, and therefore data loss can be effectively prevented, data is consumed again from Kafka after failure recovery, and zero-loss number guarantee is achieved more flexibly and conveniently under the condition that Spark Streaming program is abnormal.
Drawings
FIG. 1 is a diagram of a Spark Streaming architecture for use with the present invention;
FIG. 2 is a schematic illustration of Kafka processing of streaming data for use in the present invention;
FIG. 3 is a block diagram of a scheduling table and failure count record table model according to the present invention;
FIG. 4 is a flow chart of monitoring Kafka data reading based on Spark Streaming according to the present invention;
FIG. 5 is a flow chart of the failure record complement of the present invention.
Detailed Description
The invention is further described below with reference to the figures and examples.
According to the processing method for reading Kafka data based on Spark Streaming, provided by the invention, two database tables are created by using a relational database, wherein the two database tables are respectively a scheduling table (control) and a failure record number table (fai core). The scheduling table stores scheduling information including scheduling number id, start time, end time, state, creation time and the like. The failure number record table stores detailed information of specific failure data records, including information such as failure record id, offset, topic (topic), Kafka node list and the like. And the scheduling number id in the scheduling table and the id of the failure number recording table are in a main foreign key relationship.
In the process of processing data by sparkStreaming docking Kafka reading, firstly, the offset of data being read from Kafka is acquired by using a createDirectStream method of sparkStreaming, and the data offset information is put into a relational database failure number recording table, wherein the state indicates that the data is in progress.
When abnormality occurs in the course of processing data by reading Kafka in spark streaming docking, the program cannot be normally executed, and the state is modified to be failure according to the captured Exception information and the corresponding data offset information; otherwise, the modification is successful.
And combining the failure number recording table, manually setting a scheduling table, performing failure complement setting, scanning the scheduling table and the failure number recording table when the sparkStreaming program is restarted to obtain a complement strategy, and re-reading data on topic specified by Kafka.
The invention relates to two modes, namely a Receiver mode and a Direct mode, for sparkStreaming to acquire Kafka data, wherein the Receiver mode is to connect a Kafka queue through a zookeeper, and the Direct mode is to directly connect to a node of the Kafka to acquire the data. The method is based on a Receiver, and the method uses the Receiver to acquire data. Receiver is implemented using Kafka's high-level Consumer API. Data obtained by the Receiver from Kafka are stored in the memory of Spark executive, and then the jobs initiated by Spark Streaming will process those data. However, in the default configuration, this approach may lose data due to an underlying failure. If the high reliability mechanism is to be enabled, let the data be lost zero, then the Write Ahead logging mechanism (WAL) of Spark Streaming must be enabled. The mechanism will synchronously write the received Kafka data into a pre-write log on a distributed file system (such as HDFS); however, the Receiver mode has the following disadvantages: 1. WAL reduces receiver throughput because the received data must be saved to a reliable distributed file system; 2. for some input sources, it will repeat the same data. For example, when reading data from Kafka, a copy of data is saved in brokers of Kafka, and a copy is saved in Spark Streaming. The technical scheme of the invention is carried out on the premise of a method for acquiring the kafka data Direct by spark streaming, and compared with a first zero-loss mode, the technical scheme of the invention can bring remarkable beneficial effects, and has the following specific advantages: 1. the Kafka receiver is no longer required, and the exterctor consumes data from Kafka directly using the Simple Consumer API; 2. the WAL mechanism is no longer needed, and data can still be re-consumed from Kafka after recovery from failure; 3. the exact-once semantics are saved, and repeated data are not read from the WAL any more; 4. the method can guarantee zero loss more flexibly and conveniently under the abnormal condition of the spark streaming program.
The Spark Streaming used by the invention is a real-time computing framework established on Spark, and a user can combine Streaming, batch processing and interactive trial query application through rich API provided by the Spark Streaming and high-speed execution engine based on a memory; with the development of big data, the processing requirements of people on the big data are higher and higher, and the original batch processing framework MapReduce is suitable for offline calculation and cannot meet the service with higher real-time requirements. Therefore, how to ensure that the Spark Streaming acquires the kafka data is efficient and stable is very important. Aiming at the problem that Spark Streaming acquires kafka data loss data, the method for Spark Streaming to read kafka failure complement number mainly relates to three aspects of design of a scheduling and monitoring model, design of a complement scheduling center, design of a monitoring center and the like. The specific implementation process is as follows:
1. a schedule table (control) and a failure record number table (failure) are created in the relational database, and the specific table structure is shown in fig. 3.
2. In the process of processing data by the spark streaming docking Kafka reading, firstly, the offset of data being read from the Kafka is obtained by the createDirectstream method of spark streaming, and the information such as the data offset (offset) is put into a relational database failure number recording table, and the state indicates that the data is in progress. When abnormality occurs in the course of processing data by reading kafka in spark streaming docking, causing that a program cannot be normally executed, and calling update modification state as failure according to the captured Exception information and corresponding data offset information; otherwise the modification is successful as shown in figure 4.
3. The complement scheduling center interface is a scheduling center program, and as shown in fig. 5, firstly, scans a scheduling table according to a scheduling table state field as a query condition, obtains the earliest scheduling record according to a creation time field as a sort descending order, then obtains a scheduling number ID, obtains the number of all Kafka failure records by using the field as a query failure number record table condition, and then reads Kafka data again according to topic and offset (offset) for processing.
Although the present invention has been described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (5)

1. A processing method for reading Kafka data based on Spark Streaming is characterized by comprising the following steps:
s1) storing the data in topics using Kafka, each topic containing a number of configurable number of partitions;
s2) utilizing Spark Streaming to segment the real-time input data stream into blocks by taking a time slice as a unit, and generating a Spark Job process for each block;
s3) setting sparkStreaming complement scheduling time in advance according to the Kafka data failure record number;
s4) monitoring the processing process of reading Kafka data by spark streaming in real time;
s5) re-reading the Kafka data lost by failure by SparkStreaming according to the Kafka data failure record number and the scheduling time;
the step S3) creates two database tables, namely a scheduling table and a failure record table, using the relational database, wherein the scheduling table stores information of a scheduling number id, a starting time, an ending time, a state and a creation time, the failure record table stores information of a failure record id, an offset, a Kafka topic and Kafka node list information, and the scheduling number id in the scheduling table and the failure record id in the failure record table are in a main foreign key relationship.
2. The Spark Streaming based Kafka data reading processing method according to claim 1, wherein the step S4) includes: in the process of reading Kafka data by spark streaming, if the corresponding Kafka topic data is not null, acquiring the offset of the data being read from Kafka, and storing the data offset, Kafka topic and Kafka node list information into a relational database failure number recording table, and if the data processing is abnormal, modifying the state in the data table to be failure.
3. The processing method for reading Kafka data based on Spark Streaming according to claim 2, wherein Spark Streaming in step S4) is directly connected to the Kafka node by Direct and acquires an offset that is being read from Kafka to the data by createDirectStream method while identifying a state in the schedule table as being in progress; when the abnormal condition occurs to cause the program not to be executed normally in the process of reading and processing data by the spark streaming docking Kafka, the state in the schedule table is modified to be failure.
4. The Spark Streaming based Kafka data reading processing method according to claim 3, wherein the step S5) includes: firstly, scanning a scheduling table according to a scheduling table state field as a query condition, obtaining the earliest scheduling record according to a creation time field as a sequencing descending order, then obtaining a scheduling number id, obtaining all Kafka failure record numbers by taking the field as a query failure record table condition, and then re-reading Kafka data according to Kafka topics and offset.
5. The Spark Streaming based Kafka data processing method of claim 2, wherein step S4) first reads the schedule table and the failure record table in the relational database and caches them in the memory, and then updates the data in the cache periodically by a thread to perform real-time monitoring.
CN201611069230.9A 2016-11-29 2016-11-29 Processing method for reading Kafka data based on Spark Streaming Active CN106776855B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611069230.9A CN106776855B (en) 2016-11-29 2016-11-29 Processing method for reading Kafka data based on Spark Streaming

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611069230.9A CN106776855B (en) 2016-11-29 2016-11-29 Processing method for reading Kafka data based on Spark Streaming

Publications (2)

Publication Number Publication Date
CN106776855A CN106776855A (en) 2017-05-31
CN106776855B true CN106776855B (en) 2020-03-13

Family

ID=58905124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611069230.9A Active CN106776855B (en) 2016-11-29 2016-11-29 Processing method for reading Kafka data based on Spark Streaming

Country Status (1)

Country Link
CN (1) CN106776855B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228830A (en) * 2018-01-03 2018-06-29 广东工业大学 A kind of data processing system
CN108062251B (en) * 2018-01-09 2023-02-28 福建星瑞格软件有限公司 Server resource recovery method and computer equipment
CN108647329B (en) * 2018-05-11 2021-08-10 中国联合网络通信集团有限公司 User behavior data processing method and device and computer readable storage medium
CN110912949B (en) * 2018-09-14 2022-11-08 北京京东尚科信息技术有限公司 Method and device for submitting sites
CN111163118B (en) * 2018-11-07 2023-04-07 株式会社日立制作所 Message transmission method and device in Kafka cluster
CN111328013B (en) * 2018-12-17 2021-11-23 中国移动通信集团山东有限公司 Mobile terminal positioning method and system
CN109634784B (en) * 2018-12-24 2021-02-26 康成投资(中国)有限公司 Spark application program control method and device
CN110647570B (en) * 2019-09-20 2022-04-29 百度在线网络技术(北京)有限公司 Data processing method and device and electronic equipment
CN110648178A (en) * 2019-09-24 2020-01-03 四川长虹电器股份有限公司 Method for increasing kafka consumption capacity
CN111061565B (en) * 2019-12-12 2023-08-25 湖南大学 Two-section pipeline task scheduling method and system in Spark environment
CN111124650B (en) * 2019-12-26 2023-10-24 中国建设银行股份有限公司 Stream data processing method and device
CN111241051B (en) * 2020-01-07 2023-09-12 深圳迅策科技有限公司 Batch data processing method and device, terminal equipment and storage medium
CN111526188B (en) * 2020-04-10 2022-11-22 北京计算机技术及应用研究所 System and method for ensuring zero data loss based on Spark Streaming in combination with Kafka
CN112615773B (en) * 2020-12-02 2023-02-28 海南车智易通信息技术有限公司 Message processing method and system
CN112800073B (en) * 2021-01-27 2023-03-28 浪潮云信息技术股份公司 Method for updating Delta Lake based on NiFi

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636352A (en) * 2013-11-08 2015-05-20 中国石油天然气股份有限公司 SCADA system historical data complement and query processing method based on quality stamp
CN106126721A (en) * 2016-06-30 2016-11-16 北京奇虎科技有限公司 The data processing method of a kind of real-time calculating platform and device
CN106156307A (en) * 2016-06-30 2016-11-23 北京奇虎科技有限公司 The data handling system of a kind of real-time calculating platform and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10740290B2 (en) * 2015-04-14 2020-08-11 Jetflow Technologies Systems and methods for key-value stores

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636352A (en) * 2013-11-08 2015-05-20 中国石油天然气股份有限公司 SCADA system historical data complement and query processing method based on quality stamp
CN106126721A (en) * 2016-06-30 2016-11-16 北京奇虎科技有限公司 The data processing method of a kind of real-time calculating platform and device
CN106156307A (en) * 2016-06-30 2016-11-23 北京奇虎科技有限公司 The data handling system of a kind of real-time calculating platform and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"spark streaming 对接 kafka记录";鱼儿慢慢游;《https://www.cnblogs.com/missmzt/p/6004868.html》;20161027;第1-3页 *

Also Published As

Publication number Publication date
CN106776855A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106776855B (en) Processing method for reading Kafka data based on Spark Streaming
US9917913B2 (en) Large message support for a publish-subscribe messaging system
US9298774B2 (en) Changing the compression level of query plans
CN108694195B (en) Management method and system of distributed data warehouse
US20180365254A1 (en) Method and apparatus for processing information flow data
CN106126403B (en) Oracle database failure analysis methods and device
US9037905B2 (en) Data processing failure recovery method, system and program
US20130227194A1 (en) Active non-volatile memory post-processing
US9836516B2 (en) Parallel scanners for log based replication
US8464269B2 (en) Handling and reporting of object state transitions on a multiprocess architecture
CN110688382A (en) Data storage query method and device, computer equipment and storage medium
US20180032567A1 (en) Method and device for processing data blocks in a distributed database
CN107844506B (en) Method and device for realizing data synchronization of database and cache
CN111046022A (en) Database auditing method based on big data technology
CN116501805A (en) Stream data system, computer equipment and medium
CN113779094B (en) Batch-flow-integration-based data processing method and device, computer equipment and medium
CN109902067B (en) File processing method and device, storage medium and computer equipment
CN115168297A (en) Bypassing log auditing method and device
CN117390040B (en) Service request processing method, device and storage medium based on real-time wide table
CN116932779B (en) Knowledge graph data processing method and device
CN115185995A (en) Enterprise operation rate evaluation method, system, equipment and medium
US9317546B2 (en) Storing changes made toward a limit
US20230325378A1 (en) Online Migration From An Eventually Consistent System To A Strongly Consistent System
CN109710690B (en) Service driving calculation method and system
CN116910079A (en) Method, system, device and storage medium for realizing delay association of Flink with respect to CDC data dimension table

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant