CN106776855B - Processing method for reading Kafka data based on Spark Streaming - Google Patents
Processing method for reading Kafka data based on Spark Streaming Download PDFInfo
- Publication number
- CN106776855B CN106776855B CN201611069230.9A CN201611069230A CN106776855B CN 106776855 B CN106776855 B CN 106776855B CN 201611069230 A CN201611069230 A CN 201611069230A CN 106776855 B CN106776855 B CN 106776855B
- Authority
- CN
- China
- Prior art keywords
- kafka
- data
- spark streaming
- scheduling
- reading
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/1805—Append-only file systems, e.g. using logs or journals to store data
- G06F16/1815—Journaling file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Abstract
The invention discloses a processing method for reading Kafka data based on Spark Streaming, which comprises the following steps: s1) storing the data in the topic using Kafka; s2) slicing the real-time input data stream into blocks in units of time slices using Spark Streaming; s3) setting sparkStreaming complement scheduling time in advance according to the Kafka data failure record number; s4) monitoring the Kafka data reading process of spark streaming in real time; s5) rereading Kafka data by SparkStreaming. According to the invention, the spark streaming complement scheduling time is set according to the Kafka data failure record number, the reading process is monitored in real time, the failure record number is read again for complement, and the zero loss number guarantee is realized more flexibly and conveniently.
Description
Technical Field
The invention relates to a Kafka data processing method, in particular to a processing method for reading Kafka data based on Spark Streaming.
Background
Spark Streaming is the decomposition of Streaming into a series of short batch jobs. The batch processing engine is Spark, that is, the input data of Spark Streaming is divided into one piece of data (partitioned Stream) according to the batch size (e.g. 1 second), each piece of data is converted into RDD (resoilentpartitioned data set) in Spark, then the Transformation operation on dsstream in Spark Streaming is changed into the Transformation operation on RDD in Spark, and RDD is changed into an intermediate result and stored in the memory. The whole stream computation can superpose the intermediate results according to the service requirement or store the results to an external device. Fig. 1 shows the overall flow of Spark Streaming.
Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn corporation and later became part of the Apache project. Kafka is a distributed, partitionable, redundant backup, persistent log service. It is mainly used for processing active streaming data, as shown in fig. 2.
As is well known, the requirements of the big data era on the real-time property, stability and accuracy of data processing are higher and higher; the existing emerging combined architecture has spark streaming docking Kafka, and the real-time performance of data processing is further achieved by utilizing the spark streaming based on the memory iterative computation advantage and the high concurrent data distribution capability of Kafka; however, in the SparkStreaming docking kafka process, a potential data loss scene still inevitably occurs, and the specific process is as follows:
1. two Exectors have received input data from the receiver and cached it in Exector's memory; 2. the receiver notifies the input source that data has been received; 3. the Execturer starts to process the cached data according to the code of the application program; 4. at this time, Driver is suddenly hung up; 5. from a design perspective, once a Driver hangs down, the Exectuor it maintains will also be kill in its entirety; 6. since all Exectuors are kill, the data cached in their memory will also be lost. As a result, the cached data that has been notified of the data source but has not yet been processed is lost; 7. when cached, recovery is not possible because they are cached in Exector's memory, so data is lost.
From the above, a method for preventing zero loss is urgently needed to ensure the processing stability of SparkStreaming docking Kafka data.
Disclosure of Invention
The invention aims to solve the technical problem of providing a processing method for reading Kafka data based on Spark Streaming, which can effectively prevent data loss and re-consume data from Kafka after failure recovery, thereby ensuring zero-loss number more flexibly and conveniently under the abnormal condition of Spark Streaming program.
The technical scheme adopted by the invention for solving the technical problems is to provide a processing method for reading Kafka data based on Spark Streaming, which comprises the following steps: s1) storing the data in topics using Kafka, each topic containing a number of configurable number of partitions; s2) utilizing Spark Streaming to segment the real-time input data stream into blocks by taking a time slice as a unit, and generating a Spark Job process for each block; s3)
Setting sparkStreaming complement scheduling time according to the Kafka data failure record number in advance; s4) monitoring the processing process of reading Kafka data by spark streaming in real time; s5) re-reading the Kafka data lost by the failure by SparkStreaming based on the Kafka data failure record number and the schedule time.
In the above processing method for reading Kafka data based on Spark Streaming, in step S3), two database tables are created by using a relational database, which are a scheduling table and a failure record table, respectively, where the scheduling table stores information of a scheduling number id, a start time, an end time, a state and creation time, the failure record table stores information of a failure record id, an offset, a Kafka topic and Kafka node list information, and the scheduling number id in the scheduling table and the failure record id in the failure record table are in a primary foreign key relationship.
The processing method for reading Kafka data based on Spark Streaming, wherein the step S4) includes: in the process of reading Kafka data by spark streaming, if the corresponding Kafka topic data is not null, acquiring the offset of the data being read from Kafka, and storing the data offset, Kafka topic and Kafka node list information into a relational database failure number recording table, and if the data processing is abnormal, modifying the state in the data table to be failure.
In the above processing method for reading Kafka data based on Spark Streaming, in step S4), Spark Streaming is directly connected to the Kafka node in a Direct manner, and obtains the offset of data being read from Kafka by a createDirectStream method, and simultaneously identifies the state in the schedule table as in progress; when the abnormal condition occurs to cause the program not to be executed normally in the process of reading and processing data by the spark streaming docking Kafka, the state in the schedule table is modified to be failure.
The processing method for reading Kafka data based on Spark Streaming, wherein the step S5) includes: firstly, scanning a scheduling table according to a scheduling table state field as a query condition, obtaining the earliest scheduling record according to a creation time field as a sequencing descending order, then obtaining a scheduling number id, obtaining all Kafka failure record numbers by taking the field as a query failure record table condition, and then re-reading Kafka data according to Kafka topics and offset.
In the processing method for reading Kafka data based on Spark Streaming, in step S4), the scheduling table and the failure number recording table in the relational database are read and cached in the memory, and then the data in the cache is updated periodically by the thread to perform real-time monitoring.
Compared with the prior art, the invention has the following beneficial effects: according to the processing method for reading Kafka data based on Spark Streaming, Spark Streaming complement scheduling time is set according to the number of Kafka data failure records, the reading process is monitored in real time, the number of failure records is read again for complement, and therefore data loss can be effectively prevented, data is consumed again from Kafka after failure recovery, and zero-loss number guarantee is achieved more flexibly and conveniently under the condition that Spark Streaming program is abnormal.
Drawings
FIG. 1 is a diagram of a Spark Streaming architecture for use with the present invention;
FIG. 2 is a schematic illustration of Kafka processing of streaming data for use in the present invention;
FIG. 3 is a block diagram of a scheduling table and failure count record table model according to the present invention;
FIG. 4 is a flow chart of monitoring Kafka data reading based on Spark Streaming according to the present invention;
FIG. 5 is a flow chart of the failure record complement of the present invention.
Detailed Description
The invention is further described below with reference to the figures and examples.
According to the processing method for reading Kafka data based on Spark Streaming, provided by the invention, two database tables are created by using a relational database, wherein the two database tables are respectively a scheduling table (control) and a failure record number table (fai core). The scheduling table stores scheduling information including scheduling number id, start time, end time, state, creation time and the like. The failure number record table stores detailed information of specific failure data records, including information such as failure record id, offset, topic (topic), Kafka node list and the like. And the scheduling number id in the scheduling table and the id of the failure number recording table are in a main foreign key relationship.
In the process of processing data by sparkStreaming docking Kafka reading, firstly, the offset of data being read from Kafka is acquired by using a createDirectStream method of sparkStreaming, and the data offset information is put into a relational database failure number recording table, wherein the state indicates that the data is in progress.
When abnormality occurs in the course of processing data by reading Kafka in spark streaming docking, the program cannot be normally executed, and the state is modified to be failure according to the captured Exception information and the corresponding data offset information; otherwise, the modification is successful.
And combining the failure number recording table, manually setting a scheduling table, performing failure complement setting, scanning the scheduling table and the failure number recording table when the sparkStreaming program is restarted to obtain a complement strategy, and re-reading data on topic specified by Kafka.
The invention relates to two modes, namely a Receiver mode and a Direct mode, for sparkStreaming to acquire Kafka data, wherein the Receiver mode is to connect a Kafka queue through a zookeeper, and the Direct mode is to directly connect to a node of the Kafka to acquire the data. The method is based on a Receiver, and the method uses the Receiver to acquire data. Receiver is implemented using Kafka's high-level Consumer API. Data obtained by the Receiver from Kafka are stored in the memory of Spark executive, and then the jobs initiated by Spark Streaming will process those data. However, in the default configuration, this approach may lose data due to an underlying failure. If the high reliability mechanism is to be enabled, let the data be lost zero, then the Write Ahead logging mechanism (WAL) of Spark Streaming must be enabled. The mechanism will synchronously write the received Kafka data into a pre-write log on a distributed file system (such as HDFS); however, the Receiver mode has the following disadvantages: 1. WAL reduces receiver throughput because the received data must be saved to a reliable distributed file system; 2. for some input sources, it will repeat the same data. For example, when reading data from Kafka, a copy of data is saved in brokers of Kafka, and a copy is saved in Spark Streaming. The technical scheme of the invention is carried out on the premise of a method for acquiring the kafka data Direct by spark streaming, and compared with a first zero-loss mode, the technical scheme of the invention can bring remarkable beneficial effects, and has the following specific advantages: 1. the Kafka receiver is no longer required, and the exterctor consumes data from Kafka directly using the Simple Consumer API; 2. the WAL mechanism is no longer needed, and data can still be re-consumed from Kafka after recovery from failure; 3. the exact-once semantics are saved, and repeated data are not read from the WAL any more; 4. the method can guarantee zero loss more flexibly and conveniently under the abnormal condition of the spark streaming program.
The Spark Streaming used by the invention is a real-time computing framework established on Spark, and a user can combine Streaming, batch processing and interactive trial query application through rich API provided by the Spark Streaming and high-speed execution engine based on a memory; with the development of big data, the processing requirements of people on the big data are higher and higher, and the original batch processing framework MapReduce is suitable for offline calculation and cannot meet the service with higher real-time requirements. Therefore, how to ensure that the Spark Streaming acquires the kafka data is efficient and stable is very important. Aiming at the problem that Spark Streaming acquires kafka data loss data, the method for Spark Streaming to read kafka failure complement number mainly relates to three aspects of design of a scheduling and monitoring model, design of a complement scheduling center, design of a monitoring center and the like. The specific implementation process is as follows:
1. a schedule table (control) and a failure record number table (failure) are created in the relational database, and the specific table structure is shown in fig. 3.
2. In the process of processing data by the spark streaming docking Kafka reading, firstly, the offset of data being read from the Kafka is obtained by the createDirectstream method of spark streaming, and the information such as the data offset (offset) is put into a relational database failure number recording table, and the state indicates that the data is in progress. When abnormality occurs in the course of processing data by reading kafka in spark streaming docking, causing that a program cannot be normally executed, and calling update modification state as failure according to the captured Exception information and corresponding data offset information; otherwise the modification is successful as shown in figure 4.
3. The complement scheduling center interface is a scheduling center program, and as shown in fig. 5, firstly, scans a scheduling table according to a scheduling table state field as a query condition, obtains the earliest scheduling record according to a creation time field as a sort descending order, then obtains a scheduling number ID, obtains the number of all Kafka failure records by using the field as a query failure number record table condition, and then reads Kafka data again according to topic and offset (offset) for processing.
Although the present invention has been described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (5)
1. A processing method for reading Kafka data based on Spark Streaming is characterized by comprising the following steps:
s1) storing the data in topics using Kafka, each topic containing a number of configurable number of partitions;
s2) utilizing Spark Streaming to segment the real-time input data stream into blocks by taking a time slice as a unit, and generating a Spark Job process for each block;
s3) setting sparkStreaming complement scheduling time in advance according to the Kafka data failure record number;
s4) monitoring the processing process of reading Kafka data by spark streaming in real time;
s5) re-reading the Kafka data lost by failure by SparkStreaming according to the Kafka data failure record number and the scheduling time;
the step S3) creates two database tables, namely a scheduling table and a failure record table, using the relational database, wherein the scheduling table stores information of a scheduling number id, a starting time, an ending time, a state and a creation time, the failure record table stores information of a failure record id, an offset, a Kafka topic and Kafka node list information, and the scheduling number id in the scheduling table and the failure record id in the failure record table are in a main foreign key relationship.
2. The Spark Streaming based Kafka data reading processing method according to claim 1, wherein the step S4) includes: in the process of reading Kafka data by spark streaming, if the corresponding Kafka topic data is not null, acquiring the offset of the data being read from Kafka, and storing the data offset, Kafka topic and Kafka node list information into a relational database failure number recording table, and if the data processing is abnormal, modifying the state in the data table to be failure.
3. The processing method for reading Kafka data based on Spark Streaming according to claim 2, wherein Spark Streaming in step S4) is directly connected to the Kafka node by Direct and acquires an offset that is being read from Kafka to the data by createDirectStream method while identifying a state in the schedule table as being in progress; when the abnormal condition occurs to cause the program not to be executed normally in the process of reading and processing data by the spark streaming docking Kafka, the state in the schedule table is modified to be failure.
4. The Spark Streaming based Kafka data reading processing method according to claim 3, wherein the step S5) includes: firstly, scanning a scheduling table according to a scheduling table state field as a query condition, obtaining the earliest scheduling record according to a creation time field as a sequencing descending order, then obtaining a scheduling number id, obtaining all Kafka failure record numbers by taking the field as a query failure record table condition, and then re-reading Kafka data according to Kafka topics and offset.
5. The Spark Streaming based Kafka data processing method of claim 2, wherein step S4) first reads the schedule table and the failure record table in the relational database and caches them in the memory, and then updates the data in the cache periodically by a thread to perform real-time monitoring.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611069230.9A CN106776855B (en) | 2016-11-29 | 2016-11-29 | Processing method for reading Kafka data based on Spark Streaming |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611069230.9A CN106776855B (en) | 2016-11-29 | 2016-11-29 | Processing method for reading Kafka data based on Spark Streaming |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106776855A CN106776855A (en) | 2017-05-31 |
CN106776855B true CN106776855B (en) | 2020-03-13 |
Family
ID=58905124
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611069230.9A Active CN106776855B (en) | 2016-11-29 | 2016-11-29 | Processing method for reading Kafka data based on Spark Streaming |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106776855B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108228830A (en) * | 2018-01-03 | 2018-06-29 | 广东工业大学 | A kind of data processing system |
CN108062251B (en) * | 2018-01-09 | 2023-02-28 | 福建星瑞格软件有限公司 | Server resource recovery method and computer equipment |
CN108647329B (en) * | 2018-05-11 | 2021-08-10 | 中国联合网络通信集团有限公司 | User behavior data processing method and device and computer readable storage medium |
CN110912949B (en) * | 2018-09-14 | 2022-11-08 | 北京京东尚科信息技术有限公司 | Method and device for submitting sites |
CN111163118B (en) * | 2018-11-07 | 2023-04-07 | 株式会社日立制作所 | Message transmission method and device in Kafka cluster |
CN111328013B (en) * | 2018-12-17 | 2021-11-23 | 中国移动通信集团山东有限公司 | Mobile terminal positioning method and system |
CN109634784B (en) * | 2018-12-24 | 2021-02-26 | 康成投资(中国)有限公司 | Spark application program control method and device |
CN110647570B (en) * | 2019-09-20 | 2022-04-29 | 百度在线网络技术(北京)有限公司 | Data processing method and device and electronic equipment |
CN110648178A (en) * | 2019-09-24 | 2020-01-03 | 四川长虹电器股份有限公司 | Method for increasing kafka consumption capacity |
CN111061565B (en) * | 2019-12-12 | 2023-08-25 | 湖南大学 | Two-section pipeline task scheduling method and system in Spark environment |
CN111124650B (en) * | 2019-12-26 | 2023-10-24 | 中国建设银行股份有限公司 | Stream data processing method and device |
CN111241051B (en) * | 2020-01-07 | 2023-09-12 | 深圳迅策科技有限公司 | Batch data processing method and device, terminal equipment and storage medium |
CN111526188B (en) * | 2020-04-10 | 2022-11-22 | 北京计算机技术及应用研究所 | System and method for ensuring zero data loss based on Spark Streaming in combination with Kafka |
CN112615773B (en) * | 2020-12-02 | 2023-02-28 | 海南车智易通信息技术有限公司 | Message processing method and system |
CN112800073B (en) * | 2021-01-27 | 2023-03-28 | 浪潮云信息技术股份公司 | Method for updating Delta Lake based on NiFi |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104636352A (en) * | 2013-11-08 | 2015-05-20 | 中国石油天然气股份有限公司 | SCADA system historical data complement and query processing method based on quality stamp |
CN106126721A (en) * | 2016-06-30 | 2016-11-16 | 北京奇虎科技有限公司 | The data processing method of a kind of real-time calculating platform and device |
CN106156307A (en) * | 2016-06-30 | 2016-11-23 | 北京奇虎科技有限公司 | The data handling system of a kind of real-time calculating platform and method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10740290B2 (en) * | 2015-04-14 | 2020-08-11 | Jetflow Technologies | Systems and methods for key-value stores |
-
2016
- 2016-11-29 CN CN201611069230.9A patent/CN106776855B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104636352A (en) * | 2013-11-08 | 2015-05-20 | 中国石油天然气股份有限公司 | SCADA system historical data complement and query processing method based on quality stamp |
CN106126721A (en) * | 2016-06-30 | 2016-11-16 | 北京奇虎科技有限公司 | The data processing method of a kind of real-time calculating platform and device |
CN106156307A (en) * | 2016-06-30 | 2016-11-23 | 北京奇虎科技有限公司 | The data handling system of a kind of real-time calculating platform and method |
Non-Patent Citations (1)
Title |
---|
"spark streaming 对接 kafka记录";鱼儿慢慢游;《https://www.cnblogs.com/missmzt/p/6004868.html》;20161027;第1-3页 * |
Also Published As
Publication number | Publication date |
---|---|
CN106776855A (en) | 2017-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106776855B (en) | Processing method for reading Kafka data based on Spark Streaming | |
US9917913B2 (en) | Large message support for a publish-subscribe messaging system | |
US9298774B2 (en) | Changing the compression level of query plans | |
CN108694195B (en) | Management method and system of distributed data warehouse | |
US20180365254A1 (en) | Method and apparatus for processing information flow data | |
CN106126403B (en) | Oracle database failure analysis methods and device | |
US9037905B2 (en) | Data processing failure recovery method, system and program | |
US20130227194A1 (en) | Active non-volatile memory post-processing | |
US9836516B2 (en) | Parallel scanners for log based replication | |
US8464269B2 (en) | Handling and reporting of object state transitions on a multiprocess architecture | |
CN110688382A (en) | Data storage query method and device, computer equipment and storage medium | |
US20180032567A1 (en) | Method and device for processing data blocks in a distributed database | |
CN107844506B (en) | Method and device for realizing data synchronization of database and cache | |
CN111046022A (en) | Database auditing method based on big data technology | |
CN116501805A (en) | Stream data system, computer equipment and medium | |
CN113779094B (en) | Batch-flow-integration-based data processing method and device, computer equipment and medium | |
CN109902067B (en) | File processing method and device, storage medium and computer equipment | |
CN115168297A (en) | Bypassing log auditing method and device | |
CN117390040B (en) | Service request processing method, device and storage medium based on real-time wide table | |
CN116932779B (en) | Knowledge graph data processing method and device | |
CN115185995A (en) | Enterprise operation rate evaluation method, system, equipment and medium | |
US9317546B2 (en) | Storing changes made toward a limit | |
US20230325378A1 (en) | Online Migration From An Eventually Consistent System To A Strongly Consistent System | |
CN109710690B (en) | Service driving calculation method and system | |
CN116910079A (en) | Method, system, device and storage medium for realizing delay association of Flink with respect to CDC data dimension table |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |