CN112988805A

CN112988805A - Data processing method, device and equipment based on computing framework and storage medium

Info

Publication number: CN112988805A
Application number: CN201911281274.1A
Authority: CN
Inventors: 安金龙; 刘业辉; 张宁; 张飞; 王彦明
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2021-06-18

Abstract

The application provides a data processing method, a device, equipment and a storage medium based on a computing framework, wherein the method comprises the following steps: determining a first data volume when determining to limit the speed of data reception speed when receiving data based on the calculation framework, the first data volume being a first number of pieces of data corresponding to a current time slice; when the data difference value between the first data volume and the pre-stored second data volume is determined to be larger than a preset value and the first data volume is smaller than the second data volume, the second data volume is a second data strip number of data corresponding to a previous time slice, and the data in the queue are merged; and processing the merged data belonging to the same batch in a batch processing mode. And merging the data in the queue to obtain merged data of at least one batch. The data in the queue are merged, so that the data can be processed in advance, the continuous accumulation of the data is avoided, and the data processing efficiency based on a calculation frame is effectively improved.

Description

Data processing method, device and equipment based on computing framework and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, device, and storage medium based on a computing framework.

Background

With the development of data processing technology, the processing requirements for large data are higher and higher. Now to meet the processing requirement of real-time data, a Spark Streaming computing framework is provided. Spark Streaming can provide a rich API, memory-based high-speed execution engine.

Now, data can be received on each time slice through the Spark Streaming computing framework; the data on each time slice is then processed as a whole. Specifically, the time length of a time slice is specified through a Spark Streaming calculation framework; receiving data in a first time slice, and then processing the data corresponding to the first time slice as a batch; then, receiving data in a second time slice, and processing the data corresponding to the second time slice as a batch; and so on.

However, in the prior art, when data is received and processed based on the Spark Streaming computing framework, since data in each time slice is processed as a whole, it is necessary to receive data in a time slice first, and then process the data corresponding to the time slice, so that when the data corresponding to the previous time slice is not processed, the data of the next time slice starts to be received, and data accumulation is caused, and data processing efficiency is low.

Disclosure of Invention

The application provides a data processing method, a data processing device, data processing equipment and a storage medium based on a computing framework, which are used for solving the problems of data accumulation and low data processing efficiency in the data processing process based on the computing framework.

In a first aspect, the present application provides a data processing method based on a computing framework, the method including:

determining a first data amount when determining to speed-limit a data receiving speed when receiving data based on a calculation framework, wherein the first data amount is a first data number of data corresponding to a current time slice, and the data corresponding to the current time slice is data in a processing state;

when it is determined that a data difference value between the first data volume and a pre-stored second data volume is larger than a preset value and the first data volume is smaller than the second data volume, wherein the second data volume is a second data number of data corresponding to a previous time slice, merging the data in the queue to obtain merged data belonging to the same batch, wherein the data in the queue is the data which is received in the current time slice and is not processed;

and processing the merged data belonging to the same batch in a batch processing mode.

Further, the merging the data in the queue to obtain merged data belonging to the same batch includes:

determining the number N of batches according to the total number of data in the queue and a preset average number, wherein N is a positive integer greater than or equal to 1;

merging the data in the queue according to the number N of the batches to obtain merged data of N batches;

the data processing is carried out on the merged data belonging to the same batch in a batch processing mode, and the data processing method comprises the following steps:

and respectively processing the merged data of each batch in the merged data of the N batches in a batch processing mode.

Further, the number of batches

Wherein S is the total number of pieces of data, and T is the average number of pieces.

Further, before determining the number N of the batches according to the total number of data in the queue and a preset average number, the method further includes:

acquiring the number of data in a plurality of time slices of data corresponding to each time slice in the plurality of time slices, wherein the plurality of time slices are time slices before the current time slice;

and taking the average value of the number of the data in each time slice as the average number.

Further, when determining to limit the data receiving speed, before determining the first data amount, the method further includes:

acquiring a first time length of the current time slice, and acquiring a second time length required for processing data corresponding to the current time slice;

and when the second time length is determined to be greater than the first time length, limiting the data receiving speed.

Further, the method further comprises:

and processing the data corresponding to each time slice at a preset speed when the second time length is determined to be less than or equal to the first time length.

Further, the method further comprises:

and when the data difference value between the first data volume and a pre-stored second data volume is determined to be smaller than or equal to a preset value and the first data volume is smaller than the second data volume, or when the first data volume is determined to be larger than or equal to the second data volume, processing the data corresponding to each time slice at a preset speed.

In a second aspect, the present application provides a computing framework-based data processing apparatus, the apparatus comprising:

a first determination unit configured to determine a first data amount when determining to speed-limit a data reception speed when receiving data based on the calculation framework, wherein the first data amount is a first number of pieces of data corresponding to a current time slice, and the data corresponding to the current time slice is data in a processing state;

a merging unit, configured to merge data in a queue to obtain merged data belonging to a same batch when it is determined that a data difference between the first data amount and a pre-stored second data amount is greater than a preset value and the first data amount is smaller than the second data amount, where the second data amount is a second data number of data corresponding to a previous time slice, and the data in the queue is data that is received in the current time slice and is not processed;

the first processing unit is used for processing the merged data belonging to the same batch in a batch processing mode.

Further, the merging unit includes:

the determining module is used for determining the number N of the batches according to the total number of the data in the queue and a preset average number, wherein N is a positive integer greater than or equal to 1;

the merging module is used for merging the data in the queue according to the number N of the batches to obtain merged data of the N batches;

the first processing unit is specifically configured to:

Further, the number of batches

Further, the apparatus further comprises:

a second determining unit, configured to obtain, before the determining module determines the number N of batches according to the total number of data in the queue and a preset average number, the number of data in a time slice corresponding to each of multiple time slices, where the multiple time slices are time slices before the current time slice; and taking the average value of the number of the data in each time slice as the average number.

Further, the apparatus further comprises:

the acquisition unit is used for acquiring a first time length of the current time slice and acquiring a second time length required for processing data corresponding to the current time slice before determining a first data volume when the first determination unit determines to limit the speed of data receiving;

and the speed limiting unit is used for limiting the speed of the data receiving speed when the second time length is determined to be greater than the first time length.

Further, the apparatus further comprises:

and the second processing unit is used for processing the data corresponding to each time slice at a preset speed when the second time length is determined to be less than or equal to the first time length.

Further, the apparatus further comprises:

and the third processing unit is used for processing the data corresponding to each time slice at a preset speed when the data difference value between the first data volume and the pre-stored second data volume is determined to be smaller than or equal to a preset value and the first data volume is smaller than the second data volume or when the first data volume is determined to be larger than or equal to the second data volume.

In a third aspect, the present application provides a computing framework based data processing apparatus comprising means or means for performing the steps of any of the methods of the first aspect above.

In a fourth aspect, the present application provides a computing framework based data processing apparatus comprising a processor, a memory and a computer program, wherein the computer program is stored in the memory and configured to be executed by the processor to implement any of the methods of the first aspect.

In a fifth aspect, the present application provides a computing framework based data processing apparatus comprising at least one processing element or chip for performing any of the methods of the first aspect above.

In a sixth aspect, the present application provides a computer program for performing any of the methods of the first aspect above when executed by a processor.

In a seventh aspect, the present application provides a computer readable storage medium having the computer program of the sixth aspect stored thereon.

According to the data processing method, the data processing device, the data processing equipment and the data processing storage medium based on the computing framework, when the data are received based on the computing framework and the speed limit is carried out on the data receiving speed, a first data volume is determined, wherein the first data volume is the first data number of the data corresponding to the current time slice, and the data corresponding to the current time slice are the data in the processing state; when the data difference value between the first data volume and a pre-stored second data volume is determined to be larger than a preset value and the first data volume is smaller than the second data volume, wherein the second data volume is the second data number of data corresponding to a previous time slice, merging the data in the queue to obtain merged data belonging to the same batch, wherein the data in the queue is the data which is received in the current time slice and is unprocessed; and processing the merged data belonging to the same batch in a batch processing mode. During the data processing based on the Spark Streaming calculation framework, based on that the data volume received by the current time slice is far smaller than the data volume received by the previous time slice, it can be determined that the backpressure mechanism of Spark Streaming is in effect, and at this time, the data volume received in each time slice after the current time slice starts is smaller; then the data received in a plurality of time slices is available, waiting to be processed, which constitutes the data in the queue. Then, since the data volume of the data of each time slice is small after the backpressure mechanism takes effect, in order to avoid continuous backlog of the data, the data in the queue can be merged to obtain merged data of at least one batch; the merged data of each batch is used as a batch, and the merged data of each batch is processed in a batch processing mode. Therefore, the data in the queue are merged, the data can be processed in advance, and continuous backlog and accumulation of the data are avoided; the throughput of data processing is improved, and the data processing efficiency based on the Spark Streaming computing framework is effectively improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a data processing method based on a computing framework according to an embodiment of the present application;

FIG. 3 is a schematic flowchart of another data processing method based on a computing framework according to an embodiment of the present application;

fig. 4 is a schematic diagram of data processing based on Spark stearing according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a Spark stearing computing framework provided in the embodiment of the present application;

fig. 6 is an implementation process of a backpressure mechanism based on Spark Steaming framework according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a data processing apparatus based on a computing framework according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of another data processing apparatus based on a computing framework according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a data processing device based on a computing framework according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The embodiment of the present application can be applied to the technical field of data processing, and the application of the embodiment of the present application is implemented by using a data processing device based on a computing framework, or a terminal device, or a server, or other devices or devices.

The terms referred to in this application are explained first:

1) data Warehouse (DW for short): is a storage combination of data.

2) Spark: the method is a quick and universal calculation engine specially designed for large-scale data processing, and can be better suitable for algorithms needing iteration, such as data mining, machine learning and the like.

3) Spark Streaming: constructing a framework for processing Stream data on Spark; the data stream can be divided into small time slices (units: seconds) and the data within each time slice is processed in a manner similar to batch processing.

4) Elastic Distributed data sets (RDDs for short): is an abstraction of distributed memory, RDD provides a highly constrained shared memory model.

5) Partition (Partition): one RDD is physically divided into a plurality of partitions; these partitions may be distributed over different nodes. Partition is the basic processing unit of the Spark calculation task.

6) Hadoop: is a distributed system infrastructure; the method mainly solves the problems of distributed computation and storage of mass data.

The application has the specific application scenarios that: with the development of network technology, network data has been widely used in various fields. In addition, in the internet, a large data flow is generated; moreover, data usually has large volatility, for example, in various activities or sudden hot spot events, the traffic flow of the data may sharply increase in a short time, and then huge traffic spikes are formed. The above-described characteristics of the data may be more apparent, particularly in the e-commerce industry.

When the data inflow speed is much higher than the data processing speed, that is, when the data receiving speed is much higher than the data processing speed, a huge load pressure is formed on a stream processing system for processing data; when the load pressure of the stream processing system cannot be relieved well, cluster resources are exhausted, and even the cluster is crashed. Thus, it is necessary to ensure the stability of the stream processing system that processes data.

A batch processing framework MapReduce is provided, which is a programming model and can be used for parallel operation of large-scale data sets. The MapReduce can perform offline computing processing on data, but the MapReduce cannot meet the data with high real-time requirements, for example, for data in scenes such as a real-time data warehouse, real-time recommendation, user behavior analysis and the like, the MapReduce cannot process the data in time.

Now to facilitate distributed computing, Spark is provided. Spark is a distributed computing framework similar to MapReduce, the core of Spark is an elastic distributed data set, but Spark provides a richer model than MapReduce, and Spark can quickly perform multiple iterations on the data set in an internal memory; thus, based on the above features, Spark can support algorithms such as complex data mining algorithms, graph computation algorithms, and the like.

Streaming is proposed based on Spark. Spark Streaming is a real-time computing framework built on Spark, which expands the ability of Spark to process large-scale Streaming data. Moreover, Spark Streaming can process data with high real-time requirement in time. The Spark Streaming can provide rich Application Programming Interface (API for short), and can complete data processing based on the memory; and further, the Spark Streaming completes data query and data by combining Streaming, batch processing and user interaction trial modes.

Based on the above description, it can be seen that the advantage of Spark Streaming is: the system can operate on hundreds of nodes and achieve second-level delay; the Spark Streaming uses Spark based on the memory as an execution engine, thereby having the characteristics of high efficiency and fault tolerance; the Spark Streaming integrates the batch processing capability and the interactive query capability of Spark; spark Streamin provides a simple interface similar to batch processing, but can implement complex algorithms.

However, in the framework of computing by Spark Streaming, data is received on each time slice; the data on each time slice is then processed as a whole. Specifically, the time length of a time slice is specified through a Spark Streaming calculation framework; receiving data in a first time slice, and then processing the data corresponding to the first time slice as a batch; then, receiving data in a second time slice, and processing the data corresponding to the second time slice as a batch; and so on.

However, when data is received and processed based on the Spark Streaming computing framework, since data in each time slice is processed as a whole, it is necessary to receive data in a time slice first, and then process the data corresponding to the time slice, so that when the data corresponding to the previous time slice is not processed, the data of the next time slice starts to be received, and data accumulation is caused, and data processing efficiency is low.

The application provides a data processing method, a data processing device, a data processing apparatus and a storage medium based on a computing framework, which aim to solve the above technical problems in the prior art.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an application scenario provided in the embodiment of the present application, and as shown in fig. 1, data may be obtained from each data source and then input into Spark Streaming; spark Streaming can process data; the processed data is then input into the respective data receiving sources. Wherein, the data source includes but is not limited to the following data sources: kafka, water tank (Flume), HDFS (Hadoop Distributed File System, HDFS), Kinesis. Data receiving sources include, but are not limited to, the following: databases (Databases), instrument panels (Dashboards). Wherein Kafka is a high throughput distributed publish-subscribe messaging system; the Flume is a high-availability, high-reliability and distributed system for acquiring, aggregating and transmitting mass logs provided by Cloudera; the HDFS is a high fault-tolerant system, can provide high-throughput data access, and is very suitable for application on a large-scale data set; kinesis is a cloud service of AWS for collecting real-time streaming data.

Fig. 2 is a schematic flowchart of a data processing method based on a computing framework according to an embodiment of the present application. As shown in fig. 2, the method includes:

101. when receiving data based on the calculation framework, when determining to speed-limit the data receiving speed, determining a first data amount, wherein the first data amount is a first data number of data corresponding to a current time slice, and the data corresponding to the current time slice is data in a processing state.

In this embodiment, the execution subject of this embodiment may be a data processing apparatus based on a computing framework, or a data processing device based on a computing framework, or a terminal device, or a server, or other apparatuses or devices.

In the process of data processing based on the Spark roaming calculation framework, a back-pressure mechanism can be provided for Spark roaming, so that in the process of data processing based on the Spark roaming calculation framework, when the data receiving speed is determined to be far lower than the data processing speed, the data receiving speed can be reduced, and the data volume of the data received based on the calculation framework can be reduced.

In the process of data processing based on Spark learning, the data is received based on the calculation frame, the data can be received in each time slice, and then the data corresponding to each time slice is processed after the data is received. So that each time slice has a corresponding amount of data, which has a number of data strips.

Based on the above process, after the speed limit is performed on the data receiving speed, data is received in the current time slice, so that data corresponding to the current time slice is obtained, and then the number of data corresponding to the current time slice can be determined, and for convenience of distinguishing, the number of data corresponding to the current time slice is called as a first number of data; thus, a first data volume of data corresponding to the current time slice can be obtained, and the first data volume is the first number of data pieces.

At this time, the data corresponding to the current time slice is received, and processing of the data corresponding to the current time slice is started, that is, the data corresponding to the current time slice is in a processing state. At this time, the data corresponding to the current time slice is used as the data of one batch to perform batch processing.

For example, in the process of processing data based on Spark stearing, the data is received in the time slice 1, so as to obtain the data corresponding to the time slice 1, and then the data corresponding to the time slice 1 is processed at a preset speed, that is, the data corresponding to the time slice 1 is used as data of one batch to be processed in batch; receiving data in a time slice 2 after the time slice 1 to further obtain data corresponding to the time slice 2, and then processing the data corresponding to the time slice 2 at a preset speed, namely, processing the data corresponding to the time slice 2 as data of one batch in batch; then, limiting the data receiving speed of a time slice 3 after the time slice 2 based on a backpressure mechanism, receiving data corresponding to the time slice 3 at the moment, and further obtaining a first data volume, wherein the first data volume is the first data number of the data corresponding to the time slice 3; at this time, the processing of the data corresponding to the current time slice 3 is started, but the processing of the data corresponding to the current time slice 3 has not been completed.

102. When the data difference value between the first data volume and a pre-stored second data volume is larger than a preset value and the first data volume is smaller than the second data volume, wherein the second data volume is the second data number of data corresponding to a previous time slice, merging the data in the queue to obtain merged data belonging to the same batch, wherein the data in the queue is the data which is received in the current time slice and is not processed.

In this embodiment, since the data of a time slice before the current time slice has been received, that is, the data of the previous time slice has been received, the second data strip number of the data corresponding to the previous time slice can be determined, and thus, the second data strip number is used as the second data amount.

Then, comparing the first data size of step 101 with the second data size, and determining whether the first data size is smaller than the second data size; when the first data volume is determined to be smaller than the second data volume, subtracting the first data volume from the second data volume to obtain a data difference value; then, judging whether the data difference value is larger than a preset value or not; and when the data difference value is determined to be larger than the preset value, determining that the first data volume is far smaller than the second data volume. It can thus be determined that the amount of data received in the current time slice is much smaller than the amount of data received in the previous time slice, and it can thus be determined that the data reception speed is actually reduced, i.e., it is determined that the Spark Streaming based backpressure mechanism is in effect. For example, 20 ten thousand data per batch are received normally, and 500 data per batch are received after the backpressure mechanism is activated.

At this time, since the data corresponding to the current time slice is completely received, the data corresponding to the current time slice is being subjected to batch processing, but the data corresponding to the current time slice is not completely processed, the data in the next time slice is already received, so that the data in the next time slice enters a queuing state, the data in the next time slice waits for data processing, and the data in the next time slice is taken as the data of the same batch and waits for batch processing. In addition, in the process of processing the data corresponding to the current time slice, the data of the time slice after the next time slice may also be received, so that the data enter the queuing state and become the data in the queue; it can be seen that the data corresponding to each time slice in the queued data is used as the data of one batch.

Since the data reception speed has been rate limited and it has been determined that the first amount of data is much smaller than the second amount of data, i.e. the amount of data received has been significantly reduced.

At this time, if the data in each time slice after the current time slice is respectively processed as data of one batch, the data in each time slice after the current time slice also needs to be continuously waited for processing, and in the process of processing the data corresponding to the current time slice, the data in the subsequent time slice 7 is also continuously received, so that the data is continuously backlogged. In this embodiment, the data in the queue may be merged, and then the data in the queue is divided into at least one batch to obtain merged data of the at least one batch, where the data in the merged data of each annotation belongs to the current batch; by merging the data, the batches of data are reduced.

For example, the data in the queue may all be merged into one batch. Or, dividing the data in the queue into a plurality of batches.

For example, in the process of processing data based on Spark stearing, the data is received in the time slice 1, so as to obtain the data corresponding to the time slice 1, and then the data corresponding to the time slice 1 is processed at a preset speed, that is, the data corresponding to the time slice 1 is used as data of one batch to be processed in batch; receiving data in a time slice 2 after the time slice 1 to further obtain data corresponding to the time slice 2, and then processing the data corresponding to the time slice 2 at a preset speed, namely, processing the data corresponding to the time slice 2 as data of one batch in batch; then, limiting the data receiving speed of a time slice 3 after the time slice 2 based on a backpressure mechanism; after receiving the data in the time slice 3, processing the data corresponding to the time slice 3, and in the processing process, performing the following receiving process of the data of the time slice, wherein the processing of the data corresponding to the time slice 3 is not completed; data is received in the time slice 4, data is received in the time slice 5 and data is received in the time slice 6, namely, data corresponding to the time slice 4, data corresponding to the time slice 5 and data corresponding to the time slice 6 are obtained; it can be seen that after the data corresponding to time slice 3 has been processed, the data in the queue is obtained, which includes the data corresponding to time slice 4, the data corresponding to time slice 5, and the data corresponding to time slice 6. If the data corresponding to time slice 4 is processed as a batch of data, the data corresponding to time slice 5 and the data corresponding to time slice 6 need to be continuously waited for processing, and the data corresponding to time slice 7 is received during the process of processing the data corresponding to time slice 4, so that the data is continuously accumulated. In this case, the data corresponding to the time slot 4, the data corresponding to the time slot 5, and the data corresponding to the time slot 6 may be merged to obtain merged data of at least one lot, for example, merged data of two lots.

103. And processing the merged data belonging to the same batch in a batch processing mode.

In this embodiment, after step 102, at least one batch of merged data is obtained; then, the merged data belonging to the same batch is subjected to data processing in a batch processing manner.

For example, 2 batches of merged data are obtained, which are the merged data of batch a and the merged data of batch B. Then, the merged data of the batch A can be processed in a batch processing mode; after the merged data of the lot A is processed, the merged data of the lot B is processed in a batch processing mode. Alternatively, 2 threads may be invoked to process the merged data of batch a and the merged data of batch B separately at the same time, e.g., thread 1 may be invoked to process the merged data of batch a while thread 2 is invoked to process the merged data of batch B.

The embodiment determines a first data amount when determining to speed-limit the data receiving speed when receiving data based on the calculation framework, wherein the first data amount is a first data number of data corresponding to a current time slice, and the data corresponding to the current time slice is data in a processing state; when the data difference value between the first data volume and a pre-stored second data volume is determined to be larger than a preset value and the first data volume is smaller than the second data volume, wherein the second data volume is the second data number of data corresponding to a previous time slice, merging the data in the queue to obtain merged data belonging to the same batch, wherein the data in the queue is the data which is received in the current time slice and is unprocessed; and processing the merged data belonging to the same batch in a batch processing mode. During the data processing based on the Spark Streaming calculation framework, based on that the data volume received by the current time slice is far smaller than the data volume received by the previous time slice, it can be determined that the backpressure mechanism of Spark Streaming is in effect, and at this time, the data volume received in each time slice after the current time slice starts is smaller; then the data received in a plurality of time slices is available, waiting to be processed, which constitutes the data in the queue. Then, since the data volume of the data of each time slice is small after the backpressure mechanism takes effect, in order to avoid continuous backlog of the data, the data in the queue can be merged to obtain merged data of at least one batch; the merged data of each batch is used as a batch, and the merged data of each batch is processed in a batch processing mode. Therefore, the data in the queue are merged, the data can be processed in advance, and continuous backlog and accumulation of the data are avoided; the throughput of data processing is improved, and the data processing efficiency based on the Spark Streaming computing framework is effectively improved.

Fig. 3 is a schematic flowchart of another data processing method based on a computing framework according to an embodiment of the present application. As shown in fig. 3, the method includes:

201. when receiving data based on a computing framework, a first time length of a current time slice is obtained, and a second time length required for processing the data corresponding to the current time slice is obtained.

In the process of data processing based on Spark roaming calculation framework, in the process of receiving data stream, the data stream can be split by taking time slice as unit, wherein the time slice can be in the second level; then, the data corresponding to each time slice is processed in a batch mode, namely, the data corresponding to each time slice is taken as each batch, and the data of each batch is processed in turn.

For example, fig. 4 is a schematic diagram of data processing based on Spark stearing provided in the embodiment of the present application, and as shown in fig. 4, a data stream is input into a computation framework based on Spark stearing, and the Spark stearing is adopted to process the data stream; based on Spark roaming, data can be received in time slice t1, that is, the data received in time length t1 is taken as a batch, and then the data received in time slice t1 is processed; during the process of processing the data received in the time slice t1, the data is also received in the time slice t2, that is, the data received in the time slice t2 is taken as a batch, and after the data received in the time slice t1 is processed, the data received in the time slice t2 is processed; and after receiving data in time slice t2, continue to receive data in time slice t3, and so on. After the data received in each time slice is processed, a data processing result corresponding to each time slice can be obtained.

The data source can continuously generate data, so that when data processing is carried out based on Spark Streaming, the data receiving speed can be the data generating speed of the data source; the Spark Streaming may employ a Receiver-based data Receiver, or; the Spark Streaming may use a Direct reception mode (Direct Approach). After receiving data in a time slice, processing the data in the time slice; in the processing process, the next time slice starts again, and further, data are received in the next time slice; and so on. When the time required for processing data (batch processing time) is greater than the inter-slice time interval (batch interval), that is, the time for processing each batch of data is longer than the batch interval time of Spark Streaming, so that more and more data is received and backlogged; at this time, the processing speed of the data is not kept up, and then data accumulation starts to occur, and further, even an over-memory (OOM) problem may be caused, and then the whole data processing process fails.

To avoid backlog of data, the maximum amount of data received per second that can receive data may be limited. For example, in Spark Streaming, the parameter Spark. Alternatively, in Spark Streaming, when data reception is performed in a Direct Approach manner, parameters Spark, Streaming, Kafka, maxrateperpartition may be configured to limit the number of data pieces read by each Kafka partition in each operation, and further limit the data reading speed, so as to limit the maximum data reception amount per second.

However, the above-mentioned method of limiting the maximum data receiving amount per second needs to estimate the data processing speed of the data cluster and the data generation speed in advance; in addition, since the maximum data receiving amount per second needs to be limited by means of parameter configuration, in Spark Streaming, after parameters are modified, an application program based on Spark Streaming needs to be restarted, so that interruption of data processing is caused, and data processing efficiency is affected; moreover, when the data generation speed is high, the data cannot be processed in time, and the cluster resource utilization rate is low. Further, a backpressure mechanism may be provided on the basis of Spark Streaming to solve the above problems. The backpressure mechanism adapts the cluster data processing capability by dynamically controlling the data receiving rate without predicting the data processing speed and the data generating speed.

Fig. 5 is a schematic diagram of a Spark stearing computing framework provided in the embodiment of the present application, and as shown in fig. 5, Spark stearing at least includes an execute framework, a Spark Streaming Driver, and Spark context. Wherein, the execution frame at least comprises the following components: receiver (Receiver), Block Manager. The Spark Streaming Driver at least comprises the following components: receiver Tracker, Job Generator, Job Scheduler. Therefore, when the backpressure mechanism is started, the execution information of the data processing operation can be fed back according to the JobSchedule, and the data receiving rate of a Receiver (Receiver) can be dynamically adjusted. And, whether the backpressure mechanism is enabled or not is controlled by an attribute parameter of "spark. Default value false of attribute parameter, characterize not starting backpressure mechanism; and when the attribute parameter is true, characterizing that a backpressure mechanism is enabled.

In the process of data processing based on Spark stearing, firstly, starting a back pressure function, and setting a parameter Spark. But at this point the backpressure mechanism has not been enabled, i.e., backpressure has not yet been effected. Then, in the process of data receiving and processing, when receiving data in the current time slice, the time length of the current time slice can be determined; the time length of the current time slice is referred to as a first time length T1. In the process of processing the data corresponding to the current time slice, the time length required for processing the data corresponding to the current time slice can be determined; the time length required for processing the data corresponding to the current time slice is referred to as a second time length T2.

202. And when the second time length is determined to be greater than the first time length, limiting the data receiving speed.

In the present embodiment, after step 201, the first time length T1 and the second time length T2 are compared in size.

When the second time length T2 is determined to be greater than the first time length T1, determining that the time required for data processing is greater than the time required for data receiving, further determining that the data processing speed is less than the data receiving speed, and further determining that the data processing capacity based on Spark Streaming is insufficient; at this point, it is determined that backpressure is effective. And limiting the data receiving speed when the backpressure is determined to be effective.

In an example, fig. 6 is an execution process of a backpressure mechanism based on Spark learning computing framework according to an embodiment of the present application, and as shown in fig. 6, the execution process of the backpressure mechanism is as follows: adding a new component ReceiverRateController in the original framework of Spark Streaming, wherein the component ReceiverRateController is responsible for monitoring an event of 'OnBatchCompleted'; then, the component ReceiverRateController extracts processing Delay (processing Delay) information and scheduling Delay (scheduling Delay) information; then the component Estimator estimates the maximum processing speed according to the information, namely, the data receiving rate can be dynamically controlled according to the information; the maximum processing speed is forwarded by the Receiver (Receiver) -based component Input Stream to the component BlockGenerator via the component receiveretracker, the component receiversuppressor. Therefore, the speed limit of the data receiving speed is realized.

203. When determining to limit the data receiving speed, determining a first data volume, wherein the first data volume is a first data number of data corresponding to the current time slice, and the data corresponding to the current time slice is data in a processing state.

In this embodiment, after the backpressure mechanism is activated, the data is processed in batches. After backpressure, after the data receiving speed is limited, data are received in the current time slice, so that data corresponding to the current time slice are obtained, the number of the data corresponding to the current time slice can be determined, and for convenience of distinguishing, the number of the data corresponding to the current time slice is called as a first number of the data; thus, a first data volume of data corresponding to the current time slice can be obtained, and the first data volume is the first number of data pieces.

204. Acquiring the number of data in a plurality of time slices of data corresponding to each time slice in the plurality of time slices, wherein the plurality of time slices are time slices before the current time slice; the average value of the number of data in each time slice is used as the average number.

In this embodiment, step 204 may be executed before step 205, or may be executed during the execution of step 205. In step 204, a parameter is determined, which is the average number of strips.

The method comprises the steps of obtaining the number of data in a time slice of data corresponding to each time slice in a plurality of time slices in advance, wherein the time slices are time slices before a current time slice; that is, for each time slice, the number of pieces of data in a time slice of a time slice is the sum of the number of pieces of data of the data received in the time slice; and accumulating the data number in the time slices of the data corresponding to each time slice to obtain the sum of the data number, and then averaging the sum of the data number to obtain the average data number.

For example, also in the process of executing the backpressure mechanism, P time slices have been executed, data is received in each of the P time slices, the number of pieces of data in each of the P time slices is Q, and Q may be the same or different; and averaging the data number in the time slices of the P time slices to obtain the average number.

205. When the data difference value between the first data volume and a pre-stored second data volume is larger than a preset value and the first data volume is smaller than the second data volume, wherein the second data volume is the second data number of data corresponding to a previous time slice, determining the number N of batches according to the total data number of the data in queue and a preset average number, wherein N is a positive integer larger than or equal to 1; wherein the data in the queue is the data received in the current time slice and unprocessed.

In one example, the number of batches

S is the total number of data pieces, and T is the average number of pieces.

Then, comparing the first data amount of step 203 with the second data amount, as described in step 102 of fig. 2, when it is determined that the data difference between the first data amount and the pre-stored second data amount is greater than the preset value and the first data amount is smaller than the second data amount, it is determined that the queued data needs to be merged; the total number S of data in the queue can be obtained, and it can be known that the total number S of data is the total number of data in the queue; and dividing the total number S of the data by the average number T to obtain the number N of the batches.

In one example, when the total number of data S cannot be divided by the average number of data T, an rounding-up formula may be used

Obtaining the number N of batches.

206. And merging the data in the queue according to the number N of the batches to obtain merged data of the N batches.

In this embodiment, after step 205, the number N of batches that need to be generated is obtained; the data in the queue is then distributed as N batches of consolidated data. For example, the data in the queue is divided equally into N batches of merged data.

207. And respectively processing the merged data of each batch in the merged data of the N batches in a batch processing mode.

In this embodiment. After step 206, N batches of consolidated data are obtained; the data processing can be performed in a batch processing manner with respect to the merged data of each batch.

208. And when the data difference value between the first data volume and the pre-stored second data volume is determined to be smaller than or equal to a preset value and the first data volume is smaller than the second data volume, or when the first data volume is determined to be larger than or equal to the second data volume, processing the data corresponding to each time slice at a preset speed.

In this embodiment, after step 204, the first data amount of step 101 is compared with the second data amount, and it is determined whether the first data amount is smaller than the second data amount.

When the first data volume is determined to be larger than or equal to the second data volume, determining that the data volume received in the current time slice is not smaller than the data volume received in the previous time slice; or when the data difference obtained after subtracting the first data amount from the second data amount is less than or equal to the preset value, it may be determined that the data amount received in the current time slice is not much less than the data amount received in the previous time slice. In both cases, it can be considered that the data reception speed is not reduced, that is, the back pressure mechanism based on Spark Streaming has no effect, and the back pressure mechanism does not work.

Then, processing the data corresponding to each time slice at the original preset speed; at this time, the data corresponding to each time slice is still regarded as a batch; and sequentially processing the data of each batch according to a batch processing mode.

209. And processing the data corresponding to each time slice at a preset speed when the second time length is determined to be less than or equal to the first time length.

In this embodiment, after step 201, when it is determined that the second time length T2 is less than or equal to the first time length T1, it is determined that the time required for data processing is less than or equal to the time required for data reception, and then it is determined that the data processing speed is greater than or equal to the data reception speed, and then it is determined that the data processing capability based on Spark Streaming is sufficient; at this time, the backpressure is determined not to be effective, and the data receiving speed is not required to be limited. And continuing to receive data in each subsequent time slice.

In this embodiment, on the basis of the above embodiment, the number of batches of data that need to be merged is determined by determining the data amount of the data in the time slice based on the backpressure mechanism of the Spark Streaming calculation framework; then, merging the data in the queue as data of a plurality of batches; data can be processed in advance, so that continuous backlog and accumulation of the data are avoided; the throughput of data processing is improved, and the data processing efficiency based on the Spark Streaming computing framework is effectively improved. When the back pressure mechanism is determined not to be started, processing data corresponding to each time slice at the original preset speed; at this time, the data corresponding to each time slice is still regarded as one batch.

Fig. 7 is a schematic structural diagram of a data processing apparatus based on a computing framework according to an embodiment of the present application, and as shown in fig. 7, the data processing apparatus based on a computing framework according to the present embodiment may include:

a first determining unit 31 for determining a first data amount when determining to speed-limit the data receiving speed when receiving data based on the calculation framework, wherein the first data amount is a first number of pieces of data corresponding to a current time slice, and the data corresponding to the current time slice is data in a processing state.

A merging unit 32, configured to merge the queued data to obtain merged data belonging to the same batch when it is determined that a data difference between the first data amount and a pre-stored second data amount is greater than a preset value and the first data amount is smaller than the second data amount, where the second data amount is a second data number of data corresponding to a previous time slice, where the queued data is data that is received in a current time slice and is not processed.

The first processing unit 33 is configured to perform data processing on the merged data belonging to the same batch in a batch processing manner.

The data processing apparatus based on the computing framework of this embodiment can execute the data processing method based on the computing framework provided in any of the above embodiments, and the implementation principle and the technical effect are similar, and are not described herein again.

Fig. 8 is a schematic structural diagram of another data processing apparatus based on a computing framework according to an embodiment of the present application, and based on the embodiment shown in fig. 7, as shown in fig. 8, the data processing apparatus based on a computing framework according to the present embodiment, and the merging unit 32 includes:

the determining module 321 is configured to determine the number N of the batches according to the total number of data in the queued data and a preset average number, where N is a positive integer greater than or equal to 1.

And a merging module 322, configured to merge the queued data according to the number N of batches to obtain merged data of N batches.

The first processing unit 33 is specifically configured to: and respectively processing the merged data of each batch in the merged data of the N batches in a batch processing mode.

In one example, the number of batches

Where S is the total number of data pieces and T is the average number of pieces.

In an example, the apparatus provided in this embodiment further includes:

a second determining unit 41, configured to obtain, before the determining module 321 determines the number N of batches according to the total number of data in the queued data and the preset average number, the number of data in a time slice corresponding to each of a plurality of time slices, where the plurality of time slices are time slices before the current time slice; the average value of the number of data in each time slice is used as the average number.

In an example, the apparatus provided in this embodiment further includes:

the obtaining unit 42 is configured to, when the first determining unit determines to limit the data receiving speed, obtain a first time length of a current time slice before determining the first data amount, and obtain a second time length required for processing data corresponding to the current time slice.

And the speed limiting unit 43 is configured to limit the data receiving speed when it is determined that the second time duration is greater than the first time duration.

In an example, the apparatus provided in this embodiment further includes:

and the second processing unit 44 is configured to process the data corresponding to each time slice at a preset speed when it is determined that the second time length is less than or equal to the first time length.

In an example, the apparatus provided in this embodiment further includes:

and a third processing unit 45, configured to process data corresponding to each time slice at a preset speed when it is determined that a data difference between the first data amount and a pre-stored second data amount is less than or equal to a preset value and the first data amount is less than the second data amount, or when it is determined that the first data amount is greater than or equal to the second data amount.

Fig. 9 is a schematic structural diagram of a data processing device based on a computing framework according to an embodiment of the present application, and as shown in fig. 9, an embodiment of the present application provides a data processing device based on a computing framework, which may be used to execute actions or steps of the data processing method based on a computing framework in the embodiments shown in fig. 2 to fig. 6, and specifically includes: a processor 2701, memory 2702, and a communication interface 2703.

The memory 2702 is used to store computer programs.

The processor 2701 is configured to execute the computer program stored in the memory 2702 to implement the actions of the data processing method based on the computing framework in the embodiments shown in fig. 2 to fig. 6, which are not described again.

Optionally, the computing framework based data processing apparatus may further comprise a bus 2704. The processor 2701, the memory 2702, and the communication interface 2703 may be connected to each other via a bus 2704; the bus 2704 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 2704 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 9, but this does not indicate only one bus or one type of bus.

In the embodiments of the present application, the above embodiments may be referred to and referred to by each other, and the same or similar steps and terms are not repeated.

Alternatively, part or all of the above modules may be implemented by being embedded in a chip of the data processing device based on the computing framework in the form of an integrated circuit. And they may be implemented separately or integrated together. That is, the above modules may be configured as one or more integrated circuits implementing the above methods, for example: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 2702 comprising instructions, executable by the processor 2701 of the computing framework based data processing apparatus to perform the method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer-readable storage medium, in which instructions, when executed by a processor of a computing framework-based data processing apparatus, enable the computing framework-based data processing apparatus to perform the above-described computing framework-based data processing method.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, terminal device, or data center to another website, computer, terminal device, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a terminal device, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for data processing based on a computing framework, the method comprising:

2. The method of claim 1, wherein merging the data in the queue to obtain merged data belonging to the same batch comprises:

3. The method of claim 2, wherein the number of batches is

4. The method of claim 2, wherein before determining the number of batches N based on the total number of data in the queue and a preset average number of data, further comprising:

5. The method of any of claims 1-4, wherein prior to determining the first amount of data in determining to limit the speed of data reception, further comprising:

6. The method of claim 5, further comprising:

7. The method according to any one of claims 1-4, further comprising:

8. A computing framework-based data processing apparatus, the apparatus comprising:

9. The apparatus of claim 8, wherein the merging unit comprises:

the first processing unit is specifically configured to:

10. The apparatus of claim 9, wherein the number of batches is

11. The apparatus of claim 9, further comprising:

12. The apparatus of any one of claims 8-11, further comprising:

13. The apparatus of claim 12, further comprising:

14. The apparatus of any one of claims 8-11, further comprising:

15. A computing framework-based data processing apparatus, comprising: a processor, a memory, and a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1-7.

16. A computer-readable storage medium, having stored thereon a computer program for execution by a processor to perform the method of any one of claims 1-7.