CN111324392A

CN111324392A - Method and device for automatically adjusting data processing time window

Info

Publication number: CN111324392A
Application number: CN201811522063.8A
Authority: CN
Inventors: 安金龙; 刘业辉; 张宁; 张飞; 王彦明
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2020-06-23

Abstract

The application discloses a method for automatically adjusting a data processing time window, which comprises the steps of obtaining the data volume to be processed; judging whether the current data processing time window needs to be adjusted or not according to the data volume to be processed and a standard data processing time window, wherein the standard data processing time window is a data processing time window which is set in advance for reference; and if the adjustment is needed, determining a new data processing time window according to a preset data processing time window record table, and taking the new data processing time window as the current data processing time window, wherein the data processing time window record table comprises the corresponding relation between the data processing time window and the single window processing capacity value. By applying the technical scheme disclosed by the application, the data processing time window can be dynamically adjusted according to the data volume condition, so that the processing of common data can be coped with, and corresponding strategies can be adopted in time when the flood peak data arrives, and the smoothness of data processing is ensured.

Description

Method and device for automatically adjusting data processing time window

Technical Field

The present application relates to the field of big data processing technologies, and in particular, to a method and an apparatus for automatically adjusting a data processing time window.

Background

With the development of electronic commerce, people are more and more accustomed to purchasing commodities to trade on the internet, and even shopping festivals guiding people to consume on the internet appear. During these shopping-intensive periods, thousands of clients may submit shopping orders from all sides almost simultaneously. Data generated by the orders can impact a back-end system on a network side and a large data computing platform like flood, and data flood peaks are caused.

How to deal with such data flooding has become a problem that big data processing is troublesome at present. With the rapid development of big data, service scenes are more and more complex, an offline batch processing framework MapReduce cannot meet service requirements, and a large number of scenes need real-time data processing results for analysis and decision-making. Real-time computing is usually based on a distributed parallel computing framework, and a single computer is far from achieving the effect of real-time processing on short-time high data volume. However, even under a distributed parallel computing framework, in the face of a data flood peak situation, the prior art generally increases a server or expands the capacity of the existing server, so as to improve the capacity of a back-end system for processing big data. However, this approach mainly increases hardware facilities, resulting in an increase in hardware cost and maintenance cost.

Disclosure of Invention

The application provides a method for automatically adjusting a data processing time window, which can deal with the processing of flood peak data on the basis of not increasing the hardware cost.

The method for automatically adjusting the data processing time window comprises the following steps:

acquiring the data volume to be processed;

judging whether the current data processing time window needs to be adjusted or not according to the data volume to be processed and a standard data processing time window, wherein the standard data processing time window is a data processing time window which is set in advance for reference;

and if the adjustment is needed, determining a new data processing time window according to a preset data processing time window record table, and taking the new data processing time window as the current data processing time window, wherein the data processing time window record table comprises the corresponding relation between the data processing time window and the single window processing capacity value.

Further, the step of acquiring the data amount to be processed comprises:

acquiring the current position of data in the data queue;

acquiring the last position of the data in the recorded data queue;

and determining the data volume to be processed according to the current position and the last position.

Further, the step of determining whether the current data processing time window needs to be adjusted according to the amount of data to be processed and the standard data processing time window includes:

judging whether the data volume to be processed has backlog according to the data volume to be processed and the data processing capacity value corresponding to the standard data processing time window, and if not, taking the standard data processing time window as a new data processing time window; if not, then,

and calculating the ratio of the data volume to be processed to the data processing capacity value corresponding to the standard data processing time window, and determining whether the current data processing time window needs to be adjusted according to the ratio result.

Further, the data processing time window record table comprises a standard data processing time window, an ultrahigh data processing time window and a flood peak data processing time window, wherein the ultrahigh data processing time window is larger than the standard data processing time window, and the flood peak data processing time window is larger than the ultrahigh data processing time window;

the correspondence between the data processing time window and the single window processing capability value includes:

a standard data processing time window and a processing capacity value corresponding to the single standard data processing time window;

the processing capacity value corresponding to the ultrahigh data processing time window and a single ultrahigh data processing time window;

and the processing capacity value corresponds to the flood peak data processing time window and a single flood peak data processing time window.

Further, the step of determining a new data processing time window according to a preset data processing time window record table comprises:

if the ratio result is larger than the preset first ratio threshold and smaller than a preset second ratio threshold, determining that the new data processing time window is an ultrahigh data processing time window;

and if the ratio result is greater than or equal to the preset second ratio threshold, determining that the new data processing time window is the flood peak data processing time window.

The application also provides a device for automatically adjusting the data processing time window, which can deal with the flood peak data on the basis of not increasing the hardware cost. The device includes:

a data amount acquisition unit for acquiring a data amount to be processed;

the judging unit judges whether the current data processing time window needs to be adjusted or not according to the data volume to be processed and a standard data processing time window, wherein the standard data processing time window is a data processing time window which is set in advance for reference;

and the window adjusting unit is used for determining a new data processing time window according to a preset data processing time window record table, taking the new data processing time window as the current data processing time window, and the data processing time window record table comprises the corresponding relation between the data processing time window and the single window processing capacity value.

Further, the discrimination unit includes:

the first judging subunit is used for judging whether backlog exists in the data volume to be processed according to the data volume to be processed and the data processing capacity value corresponding to the standard data processing time window, and if the backlog does not exist, the standard data processing time window is used as a new data processing time window; otherwise, triggering a second judgment subunit;

and the second judgment subunit is used for calculating the ratio of the data volume to be processed to the data processing capacity value corresponding to the standard data processing time window, determining whether the current data processing time window needs to be adjusted according to the ratio result, and triggering the window adjusting unit.

Further, the window adjusting unit includes:

the storage subunit is used for storing a data processing time window record table; the data processing time window record table comprises a standard data processing time window, an ultrahigh data processing time window and a flood peak data processing time window, wherein the ultrahigh data processing time window is larger than the standard data processing time window, and the flood peak data processing time window is larger than the ultrahigh data processing time window; the correspondence between the data processing time window and the single window processing capability value includes: a standard data processing time window and a processing capacity value corresponding to the single standard data processing time window; the processing capacity value corresponding to the ultrahigh data processing time window and a single ultrahigh data processing time window; the peak data processing time window and the processing capacity value corresponding to the single peak data processing time window;

a new window determining subunit, configured to determine, when a new data processing time window is determined, if the ratio result is greater than the preset first ratio threshold and smaller than a preset second ratio threshold, that the new data processing time window is an ultrahigh data processing time window; and if the ratio result is greater than or equal to the preset second ratio threshold, determining that the new data processing time window is the flood peak data processing time window.

The present application also provides a computer readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the steps in the method of automatically adjusting a data processing time window as described above.

The present application also provides a server comprising an input interface and an output interface, the server further comprising a computer-readable storage medium as described above, and a processor that can execute instructions in the computer-readable storage medium.

According to the technical scheme, the data processing time window can be dynamically adjusted according to the data volume condition by applying the technical scheme disclosed by the application, so that the processing of common data can be dealt with, and corresponding strategies can be adopted in time when flood peak data arrives, so that the smoothness of data processing is ensured.

Drawings

Fig. 1 is a flowchart of a first embodiment of the method of the present application.

Fig. 2 is a flowchart of a second embodiment of the method of the present application.

Fig. 3 is a flow chart of a first embodiment of the apparatus of the present application.

Fig. 4 is a flowchart of a second embodiment of the apparatus of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below by referring to the accompanying drawings and examples.

According to the method and the device, the data processing capacity can be automatically adjusted according to the data volume condition so as to cope with the data flood peak which may occur. The time at which the data flood occurs is not predictable and thus requires dynamic processing power. In the embodiment of the present application, this dynamic processing capability is embodied in automatically adjusting the data time window. The data time window refers to a time period during which the processor processes data, and the data is processed in batches according to the time period. If the data volume is larger than the data time window can hold, the memory may overflow, and if the data volume is much smaller than the set data time window can hold, the waste may be caused.

In practical application, Spark Streaming is a distributed big data real-time computing framework provided by Spark, and can provide dynamic, high-throughput, fault-tolerant stream data processing. For such streaming data, an embodiment of the present application provides a method for automatically adjusting a data processing time window. Practical application fig. 1 is a flow chart of the method. As shown in fig. 1, the method includes:

step S01: the amount of data to be processed is obtained.

In practice, Spark Streaming can obtain data from various data queues, such as Kafka, Flume, Kinesis, Twitter, Tcp scotches, and then obtain the amount of data to be processed from these queues.

Step S02: and judging whether the current data processing time window needs to be adjusted or not according to the data volume to be processed and a standard data processing time window, wherein the standard data processing time window is a data processing time window which is set in advance for reference.

In data processing, since the change of the data amount cannot be predicted reliably, in order to fully utilize the processing performance and not waste resources, a standard data processing time window can be set in advance. The processor batches the data according to this standard data processing time window. The data volume variation may also be referenced to this standard data processing time window. For example, if the data size is within the standard data processing time window, it indicates that there is no backlog of data. And if the data size cannot be completed within the standard data processing time window, the backlog condition of the data is represented.

Step S03: and if the adjustment is needed, determining a new data processing time window according to a preset data processing time window record table, and taking the new data processing time window as the current data processing time window, wherein the data processing time window record table comprises the corresponding relation between the data processing time window and the single window processing capacity value.

In practical applications, different data processing time windows correspond to different processing capabilities. That is, the larger the data processing time window, the more data can be processed, and the stronger the capacity thereof. In the step, the data processing time window and the corresponding processing capacity value are correspondingly recorded in advance, and when the adjustment is needed, the record table can be searched to adjust the proper data processing time window.

In order to better illustrate the embodiments of the present application, the following method example two is used to describe in detail. In the second embodiment of the method, a data processing time window record table is first set for subsequent use. The data processing time window record table can be set according to experience, and can also be determined through detection by using experimental data. If the embodiment of the method is determined by experiments, different data processing time windows can be set in advance. For example, the data processing method can be divided into three types, namely a standard data processing time window, an ultrahigh data processing time window and a flood peak data processing time window, and then different processing capability values are matched with the three types, as shown in table one:

watch 1

That is, in the case of knowing the amount of data that needs to be processed, it is possible to determine which capabilities are needed from the data processing time window record table shown in table one, and thus determine the corresponding time window. The standard data processing time window can be set as a default time window of a system, can be set to 10 seconds according to experience, and can generally process 10 ten thousand data records, wherein 1 data record is 1 kbyte; the ultrahigh data processing time window is a time window which needs to process a lot of backlogs but does not reach a peak stage, can be set to different times such as 20 seconds, 30 seconds or 60 seconds according to experience, and can process 20 ten thousand, 30 ten thousand or 60 ten thousand different data records respectively, wherein 1 data record is 1 Kbyte; the flood peak data processing time window is a time window for processing the topmost flood peak data, can be set to 90 seconds according to experience, and can process 90 ten thousand data records, wherein 1 data record is 1 Kbyte. In practical applications, these data processing time windows can be further refined and classified into more levels. In this way, the data processing time window record table can be set to the form as shown in table two:

watch two

The 10-second data processing time window can be used as a standard data processing time window, the 90-second data processing time window can be used as a flood peak data processing time window, the other data processing time windows with different levels are used as exceeding data processing time windows of all levels, the corresponding capacity value of a single time window can be represented by the number of the recording strips and the size of the data, and the corresponding processing time can be recorded correspondingly.

In an experiment, the data in Kafka can be used as a data source to be processed by a processor, the data can be written into Kafka at a speed of 1 ten thousand data records per second, and the average size of 1 data record can be calculated to be 1 kbyte. After writing the data, the processor may be caused to perform some operation to process the data in Kafka, such as performing a map operation to check its processing power and processing time. In practice, the processor may be caused to perform other types of operations, such as reduce or join operations, for example.

No matter what kind of detection is carried out in advance or according to experience, a data processing time window recording table can be set, and the table at least comprises the corresponding relation between the data processing time window and the single window processing capacity value.

Fig. 2 is a flowchart of a method for automatically adjusting a data processing time window according to a second embodiment. As shown in fig. 2, the method includes:

step L201: and acquiring the current position of the data in the data queue.

Step L202: and acquiring the last position of the data in the recorded data queue.

Step L203: and determining the data volume to be processed according to the current position and the last position.

Here, steps L201 to L203 may acquire the amount of data to be processed, i.e., a specific embodiment of step S01. Still taking Kafka data as an example, in practical applications, data to be transmitted to the processor for processing is stored in the queue in advance. Each batch of data is identified by a bit value after it is transferred from Kafka to the processor, which may indicate the beginning of the data in Kafka, denoted as offset 1. Similarly, if new data is input into Kafka, it is also identified by a bit-point value, which indicates the end position of the data in Kafka, denoted as offset 2. Thus, the size of the data size in Kafka, i.e., the amount of data to be processed by the processor, can be calculated from the positions of offset1 and offset 2. The offset2 is the current position of the data in step L201, and the offset1 is the last position of the data in step L202. In practical applications, the acquired data amount to be processed may be recorded in a configuration file, and the current time may be recorded. Such as recording: time points are as follows: 2018/5/611: 07:00, and the data volume is 20 ten thousand.

Step L204: judging whether the data volume to be processed has backlog according to the data volume to be processed and the data processing capacity value corresponding to the standard data processing time window, and executing the step L205 if the backlog does not exist; otherwise, step L206 is performed.

In practical application, if the data volume to be processed is larger than the data processing capacity value corresponding to the standard data processing time window, the data volume is overstocked, otherwise, no overstock exists. For example, the amount of data currently required to be processed is 20 ten thousand data records, the standard data processing time window is 10 seconds, and the corresponding capacity is 10 ten thousand, which indicates that data backlog has been generated currently.

Step L205: and taking the standard data processing time window as a new data processing time window, and exiting the process.

In practical applications, if there is no backlog of data, the standard data processing time window can be restored regardless of the type of data processing time window, and the processor no longer operates at high performance, thereby saving resources.

Step L206: and calculating the ratio of the data volume to be processed to the data processing capacity value corresponding to the standard data processing time window.

Step L207: determining whether the current data processing time window needs to be adjusted according to the ratio result, and executing a step L208 if the current data processing time window needs to be adjusted; otherwise, it is determined that no adjustment is required and the process exits.

If there is a data backlog, step L206 and step L207 of this embodiment are a method for determining whether to adjust the current data processing time window. This embodiment uses the set standard data processing time window as a reference standard for backlog or not. However, in order to not switch the data processing time window frequently, the data amount to be processed is not directly compared with the current data processing time window, but the ratio of the data amount to be processed and the data processing capability value corresponding to the standard data processing time window is used as the criterion for adjusting or not.

Step L208: and determining a new data processing time window according to a preset data processing time window record table, and taking the new data processing time window as the current data processing time window. The method specifically comprises the following steps:

1) and if the ratio result is larger than the preset first ratio threshold and smaller than the preset second ratio threshold, determining that the new data processing time window is an ultrahigh data processing time window.

2) And if the ratio result is greater than or equal to the preset second ratio threshold, determining that the new data processing time window is the flood peak data processing time window.

In practical application, the first ratio threshold and the second ratio threshold can be set as conditions for adjusting to be an ultrahigh data processing time window or a flood peak data processing time window. And when the ratio meets the first ratio threshold, setting the ratio as an ultrahigh data processing time window. Of course, if the ultra-high data processing time window has multiple levels, the accurate level can be further searched according to the data size. And when the ratio meets a second ratio threshold, setting the ratio as a flood peak data processing time window.

Taking table two as an example, assume that the first ratio threshold is N1-3 and the second ratio threshold is N2-10. If the data volume to be processed is 20 ten thousand data records and the standard data processing time window is 10 seconds, the step L204 is used to determine that the data backlog exists currently. However, since the ratio does not exceed the first ratio threshold N1, no adjustment is currently required and data can still continue to be processed in accordance with the standard data processing time window of 10 seconds.

If the data volume to be processed is 60 ten thousand data records and the standard data processing time window is 10 seconds, the step L204 is used to determine that the data backlog exists currently. And since the ratio has exceeded the first ratio threshold N1 and not exceeded the second ratio threshold N2, it is determined that the current data processing time window needs to be adjusted to an ultra-high data processing time window. According to the corresponding relation of the second table, 60 seconds can be selected as a new data processing time window.

If the amount of data to be processed is large, which reaches 120 ten thousand data records, and the standard data processing time window is 10 seconds, it is determined that the data backlog exists currently by using step L204. Moreover, the ratio has exceeded the second ratio threshold N2, determining that the current data processing time window needs to be adjusted to the flood peak data processing time window. According to the corresponding relation of the second table, 90 seconds can be selected as a new data processing time window.

In practical applications, the processing performance of the processor is limited, and if the data processing time window is increased all the time according to the data volume, the data processing speed is not increased, but the processing speed is reduced because the data volume is far greater than the processing performance of the processor, which causes memory overflow. Here, if the peak data processing time window is assumed to be 90 seconds, even if the amount of data to be processed is greater than 90 ten thousand data records, the data processing time window is not expanded any more, so as to ensure that the memory overflow does not occur.

By applying the scheme of the embodiment, the data volume to be processed is acquired in advance, the data processing time window recording table is set, and the data processing time window is dynamically adjusted according to the data volume condition. Therefore, in the case of a severe data backlog, or flood peak data, the data processing time window may be automatically adjusted to the flood peak data processing time window to cope with the large data case. Under the condition that data is not backlogged, the standard data processing time window can be automatically recovered to save resources.

Aiming at the above method embodiments, the present application also provides a device for automatically adjusting the data processing time window. Fig. 3 is a structural diagram of a first embodiment of the apparatus of the present application, and as shown in fig. 3, the apparatus includes: data amount acquisition section 301, discrimination section 302, and window adjustment section 303. Wherein the content of the first and second substances,

a data amount acquisition unit 301 for acquiring the amount of data to be processed. The determination unit 302 determines whether the current data processing time window needs to be adjusted according to the data amount to be processed and a standard data processing time window, which is a data processing time window set in advance for reference. The window adjusting unit 303 determines a new data processing time window according to a data processing time window record table set in advance, and uses the new data processing time window as the current data processing time window, where the data processing time window record table includes a corresponding relationship between the data processing time window and a single window processing capability value.

That is, the data amount obtaining unit 301 obtains the data amount to be processed from the data queue, the judging unit 302 judges whether the current data processing time window needs to be adjusted, and if so, the window adjusting unit 303 is triggered. The window adjusting unit 303 determines a new data processing time window according to a data processing time window record table set in advance.

Fig. 4 is a structural diagram of a second embodiment of the apparatus of the present application. As shown in fig. 4, the apparatus includes a data amount acquisition unit 301, a determination unit 302, and a window adjustment unit 303. The determining unit 302 includes:

a first judging subunit 3021, configured to judge whether the amount of data to be processed has backlog according to the amount of data to be processed and the data processing capability value corresponding to the standard data processing time window, and if not, take the standard data processing time window as a new data processing time window; otherwise, the second judging subunit 3022 is triggered.

A second judging subunit 3022, configured to calculate a ratio between the data amount to be processed and the data processing capability value corresponding to the standard data processing time window, determine whether the current data processing time window needs to be adjusted according to a ratio result, and trigger the window adjusting unit 303.

The window adjusting unit 303 includes:

a storage subunit 3031, configured to store a data processing time window record table; the data processing time window record table comprises a standard data processing time window, an ultrahigh data processing time window and a flood peak data processing time window, wherein the ultrahigh data processing time window is larger than the standard data processing time window, and the flood peak data processing time window is larger than the ultrahigh data processing time window; the correspondence between the data processing time window and the single window processing capability value includes: a standard data processing time window and a processing capacity value corresponding to the single standard data processing time window; the processing capacity value corresponding to the ultrahigh data processing time window and a single ultrahigh data processing time window; and the processing capacity value corresponds to the flood peak data processing time window and a single flood peak data processing time window.

A new window determining subunit 3032, configured to determine, when determining a new data processing time window, that the new data processing time window is an ultrahigh data processing time window if the ratio result is greater than the preset first ratio threshold and smaller than a preset second ratio threshold; and if the ratio result is greater than or equal to the preset second ratio threshold, determining that the new data processing time window is the flood peak data processing time window.

The data processing time window record table stored in the storage subunit 3031 may be as shown in table one or table two. That is, the data amount obtaining unit 301 obtains the data amount to be processed from the data queue, the first judging subunit 3021 judges whether the data amount to be processed has backlog according to the data amount to be processed and the data processing capability value corresponding to the standard data processing time window, and if not, the standard data processing time window is used as a new data processing time window; otherwise, the second judging subunit 3022 is triggered. The second judging subunit 3022 calculates a ratio between the data amount to be processed and the data processing capability value corresponding to the standard data processing time window, determines whether the current data processing time window needs to be adjusted according to a ratio result, and triggers the window adjusting unit 303. The new window determining subunit 3032 in the window adjusting unit 303 queries the data processing time window record table stored in the storage subunit 3031, and determines that the new data processing time window is an ultrahigh data processing time window if the ratio result is greater than the preset first ratio threshold and smaller than the preset second ratio threshold; and if the ratio result is greater than or equal to the preset second ratio threshold, determining that the new data processing time window is the flood peak data processing time window.

The application also provides a computer readable storage medium, which can be a storage medium such as a ROM, a RAM, an EPROM, a SIM card, an optical disk and the like, and is used for storing instructions. The instructions, when executed by the processor, cause the processor to perform the steps of the method of automatically adjusting a data processing time window in the embodiments described above.

The application also provides a server, which comprises an input interface and an output interface, the computer-readable storage medium and a processor capable of executing instructions in the computer-readable storage medium.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. A method for automatically adjusting a data processing time window, the method comprising:

acquiring the data volume to be processed;

2. The method of claim 1, wherein the step of obtaining the amount of data to be processed comprises:

acquiring the current position of data in the data queue;

acquiring the last position of the data in the recorded data queue;

3. The method of claim 1, wherein the step of determining whether the current data processing time window needs to be adjusted based on the amount of data to be processed and the standard data processing time window comprises:

4. The method of claim 3, wherein the data processing time window log comprises a standard data processing time window, an ultra-high data processing time window, and a flood peak data processing time window, the ultra-high data processing time window being greater than the standard data processing time window, the flood peak data processing time window being greater than the ultra-high data processing time window;

5. The method of claim 4, wherein the step of determining a new data processing time window according to a pre-set data processing time window log comprises:

6. An apparatus for automatically adjusting a data processing time window, the apparatus comprising:

a data amount acquisition unit for acquiring a data amount to be processed;

7. The apparatus according to claim 6, wherein the discriminating unit comprises:

8. The apparatus of claim 7, wherein the window adjusting unit comprises:

9. A computer readable storage medium storing instructions which, when executed by a processor, cause the processor to perform the steps in the method of automatically adjusting a data processing time window of any one of claims 1 to 5.

10. A server comprising an input interface and an output interface, the server further comprising the computer-readable storage medium of claim 9, and a processor that executes instructions in the computer-readable storage medium.