CN111444023A

CN111444023A - Data processing method, device, equipment and readable storage medium

Info

Publication number: CN111444023A
Application number: CN202010284693.7A
Authority: CN
Inventors: 栾英英; 张静; 童楚婕; 严洁; 彭勃; 李福洋; 徐晓健
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2020-07-24

Abstract

According to the technical scheme, the data processing method provided by the embodiment of the application can be used for grouping the data sets with the risk of memory overflow to obtain a plurality of data groups, operating part of the data groups (namely the second data group), persisting the data groups which are not operated (namely the first data group), and operating the first data group after the operation result of the second data group is obtained. Further, the operation results of all the data groups are fused to obtain the operation result of the target data set. On one hand, when the second data group or the first data group is operated, the occupied memory amount is smaller than the memory amount required for operating the whole target data set, and on the other hand, the data group which is not operated is persisted, so that more memories can be provided for the data group which is operated. Therefore, the method greatly reduces the possibility of memory overflow by using a data set grouping processing mode and avoids the failure of a Spark task.

Description

Data processing method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for processing data.

Background

Spark (compute engine) is a big data parallel computing framework based on memory computation. RDD (flexible distributed DataSet) is a core concept in Spark, and in the Spark execution process, according to the dependency between RDDs, work is divided into multiple stages (execution stages), each Stage includes one or more RDDs, and a shuffle operator is involved between different stages. Therefore, a large number of shuffle operators are involved in the Spark execution process, and a memory larger than the data volume is often needed, and at this time, a memory overflow is easy to occur.

In the prior art, when the memory overflows, the task is restarted after the problem is manually solved, so that the task is blocked, and time and resources are wasted.

Disclosure of Invention

In view of the above, the present application provides a data processing method, apparatus, device and readable storage medium, as follows:

a method of processing data, comprising:

acquiring a target data set, wherein the target data set is a data set with memory overflow risk;

dividing the target data set to obtain at least two groups of data groups;

persisting a first data set, the first data set comprising at least one set of data sets in the target data set;

operating a second data group and acquiring an operation result of the second data group, wherein the second data group comprises other data groups except the first data group in the target data set;

operating the first data set and acquiring an operation result of the first data set;

and fusing the operation results of all the data groups to obtain the operation result of the target data set.

Optionally, before the acquiring the target data set, further comprising:

acquiring state information of each operation stage, wherein the state information at least comprises memory usage and memory recovery frequency;

when the state information of any one of the operation stages meets a preset condition, determining that a data set included in the operation stage is a data set with a memory overflow risk, wherein the preset condition comprises: the memory usage amount of the operation stage is greater than a first preset threshold, and the memory recovery frequency of the operation stage is greater than a second preset threshold.

Optionally, the status information further includes: CPU usage;

the preset conditions further include: and the CPU usage amount of the running stage is less than a third preset threshold value.

Optionally, the method further comprises:

and when the state information of any one operation stage meets the preset condition, sending out an early warning signal.

Optionally, after the running the second data group and obtaining the running result of the second data group, the method further includes:

releasing the memory occupied by the second data group;

after the first data group is operated and the operation result of the first data group is obtained, the method further comprises the following steps:

and releasing the memory occupied by the first data group.

Optionally, the method further comprises:

An apparatus for processing data, comprising:

the system comprises a data set acquisition module, a data set processing module and a data set processing module, wherein the data set acquisition module is used for acquiring a target data set, and the target data set is a data set with memory overflow risk;

the grouping module is used for dividing the target data set to obtain at least two groups of data groups;

a persistence module to persist a first data set, the first data set comprising at least one data set of the target data set;

the first operation module is used for operating a second data group and acquiring an operation result of the second data group, wherein the second data group comprises other data groups except the first data group in the target data set;

the second operation module is used for operating the first data group and acquiring an operation result of the first data group;

and the fusion module is used for fusing the operation results of all the data groups to obtain the operation result of the target data set.

Optionally, the data processing apparatus further includes: a monitoring module for determining the target data set;

the monitoring module is configured to determine the target data set, and includes: the monitoring module is specifically configured to:

Optionally, the data processing apparatus further includes: the memory release module is used for releasing the memory occupied by the second data group after the second data group is operated and the operation result of the second data group is obtained; and after the first data group is operated and the operation result of the first data group is obtained, releasing the memory occupied by the first data group.

A device for processing data, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the data processing method.

A readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of processing data as described above.

It can be seen from the foregoing technical solutions that, in the data processing method provided in the embodiment of the present application, data sets with risk of memory overflow are grouped to obtain multiple data sets, a part of the data sets (i.e., a second data set) is run, a data set that is not run (i.e., a first data set) is persisted, and the first data set is run after a running result of the second data set is obtained. Further, the operation results of all the data groups are fused to obtain the operation result of the target data set. On one hand, when the second data group or the first data group is operated, the occupied memory amount is smaller than the memory amount required for operating the whole target data set, and on the other hand, the data group which is not operated is persisted, so that more memories can be provided for the data group which is operated. Therefore, the method greatly reduces the possibility of memory overflow by using a data set grouping processing mode and avoids the failure of a Spark task.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a specific implementation of a data processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a data processing device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The data processing method provided by the embodiment of the Application can be applied to a data processing system, and the data processing system can be arranged in an Application program and connected with a Spark framework through an Application Programming Interface (API). It can be understood that the data processing system can realize real-time monitoring of the running data and real-time processing of the data during the running process of the Spark platform. It should be noted that, a specific implementation manner of the data processing method provided in the embodiment of the present application may refer to fig. 1, where fig. 1 is a schematic flow diagram of the data processing method provided in the embodiment of the present application, and specifically includes S101 to S109.

S101, acquiring the state information of each Stage in real time, wherein the state information comprises the memory usage amount, the CPU usage amount and the memory recycling frequency.

In this embodiment, the memory usage of any Stage refers to the cluster resource occupation amount of the Stage in the operation process, the CPU usage refers to the CPU resource occupation amount of the Stage in the operation process, and the memory recovery frequency refers to the frequency of memory recovery of the Stage in the operation process, that is, the number of times of Full-GC (memory recovery) in a preset time period. It should be noted that each Stage includes a running phase of a job executed in the Spark platform, and the basis for dividing the Stage is the dependency relationship between tasks in the job. For example, dependencies between RDDs include wide dependencies and narrow dependencies, and when Spark divides a Stage, the Stage is divided when encountering the wide dependency, each Stage contains one or more tasks, and in particular, the method for dividing the Stage can refer to the prior art.

It should be noted that, in this embodiment, a visual real-time monitoring module may be set, which is used to perform real-time monitoring on each Stage operation process and obtain state information in real time, and a specific implementation manner of the setting method of the real-time monitoring module may refer to the prior art, which is not described in detail in this embodiment.

And S102, when the state information of any Stage meets a preset condition, sending out an early warning signal.

In this embodiment, the preset condition may include: the memory usage of Stage is larger than a first preset threshold, the memory recovery frequency is larger than a second preset threshold, and the CPU usage is smaller than a third preset threshold.

It can be understood that the first preset threshold is a maximum value of the memory usage amount of the Stage preset according to the cluster resource condition, and when the memory usage amount of the Stage is greater than the first preset threshold, the memory occupied by the operation of the Stage is close to saturation. When the memory occupied by the Stage during operation is close to saturation, the memory release is executed, namely, Full-GC occurs once, and when the memory recovery frequency is greater than a second preset threshold, the number of times that the Stage occupies the memory in the operation process is large. Because Stage still occupies the CPU resource in the operation process, this embodiment further determines whether the CPU usage is less than a third preset threshold, and when the CPU usage is less than the third preset threshold, it indicates that the CPU usage of Stage in the operation process is normal. Therefore, the excessive Full-GC caused by insufficient memory in the operation process of the Stage can be judged, and the condition of possible memory overflow can be further judged in advance.

In this embodiment, it is determined that the state information of any Stage meets the preset condition, and an early warning signal may be sent out. It should be noted that the sending form of the warning signal may be preset, and this embodiment is not described in detail herein.

S103, acquiring a target data set.

In this embodiment, a Stage whose state information satisfies a preset condition is regarded as a Stage where memory overflow is to occur, and is denoted as a target Stage, where multiple tasks are run in the target Stage, that is, multiple data sets (e.g., RDDs and/or dataframes) processed by a shuffle operator exist in the target Stage. The target data set obtained in this step is a data set in a target Stage, and may be RDD or DataFrame. The DataFrame is an immutable distributed elastic data set, and the data set of the DataFrame is stored in a specified column, namely structured data.

And S104, dividing the target data set into N data groups, wherein each data group is a group of data which can be independently operated.

In this embodiment, the method for dividing the data group may include multiple methods, and optionally, the number N of the divided groups may be determined according to the remaining situation of the cluster resources. Alternatively, the number of divided groups N may be determined according to the total amount of cluster resources and the data amount of the target data set. It will be appreciated that the data of the data sets may be executed in parallel. It should be noted that, when the target data set is an RDD, the target data set may be divided according to the slices in the RDD, that is, data belonging to the same slice is divided into the same data group.

And S105, taking at least one data group in the data groups to be processed as a first data group, and taking the data groups except the first data group as second data groups.

The to-be-processed data group refers to a data group in which no operation result is obtained in the target data set, and it can be understood that in an initial state, the to-be-processed data group is N data groups obtained by dividing the target data set. It should be noted that the number of data sets in the first data set may be determined according to the cluster resource, for example, when the cluster resource is sufficient, the first data set only includes one data set.

And S106, performing persistence operation on the first data group.

In this embodiment, the method for performing persistence operation on data in any first data group may be: and stopping running the shuffle operator for processing the data set, and writing the data of the data set into a disk.

And S107, operating the second data group, and releasing the memory occupied by the second data group when the preset operating condition is met.

It should be noted that the preset operation condition may be set according to an actual situation, and in this embodiment, the preset operation condition may be to operate the second data group to obtain a final operation result.

In this embodiment, the shuffle operator that processes the second data group is run until the run result of the second data group is obtained, and then the run result is stored, and further, the memory occupied by the data in the second data group is released, and the data in the second data group is deleted from the disk. It can be understood that, in S106, the operation of the fractional shuffle operator that processes the target data set is stopped, and the memory occupied by the fractional data in the target data set is released, so that when the fractional shuffle operator that processes the data set is operated in this step, no memory overflow occurs.

And S108, taking the first data group as a data group to be processed, and returning to S105.

That is, at least one of the current first data sets is set as a first data set, and data sets other than the first data set are set as second data sets. And executes S107 to S108.

It should be noted that, in this embodiment, the number of the data groups in the second data group may be one group, that is, N data groups obtained by dividing the target data set are operated one by one in this embodiment, when any one of the data groups is operating, the other data groups stop operating and are written into the disk, and when the operation result of the currently operating data group is obtained, the memory occupied by the data group is released. And further extracting the data of the next data group from the disk, and operating to obtain an operation result until the operation results of all the data groups are obtained.

It should be further noted that, in this embodiment, the number of the data groups in the second data group is not limited, that is, the second data group may also include multiple data groups, in this case, all the data groups in the second data group are simultaneously run, and when one of the data groups in the second data group runs to obtain a result, the memory occupied by the data group is released, and the data of the next data group is extracted from the disk, and the running is performed to obtain a running result until the running results of all the data groups are obtained.

It should be further noted that, in this embodiment, the operation sequence of the data sets is not limited, that is, each time the loop is executed from S105 to S108, the second data set may be any one or more data sets in the data sets that are not operated.

As can be seen from the above, in the data processing method provided in this embodiment of the present application, the target data set is run in batches by executing S105 to S108 in a loop, after the running of the data group in the second data group is finished, the memory occupied by the data group is acquired, and the data in the remaining data group that is not run is read out from the disk as a new second data group to be run until the running results of all the data groups are obtained. In summary, the second data group may include one or more data groups, and in this embodiment, compared to the shuffle operator that operates to process all data in the target data set, the amount of memory required to cyclically operate the shuffle operator that processes the second data group is small, so that memory overflow may be avoided.

And S109, fusing the operation results of all the data groups to obtain the operation result of the target data set.

Specifically, running each data set produces a running result, and the running results are fused to obtain a running result of the whole target data set. The method for obtaining the operation result of each data set and the method for fusing the operation results may refer to the prior art, which is not described in detail in this embodiment.

According to the technical scheme, the data processing method provided by the embodiment of the application judges whether the risk of memory overflow exists or not by acquiring the state information of each Stage in real time and according to the preset conditions, and reminds operation and maintenance personnel of the existence of the risk of memory overflow by sending the early warning signal. And automatically positioning a shuffle operation causing the risk of memory overflow and a data set processed by the operation, grouping the data sets, operating each data group according to the grouping, and automatically persisting the non-operating data. Because the amount of memory required for operating and processing the partial data group is small, and the partial memory is released when the partial data group is not operated in a lasting mode, the memory overflow is avoided, and the failure of the Spark task is prevented.

Furthermore, the method automatically warns the possible memory overflow risk, and automatically processes the data set processed by the shuffle operation with the overflow risk in a grouping manner, compared with the prior art, the method has the advantages that the operator causing the memory overflow is manually positioned after the memory overflow problem occurs, the memory overflow problem is solved, the labor cost in the development process is saved, and the working efficiency is improved.

Therefore, the data processing method provided by the embodiment processes the data before the memory overflow occurs, so that the memory overflow can be effectively avoided. Fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application, and as shown in fig. 2, the method may specifically include steps S201 to S204.

S201, acquiring a target data set.

In this embodiment, the target data set is a data set with a risk of memory overflow, for example, the target data set may be an RDD. In this embodiment, whether a Stage has a risk of memory overflow is determined by monitoring the data running state in each Stage in the Spark platform. And when the memory of the Stage overflows, acquiring the RDD in the Stage as a target data set.

The monitoring method of the operating state and the determination of the overflow risk may refer to the above S101 to S102, which are not described in detail in this embodiment.

S202, dividing the target data set to obtain at least two groups of data sets.

It should be noted that, in this embodiment, the number of the divided data groups may be determined according to the usage condition of the cluster resources (for example, the occupied memory amount, and/or the total amount of the cluster resources, etc.). And, each data group may be operated individually, and a specific division method may refer to S104 described above.

S203, the first data group is persisted.

Wherein the first data set comprises at least one data set of the target data set.

In this embodiment, as for the persistence of the first data group, reference may be made to the prior art, and it can be understood that, after the first data group is persisted, the memory occupied by the first data group and the memory occupied by the operator that processes the first data group are released.

And S204, operating the second data group, and acquiring an operation result of the second data group.

Wherein the second data set comprises data sets in the target data set other than the first data set. In this embodiment, the method for operating the second data set may refer to the prior art. It should be noted that, after the operation result of the second data group is obtained, the embodiment may also release the memory occupied by the second data group.

S205, operating the first data set, and obtaining an operation result of the first data set.

In this embodiment, the method of operating the first data set may refer to the prior art. It should be noted that, after the operation result of the first data group is obtained, the embodiment may also release the memory occupied by the first data group.

And S206, fusing the operation results of all the data groups to obtain the operation result of the target data set.

It can be understood that the method operates the target data set in groups, and the operation result of each group of data set can be obtained, so the method further fuses the methods of all the data sets according to the preset fusion rule to obtain the operation result of the whole target data set. It should be noted that the fusion rule may be preset according to the actual content and/or grouping condition of the target data group, and the specific fusion method may refer to the prior art.

The following describes the data processing apparatus provided in the embodiments of the present application, and the data processing apparatus described below and the data processing method described above may be referred to in correspondence with each other.

Referring to fig. 3, a schematic structural diagram of a data processing apparatus according to an embodiment of the present application is shown, and as shown in fig. 3, the apparatus may include:

a data set obtaining module 301, configured to obtain a target data set, where the target data set is a data set with a risk of memory overflow;

a grouping module 302, configured to divide the target data set to obtain at least two groups of data groups;

a persistence module 303 for persisting a first data set, the first data set comprising at least one data set in the target data set;

a first operation module 304, configured to operate a second data group and obtain an operation result of the second data group, where the second data group includes data groups other than the first data group in the target data set;

a second operation module 305, configured to operate the first data set and obtain an operation result of the first data set;

and a fusion module 306, configured to fuse the operation results of all the data sets to obtain an operation result of the target data set.

Optionally, the status information further includes: CPU usage;

Optionally, the data processing apparatus further includes:

and the early warning module is used for sending out an early warning signal when the state information of any one operation stage meets the preset condition.

Optionally, the data processing apparatus further includes a memory release module, configured to release the memory occupied by the second data group after the second data group is operated and the operation result of the second data group is obtained; and after the first data group is operated and the operation result of the first data group is obtained, releasing the memory occupied by the first data group.

It can be seen from the foregoing technical solutions that, in the data processing apparatus provided in this embodiment of the present application, the data sets with risk of memory overflow are grouped to obtain a plurality of data sets, a part of the data sets (i.e., the second data set) is run, the data sets that are not run (i.e., the first data set) are persisted, and the first data set is run after a running result of the second data set is obtained. On one hand, when the second data group or the first data group is operated, the occupied memory amount is smaller than the memory amount required for operating the whole target data set, and on the other hand, the data group which is not operated is persisted, so that more memories can be provided for the data group which is operated. Therefore, the method greatly reduces the possibility of memory overflow by using a data set grouping processing mode and avoids the failure of a Spark task.

An embodiment of the present application further provides a data processing device, please refer to fig. 4, which shows a schematic structural diagram of the data processing device, where the data processing device may include: at least one processor 401, at least one communication interface 402, at least one memory 403 and at least one communication bus 404;

in the embodiment of the present application, the number of the processor 401, the communication interface 402, the memory 403 and the communication bus 404 is at least one, and the processor 401, the communication interface 402 and the memory 403 complete communication with each other through the communication bus 404;

the processor 401 may be a central processing unit CPU, or an application specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;

the memory 403 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

the memory stores programs, and the processor can execute the programs stored in the memory to realize the data processing method.

The embodiment of the present application also provides a readable storage medium, which can store a computer program suitable for being executed by a processor, and when the computer program is executed by the processor, the method for processing data as described above is implemented.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for processing data, comprising:

dividing the target data set to obtain at least two groups of data groups;

2. The method of processing data of claim 1, further comprising, prior to said obtaining a target data set:

3. The method of claim 2, wherein the status information further comprises: CPU usage;

4. The data processing method according to claim 2 or 3, further comprising:

5. The method of claim 1, further comprising, after the executing the second data group to obtain an execution result of the second data group:

releasing the memory occupied by the second data group;

and releasing the memory occupied by the first data group.

6. An apparatus for processing data, comprising:

7. The apparatus for processing data according to claim 6, further comprising: a monitoring module for determining the target data set;

8. The apparatus for processing data according to claim 6, further comprising:

the memory release module is used for releasing the memory occupied by the second data group after the second data group is operated and the operation result of the second data group is obtained; and after the first data group is operated and the operation result of the first data group is obtained, releasing the memory occupied by the first data group.

9. An apparatus for processing data, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is used for executing the program and realizing the steps of the data processing method according to any one of claims 1 to 5.

10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for processing data according to any one of claims 1 to 5.