CN111857538A

CN111857538A - Data processing method, device and storage medium

Info

Publication number: CN111857538A
Application number: CN201910337777.XA
Authority: CN
Inventors: 范振
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2019-04-25
Filing date: 2019-04-25
Publication date: 2020-10-30

Abstract

An embodiment of the application provides a data processing method, a data processing device and a storage medium, wherein the method comprises the following steps: acquiring data to be processed, wherein the data to be processed is file data written by a first actuator, the number of files corresponding to the file data is N times of the number of reducers, N is an integer greater than 1, and the reducers are located in a second actuator; merging the data to be processed corresponding to the same Reducer to obtain merged data, wherein the number of the merged data is less than that of files corresponding to the file data; receiving a first reading operation sent by a second actuator, wherein the first reading operation is used for reading the merged data corresponding to the Reducer; and sending the merged data corresponding to the Reducer according to the first reading operation. The embodiment of the application can reduce the random reading times, thereby accelerating the shuffle read process and reducing the disk IO resource and network consumption.

Description

Data processing method, device and storage medium

Technical Field

The embodiment of the application relates to the technical field of information processing, in particular to a data processing method, a data processing device and a storage medium.

Background

In the MapReduce framework, a shuffle is a bridge connecting a Map and a Reduce, wherein the output of the Map needs to be used in the Reduce through the shuffle, so that the performance and the throughput of the whole framework are directly influenced by the performance of the shuffle.

Spark, as an implementation of MapReduce framework, naturally also implements the logic of shuffle. When Spark runs an Application (Application) containing a shuffle process, an Executor (Executor) is responsible for writing data generated by a task into a disk after the task is finished except for running the task (task), and provides shuffle data for other executors. In particular, Spark may include a write phase and a read phase, i.e.: a shuffle and a shuffle read. During the shuffle process, a Mapper in the executor fragments according to the number of reducers in other executors, then stores the fragmented data into a special directory of Spark, and manages and provides services by a peripheral component external shuffle service; in the process of shuffle read, the Reducer reads the data corresponding to the Reducer in each Mapper, so that a large amount of random reads are caused, the whole shuffle read process is very slow, and many unnecessary disk IO resources and network consumption exist.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device and a storage medium, so that a shuffle read process is accelerated, and unnecessary disk IO (input/output) resources and network consumption are reduced.

In a first aspect, an embodiment of the present application provides a data processing method, including:

Acquiring data to be processed, wherein the data to be processed is file data written by a first actuator, the number of files corresponding to the file data is N times of the number of reducers, N is an integer greater than 1, and the reducers are located in a second actuator;

merging the data to be processed corresponding to the same Reducer to obtain merged data, wherein the number of the merged data is less than that of files corresponding to the file data;

receiving a first reading operation sent by the second executor, wherein the first reading operation is used for reading the merged data corresponding to the Reducer;

and sending the merged data corresponding to the Reducer according to the first reading operation.

In a possible implementation, the number of merged data is equal to the number of reducers.

In one possible implementation, each merged data corresponds to a block id of a corresponding Reducer.

In a possible implementation manner, after the merging the to-be-processed data corresponding to the same Reducer to obtain merged data, the merging may further include: and updating the data to be processed.

In a possible implementation manner, the data to be processed is metadata corresponding to at least one application.

In a possible implementation, each application corresponds to a thread, the life cycle of the thread is the same as that of the corresponding application, and the thread is used for periodically scanning the data structure related to the corresponding application and performing the merging processing operation each time new file data is formed.

In one possible implementation, the thread employs a timer as a driver.

In a possible implementation, the data processing method may further include:

if the situation that an abnormality occurs in the merging processing process is monitored, receiving a second reading operation sent by the second actuator, wherein the second reading operation is used for reading data before merging corresponding to the Reducer;

and sending the data before combination corresponding to the Reducer according to the second reading operation.

In a second aspect, an embodiment of the present application provides a data processing apparatus, including:

the device comprises an acquisition module, a first actuator and a second actuator, wherein the acquisition module is used for acquiring data to be processed, the data to be processed is file data written by the first actuator, the number of files corresponding to the file data is N times of the number of reducers, N is an integer larger than 1, and the reducers are positioned in the second actuator;

The processing module is used for merging the data to be processed corresponding to the same Reducer to obtain merged data, and the number of the merged data is less than that of files corresponding to the file data;

a receiving module, configured to receive a first read operation sent by the second executor, where the first read operation is used to read merged data corresponding to the Reducer;

and the sending module is used for sending the merged data corresponding to the Reducer according to the first reading operation.

In a possible implementation, the processing module may be further configured to: and after the data to be processed corresponding to the same Reducer are merged and the merged data are obtained, updating the data to be processed.

In a possible implementation, the data processing apparatus may further include:

the monitoring module is used for monitoring whether an exception occurs in the merging processing process;

correspondingly, the receiving module is further configured to receive a second reading operation sent by the second actuator when the monitoring module monitors that an abnormality occurs in the merging process, where the second reading operation is used to read the data before merging corresponding to the Reducer;

And the sending module is further used for sending the data before combination corresponding to the Reducer according to the second reading operation.

In a third aspect, an embodiment of the present application provides a data processing apparatus, including: a memory and a processor;

wherein the memory is to store program instructions;

the processor is configured to call and execute the program instructions stored in the memory, and when the processor executes the program instructions stored in the memory, the data processing apparatus is configured to execute the method according to any implementation manner of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to perform the method according to any implementation manner of the first aspect.

The data processing method, the data processing device and the storage medium provided by the embodiment of the application obtain data to be processed, wherein the data to be processed is file data written by a first actuator, the number of files corresponding to the file data is N times of the number of reducers, N is an integer greater than 1, and the reducers are located in a second actuator; merging the data to be processed corresponding to the same Reducer to obtain merged data, wherein the number of the merged data is less than that of files corresponding to the file data; receiving a first reading operation sent by a second executor, wherein the first reading operation is used for reading the merged data corresponding to the Reducer; and sending the merged data corresponding to the Reducer according to the first reading operation. Because the data to be processed corresponding to the same Reducer are merged and the number of the merged data is less than that of the files corresponding to the file data, the random reading times can be reduced, the shuffle read process is accelerated, and the disk IO (input/output) resource and network consumption are reduced.

Drawings

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a data processing method according to another embodiment of the present application;

fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present application;

fig. 6 is a schematic structural diagram of a data processing apparatus according to yet another embodiment of the present application.

Detailed Description

It should be understood that the numbers "first" and "second" in the embodiments of the present application are used for distinguishing similar objects, and are not necessarily used for describing a specific order or sequence order, and should not be construed as limiting the embodiments of the present application in any way.

The "and/or" mentioned in the embodiments of the present application describes an association relationship of associated objects, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

First, an application scenario and a part of vocabulary related to the embodiments of the present application will be described.

Kubernetes, a container cluster orchestration and management system open by google (google).

Spark, the top-level item of the new generation distributed memory computing framework, Apache open source. Compared with a HadoopMap-Reduce calculation framework, the Spark retains the intermediate calculation result in the memory, and the speed is increased by 10-100 times. The Shuffle is a specific staging (phase) in the MapReduce framework, which is between the Map phase and the Reduce phase, and when the output result of the Map is to be used by the Reduce, the output result needs to be hashed by a key value (key) and distributed to each Reduce, which is the Shuffle.

The ESS, a peripheral component of spark, is able to register the location of data written during a shuffle write to the component and provide shuffle read services.

And the application corresponds to a user job in Spark.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application. As shown in fig. 1, the application scenarios of the embodiment of the present application may include, but are not limited to: actuator 1, actuator 2, ESS 1, ESS 2, and actuator 3. Of course, the application scenario of the embodiment of the present application does not limit the number of actuators and ESS, and the embodiment only takes 3 actuators and two ESS as examples for description.

The executor 1 executes a Task (Task)0 and a Task 1, namely a Map Task set (MapTask set) executed by the executor 1 comprises the Task 0 and the Task 1; the executor 2 executes the task 2 and the task 3, namely the Map task set executed by the executor 2 comprises the task 2 and the task 3; the actuator 1 and the actuator 2 are mutually independent and do not interfere with each other. In addition, the executor 1 and the executor 2 are also responsible for writing shuffle data, that is, writing data generated by executing the task into the ESS, wherein the executor 1 writes data generated by executing the task 0 and the task 1 into the ESS1, the executor 2 writes data generated by executing the task 2 and the task 3 into the ESS 2, and the ESS1 and the ESS 2 are independent from each other and do not interfere with each other.

The executor 3 executes the task 4, the task 5, and the task 6 to read the shuffle data, that is, the Reduce task set (Reduce task) executed by the executor 3 includes the task 4, the task 5, and the task 6. Specifically, the executor 3 may read data required to execute the task 4, the task 5, and the task 6 from the ESS1, the ESS 2, respectively.

In the application scenario, one end of the shuffle data is written to is called a Map end, each task that generates data at the Map end is called a Mapper, and for example, task 0, task 1, task 2, and task 3 are all called mappers. Correspondingly, one end of the shuffle data is called a Reduce end, each task of the Reduce end reading the data is called a Reducer, and for example, task 4, task 5 and task 6 are all called reducers.

In the following embodiments, the executor that writes the shuffle data is referred to as a first executor, the executor that reads the shuffle data is referred to as a second executor, and the shuffle data written in the ESS is referred to as to-be-processed data.

For the prior art, exemplarily, when a certain application containing a shuffle process is run, the application corresponds to M mappers and R reducers, and then the whole shuffle read process generates M × R random reads. In practical application, both M and R are very large, and the number of generated random reads increases exponentially, so that the consumption of disk IO resources is huge, and the duration of the shuffle read process is long.

Based on the above problems, in the ESS, the embodiments of the present application perform merging processing on the to-be-processed data written into the ESS through the shuffle write process, thereby reducing the consumption of the disk IO resources, reducing the performance loss caused by network packet transmission, and improving the operation efficiency.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application. The embodiment of the application provides a data processing method, which can be executed by a data processing device, and the data processing device can be realized by software and/or hardware. Illustratively, the data processing device may be embodied as an ESS, or the data processing device may be embedded in an ESS, such as a module or chip in the ESS.

As shown in fig. 2, the method of the embodiment of the present application includes:

s201, acquiring data to be processed.

The data to be processed is file data written by the first actuator, the number of files corresponding to the file data is N times of the number of reducers, N is an integer larger than 1, and the reducers are located in the second actuator.

Still referring to the application scenario shown in fig. 1 as an example, the ESS 1 receives the file data P0, the file data P1, and the file data P2 written by the first actuator (i.e., actuator 1), i.e., obtains the data to be processed. It should be noted that, for the same Mapper, the number of files corresponding to the corresponding file data is equal to the number of reducers in the second executor (i.e., executor 3). If ESS 1 corresponds to N mappers, the number of files corresponding to the file data received in ESS 1 is N times the number of reducers.

Considering that in practical applications, the space occupied by all Mapper-generated data corresponding to the ESS may be larger than the memory of the ESS, for this case, the file data P0, the file data P1, and the file data P2 are first written into the ESS, and then the remaining portions are written into the ESS, for example, the remaining portions are represented as file data P01, file data P11, and file data P21, and then the file data P01, file data P11, and file data P21 are written into the ESS. The number of files corresponding to the file data contained in the remaining part is also a natural number multiple of the number of reducers.

S202, merging the data to be processed corresponding to the same Reducer to obtain merged data.

And the number of the merged data is less than the number of the files corresponding to the file data. Since the reading of the shuffle data in the ESS by the second executor may occur during the merge process or after the merge process is completed, the merged data in this step may be data after the merge process is completed (all data after the merge process), or data during the merge process (part of data after the merge process, and part of data before the merge process), and therefore, the number of merged data is less than the number of files corresponding to the file data, but the number of merged data is greater than or equal to the number of reducers.

For example, in the case of successfully performing the merging process, M × R files are finally merged into R files.

S203, receiving a first reading operation sent by a second actuator.

The first read operation is used to read the merged data corresponding to the Reducer.

It should be noted that when reading data, it is necessary to transmit merged data in multiple threads, otherwise, bandwidth is not fully utilized, resulting in resource waste; moreover, the transmission time of a single file is too long, which also causes the performance of data processing to be greatly reduced.

And S204, sending the merged data corresponding to the Reducer according to the first reading operation.

In this embodiment, after obtaining data to be processed, the data to be processed is file data written by a first executor, where the number of files corresponding to the file data is N times of the number of reducers, N is an integer greater than 1, the reducers are located in a second executor, and merge the data to be processed corresponding to the same Reducer to obtain merged data, where the number of the merged data is less than the number of files corresponding to the file data, and after receiving a first read operation sent by the second executor, the first read operation is used to read the merged data corresponding to the reducers, and the merged data corresponding to the reducers is sent according to the first read operation. Because the data to be processed corresponding to the same Reducer are merged and the number of the merged data is less than that of the files corresponding to the file data, the random reading times can be reduced, the shuffle read process is accelerated, and the disk IO (input/output) resource and network consumption are reduced.

Optionally, the number of merged data is equal to the number of reducers. Those skilled in the art will appreciate that when the merge process is normally completed, the number of merged data is equal to the number of reducers; if the merging processing is abnormal, merging part of the data to be processed, and keeping the rest of the data to be processed unchanged, wherein the number of the merged data is larger than the number of reducers.

After the merging process is normally completed, in some embodiments, in step S202, merging the to-be-processed data corresponding to the same Reducer to obtain merged data, where the data processing method may further include: and updating the data to be processed. That is, in the merging process, the data to be processed needs to be kept unchanged, the updating operation on the data is not executed, and the data to be processed after the merging process is completed is updated only, so that the inconsistency of the data caused by the abnormal phenomenon is prevented.

Optionally, each merged datum corresponds to a block identifier of a corresponding Reducer. In this way, the Reducer may map the corresponding merged data through the block identifier, that is, change the shuffle read interface, so as to adapt to the structure of the ESS provided in the embodiment of the present application.

For example, referring to the application scenario shown in fig. 1, when the executor 3 (i.e., the second executor) reads the shuffle data, it needs to perform mapping according to the passed block identifier (BlockId). The original shuffle read interface is Seq (Block _ id1 … Block _ idn), and should be mapped to the merged Block id.

Further, the data to be processed may be metadata corresponding to at least one application. The metadata processing corresponding to different applications is independent and non-interfering.

Optionally, each application corresponds to a thread, and the life cycle of the thread is the same as that of the corresponding application. Namely, when the application is registered, the thread corresponding to the application is generated, and after the execution of the application is finished, the thread corresponding to the application is destroyed. The thread is used for regularly scanning data structures related to corresponding applications, and merging processing operation is carried out after new file data are formed.

In one specific implementation, the ESS maintains a thread pool, one for each newly registered application, and passes the identification (e.g., ID) of the application into the thread during the process of starting the thread; when the application execution is finished, the thread exits itself.

In some embodiments, the threads are driven by timers. The timing scan corresponds to the application-dependent data structure, and the merge process operation is performed each time new file data is formed. By using the timer as the drive, omission can be prevented compared with using the event as the drive.

Fig. 3 is a schematic flow chart of a data processing method according to another embodiment of the present application. On the basis of the foregoing embodiment, as shown in fig. 3, after S202 performs merging processing on to-be-processed data corresponding to the same Reducer to obtain merged data, the method in the embodiment of the present application may further include:

s301, monitoring whether an exception occurs in the merging processing process.

If the situation that no abnormity occurs in the merging processing process is monitored, executing S203 and S204; and if the abnormal condition is detected in the merging processing process, executing S302 and S303.

And S302, receiving a second reading operation sent by a second actuator.

And the second reading operation is used for reading the data before merging corresponding to the Reducer.

And S303, sending the data before combination corresponding to the Reducer according to the second reading operation.

In the embodiment, considering that there may be an exception occurring in the merging process, the exception in the merging process is monitored, and when the exception occurs, the shuffle read process is performed according to the existing flow, so as to ensure that the corresponding application is normally completed.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 4, a data processing apparatus 40 provided in the embodiment of the present application includes: an acquisition module 41, a processing module 42, a receiving module 43 and a sending module 44. The obtaining module 41, the receiving module 43 and the sending module 44 are respectively connected to the processing module 42.

An obtaining module 41, configured to obtain data to be processed. The data to be processed is file data written by the first actuator, wherein the number of files corresponding to the file data is N times of the number of reducers, N is an integer greater than 1, and the reducers are located in the second actuator.

And the processing module 42 is configured to perform merging processing on the to-be-processed data corresponding to the same Reducer to obtain merged data. And the number of the merged data is less than the number of the files corresponding to the file data.

And the receiving module 43 is configured to receive the first read operation sent by the second actuator. The first read operation is used to read the merged data corresponding to the Reducer.

And a sending module 44, configured to send the merged data corresponding to the Reducer according to the first read operation.

After the merge process is normally completed, in some embodiments, the processing module 42 may further be configured to: and updating the data to be processed. That is, in the merging process, the data to be processed needs to be kept unchanged, the updating operation on the data is not executed, and the data to be processed after the merging process is completed is updated only, so that the inconsistency of the data caused by the abnormal phenomenon is prevented.

Wherein, each merged data corresponds to the block identifier of the corresponding Reducer. In this way, the Reducer can be mapped to the corresponding merged data through the block identifier, that is, the shuffle read interface is changed, so as to adapt to the structure of the ESS provided in the embodiment of the present application.

In one implementation, the processing module 42 maintains a thread pool, one for each newly registered application, and transmits an identification (e.g., ID) of the application to the thread during the process of starting the thread; when the application execution is finished, the thread exits itself.

Fig. 5 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present application. As shown in fig. 5, the data processing apparatus 50 may further include, on the basis of the structure shown in fig. 4:

and a monitoring module 51, configured to monitor whether an exception occurs in the merging process. The monitoring module 51 is connected to the processing module 42.

Correspondingly, the receiving module 43 may be further configured to receive a second reading operation sent by the second actuator when the monitoring module 50 monitors that an abnormality occurs in the merging process. Wherein the second reading operation is to read the data before merging corresponding to the Reducer.

The sending module 44 may be further configured to send, according to the second reading operation, data before merging corresponding to the Reducer.

Because there may be an exception occurring in the merging process, the exception in the merging process is monitored by the monitoring module, and when the exception occurs, the shuffle read process is performed according to the existing flow, so that the normal completion of the corresponding application is ensured.

The data processing apparatus provided in the embodiment of the present application may be configured to execute the data processing method described in any of the above embodiments of the present application, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 6 is a schematic structural diagram of a data processing apparatus according to yet another embodiment of the present application. As shown in fig. 6, the data processing apparatus 60 provided in the embodiment of the present application may include: a memory 61 and a processor 62. Wherein the content of the first and second substances,

a memory 61 for storing program instructions.

The processor 62 is configured to call and execute the program instruction stored in the memory 61, and when the processor 62 executes the program instruction, the data processing apparatus 60 implements the data processing method described in any one of the method embodiments, which is similar in implementation principle and technical effect and is not described herein again.

An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the instructions cause the computer to execute the data processing method according to any of the embodiments of the present application, and the implementation principle and the technical effect are similar, and are not described herein again.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A data processing method, comprising:

2. The method of claim 1, wherein the number of merged data is equal to the number of reducers.

3. The method of claim 2, wherein each merged datum corresponds to a respective Reducer block id.

4. The method according to claim 2, wherein after the merging the to-be-processed data corresponding to the same Reducer to obtain merged data, the method further comprises:

and updating the data to be processed.

5. The method of claim 1, wherein the data to be processed is metadata corresponding to at least one application.

6. The method of claim 5, wherein each of the applications corresponds to a thread, the life cycle of the thread is the same as that of the corresponding application, and the thread is used for periodically scanning the data structure related to the corresponding application, and performing the merge processing operation each time a new file data is formed.

7. The method of claim 6, wherein the thread employs a timer as a driver.

8. The method of claim 1, further comprising:

9. A data processing apparatus, comprising:

10. The apparatus of claim 9, wherein the number of merged data is equal to the number of reducers.

11. The apparatus of claim 10, wherein the processing module is further configured to:

and after the data to be processed corresponding to the same Reducer are merged and the merged data are obtained, updating the data to be processed.

12. The apparatus of claim 9, further comprising:

13. A data processing apparatus, comprising: a memory and a processor;

Wherein the memory is to store program instructions;

the processor for invoking and executing the program instructions, the data processing apparatus implementing the method of any one of claims 1 to 8 when the processor executes the program instructions.

14. A computer-readable storage medium having stored therein instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 8.