CN111782367A

CN111782367A - Distributed storage method and device, electronic equipment and computer readable medium

Info

Publication number: CN111782367A
Application number: CN202010616643.4A
Authority: CN
Inventors: 齐赫; 王亚知
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2020-10-16
Anticipated expiration: 2040-06-30
Also published as: JP2022013618A; EP3933582B1; KR20220002056A; KR102544755B1; CN111782367B; US20210406067A1; JP7226743B2; EP3933582A1

Abstract

The disclosure provides a distributed storage method, and relates to the technical field of computers and cloud computing. The method comprises the following steps: responding to the task request of the driving thread, reading and sending data to an external shuffle service; after the external shuffling service finishes sending the data, modifying the state of the task into a waiting completion state; and sending the waiting completion state to the driving thread so that the driving thread releases the execution thread corresponding to the task. The distributed storage method can reduce the waste of execution thread resources and improve the task operation efficiency. The present disclosure also provides a distributed storage apparatus, an electronic device, and a computer-readable medium.

Description

Distributed storage method and device, electronic equipment and computer readable medium

Technical Field

The embodiment of the disclosure relates to the technical field of computers and cloud computing, in particular to a distributed storage method and device, electronic equipment and a computer readable medium.

Background

In the distributed storage of data, the distributed computing engine Spark needs to run a job by means of an external Shuffle (Shuffle) service. Specifically, Spark continuously transmits data to the external Shuffle service, and the external Shuffle service merges and sorts the data and sends the data to the distributed storage system for storage. After the data is successfully written into the distributed storage system, the external Shuffle service sends a response message that the data writing is successful to an execution thread (Executor thread) of Spark. The process has low operation efficiency, long time consumption and resource waste.

BRIEF SUMMARY OF THE PRESENT DISCLOSURE

The embodiment of the disclosure provides a distributed storage method and device, electronic equipment and a computer readable medium.

In a first aspect, an embodiment of the present disclosure provides a distributed storage method, which includes:

responding to the task request of the driving thread, reading and sending data to an external shuffle service;

after the external shuffling service finishes sending the data, modifying the state of the task into a waiting completion state;

and sending the waiting completion state to the driving thread so that the driving thread releases the execution thread corresponding to the task.

In some embodiments, the reading and sending data to an external shuffle service in response to a task request of a driver thread includes:

responding to a task request of a driving thread, reading the data and constructing an elastic distributed data set;

processing the elastic distributed data set to obtain shuffling data;

writing the shuffling data to the external shuffling service.

In some embodiments, after modifying the status of the task to a wait for completion status after the data is sent to the external shuffling service, the method further comprises:

adding the task in a waiting completion state into a pipeline task set; wherein the pipeline task set is a set of tasks in a wait for completion state.

In some embodiments, after adding the task in the wait-for-completion state to the pipeline task set, the method further includes:

responding to a response message returned by the external shuffling service, and calling a callback function to perform callback operation on the task;

removing the task after the callback operation is executed from the pipeline task set.

performing flushing operation on tasks in the pipeline task set;

obtaining tasks in a termination state from the pipeline task set;

calling a failure callback function and a completion callback function, and carrying out callback operation on the task in the termination state;

In some embodiments, the flushing the tasks in the pipeline task set includes:

and flushing the tasks in the pipeline task set according to a preset time interval or when the number of the tasks reaches a preset value.

In some embodiments, the termination state includes a stop state, a timeout state, and a completion state.

In a second aspect, an embodiment of the present disclosure provides a distributed storage method, including:

sending a task request to an execution thread to cause the execution thread to read and send data to an external shuffle service;

responding to the state that the execution thread returns to the task as a waiting completion state, and releasing the execution thread corresponding to the task; wherein the waiting completion state is a state in which the execution thread sends the data-completed task to the external shuffle service.

In a third aspect, an embodiment of the present disclosure provides a distributed storage apparatus, including:

the data reading module is used for responding to a task request of a driving thread to read data;

the first sending module is used for sending the data to an external shuffling service;

the state modification module is used for modifying the state of the task into a waiting completion state after the external shuffling service finishes sending the data;

and the second sending module is used for sending the waiting completion state to the driving thread so that the driving thread releases the execution thread corresponding to the task.

In a fourth aspect, an embodiment of the present disclosure provides a distributed storage apparatus, including:

the task sending module is used for sending a task request to the execution thread so that the execution thread reads and sends data to the external shuffle service;

the resource releasing module is used for responding that the state of the task returned by the execution thread is a waiting completion state, and releasing the execution thread corresponding to the task; wherein the waiting completion state is a state in which the execution thread sends the data-completed task to the external shuffle service.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including:

one or more processors;

a memory having one or more programs stored thereon that, when executed by the one or more processors, cause the one or more processors to perform any of the above-described distributed storage methods;

one or more I/O interfaces connected between the processor and the memory and configured to enable information interaction between the processor and the memory.

In a sixth aspect, the disclosed embodiments provide a computer readable medium, on which a computer program is stored, which when executed by a processor, implements any of the distributed storage methods described above.

The distributed storage method provided by the embodiment of the disclosure responds to the task request of the driving thread, reads and sends data to the external shuffle service; after the external shuffling service finishes sending the data, modifying the state of the task into a waiting completion state; and sending the waiting completion state to the driving thread for the driving thread to release the execution thread corresponding to the task, namely, after the execution thread finishes sending the shuffle service to the outside, returning the task to be in the waiting completion state to the driving thread, immediately releasing the execution thread corresponding to the task by the driving thread, and not needing to release the corresponding execution thread until the task is in the termination state, thereby reducing the waste of execution thread resources and improving the task operation efficiency.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. The above and other features and advantages will become more apparent to those skilled in the art by describing in detail exemplary embodiments thereof with reference to the attached drawings, in which:

FIG. 1 is a schematic flow chart illustrating distributed storage of data using an external Shuffle service according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a distributed storage method according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a driving thread in a distributed storage method according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of another distributed storage method provided by the embodiments of the present disclosure;

fig. 5 is a flowchart illustrating management of Pipeline task sets by Pipeline threads according to an embodiment of the present disclosure;

fig. 6 is a flowchart of another Pipeline task set management by Pipeline threads according to an embodiment of the present disclosure;

FIG. 7 is a flowchart of a task state update function according to an embodiment of the present disclosure;

FIG. 8 is a flow diagram illustrating the implementation of a failure callback by using a failure callback function in accordance with an embodiment of the present disclosure;

FIG. 9 is a flow diagram illustrating a process for performing a completion callback using a completion callback function in accordance with an embodiment of the present disclosure;

fig. 10 is a flowchart of a distributed storage method according to an embodiment of the present disclosure;

fig. 11 is a schematic block diagram of a distributed storage apparatus according to an embodiment of the present disclosure;

FIG. 12 is a schematic block diagram of a distributed storage apparatus according to an embodiment of the present disclosure;

fig. 13 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present disclosure, the following describes the distributed storage method and apparatus, the electronic device, and the computer readable medium provided in the present disclosure in detail with reference to the accompanying drawings.

Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but which may be embodied in different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

When Spark transmits data to the external Shuffle service, the external Shuffle service receives, combines and simply sorts the data to generate data groups, and when certain distributed storage conditions are met, the data groups are sent to the distributed storage system for storage.

Typically distributed storage conditions are dominated by temporal conditions, quantity conditions and shuffle (Flush) commands. And when the waiting time of the external Shuffle service reaches the preset time threshold, the external Shuffle service sends the data group to the distributed storage system for storage. The quantity condition is a preset quantity threshold value, and when the received data quantity of the external Shuffle service reaches the preset quantity threshold value, the external Shuffle service sends the data group to the distributed storage system for storage. The Flush command is a command for enforcing Flush, and the external Shuffle service enforces Flush command to send the data group to the distributed storage system for storage.

In this embodiment, Spark implements a specific function called Application (Application), and processes a specific execution unit called Task (Task) of a certain data slice, where the Task can be divided into two types, namely, a shuffle Map Task (Map shuffle Task) and a Result Task (Result Task). The Shuffle map task writes data to an external Shuffle service, which persists the data to the distributed file system. And the result task reads and combines the data from the distributed storage system, sorts the data if sorting is needed, and generates a data group.

Fig. 1 is a schematic flow chart of data distributed storage using an external Shuffle service according to an embodiment of the present disclosure. As shown in fig. 1, an application of Spark has one Driver thread 11 and multiple execution threads 12. The driving thread 11 is mainly responsible for scheduling tasks, and the executing thread 12 is responsible for running specific tasks. The execution thread 12 sends data to the distributed file storage system 14 through the external Shuffle service 13 under the scheduling of the drive thread 11. When data is sent to the distributed file storage system, the distributed file storage system writes multiple copies, and after the data is successfully written into the distributed file storage system, the external Shuffle service returns a response message to the execution thread of Spark. Obviously, in the period from when the Spark sends the data to the external Shuffle service to when the write-successful response message is received, the execution thread of the Spark is always in a waiting state, which wastes the computing resources of the execution thread and also blocks the execution of the subsequent task.

The embodiment of the disclosure aims at Spark and realizes a distributed storage method by using external Shuffle service, so as to optimize the pipeline performance of the external Shuffle service. Under the condition of the same resources, the parallelism of task operation is improved, so that the operation efficiency of Spark is improved, and the resource waste is reduced.

In a first aspect, an embodiment of the present disclosure provides a distributed storage method. Fig. 2 is a flowchart of a distributed storage method according to an embodiment of the present disclosure. As shown in fig. 2, the distributed storage method includes:

step 201, responding to the task request of the driving thread, reading and sending data to the external shuffle service.

The driving thread distributes tasks to the execution threads, the execution threads respond to the task requests of the driving threads to execute corresponding tasks, and data are stored in the distributed file system.

In this embodiment, the driver thread stores data in the distributed file system through an external Shuffle service. When the execution thread receives the task distributed by the driving thread, the execution thread reads the data and continuously sends the data to the external Shuffle service, and the external Shuffle service stores the data in the distributed file system.

In some embodiments, the executing thread runs a first Task Map Shuffle Task and a second Task ResultTask. The step of executing the Map Shuffle Task by the thread comprises the following steps: the Map Shuffle Task responds to the Task request of the driving thread, reads the data of the user, and constructs an elastic distributed data set (RDD for short) from the data; then calling a processing logic of a user to process the RDD to obtain shuffling data; finally, shuffle data is continuously written to an external shuffle service.

Step 202, after the external shuffle service finishes sending data, the state of the task is modified into a waiting completion state.

In some embodiments, the execution thread manages tasks with a task set list that marks the current state of each task. The task state comprises a starting state, a running state, a completing state, a failing state, a stopping state, a losing state and a waiting completing state. The wait-for-complete state is a state in which the execution thread finishes sending the data to the external Shuffle service but does not finish the sending. In other words, when data has been written to the external Shuffle service, but the external Shuffle service has not yet written such data to the distributed file system, the state of the task does not belong to the done state, i.e., the task has not actually completely ended.

In this embodiment, on the basis of the original task state, a wait completion state is added to indicate that data has been written into the external Shuffle service and is waiting for the external Shuffle service to store the data in the distributed file system. At this time, the execution thread does not work specifically and does not occupy resources.

Step 203, sending the waiting completion state to the driving thread for the driving thread to release the execution thread corresponding to the task.

In some embodiments, a wait for completion status is sent to the driver thread when the execution thread writes data to the external Shuffle service. And when the state of the received task of the driving thread is a waiting completion state, releasing the execution thread corresponding to the task, so that the driving thread can redistribute the task for the execution thread.

Fig. 3 is a flowchart of a driving thread in a distributed storage method according to an embodiment of the present disclosure. As shown in fig. 3, after the driver thread receives the task status reported by the execution thread, the following steps are performed:

step 301, receiving the task state reported by the execution thread.

In some embodiments, when an executing thread writes data to an external Shuffle service, the state of the task is reported to the driver thread as a wait for completion state. And when the execution thread receives a return message of finishing data storage returned by the external Shuffle service or the distributed file system, reporting the state of the task to a driving thread as a finished state.

Step 302, judging whether the task state is a waiting completion state, if yes, executing step 305; if not, go to step 303.

In some embodiments, the driver thread determines the status of the task. When the status of the task is a wait for completion status, step 305 is executed; when the status of the task is not the wait for completion status, step 303 is performed.

Step 303, judging whether the task state is a completion state; if yes, go to step 304; if not, go to step 306.

In some embodiments, the driver thread determines whether the state of the task is a completion state. When the status of the task is determined to be the completion status, step 304 is performed. And when the state of the task is judged not to be the completion state, keeping the resources of the execution thread unchanged, namely not releasing the execution thread.

Step 304, determining whether the state before the task is a waiting completion state, if yes, executing step 306; if not, go to step 305.

In some embodiments, when determining that the task is in the completion state, the driving thread needs to determine again whether the state before the task is in the wait for completion state. When the state before the task is the wait for completion state, step 306 is executed. When the state before the task is not the wait for completion state, step 305 is performed.

In this embodiment, in step 302 and step 304, it is determined twice whether the state of the task is the wait-to-complete state, so as to ensure that the driving thread releases the execution thread only once for the task in the wait-to-complete state, thereby avoiding the driving thread releasing the resources of the execution thread erroneously due to logic confusion.

Step 305, releasing the execution thread corresponding to the task.

In some embodiments, the driving thread releases the resource of the execution thread corresponding to the task in the waiting completion state, so that the execution thread executes the new task.

In this embodiment, when the task state is the wait completion state, and when the task state is the completion state but the previous state is not the wait completion state, the driver thread releases the resources of the execution thread corresponding to the task.

Step 306, the resources of the execution thread are kept unchanged.

In some embodiments, when the state of the task is not the completion state, the thread of execution is not released, and the resources of the thread of execution are kept unchanged. And when the task is in a completion state, but the state of the task is a waiting completion state, the execution thread is not released, and the resources of the execution thread are kept unchanged.

The distributed storage method provided by the embodiment responds to the task request of the driving thread, reads and sends data to the external shuffle service; after the external shuffling service finishes sending the data, modifying the state of the task into a waiting completion state; and sending the waiting completion state to the driving thread for the driving thread to release the execution thread corresponding to the task, namely, after the execution thread finishes sending the shuffle service to the outside, returning the task to be in the waiting completion state to the driving thread, immediately releasing the execution thread corresponding to the task by the driving thread, and not needing to release the corresponding execution thread until the task is in the termination state, thereby reducing the waste of execution thread resources and improving the task operation efficiency.

Fig. 4 is a flowchart of another distributed storage method provided in the embodiments of the present disclosure. As shown in fig. 3, the distributed storage method includes:

in response to a task request of a driver thread, data is read and sent to an external shuffle service, step 401.

Wherein the elastic distributed data set RDD is a distributed read-only, partitioned collection object. These sets are elastic and can be reconstructed if a portion of the data set is lost.

Step 402, after the external shuffle service finishes sending data, the state of the task is modified into a waiting completion state.

And step 403, sending the waiting completion state to the driving thread, so that the driving thread releases the execution thread corresponding to the task.

In some embodiments, a wait for completion status is sent to the driver thread when the execution thread writes data to the external Shuffle service. And when the state of the task received by the driving thread is a waiting completion state, releasing the execution thread corresponding to the task.

And step 404, adding the task in the waiting completion state into the pipeline task set.

The pipeline task set is a set of tasks of which the execution thread management state is a waiting completion state, and in the pipeline task set, the tasks are managed in a list form mode, namely the tasks and the states of the tasks are listed in the list.

In some embodiments, the execution thread is provided with an external Shuffle service plug-in, and a pipe thread (Pipeline thread) is added to the external Shuffle service plug-in, and the pipe thread is responsible for maintaining the pipe task set. When the execution thread writes data into the external Shuffle service, the task is added into the pipeline task set, and the state of the task is modified into a waiting completion state.

In this embodiment, the Pipeline thread manages a Pipeline task set. Fig. 5 is a flowchart illustrating management of Pipeline task sets by Pipeline threads according to an embodiment of the present disclosure. As shown in fig. 5, Pipeline task set management by Pipeline thread includes:

and step 501, responding to a response message returned by the external shuffling service, and calling a callback function to perform callback operation on the task.

The response message returned by the external shuffling service is a message returned after the external shuffling service executes a task, namely a message returned after the external shuffling service stores data in the distributed file system. The returned message is typically the status of the task.

In some embodiments, the results of the tasks performed by the external shuffling service include stop, timeout, complete, etc., with the corresponding task states being stop, timeout, and complete. For convenience of description, the present embodiment refers to these states collectively as termination states, i.e., indicating that the task has terminated. In other words, a task is considered to have terminated regardless of whether the state of the task is a stopped state, a timed-out state, or a completed state.

The callback function comprises a failure callback function and a completion callback function, and each task needs to execute the failure callback and the completion callback.

In some embodiments, the Pipeline thread calls a corresponding callback function after receiving a response message returned by the external shuffling service.

Step 502, removing the task after the callback operation is executed from the pipeline task set.

And after the Pipeline thread performs callback operation on the task, removing the task from the Pipeline task set.

Fig. 6 is a flowchart of another Pipeline task set management by Pipeline threads according to an embodiment of the present disclosure. As shown in fig. 6, Pipeline task set management by Pipeline thread includes:

step 601, performing a flushing operation on the tasks of the pipeline task set.

In some embodiments, the Pipeline thread flushes the tasks according to a flush policy. The flushing strategy can be that when the number of tasks reaches a preset value according to a preset time interval or the number of tasks, the tasks in the pipeline task set are flushed. For example, if the preset time interval is 10 minutes, the Pipeline thread performs a flushing operation on the tasks in the Pipeline task set every 10 minutes. Or when the number of the tasks in the Pipeline task set reaches a preset value, the Pipeline thread performs one flushing operation on the tasks in the Pipeline task set.

It should be noted that the Pipeline thread performs a flushing operation on the task according to the flushing policy. The embodiment does not limit the flush policy.

In this embodiment, the number of small files in the distributed storage process can be reduced by using the flushing policy, the pressure of distributed storage is reduced, and the throughput capacity of the distributed file system is improved.

Step 602, obtaining the task in the termination state from the pipeline task set.

The termination state comprises a stop state, a timeout state and a completion state, and correspondingly, the tasks in the termination state comprise a stopped task, a timeout task and a completed task.

In some embodiments, the tasks in the pipeline task set are filtered to obtain stopped tasks, timed-out tasks, and completed tasks.

Step 603, calling the failure callback function and the completion callback function, and performing callback operation on the task in the termination state.

In some embodiments, the task is recalled by triggering a failure callback function and a completion callback function. For example, for a task in a stopped state, a failure callback function and a completion callback function are triggered to callback the task. And for overtime tasks, triggering a failure callback function and completing the callback function to carry out callback on the tasks. And for the completed task, triggering a completion callback function to callback the task.

The order of task callbacks is not limited by the Pipeline thread. If the task in the stop state is filtered out, the task is called back; filtering out overtime tasks and calling back the tasks; and finally, filtering out the completed task and calling back the task. Or filtering out overtime tasks and calling back the tasks; filtering out the task in the stop state, and calling back the task; and finally, filtering out the completed task and calling back the task.

And step 604, removing the task after the callback operation is executed from the pipeline task set.

And after the Pipeline thread calls back the task, removing the task from the Pipeline task set.

In some embodiments, the Pipeline thread may call a state update function (statusupate function) to report the state and running results of the task to the driver thread at callback time. The execution thread may support two types of callback functions, namely a failed callback function and a completed callback function.

Fig. 7 is a flowchart of updating a task state by a state update function according to an embodiment of the present disclosure. As shown in fig. 7, the step of updating the task state by the state updating function includes:

step 701, judging whether the task state is a waiting completion state.

In some embodiments, step 708 is performed when it is determined that the status of the task is not a wait for completion status. When it is determined that the status of the task is a wait for completion status, step 702 is performed. When the state of the task is not the wait for completion state, the state of the task is considered to be the termination state.

Step 702, putting the task into a pipeline task set.

When the state of the task is a waiting completion state, the task is added into the pipeline task set.

Step 703 registers the failure callback function.

In step 703, the failed callback function is registered in the pipeline thread.

Step 704, register the completion callback function.

In step 704, the completion callback function is registered in the pipeline thread.

Step 705, determine whether the task is in the pipeline task set.

In some embodiments, step 706 is performed if the task is in the set of pipelined tasks, and step 707 is performed if the task is in the set of pipelined tasks.

Step 706, report the status of the task to the driver thread.

In step 706, the wait for completion status of the task is reported to the driver thread.

Step 707, the task is terminated.

In step 707, if the task is not in the pipeline task set, it is verified that the callback function of the task has been triggered, and the task is terminated directly without reporting the wait-for-completion state.

At step 708, the termination state of the task is reported to the driver thread.

It should be noted that, in step 708, the execution thread directly reports the termination state of the task to the driver thread according to the Spark flow.

Fig. 8 is a flowchart of performing a failure callback using a failure callback function in the embodiment of the present disclosure. As shown in fig. 8, the step of performing a failure callback by using the failure callback function includes:

step 801, remove tasks from a pipeline task set.

Tasks in the pipeline task set are in a waiting completion state, but when the external Shuffle service completes storage, a new state is returned, and the execution thread updates the state of the tasks at any time. Therefore, at Flush operation, a task is required to be removed from the pipeline task set.

Step 802, determine if the task state is a stop state.

In some embodiments, if the state of the task is a stop state, step 803 is performed; if the task is not in the stopped state, step 804 is executed. And if the state of the task is not the stop state, the task is considered to be a failure state.

In step 803, the stop state is reported to the driver thread.

Step 804, reporting the failure to the driver thread.

FIG. 9 is a flow diagram illustrating a completion callback performed by a completion callback function according to an embodiment of the present disclosure. As shown in fig. 9, the step of performing a completion callback using a completion callback function includes:

step 901, judging whether a pipeline task set has a task.

In some embodiments, when there is a task in the pipeline task set, step 902 is performed; when there are no tasks in the pipeline task set, step 904 is performed.

Step 902, remove the task from the set of pipe tasks.

Step 903, reporting the completion state of the task to the driving thread.

Step 904, terminate the task.

In step 904, if there is no task in the pipeline task set, it is verified that all tasks are in a failure state or a stop state, and the tasks are terminated directly without reporting to the driving thread.

In a second aspect, the embodiments of the present disclosure provide a distributed storage method, which is applied to a driver thread of Spark. Fig. 10 is a flowchart of a distributed storage method according to an embodiment of the present disclosure. As shown in fig. 10, the distributed storage method includes:

step 1001, a task request is sent to the execution thread to cause the execution thread to read and send data to the external shuffle service.

Step 1002, responding to the state that the execution thread returns the task as a waiting completion state, and releasing the execution thread corresponding to the task.

The waiting completion state is the state of the task after the execution thread sends the data to the external shuffle service.

In some embodiments, the specific workflow of the thread execution may refer to the flowchart shown in fig. 3, which is not described herein again.

In a third aspect, embodiments of the present disclosure provide a distributed storage apparatus, which is applied to an execution thread. Fig. 11 is a schematic block diagram of a distributed storage apparatus according to an embodiment of the present disclosure. As shown in fig. 11, the distributed storage apparatus includes:

a read data module 1101, configured to read data in response to a task request of a driving thread.

A first sending module 1102 for sending data to an external shuffling service.

In some embodiments, the execution thread may also process the data before sending it to an external shuffling service. Specifically, responding to a task request of a driving thread, reading data of a user, and constructing an elastic Distributed data set (RDD) from the data; then calling a processing logic of a user to process the RDD to obtain shuffling data; finally, shuffle data is continuously written to an external shuffle service.

And a state modification module 1103, configured to modify the state of the task to a waiting state after the external shuffle service finishes sending the data.

A second sending module 1104, configured to send the wait completion status to the driving thread, so that the driving thread releases the execution thread corresponding to the task.

In a fourth aspect, embodiments of the present disclosure provide a distributed storage apparatus, which is applied to a drive thread. Fig. 12 is a schematic block diagram of a distributed storage apparatus according to an embodiment of the present disclosure. As shown in fig. 12, the distributed storage apparatus includes:

a task sending module 1201, configured to send a task request to the execution thread, so that the execution thread reads and sends data to the external shuffle service.

A receiving module 1202, configured to receive a state that the execution thread returns the task.

The resource releasing module 1203 is configured to release the execution thread corresponding to the task when the state where the execution thread returns to the task is a wait-to-complete state.

In a fifth aspect, referring to fig. 13, an embodiment of the present disclosure provides an electronic device, including:

one or more processors 1301;

a memory 1302 on which one or more programs are stored which, when executed by the one or more processors, cause the one or more processors to implement the distributed storage method of any one of the above;

and one or more I/O interfaces 1303 connected between the processor and the memory, and configured to enable information interaction between the processor and the memory.

The processor 1301 is a device with data processing capability, and includes, but is not limited to, a Central Processing Unit (CPU), and the like; memory 1302 is a device having data storage capabilities including, but not limited to, random access memory (RAM, more specifically SDRAM, DDR, etc.), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), FLASH memory (FLASH); an I/O interface (read/write interface) 1303 is connected between the processor 1301 and the memory 1302, and can implement information interaction between the processor 1301 and the memory 1302, which includes but is not limited to a data Bus (Bus) and the like.

In some embodiments, processor 1301, memory 1302, and I/O interface 1303 connect to each other and to other components of the computing device via a bus.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purposes of limitation. In some instances, features, characteristics and/or elements described in connection with a particular embodiment may be used alone or in combination with features, characteristics and/or elements described in connection with other embodiments, unless expressly stated otherwise, as would be apparent to one skilled in the art. Accordingly, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims

1. A distributed storage method, comprising:

2. The method of claim 1, wherein said reading and sending data to an external shuffle service in response to a task request of a driver thread comprises:

responding to a task request of a driving thread to read the data, and constructing an elastic distributed data set based on the data;

processing the elastic distributed data set to obtain shuffling data;

writing the shuffling data to the external shuffling service.

3. The method of claim 1, wherein the modifying the status of the task to a wait for completion status after the data is sent to the external shuffling service comprises:

4. The method of claim 3, wherein after joining the task in a wait for completion state to a pipeline task set, further comprising:

removing the task that performed the callback operation from the set of pipeline tasks.

5. The method of claim 3, wherein after joining the task in a wait for completion state to a pipeline task set, further comprising:

performing a flushing operation on the tasks in the pipeline task set;

filtering out tasks in a termination state from the pipeline task set;

6. The method of claim 5, wherein the flushing tasks of the pipeline task set comprises:

7. The method of claim 5 or 6, wherein the termination state comprises a stop state, a timeout state, and a completion state.

8. A distributed storage method, comprising:

responding to the state that the execution thread returns to the task as a waiting completion state, and releasing the execution thread corresponding to the task; wherein the wait for completion state is a state where the task is after the execution thread has finished sending the data to the external shuffle service.

9. A distributed storage device, comprising:

10. A distributed storage device, comprising:

the receiving module is used for receiving the state of the task returned by the execution thread;

the resource releasing module is used for releasing the execution thread corresponding to the task when the state of the execution thread returning the task is a waiting completion state; wherein the wait for completion state is a state where the task is after the execution thread has finished sending the data to the external shuffle service.

11. An electronic device, comprising:

one or more processors;

storage means having one or more programs stored thereon which, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8;

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-8.