WO2024012153A1 - 一种数据处理方法及装置 - Google Patents

一种数据处理方法及装置 Download PDF

Info

Publication number
WO2024012153A1
WO2024012153A1 PCT/CN2023/100813 CN2023100813W WO2024012153A1 WO 2024012153 A1 WO2024012153 A1 WO 2024012153A1 CN 2023100813 W CN2023100813 W CN 2023100813W WO 2024012153 A1 WO2024012153 A1 WO 2024012153A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
dpu
read
information
sub
Prior art date
Application number
PCT/CN2023/100813
Other languages
English (en)
French (fr)
Inventor
陈一都
陈强
潘孝刚
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024012153A1 publication Critical patent/WO2024012153A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • the present application relates to the field of computer technology, and in particular, to a data processing method and device.
  • IO input/output
  • HPC high-performance computing
  • SC supercomputing
  • IO modes that are "non- For parallel applications with "continuous small IO”
  • the number of IOs can reach the TB level.
  • the processor processes these non-consecutive small IOs, it will consume more computing resources and time resources, and the processing efficiency is slow.
  • the IO performance of the application becomes the technical bottleneck of the application. Optimizing the IO performance of the application can reduce the computing time of the application on a large scale. .
  • This application provides a data processing method and device for providing application IO performance.
  • embodiments of the present application provide a data processing method.
  • the data processing method is applied to a computing system.
  • the computing system includes multiple computing nodes. Each computing node runs at least one process.
  • Each computing node includes a data processing method.
  • Device DPU; the method includes: the first DPU receives multiple read requests corresponding to multiple processes in the computing system.
  • the multiple processes may be multiple parallel processes running the same job, and the first DPU reads the multiple requests.
  • the information of the data read by each read request in the request is aggregated to obtain the first aggregated information; the first DPU determines the first target data to be read by the first DPU based on the first aggregated information.
  • the first DPU aggregates the information of each read request among the multiple read requests received. There is no need for the first DPU to send multiple read requests to the processor in sequence, which reduces the software and hardware interaction within the computing node. times, reducing CPU occupancy. In addition, by aggregating information from multiple read requests to read data, it can reduce or avoid repeated IO, improve IO performance, shorten job running time, and further reduce computing resource occupancy in the computing system. .
  • the first aggregate information is used to indicate the first aggregate data read by multiple read requests; that is, the aggregate data includes data read by each of the multiple read requests. ;
  • the first DPU determines the first target data to be read by the first DPU based on the first aggregation information, including: the first DPU divides the first aggregated data into multiple data sub-blocks; the first DPU determines the first target data to be read by the first DPU based on the identification and data sub-blocks of the DPU.
  • the mapping relationship of the blocks determines at least one data sub-block corresponding to the first DPU, and the first target data includes at least one data sub-block corresponding to the first DPU.
  • the first DPU aggregates the information in each of the multiple read requests received to obtain aggregate information.
  • the aggregate information indicates the aggregate data read by the multiple read requests.
  • the first DPU can Aggregate the data read by non-consecutive small IO into a piece of aggregate data, thereby reducing or avoiding repeated IO and improving read performance.
  • the first DPU divides the aggregate data into multiple sub-blocks. For example, the length of each sub-block can be executed once The appropriate length of operation degree, which can reduce the number of read IOs overall.
  • the first DPU is determined as an aggregate DPU in the computing system.
  • the computing system further includes a second DPU.
  • the second DPU is also an aggregate DPU.
  • the first DPU is used to read the third DPU.
  • a target data, the second DPU is used to read the second target data, the second target data is the remaining part or all of the data in the aggregated data except the first target data, such as the second target data
  • the data includes one or more sub-blocks of the plurality of sub-blocks of aggregated data other than the first target data.
  • the first DPU and the second DPU jointly read the aggregate data.
  • each DPU reads some of the sub-blocks in the multiple sub-blocks, so that the read data can be shortened through parallel reading. time, providing an efficient and flexible method of reading data.
  • the method further includes: the first DPU separates the first target data read from the storage device according to the computing nodes to which the multiple read requests belong, and sends the separated data to the corresponding of computing nodes.
  • the first DPU can separate and send data at the granularity of computing nodes instead of separating and sending data by the process corresponding to the read request. In this way, the data requested by multiple read requests on one computing node can be aggregated. and then sent to the computing node, thereby reducing the number of network interactions.
  • the information of the data read by each read request is the address information of the data.
  • the method further includes: the first DPU receives multiple write requests corresponding to multiple processes in at least one computing node, and responds to the instructions to be written in each of the multiple write requests.
  • the data information is aggregated to obtain second aggregation information; the first DPU determines the third target data to be written by the first DPU based on the second aggregation information;
  • the first DPU aggregates the information in each of the multiple write requests received. There is no need for the first DPU to send multiple write requests to the processor in sequence, which reduces the hardware and software in the computing node. The number of interactions reduces CPU occupancy. In addition, by aggregating the information of multiple write requests and performing data write operations, it can reduce or avoid repeated IO, improve IO performance, shorten job running time, and further reduce computing resource occupancy in the computing system. Rate.
  • the second aggregation information is used to indicate the second aggregation data written by multiple write requests
  • the first DPU determines the third target data to be written by the first DPU based on the second aggregate information, including: the first DPU divides the second aggregate data into multiple data sub-blocks; the first DPU determines the third target data to be written by the first DPU based on the identification and data of the DPU.
  • the mapping relationship of the sub-blocks determines at least one data sub-block corresponding to the first DPU, and the third target data includes at least one data sub-block corresponding to the first DPU.
  • the first DPU aggregates the information in each of the received multiple write requests to obtain aggregate information.
  • the aggregate information indicates the aggregate data to be written by the multiple write requests.
  • the first DPU can Aggregate the data requested to be written by non-consecutive small IO into a piece of aggregate data, thereby reducing or avoiding repeated IO and improving write performance.
  • the first DPU divides the aggregate data into multiple sub-blocks. For example, the length of each sub-block can be executed once. The appropriate length of the write operation can overall reduce the number of write IOs.
  • the first DPU is determined as an aggregate DPU in the computing system
  • the computing system also includes a second DPU
  • the second DPU is also an aggregate DPU
  • the first DPU is used to write the third target data
  • the second DPU is used to write fourth target data
  • the fourth target data is the remaining part or all of the second aggregated data except the third target data.
  • the first DPU and the second DPU jointly write the aggregate data to the storage device.
  • each DPU is responsible for performing write operations on some of the multiple sub-blocks, so that pass Parallel writing shortens the time of writing data and provides an efficient and flexible method of writing data.
  • the method further includes: the first DPU obtains the third target data, and writes the third target data into a storage device connected to the first DPU.
  • the information indicating the data to be written in each write request is the address information of the data to be written.
  • embodiments of the present application also provide a data processing device, which has the function of implementing the first DPU in the method example of the first aspect.
  • the beneficial effects can be found in the description of the first aspect and will not be described again here.
  • the functions described can be implemented by hardware, or can be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the structure of the device includes a communication module, an aggregation module, and a processing module. These modules can perform the corresponding functions of the management node in the above method example of the second aspect. For details, please refer to the detailed description in the method example, which will not be described again here.
  • this application also provides a computing device, which includes a processor and a power supply circuit.
  • the processor executes the program instructions in the memory to execute the above second aspect or any possible implementation of the second aspect. method provided.
  • the memory is coupled to the processor and stores program instructions and data necessary for executing the data backup process.
  • the power supply circuit is used to provide power to the processor.
  • the present application also provides a computing device.
  • the device includes a processor and a memory, and may also include a communication interface.
  • the processor executes the program instructions in the memory to perform the above second aspect or the second aspect. Methods provided by any possible implementation.
  • the memory is coupled to the processor and stores program instructions and data necessary for executing the data backup process.
  • the communication interface is used to communicate with other devices, such as receiving read requests/write requests, and for example, reading data from a storage device or writing data to be written into the storage device.
  • the present application provides a computer-readable storage medium.
  • the computer-readable storage medium When the computer-readable storage medium is executed by a computing device, the computing device executes the aforementioned second aspect or any possible implementation of the second aspect. Methods.
  • the storage medium stores the program.
  • the storage medium includes but is not limited to volatile memory, such as random access memory, and non-volatile memory, such as flash memory, hard disk drive (HDD), and solid state drive (SSD).
  • the present application provides a computing device program product.
  • the computing device program product includes computer instructions. When executed by the computing device, the computing device executes the aforementioned second aspect or any possible implementation of the second aspect. methods provided in the method.
  • the computer program product can be a software installation package. If it is necessary to use the method provided in the first aspect or any possible implementation of the first aspect, the computer program product can be downloaded and executed on a computing device. Program Products.
  • the present application also provides a chip, which is used to implement the method described in the above-mentioned second aspect and each possible implementation manner of the second aspect by executing a software program.
  • Figure 1 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of the execution flow of a job provided by the embodiment of the present application.
  • Figure 3 is a schematic structural diagram of a computing node provided by an embodiment of the present application.
  • Figure 4 is a schematic flow chart of a data processing method provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of an IO relationship provided by an embodiment of the present application.
  • Figure 6 is another IO relationship schematic diagram provided by the embodiment of the present application.
  • Figure 7 is a schematic diagram of a scenario for determining a subset provided by an embodiment of the present application.
  • Figure 8 is a schematic diagram of another scenario for determining a subset provided by an embodiment of the present application.
  • Figure 9 is a schematic diagram of another scenario for determining a subset provided by an embodiment of the present application.
  • Figure 10 is a schematic flow chart of another data processing method provided by an embodiment of the present application.
  • Figure 11 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • High performance computing is a cross-industry and cross-application computing discipline. It usually uses the most cutting-edge computer technology for the most complex and cutting-edge scientific calculations and solutions, and is widely used in the calculation of large-scale scientific problems. And the processing of massive data, such as weather forecasting, automobile simulation, biopharmaceuticals, gene sequencing, nuclear explosion simulation, and chip design and manufacturing, etc.
  • a computer cluster capable of providing HPC services is called an "HPC cluster”.
  • a computer cluster refers to a group of computing nodes that are loosely or tightly connected to work together, usually used to perform large-scale jobs. Deploying a cluster is usually more cost-effective than a single compute node with comparable speed or availability to improve overall performance through concurrency.
  • Each computing node is connected to each other through a network, and each computing node runs its own operating system instance. In most cases, each compute node uses the same hardware and the same operating system, and in some cases, different operating systems can be used on different hardware.
  • FIG. 1 is a schematic diagram of a computing node cluster provided by an embodiment of the present application.
  • the computing node cluster 10 includes a plurality of computing nodes, such as 100A, 100B, 100C, 100D and 100E. These computing nodes are used to provide computing resources.
  • a computing node it can include multiple processors or processor cores. Each processor or processor core may be a computing resource, so a physical computing node can provide multiple computing resources.
  • Computing nodes 100A, 100B, 100C, 100D, and 100E are interconnected through a network 112.
  • computing node 160 is also connected to network 112 as a scheduler. In operation, scheduler 160 may control execution of jobs submitted to cluster 10 of compute nodes.
  • Jobs can be submitted to the cluster of compute nodes 10 from any suitable source.
  • the embodiments of this application do not limit the location where jobs are submitted, nor do they limit the specific mechanism for users to submit jobs.
  • a user 132 may submit a job 136 from an enterprise 130 to a cluster of compute nodes 10.
  • user 132 operates client computer 134 to submit job 136 to compute node cluster 10 .
  • enterprise 130 is connected to computing node cluster 10 through network 120, which may be the Internet, or other networks. Therefore, users can submit jobs to the cluster of computing nodes 10 from a remote location.
  • the jobs here are usually large-scale jobs that require more computing resources to be processed in parallel. This embodiment does not limit the nature and quantity of the jobs.
  • a job may protect multiple computing tasks, and these tasks can be assigned to multiple computing resources for execution. Most tasks are executed concurrently or in parallel, while some tasks depend on data generated by other tasks.
  • a job is to predict the weather of city A in the next 24 hours.
  • city A includes multiple regions, respectively recorded as region 1, region 2,..., region n (n is a positive integer)
  • this job can be divided into multiple first-level subtasks in a coarse-grained manner, and multiple first-level subtasks are executed in parallel.
  • Each first-level subtask is used to predict the performance of one of the areas in city A in the next 24 hours. weather inside.
  • each first-level subtask can also be divided into multiple second-level subtasks in a fine-grained manner, which are used to predict the weather in the same area at different time periods. For example, the first-level subtask corresponding to area 1.
  • the first second-level subtask in the first-level subtask (denoted as subtask 1 in Figure 2) It is used to predict the weather in area 1 from 0:00-1:00 in the future.
  • the second secondary subtask (recorded as subtask 1' in Figure 2) is used to predict the weather in area 1 from 1:00-2 in the future: 00's weather
  • the third secondary subtask (recorded as subtask 1'' in Figure 2) is used to predict the weather in area 1 from 2:00 to 3:00 in the future, and so on.
  • multiple second-level subtasks in the same first-level subtask are executed iteratively.
  • Iterative execution means that the output result (or prediction result) of the previous second-level subtask is the input data (initial data) of the next second-level subtask. value), for example, the output result of subtask 1 in Figure 2 is the input data of subtask 1', which can be understood as using the meteorological data of the same area in the period before the prediction time to predict the area in the future period. meteorological data.
  • multiple first-level subtasks are executed in parallel, or multiple second-level subtasks belonging to the same iteration are executed in parallel, and multiple second-level subtasks in the same first-level subtask are executed iteratively.
  • FIG. 3 is a schematic structural diagram of a computing node provided by an embodiment of the present application.
  • the computing nodes 100A and 100B in FIG. 3 may be the computing nodes 100A and 100B in FIG. 1 .
  • the computing node 100A runs an operating system and one or more processes (for the sake of simplicity, the computing node 100A in Figure 3 only shows processes 1 and 2, and the computing node 100B only shows processes 1 and 2. Exit process 3, 4).
  • the multiple processes can be executed in parallel, and each process can be used to run a secondary subtask.
  • process 1 can be used to perform subtask 1: predict region 1.
  • process 2 is used to perform subtask 2: predict the weather in region 2 in the future 0:00-1:00
  • process 3 is used to perform subtask 3: predict region 3 in the future
  • process 4 is used to perform subtask 4: predict the weather in area 4 from 0:00-1:00 in the future. In this way, multiple subtasks are executed in parallel to improve the execution efficiency of the job.
  • Data IO is also usually generated during job execution.
  • Data IO includes read requests and write IOs.
  • a read request is used to request to read the input data of a task from the storage device 210 .
  • Write IO is used to request to write the output results of the task to the storage device 210 .
  • their respective read requests or write IOs may be generated in the same time period, that is, multiple read requests or multiple write IOs are generated at the same time.
  • multiple A read request includes read request 1, read request 2, ..., and read request n. Among them, read request 1 is used to request to read the input data of subtask 1, and read request 2 is used to request to read the input data of subtask 2.
  • multiple write IOs include write IO1, write IO2, ..., and write IOn, where write IO1 is used to request the output result of subtask 1 to be written to the storage device 210, and write IO2 is used to request Request that the output result of subtask 2 be written to storage device 210, and so on.
  • These multiple read requests/write IOs generated at the same time due to parallel or concurrent execution of tasks can be called parallel or concurrent read requests/write IOs.
  • the embodiment of the present application provides a data processing method that can be used to process multiple read requests/write IOs generated within a period of time during job execution, such as multiple parallel or concurrent read requests, or multiple parallel or concurrent write IOs. Perform aggregation processing to reduce or avoid repeated IO, thereby improving the read/write performance of the application.
  • read requests and write IOs are only examples. Other types of read requests or write IOs may also occur during job execution. They are not limited to read requests only for reading the input data of the task, or writing IOs. It is only used to write output results and does not limit the number of IO generated by each process during task execution. It should also be noted that the number of processes shown in Figure 3 is just an example to keep it simple. In actual applications, a large job is usually executed by a large number of parallel or concurrent processes. The embodiment of the present application places no restrictions on the number of tasks, the data of tasks that can be executed in parallel, the IO types, and the number of generated IOs.
  • the computing node 100 includes a processor 112, a memory 113 and a data processing device 114.
  • the processor 112 may be a central processing unit (central processing unit).
  • processing unit CPU
  • CPU central processing unit
  • CPU central processing unit
  • FIG. 1 In actual applications, there are often multiple CPUs 112 , and one CPU 112 has one or more processor cores.
  • each processor core can run one process, so that the multiple processor cores can run multiple processes in parallel. This embodiment does not limit the number of CPU 112 and the number of processor cores.
  • Memory 113 is used to store computer instructions and data.
  • memory 113 There are many types of memory 113. Please refer to the detailed introduction of memory 1202 below, which will not be described again here.
  • the data processing device 114 is used to calculate or process data, and is also used to communicate with external devices, such as sending read requests/write IOs to the storage device 210. For example, read requests are used to obtain input data of tasks, etc., and write IOs are used to request The calculation results of the task are written to the storage device 210.
  • the storage device 210 is used to store computer program instructions and data.
  • the instructions are such as the code of the HPC application, and input data such as input data, configuration files, calculation results and other data required for the job are input.
  • the storage device 210 may be a storage server, a storage array or a storage system, and the storage system may be a centralized storage system or a distributed storage system, which is not limited in this application.
  • the storage device 210 is usually a distributed storage system that can implement a distributed file system.
  • Each computing node 100 accesses the file system by mounting the root directory of the distributed file system to complete data access. For example, obtain the program code of an HPC application and run the HPC application to execute jobs and access file data.
  • N read requests generated by multiple MPI processes running the HPC application are used to perform read operations on the same file, such as obtaining input data of a task.
  • the N write IOs generated by the multiple MPI processes are used to perform write operations on the same file, such as writing the calculation results of the task to the file.
  • FIG. 3 only shows one data processing device 114.
  • one computing node 100 may include multiple data processing devices 114, which is not limited in this application.
  • the structure shown in Figure 3 is only an example. In actual products, the computing node 100 may have more or fewer components than in Figure 3.
  • the computing node 100 may also include a hard disk, one or more dedicated The processor may be a GPU, etc., which is not limited in the embodiments of this application.
  • the method includes the following steps:
  • Step 401 Multiple processes used to execute jobs generate respective read requests and send the read requests to the DPU of the current computing node.
  • Multiple processes used to execute jobs may be called parallel processes, such as MPI processes, and the multiple parallel processes may generate respective read requests within the same time period.
  • MPI processes Multiple processes used to execute jobs
  • the multiple parallel processes may generate respective read requests within the same time period.
  • the job involved in Figure 2 is scheduled to be executed by computing nodes 100A and 100B.
  • city A includes 4 areas.
  • each iteration includes 4 subtasks.
  • each iteration The job can be executed by at least 4 parallel processes in the computing nodes 100A, 100B.
  • process 1 generates read request 1 when executing subtask 1.
  • This read request may be a request to read the input data or configuration file of subtask 1, etc., and is not specifically limited.
  • process 2 generates a read request when executing subtask 2.
  • Process 3 executes subtask 3 to generate read request 3
  • process 4 executes subtask 4 to generate read request 4.
  • Figure 4 is only an example, and this application does not limit the splitting method of jobs, the scheduling method of jobs, the degree of parallelism, and the distribution of parallel processes.
  • Each parallel process sends its own read request to the DPU of the computing node.
  • process 1 and process 2 send read request 1 and read request 2 to DPU 114A respectively.
  • Process 3 and process 4 send read request 3 and read request 4 to DPU114B respectively.
  • Step 402 Each DPU in multiple DPUs (referring to multiple DPUs corresponding to multiple computing nodes used to execute jobs, the multiple DPUs below have this meaning, and will not be repeated later) exchange their respective read requests. , so that each DPU obtains read requests from all parallel processes used to execute the job.
  • Each DPU in the multiple DPUs obtains the read request generated by the current computing node (one or more parallel processes on it respectively), and then sends the read request obtained from the current computing node to any other DPU in the multiple DPUs.
  • One DPU receives read requests sent by any other computing node among multiple computing nodes.
  • this computing node refers to the computing node to which the DPU belongs.
  • DPU114A belongs to computing node 100A
  • DPU114B belongs to computing node 100B. It can be understood that each DPU among multiple DPUs broadcasts the read request of the computing node, so that each DPU can obtain a complete and identical set of read requests.
  • DPU 114A of computing node 100A obtains read request 1 and read request 2 generated by process 1 and process 2 respectively, and sends read request 1 and read request 2 to DPU 114B of computing node 100B.
  • DPU 114B of computing node 100B obtains read request 3 and read request 4 generated by process 3 and process 4 respectively, and sends read request 3 and read request 4 to PDU 114A of computing node 100A.
  • DPU 114A receives read request 3 and read request 4 from DPU 100B
  • DPU 114B receives read request 1 and read request 2 from DPU 100A. So far, both DPU 114A and DPU 114B have obtained the same set of read requests, namely read request 1 to read request 4.
  • each DPU needs to know which computing nodes (ie, exchange objects) to send the read requests of this computing node. How to let the DPU determine all exchange objects, here is an optional implementation method: for multiple parallel processes executing the same job, after the job is started, each parallel process can obtain a process identification (such as rank number) and process The total number, for example, the total number of processes is m, and the rank number starts from 0 to m-1. Based on its own process ID and the total number of processes, each parallel process can determine the process ID of other parallel processes. In this way, the DPU can be based on the rank of the process.
  • a process identification such as rank number
  • each DPU among the multiple DPUs has a rank number, so that communication can be performed based on the rank number of the DPU.
  • one DPU may receive multiple read requests from the computing node.
  • the DPU sends the data of the computing node to each exchange object.
  • the aggregated data includes multiple read requests of this computing node.
  • DPU114A aggregates read request 1 and read request 2, and sends the aggregated data (including read request 1 and read request 2) to DPU114B, instead of sending read request 1 and read request 2 to DPU114B separately, can reduce the number of network IOs.
  • the network card on the computing node can only send each read request of the computing node individually. This is because the network card in the existing method can only passively perform forwarding.
  • the DPU can also send the computing node individually. For each read request on, for example, DPU114A first sends read request 1 to DPU114B, and then sends read request 2 to DPU114B. There is no limit here.
  • Figure 4 only shows two computing nodes, which does not mean that only two DPUs among multiple DPUs exchange read requests with each other. For example, if there are more than two computing nodes in the actual job, each DPU It is necessary to send the read request on the current computing node to any other DPU among multiple DPUs.
  • computing node 100C (including DPU114C) is also involved in the job. Assume that computing node 100C runs process 4 and computing node 100B use While running process 3, computing node 100A is still running process 1 and process 2. Then DPU 114A sends read request 1 and read request 2 to computing node 100B and computing node 100C respectively. DPU 114B is to send read request 3 to compute node 100A and compute node 100C. Likewise, compute node 100C sends read request 4 to compute node 100A and compute node 100B.
  • This set of read requests includes the read requests of all processes running the job, specifically including the read requests of this computing node and Read requests from other compute nodes.
  • Step 403 Each DPU among the plurality of DPUs aggregates the acquired information of the data requested by each read request in the plurality of read requests to obtain aggregate information.
  • the information of the data requested by each read request may be the address information of the data to be read, and each DPU uses the address information of each read request in the multiple read requests.
  • Aggregation is performed to obtain aggregate information indicating aggregate data requested to be read by multiple read requests. It can be understood that the aggregate information is a new address information, and the aggregate data indicated by it includes the data requested to be read by each read request in multiple read requests.
  • DPU114A aggregates the address information of data 1, data 2, data 3 and data 4 to obtain aggregation information.
  • the aggregation data indicated by the aggregation information includes data 1, data 2, data 3 and Data 4. It should be noted that at this time, data 1 to data 4 do not exist on the DPU 114A, and only the aggregation data indicated by the aggregation process and aggregation information are shown here.
  • the address information of the data to be read may include the starting address and length of the data to be read.
  • the address information of data 1 is 10MB (starting address) + 2MB (length)
  • the address information of data 2 is 12MB (start address) + 2MB (length)
  • the address information of data 3 is 14MB (start address) + 2MB (length)
  • the address information of data 4 is 16MB (start address) + 2MB (length).
  • DPU114A aggregates 10MB+2MB, 12MB+2MB, 14MB+2MB and 16MB+2MB (length). If the aggregation information obtained is 10MB (starting address) + 8MB (length), the aggregation data indicated by the aggregation information includes data 1 to data 4.
  • DPU 114B performs the same operation to get the same aggregate information 10MB (starting address) + 8MB (length).
  • the storage addresses of data 1, data 2, data 3, and data 4 shown in Figure 5 are consecutive, and the data size is the same. In fact, there are still gaps between the storage addresses of multiple data to be read. There may be overlap, as shown in (a) of Figure 6 , and/or the storage addresses of multiple data to be read may be discontinuous, as shown in (b) of Figure 6 .
  • the size of the data to be read may be exactly the same, completely different, or not exactly the same, and this application does not limit this.
  • the aggregation method is the same. For example, assuming that in (a) of Figure 6, the address information of data 1 is 10MB+5MB, the address information of data 2 is 12MB+6MB, and the address information of data 3 is 18MB+4MB, the address information of data 4 is 21MB+3MB, then the aggregated information obtained by aggregating multiple pieces of the address information can include 10MB+14MB.
  • the aggregated information obtained by aggregating multiple pieces of the address information may include 10MB+18MB.
  • the address information of the data to be read in this application is not limited to the starting address and length of the data to be read, and may also include other information.
  • the multiple read request requests If the same file is read, the address information carried by each read request here may also include the file path, file handle, and starting address (offset) of the data to be read in the file. and length, etc., this application does not Make limitations.
  • the file handle is the unique identifier of each file in the distributed file system, and a file can also be uniquely determined based on the file path.
  • aggregate data is a collection of multiple data to be read.
  • the aggregate data includes the first data to be read (data 1 in Figure 5) to the data to be read at the end (data 4 in Figure 5).
  • the starting address of the aggregated data is the starting address of the first data to be read (data 1 in Figure 5), the length of the aggregated data (from the starting address of the first data to be read to the last data to be read (such as The length of the tail end of data 4) in Figure 5).
  • the aggregation information indicates the aggregation data, and the aggregation information may include the starting address of the aggregation data and the length of the aggregation data.
  • each DPU performs aggregation based on the same set of read requests and will obtain the same aggregation information. It should be noted that each DPU here needs to perform an aggregation operation to obtain the aggregation information, because some DPUs need to be randomly selected from the multiple DPUs as aggregation DPUs, and the aggregation DPU needs to read data based on the aggregation information. Therefore, here Each DPU is required to perform aggregation operations.
  • Step 404 Each DPU among the multiple DPUs divides the data range corresponding to the aggregated data indicated by the aggregation information into K subsets, where K is a positive integer.
  • the aggregated data is data in a file.
  • each DPU sends the data corresponding to the aggregated data in units of the set data length.
  • the range (or file range) is divided into multiple sub-blocks, and then the multiple sub-blocks are divided into K subsets, each subset may include one or more sub-blocks. Multiple sub-blocks within each subset can be continuous or discontinuous.
  • the data length used to divide the sub-blocks may be a preset length, or may be a data length recommended (or notified) by other devices such as the storage device 210, and is not specifically limited. It should be noted that in different scenarios, the set data length can be different, which can be related to the storage location of the data to be read, such as the file system corresponding to the data to be read, the file system that stores the data to be read, It is related to one or more factors such as storage device or storage system, which is not limited in this application. Similarly, K can also be a preset value or determined in other ways, which will be described below.
  • Each DPU can divide the file range (10MB+8MB) into 2 sub-blocks in units of 4MB, namely sub-block 1. (10MB+4MB), sub-block 2 (14MB+4MB).
  • the DPU divides the two sub-blocks into a subset. It can be seen that the subset includes sub-block 1 and sub-block 2.
  • the sets are denoted as subset 1 and subset 2.
  • subset 1 may include sub-block 1
  • subset 2 may include sub-block 2.
  • subset 1 may include sub-block 1 and sub-block 2, and subset 2 includes Sub-block 3 and sub-block 4, at this time, multiple sub-blocks in each subset are continuous.
  • subset 1 may include sub-block 1 and sub-block 3, and subset 2 may include subset 2 and sub-block 4. At this time, among the multiple sub-blocks in each subset The time is discontinuous.
  • Figure 8 is only an example to facilitate understanding of the relationship between subsets and sub-blocks. In actual applications, the number of sub-blocks is usually smaller than the number of read requests, thereby achieving an aggregation effect.
  • Example 4 The above examples 1 to 3 show that the data to be read in the aggregated data is continuous. In fact, the data to be read included in the aggregated data may overlap, as shown in (a) of Figure 9 . Or, the data to be read included in the aggregated data may be discontinuous, as shown in (b) of FIG. 9 . Regardless of the relationship between the data to be read in the aggregated data, the method of dividing sub-blocks and subsets based on the data range corresponding to the aggregated data is the same, and will not be described again here.
  • the length of the sub-block at the end may be smaller than the set data length, or larger than the set data length. For example, if the file range of the data to be read is 10MB+19MB, the set data length is 4MB, then when dividing the sub-blocks, it can be divided into 5 sub-blocks, the size of the sub-block at the end can be 3MB, or divided into 4 sub-blocks, the size of the sub-block at the end can be The size can be 7MB.
  • Step 405 Each DPU in the plurality of DPUs selects K DPUs from the plurality of DPUs as aggregate DPUs.
  • Each aggregate DPU is responsible for a subset, which means that the data in the subset is read by the aggregate DPU.
  • one DPU in each compute node used to execute a job serves as an aggregate DPU.
  • each DPU of the plurality of DPUs selects the same K DPUs from the plurality of DPUs as the aggregate DPU according to a consensus algorithm.
  • the number of aggregated DPUs can be a preset value (ie, K value).
  • Each DPU uses the same input data and consistency algorithm to calculate the identifiers of K DPUs.
  • the DPU indicated by each identifier is an aggregated DPU. . Since the same consensus algorithm and input data are used, each DPU is able to compute the same K aggregate DPUs.
  • the input data of the consistency algorithm includes but is not limited to one or more of the following: the identification of each DPU in multiple DPUs, the preset value of the number of aggregated DPUs, aggregation information (the data range corresponding to the aggregated data) ), set data length, number of sub-blocks, etc.
  • the calculation results of the consistency algorithm can include the identification of k DPUs, so that each DPU can determine the same calculation results, thereby determining the same K aggregate DPUs, and determining whether the DPU itself is an aggregate DPU.
  • k is a preset value.
  • the rank number of DPU114A in Figure 4 is 0, and the rank number of DPU114B is 1.
  • k can also be a value determined by other methods.
  • the k value is determined based on the number of sub-blocks. If the number of sub-blocks is large, the k value can be correspondingly larger, so that multiple aggregated DPUs can Read operations are executed in parallel, thereby increasing the parallelism of the job and thus improving the efficiency of reading data. If the number of sub-blocks is small, the k value can be correspondingly smaller to balance read efficiency and the number of network IOs. Usually the number of aggregators is usually multiple to increase the parallelism of the job.
  • the input data of the consistency algorithm can include the identification of multiple DPUs, the number of sub-blocks, and optionally, the parallelism of the read operation (can be understood as the ratio of the number of sub-blocks to the number of aggregated DPUs), etc. , the details are no longer limited.
  • each aggregation DPU can determine the subset that the aggregation DPU itself is responsible for based on the mapping relationship between K aggregation DPUs and K subsets.
  • this mapping relationship one aggregation DPU corresponds to a subset, and different aggregation DPUs correspond to Subsets are different.
  • each aggregate DPU calculates its corresponding one or more sub-blocks through another consensus algorithm to determine the subset it is responsible for. For example, the number of sub-blocks in each subset is determined based on the total number of sub-blocks and the k value (note is m), each m consecutive sub-blocks are a subset.
  • Each aggregated DPU is sorted in ascending order (or sorted in descending order) based on the rank numbers of k aggregated DPUs, and a subset of the corresponding positions is selected according to the position of its own rank number in the sorting. For example, if the number of aggregated DPUs is 2, the 2 The rank numbers of the aggregated DPUs are rank0 and rank1 respectively.
  • the 4 sub-blocks are divided into 2 subsets.
  • Each subset includes two consecutive sub-blocks.
  • Rank0 corresponds to the first-ranked subset 1
  • rank1 corresponds to the second-ranked subset. 2.
  • the rank numbers of the aggregated DPU may be discontinuous.
  • the rank numbers of multiple aggregated DPUs are 0, 4, 9, 14, etc., which will not be described again below.
  • k aggregated DPUs all determine the subsets they are responsible for based on the same consensus algorithm.
  • the consensus algorithm is: aggregate DPU's own number + N*K.
  • the subsets that the aggregate DPU is responsible for include sub-block 1 and sub-block 3.
  • the subset that the aggregate DPU numbered 2 is responsible for includes sub-block 2 and sub-block 4.
  • Step 404 can also be executed after step 405. For example, after k aggregated DPUs are determined, the sub-blocks corresponding to each aggregated DPU are determined, thereby determining k subsets.
  • Step 406 Each aggregated DPU reads the data of the corresponding subset.
  • the aggregation DPU sends at least one read request to the storage device 210 to request to read the data in the subset.
  • the aggregation DPU can also read the data in the subset through multiple read requests, and each read request is used to request to read the data in the subset.
  • Some of the data are not specifically limited.
  • the aggregate DPU obtains data from the storage device 210 in units of file sub-blocks, and each read request is used to request to read data of one sub-block in the subset.
  • DPU114A in Figure 4 is an aggregate DPU
  • DPU114B is not an aggregate DPU
  • DPU114A is used to read the data of the subset shown in Figure 4 (ie, sub-block 1 and sub-block 2)
  • the DPU 114A sends a read request 5 to read sub-block 1 and a read request 6 to read sub-block 2 to the storage device 210; the storage device 210 responds to the read request 5 and the read request 6 by sending the data of sub-block 1 to the storage device 210.
  • the data of sub-block 2 is sent to DPU114A.
  • DPU114A and DPU114B in Figure 4 are both aggregated DPUs.
  • DPU 114A is responsible for subset 1 (eg, including sub-block 1 and sub-block 3)
  • DPU 114B is responsible for subset 2 (eg, includes sub-block 2 and sub-block 4).
  • DPU 114A sends a read request 5 requesting to read sub-block 1 to the storage device 210
  • DPU 114B sends a read request 6 requesting to read sub-block 2 to the storage device 210; similarly, DPU 114A sends a request to read sub-block 3 to the storage device 210.
  • DPU 114B sends a read request 8 requesting to read sub-block 4 to the storage device 210. It should be noted that this is only an example, and the storage devices 210 corresponding to different aggregated DPUs may be different, and this application does not limit this.
  • Each aggregate DPU reads the data of the subset it is responsible for, and uses less IO to read the data in the subset, reducing or avoiding repeated IO.
  • Step 407 The aggregation DPU separates the data in the read subset based on the target read request as the granularity, and feeds back the data requested by each target read request.
  • the target read request refers to a read request in which the data requested to be read intersects with the data in the subset among multiple read requests received by the aggregation DPU (in step 402).
  • the existence of intersection means that the target read request requests part of the read or All data are in this subset. It should be noted that the number of target read requests can be one or more.
  • the corresponding target read requests include read request 1, read request 2, read request 3 and read request 4.
  • the target read requests corresponding to subset 1 include read request 1 and read request 2; the target read requests corresponding to subset 2 include read request 3 and read request 4.
  • the target read request corresponding to subset 1 shown in (b) of FIG. 8 includes read request 1 and read request 3; the target read request corresponding to subset 2 includes read request 2 and read request 4.
  • the aggregation DPU determines one or more target read requests corresponding to the subset it is responsible for, separates the read data in the subset according to the target read requests, and obtains the data corresponding to each target read request.
  • the data may It is part or all of the data requested to be read by the target read request, and the data corresponding to each target read request is sent to the computing node 100 to which the target read request belongs.
  • DPU 114A is an aggregate DPU (called aggregate DPU 114A), which is responsible for reading the subset shown in Figure 4.
  • the aggregate DPU 114A determines that the target read request corresponding to subset 1 includes read request 1, read request 2, and read request 3 and read request 4, the aggregation DPU 114A separates the read data of subset 1 according to read request 1, read request 2, read request 3 and read request 4, and separates it into data 1 requested by read request 1, Read request 2 requests data 2, read request 3 requests data 3, and read request 4 requests data 4.
  • the aggregation DPU 114A When distributing data, for example, referring to FIG. 4 , the aggregation DPU 114A sends data 1 to process 1 and data 2 to process 2.
  • the aggregation DPU can take the target read request as the granularity and send the data requested by each target read request to the computing node to which the target read request belongs. For example, aggregate DPU 114A independently sends data 3 and data 4 to DPU 114B.
  • the aggregation DPU can also aggregate the data requested by multiple target read requests belonging to the same computing node based on the computing node to which the read request belongs, and send the aggregated data to the computing node. calculate node.
  • aggregation DPU 114A determines that read request 3 and read request 4 both belong to computing node 100B.
  • Aggregation DPU 114A aggregates data 3 and data 4, and sends the aggregated data to computing node 100B.
  • the aggregated data includes data 3 and data 100B. 4, thereby reducing network IO.
  • the data requested by a read request may be divided into one or more sub-blocks.
  • part of the data of data 2 is divided into sub-block 1
  • the remaining part of the data is divided into sub-block 2.
  • a sub-block may also contain data requested by one or more read requests.
  • sub-block 1 includes data 1 requested by read request 1 and part of data 2 requested by read request 2 .
  • Sub-block 2 includes part of data 2 requested by read request 2 and data 3 requested by read request 3.
  • the data requested by a read request may also be divided into one or more file sub-blocks.
  • data 4 is divided into 2 sub-blocks.
  • a subset may include one or more sub-blocks. Therefore, the data in a subset includes data requested by one or more read requests, and some or all of the multiple read requests may come from The same compute node 100.
  • the target read requests corresponding to subset 1 include read request 1, read request 2 and read request 3.
  • the data in subset 1 will be separated into data 1, data 2 and data 3, where, There is partial overlap between data 1 and data 2.
  • the target read request corresponding to subset 2 includes read request 4.
  • the data in subset 2 is data 4 and does not need to be separated.
  • Step 408 The DPU receives the data sent by the aggregation DPU and distributes the data to the corresponding process according to the read request of the current computing node.
  • the DPU sends the received data to the process of this computing node. If the received data is aggregated data, the DPU separates the aggregated data at the granularity of the read request and sends the separated data to this computing node. The corresponding process of the node. For example, DPU114B receives the data sent by DPU114A. The data includes data 3 and data 4. DPU114B The data is separated according to read request 3 and read request 4 into data 3 and data 4, and data 3 is sent to process 3 and data 4 is sent to process 4.
  • the DPU on each computing node of multiple computing nodes used to execute the job aggregates the read requests of each parallel process, determines the aggregation information based on the multiple read requests obtained by aggregation, and the DPU obtains the aggregated data based on the aggregation information, and The aggregated data is separated and sent to the DPU of the corresponding computing node.
  • the first DPU aggregates the information of each read request in the multiple read requests received.
  • Processor processing reduces the number of software and hardware interactions of computing nodes, reduces CPU occupancy and computing power overhead.
  • by aggregating information from multiple read requests to read data it can reduce or avoid repeated IO and improve IO performance. It can shorten the running time of jobs and further reduce the usage of computing resources.
  • This data processing method can be executed by the data processing device (referred to as DPU for short) in the computing node 100A or 100B shown in FIG. 1 or FIG. 3 .
  • DPU data processing device
  • the method includes the following steps:
  • Step 1001 Generate respective write IOs in multiple computing nodes used to execute the job.
  • Process 1 executes subtask 1 to generate write request 1. This write request may be a request to write the calculation result of subtask 1.
  • process 2 executes subtask 2 to generate write request 2
  • process 3 executes subtask 3 to generate write request. 3.
  • Process 4 executes subtask 4 to generate write request 4.
  • the write request here carries information indicating the data to be written, such as address information, but does not carry the data to be written.
  • Step 1002 Each DPU in multiple DPUs (referring to multiple DPUs corresponding to multiple computing nodes used to execute jobs, the multiple DPUs below have this meaning, and will not be repeated later) exchange their respective write data. requests so that each DPU gets write requests from all parallel processes used to execute the job.
  • step 1002 please refer to the description of step 402 above. The difference lies in the interaction of read requests in step 402 and the interaction of write requests in step 1002, which will not be described again here.
  • Step 1003 Each aggregation DPU (referring to the DPU on each computing node used to execute the job) processes the information of the data to be written in each of the multiple write requests obtained to obtain aggregation information.
  • step 1003 Each DPU aggregates based on the same set of read requests and will obtain the same aggregate information, that is, the complete file range of the data to be written corresponding to multiple write requests.
  • step 403 For the specific execution process of step 1003, please refer to the description of step 403 above. The difference is that step 403 is to aggregate the information to be read, while step 1002 is to aggregate the information to be written, which will not be described again here.
  • Step 1004 Each DPU among the multiple DPUs divides the file range indicated by the aggregation information into K subsets, where K is a positive integer.
  • K is a positive integer.
  • Step 1005 Each DPU in the plurality of DPUs selects K DPUs from the plurality of DPUs as aggregate DPUs, and determines the subset that each aggregate DPU is responsible for.
  • K DPUs from the plurality of DPUs as aggregate DPUs, and determines the subset that each aggregate DPU is responsible for.
  • Step 1006 Each DPU sends the data to be written on the current computing node to the corresponding aggregation DPU.
  • the aggregate DPU corresponding to the data to be written refers to the DPU to which the subset of data to be written belongs.
  • DPU 114B determines that DPU 114A is the aggregate DPU, and DPU 114A is responsible for the subset (see Figure 10).
  • DPU114B determines that the data to be written on this computing node includes data b and data d.
  • the subsets corresponding to data b and data d are both subsets that DPU114A is responsible for.
  • DPU114B sends data b and data d to DPU114A.
  • the DPU can aggregate multiple to-be-written data corresponding to multiple parallel processes on the current computing node, and send the aggregated data to the corresponding aggregation DPU.
  • the aggregated data includes Data b and data d.
  • the DPU aggregates multiple read requests on the computing node and then sends them in step 402 above, which will not be described again here.
  • Step 1007 The aggregation DPU writes the received data in the subset to the storage device 210.
  • DPU 114A sends the data in the subset it is responsible to the storage device 210 through at least one write request.
  • DPU 114A sends write request 5 and write request 6 to storage device 210.
  • Write request 5 includes data in sub-block 1 (data a and data b), and write request 6 includes data in sub-block 2 (data c and data d). ).
  • the aggregation DPU may first read the first data from the storage device 210, update the data to be written based on the first data, thereby obtaining the continuous data to be written corresponding to the subset, and then update the data to be written corresponding to the subset. Continuous data to be written is written into the storage device 210 .
  • the aggregation DPU may not read the first data, obtain the continuous to-be-written data corresponding to the subset by filling in 0s, and then add the data to the address space indicated by the subset.
  • the subset of data to be written is written to the storage device 210 .
  • the aggregation information in the embodiment of Fig. 10 is not the same aggregation information as the aggregation information in the embodiment of Fig. 4.
  • the aggregation information in the embodiment of Fig. 4 can be replaced with the first aggregation information.
  • DPU114A in Figure 4 is an aggregated DPU
  • DPU114B in Figure 10 may be an aggregated DPU.
  • Figure 4 and The lengths of the two sub-blocks in Figure 10 may be the same or different, and this application does not limit this.
  • the DPU on each of the multiple computing nodes used to execute the job aggregates the write requests of each parallel process, and based on the information of the data to be written carried by each of the multiple write requests obtained by aggregation, Determine the aggregation information, and the DPU writes the aggregation data to the storage device 210 based on the aggregation information.
  • the processor on the computing node can be bypassed, reducing the processor occupancy and computing power overhead, as well as the software and hardware between the DPU and the process.
  • the number of interactions and the number of network IOs can improve the writing performance of the system, improve writing efficiency, and further reduce computing resource usage.
  • Figure 4 and Figure 10 can be two independent processes, and are not limited to a parallel process that must generate read requests and write requests. In a possible scenario, a parallel process can generate only read requests or only write requests when executing a subtask, and there is no specific limit.
  • the parameters involved in Figure 4 and Figure 10 may be different.
  • the number of aggregated DPUs in the method shown in Figure 4 and the number of aggregated DPUs in the method shown in Figure 10 may be the same or different. If they are the same, It is also not necessary to select the same DPU as the aggregate DPU.
  • the sub-block length (set data length) in the method shown in Figure 4 and the sub-block length (set data length) in the method shown in Figure 10 can be the same or different, etc., the embodiment of the present application does not There are no restrictions.
  • the embodiment of the present application also provides a data processing device, which is used to execute the method executed by DPU 114A or DPU 114B in the method embodiments of FIG. 4 and FIG. 10 .
  • the data processing device 1100 includes a communication module 1101, an aggregation module 1102, and a processing module 1103; specifically, in the data In the processing device 1100, each module is connected through a communication path.
  • the communication module 1101 is configured to receive multiple read requests corresponding to multiple processes in at least one computing node; for specific implementation methods, please refer to the description of steps 401 to 402 in Figure 4, which will not be described again here.
  • the aggregation module 1102 is used to aggregate the information of the data read by each of the multiple read requests received to obtain the first aggregation information; for the specific implementation, please refer to the description of step 403 in Figure 4, here No longer.
  • the processing module 1103 is configured to determine the first target data to be read according to the first aggregate information. For specific implementation methods, please refer to the description of steps 404 to 405 in Figure 4, which will not be described again here.
  • the communication module 1101 is also used to receive multiple write requests corresponding to multiple processes in at least one computing node; for specific implementation, please refer to the description of steps 1001 to 1002 in Figure 10, No further details will be given here.
  • the aggregation module 1102 is also configured to aggregate the information indicating the data to be written in each of the multiple write requests to obtain the second aggregation information; for the specific implementation, please refer to the description of step 1003 in Figure 10, here No further details will be given.
  • the processing module 1103 is also configured to determine the second target data to be written according to the second aggregation information; for specific implementation, please refer to the description of steps 1004 to 1005 in Figure 10, which will not be described again here.
  • FIG. 12 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • the computing device is used to execute the method executed by DPU 114A or DPU 114B in the method embodiments of FIG. 4 and FIG. 10 .
  • the computing device 1200 includes a processor 1201, a memory 1202, and a communication interface 1203. Among them, the processor 1201, the memory 1202 and the communication interface 1203 can be connected through the bus 1204.
  • the processor 1201 is used to execute instructions stored in the memory 1202, so that the data processing device 1200 executes the data processing method provided by this application.
  • the processor 1201 can be but is not limited to: a data processing unit (data processing unit, DPU), a system level In processors such as system on chip (SOC), field programmable gate array (FPGA), graphics processing unit (GPU), application specific integrated circuit (ASIC), etc. Any one or more.
  • Memory 1202 is used to store computer instructions and data.
  • memory 1202 stores computer instructions and data required to implement the data processing method provided by this application.
  • Memory 1202 includes volatile memory (volatile memory), such as random access memory (random access memory, RAM), dynamic random access memory (dynamic random access memory, DRAM), etc. It can also be non-volatile memory (non-volatile memory), such as read-only memory (ROM), storage-class memory (SCM), flash memory, hard disk drive (hard disk drive) , HDD) or solid state drive (SSD).
  • volatile memory such as random access memory (random access memory, RAM), dynamic random access memory (dynamic random access memory, DRAM), etc.
  • ROM read-only memory
  • SCM storage-class memory
  • flash memory such as hard disk drive (hard disk drive) , HDD) or solid state drive (SSD).
  • the memory 1202 stores executable program code, and the processor 1201 executes the executable program code to realize the functions of the aforementioned communication module 1101, aggregation module 1102, and processing module 1103 respectively, thereby realizing the data processing method. That is, the memory 1202 stores instructions for the data processing device 1100 to execute the data processing method provided by this application.
  • the communication interface 1203 is used to communicate with internal devices or external devices, such as obtaining read requests/write requests sent by a process, or communicating with the storage device 210 to complete data access.
  • the communication interface 1203 may be a network card.
  • the bus 1204 may be a Peripheral Component Interconnect Express (PCIe) bus, a double data rate (DDR) bus, a serial advanced technology attachment (SATA) bus, or Serial attached SCSI (SAS) bus, or Controller Area Network (CAN), or extended industry standard architecture (EISA) bus, unified bus (Ubus or UB), computer Quick link (compute express link, CXL), cache coherent interconnect for accelerators (CCIX), etc.
  • PCIe Peripheral Component Interconnect Express
  • DDR double data rate
  • SAS serial attached SCSI
  • CAN Controller Area Network
  • EISA extended industry standard architecture
  • Ubus or UB unified bus
  • CXL computer Quick link
  • CXL cache coherent interconnect for accelerators
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 12, but it does not mean that there is only one bus or one type of bus.
  • Bus 1204 may include a path for transferring information between various components of data
  • An embodiment of the present application also provides a computer program product containing instructions.
  • the computer program product may be a software or program product containing instructions capable of running on a computing device or stored in any available medium.
  • the computer program product is run on at least one computer device, the at least one computer device is caused to perform the data separation method performed by the DPU 114A in the embodiment of FIG. 4 or FIG. 10. Refer to the description of each step in FIG. 4 or FIG. 10, No further details will be given here.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be any available medium that a computing device can store or a data storage device such as a data center that contains one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc.
  • the computer-readable storage medium includes instructions that instruct the computing device to perform the data processing method performed by the DPU 114A in the embodiment of FIG. 4 or FIG. 10. Refer to the description of each step in FIG. 4 or FIG. 10, which will not be described again here. .
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), etc.
  • the various illustrative logic units and circuits described in the embodiments of this application can be implemented by a general-purpose processor, a digital signal processor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, Discrete gate or transistor logic, discrete hardware components, or any combination of the foregoing are designed to implement or operate the functions described.
  • the general-purpose processor may be a microprocessor.
  • the general-purpose processor may also be any conventional processor, controller, microcontroller or state machine.
  • a processor may also be implemented as a combination of computing devices, such as a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors combined with a digital signal processor core, or any other similar configuration. accomplish.
  • the steps of the method or algorithm described in the embodiments of this application can be directly embedded in hardware, a software unit executed by a processor, or a combination of the two.
  • the software unit may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, register, hard disk, removable disk, CD-ROM or any other form of storage medium in the art.
  • the storage medium can be connected to the processor, so that the processor can read information from the storage medium and can store and write information to the storage medium.
  • the storage medium can also be integrated into the processor.
  • the processor and storage medium can be housed in an ASIC.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据处理方法及装置,该方法包括:第一DPU接收至少一个计算节点中的多个进程对应的多个读请求,对该多个读请求中每个读请求所读取的数据的信息进行聚合得到第一聚合信息;根据第一聚合信息确定第一DPU待读取的第一目标数据。第一DPU对接收到的多个读请求中每个读请求的信息进行聚合,不需要第一DPU将多个读请求依次发送给CPU处理,减少软硬件交互次数,降低CPU的占用率,另外通过将多个读请求的信息进行聚合来读取数据,可减少或避免重复IO,提高IO性能,缩短作业运行时间,进一步降低计算资源占用率。

Description

一种数据处理方法及装置
相关申请的交叉引用
本申请要求在2022年07月14日提交中国专利局、申请号为202210834105.1、申请名称为“一种数据处理方法、装置及计算设备集群”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种数据处理方法及装置。
背景技术
在输入输出(input/output,IO)密集型的高性能计算场景,如高性能计算(high performance computing,HPC)或超算(supercomputing,SC)等大规模应用中,存在许多IO模式为“非连续小IO”的并行应用程序,其IO数量可以达到TB量级。处理器在处理这些非连续小IO时,会消耗较多计算资源及时间资源,并且处理效率慢,应用的IO性能成为应用的技术瓶颈,优化应用的IO性能可以大规模减少应用的计算耗时。
发明内容
本申请提供一种数据处理方法及装置,用于提供应用的IO性能。
第一方面,本申请实施例提供一种数据处理方法,数据处理方法应用于计算系统,该计算系统包括多个计算节点,每个计算节点中运行有至少一个进程,每个计算节点包括数据处理设备DPU;该述方法包括:第一DPU接收该计算系统中的多个进程对应的多个读请求,该多个进行可以是运行同一作业的多个并行进程,第一DPU对该多个读请求中每个读请求所读取的数据的信息进行聚合,以得到第一聚合信息;第一DPU根据该第一聚合信息确定第一DPU待读取的第一目标数据。
通过上述设计,第一DPU对接收到的多个读请求中每个读请求的信息进行聚合,不需要第一DPU将多个读请求依次发送给处理器处理,减少计算节点内的软硬件交互次数,降低CPU的占用率,另外通过将多个读请求的信息进行聚合来读取数据,可减少或避免重复IO,提高IO性能,缩短作业运行时间,进一步降低计算系统内的计算资源占用率。
在一种可能的实现方式中,第一聚合信息用于指示多个读请求所读取的第一聚合数据;也就是说,聚合数据包括多个读请求中每个读请求所读取的数据;第一DPU根据第一聚合信息确定第一DPU待读取的第一目标数据,包括:第一DPU将第一聚合数据划分为多个数据子块;第一DPU根据DPU的标识和数据子块的映射关系确定第一DPU对应的至少一个数据子块,第一目标数据包括第一DPU对应的至少一个数据子块。
通过上述设计,第一DPU对接收到的多个读请求中每个读请求中的信息进行聚合,得到聚合信息,聚合信息指示多个读请求所读取的聚合数据,如此,第一DPU可以将非连续小IO所读取的数据聚合为一段聚合数据,从而减少或避免重复IO,提高读性能,第一DPU将聚合数据划分为多个子块,如每个子块的长度可以是执行一次读操作的适宜长 度,总体可以减少读IO的数量。
在一种可能的实现方式中,第一DPU被确定为所述计算系统中的聚合DPU,计算系统还包括第二DPU,所述第二DPU也为聚合DPU,第一DPU用于读取第一目标数据,所述第二DPU用于读取所述第二目标数据,第二目标数据为所述聚合数据中除所述第一目标数据之外的其余部分或全部数据,如第二目标数据包括聚合数据的多个子块中除第一目标数据之外的一个或多个子块。
通过上述设计,第一DPU和第二DPU共同读取聚合数据,当将聚合数据划分为多个子块时,每个DPU读取多个子块中的部分子块,这样可以通过并行读缩短读数据时间,提供一种高效、灵活的读数据方法。
在一种可能的实现方式中,所述方法还包括:第一DPU将从存储设备读取的第一目标数据按照多个读请求所归属的计算节点进行分离,并将分离的数据发给对应的计算节点。
通过上述设计,第一DPU可以以计算节点为粒度分离并发送数据,而不是以读请求对应的进程来分离并发送数据,如此可以将一个计算节点上的多个读请求所请求的数据进行聚合后发送给该计算节点,从而减少网络交互次数。
在一种可能的实现方式中,每个读请求所读取的数据的信息为所述数据的地址信息。
在一种可能的实现方式中,所述方法还包括:第一DPU接收至少一个计算节点中的多个进程对应的多个写请求,对多个写请求中每个写请求中指示待写入数据的信息进行聚合得到第二聚合信息;第一DPU根据第二聚合信息确定所述第一DPU待写入的第三目标数据;
通过上述设计,第一DPU对接收到的多个写请求中每个写请求中的信息进行聚合,不需要第一DPU将多个写请求依次发送给处理器处理,减少计算节点内的软硬件交互次数,降低CPU的占用率,另外通过将多个写请求的信息进行聚合执行数据写操作,可减少或避免重复IO,提高IO性能,缩短作业运行时间,进一步降低计算系统内的计算资源占用率。
在一种可能的实现方式中,第二聚合信息用于指示多个写请求所写入的第二聚合数据;
第一DPU根据第二聚合信息确定第一DPU待写入的第三目标数据,包括:第一DPU将所述第二聚合数据划分为多个数据子块;第一DPU根据DPU的标识和数据子块的映射关系确定第一DPU对应的至少一个数据子块,第三目标数据包括第一DPU对应的至少一个数据子块。
通过上述设计,第一DPU对接收到的多个写请求中每个写请求中的信息进行聚合,得到聚合信息,聚合信息指示多个写请求待写入的聚合数据,如此,第一DPU可以将非连续小IO所请求写入的数据聚合为一段聚合数据,从而减少或避免重复IO,提高写性能,第一DPU将聚合数据划分为多个子块,如每个子块的长度可以是执行一次写操作的适宜长度,总体可以减少写IO的数量。
在一种可能的实现方式中,第一DPU被确定为计算系统中的聚合DPU,计算系统还包括第二DPU,第二DPU也为聚合DPU,第一DPU用于写入第三目标数据,第二DPU用于写入第四目标数据,第四目标数据为第二聚合数据中除第三目标数据之外的其余部分或全部数据。
通过上述设计,第一DPU和第二DPU共同将聚合数据写入存储设备,当将聚合数据划分为多个子块时,每个DPU负责对多个子块中的部分子块执行写操作,这样可以通过 并行写缩短写数据时间,提供一种高效、灵活的写数据方法。
在一种可能的实现方式中,所述方法还包括:第一DPU获取所述第三目标数据,并将所述第三目标数据写入与所述第一DPU连接的存储设备。
在一种可能的实现方式中,每个写请求中指示待写入数据的信息为待写入数据的地址信息。
第二方面,本申请实施例还提供了一种数据处理装置,该装置具有实现上述第一方面的方法实例中第一DPU的功能,有益效果可以参见第一方面的描述此处不再赘述。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。在一个可能的设计中,所述装置的结构中包括通信模块、聚合模块、处理模块。这些模块可以执行上述第二方面方法示例中管理节点的相应功能,具体参见方法示例中的详细描述,此处不做赘述。
第三方面,本申请还提供了一种计算装置,所述装置包括处理器和供电电路,所述处理器执行所述存储器中的程序指令执行上述第二方面或第二方面任一可能的实现方式提供的方法。所述存储器与所述处理器耦合,其保存有执行数据备份过程中必要的程序指令和数据。供电电路用于为处理器供电。
第四方面,本申请还提供了一种计算设备,所述设备包括处理器和存储器,还可以包括通信接口,所述处理器执行所述存储器中的程序指令执行上述第二方面或第二方面任一可能的实现方式提供的方法。所述存储器与所述处理器耦合,其保存有执行数据备份过程中必要的程序指令和数据。所述通信接口,用于与其他设备进行通信,如接收读请求/写请求,又如,将从存储设备读取数据或将待写入数据写入存储设备。
第五方面,本申请提供了一种计算机可读存储介质,所述计算机可读存储介质被计算设备执行时,所述计算设备执行前述第二方面或第二方面的任意可能的实现方式中提供的方法。该存储介质中存储了程序。该存储介质包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(hard disk drive,HDD)、固态硬盘(solid state drive,SSD)。
第六方面,本申请提供了一种计算设备程序产品,所述计算设备程序产品包括计算机指令,在被计算设备执行时,所述计算设备执行前述第二方面或第二方面的任意可能的实现方式中提供的方法。该计算机程序产品可以为一个软件安装包,在需要使用前述第一方面或第一方面的任意可能的实现方式中提供的方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。
第七方面,本申请还提供一种芯片,所述芯片用于通过执行软件程序,实现上述第二方面以及第二方面的各个可能的实现方式中所述的方法。
上述第二方面至第七方面中任一实现方式的有益效果请参见第一方面的描述,此处不再赘述。
附图说明
图1为本申请实施例提供的一种系统架构示意图;
图2为本申请实施例提供的一种作业的执行流程示意图;
图3为本申请实施例提供的一种计算节点的结构示意图;
图4为本申请实施例提供的一种数据处理方法的流程示意图;
图5为本申请实施例提供的一种IO关系示意图;
图6为本申请实施例提供的另一种IO关系示意图;
图7为本申请实施例提供的一种确定子集的场景示意图;
图8为本申请实施例提供的另一种确定子集的场景示意图;
图9为本申请实施例提供的又一种确定子集的场景示意图;
图10为本申请实施例提供的另一种数据处理方法的流程示意图;
图11为本申请实施例提供的一种数据处理装置的结构示意图;
图12为本申请实施例提供的一种计算装置的结构示意图。
具体实施方式
高性能计算(high performance computing,HPC),是跨行业跨应用领域的计算学科,通常将最前沿的计算机技术用于最复杂、最尖端的科学计算和求解,广泛应用于大规模科学问题的计算和海量数据的处理,如气象预报、汽车仿真、生物制药、基因测序、核爆模拟,以及芯片设计制造等。有能力提供HPC服务的计算机集群,称之为“HPC集群”。
计算机集群(computer cluster)是指一组松散或紧密连接在一起工作的计算节点,通常用于执行大型作业。部署集群通常是通过并发度提升总体性能,比速度或可用性相当的单台计算节点的成本效益要高。各个计算节点之间通过网络相互连接,每个计算节点运行自己的操作系统实例。在大多数情况下,每台计算节点使用相同的硬件和相同的操作系统,在某些情况下,也可以在不同的硬件上使用不同的操作系统。
图1是本申请实施例提供的一种计算节点集群的示意图。如图1所示,计算节点集群10包括多个计算节点,如100A、100B、100C、100D和100E。这些计算节点用于提供计算资源。就一台计算节点来说,它可以包括多个处理器或处理器核,每个处理器或者处理器核可能是一个计算资源,因此一台物理计算节点可以提供多个计算资源。计算节点100A、100B、100C、100D和100E之间通过网络112互联。另外,计算节点160作为调度器也连接到网络112。在操作中,调度器160可以控制提交给计算节点集群10的作业的执行。
作业可以从任何合适的源头提交给计算节点集群10。本申请实施例不限定提交作业的位置,也不限定用户提交作业的具体机制。在图1中,例如,用户132可以从企业130向计算节点集群10提交作业136。具体的,在该示例中,用户132操作客户端计算机134以向计算节点集群10提交作业136。在该实例中,企业130通过网络120连接到计算节点集群10,网络120可以是因特网,或者其他网络。因此,用户可以从远程位置向计算节点集群10提交作业。这里的作业通常是需要较多计算资源并行处理的大型作业,本实施例不限定作业的性质和数量。一个作业可能保护多个计算任务,这些任务可以分配给多个计算资源执行。大多数任务是并发或并行执行的,而有一些任务则需要依赖于其他任务所产生的数据。
举例来说,一个作业是预测城市A在未来24小时内的天气,结合图2理解,假设城市A包括多个地区,分别记为地区1、地区2、…、地区n(n取正整数),示例性的,该作业可被粗粒度地拆分为多个一级子任务,多个一级子任务并行执行,每个一级子任务用于预测城市A的其中一个地区在未来24小时内的天气。进一步,每个一级子任务还可以被细粒度地拆分为多个二级子任务,分别用于预测同一个地区在不同时段的天气,比如,区域1对应的一级子任务,该一级子任务中的第一个二级子任务(图2中记为子任务1) 用于预测区域1在未来的0:00-1:00的天气,第二个二级子任务(图2中记为子任务1')用于预测区域1在未来的1:00-2:00的天气,第三个二级子任务(图2中记为子任务1'')用于预测区域1在未来的2:00-3:00的天气,依此类推。其中,同一个一级子任务中的多个二级子任务迭代执行,迭代执行是指上一个二级子任务的输出结果(或称预测结果)为下一个二级子任务的输入数据(初始值),比如,图2中子任务1的输出结果为子任务1'的输入数据,可以理解为,使用同一个地区在预测时刻前一段时间内的气象数据来预测该地区在未来一段时间内的气象数据。总而言之,多个一级子任务并行执行,或者说属于同一轮迭代的多个二级子任务并行执行,而同一个一级子任务中的多个二级子任务迭代执行。
如下结合图3理解作业的并行处理流程,图3为本申请实施例提供的一种计算节点的结构示意图。图3中的计算节点100A、100B可以是图1中的计算节点100A、100B。以一个计算节点100A为例,在软件层面,计算节点100A上运行有操作系统以及一个或多个进程(为简洁起见,图3中计算节点100A仅示出进程1、2,计算节点100B仅示出进程3、4)。该多个进程可以并行执行,每个进程可用于运行一个二级子任务,假设图2涉及的作业被调度至计算节点100A和计算节点100B执行,为便于说明,以图2中n=4,即该作业中涉及的城市A包括地区1、地区2、地区3和地区4为例,示例性地,在执行该作业的第一轮迭代时,进程1可用于执行子任务1:预测地区1在未来0:00-1:00的天气,进程2用于执行子任务2:预测地区2在未来0:00-1:00的天气,进程3用于执行子任务3:预测地区3在未来0:00-1:00的天气,进程4用于执行子任务4:预测地区4在未来0:00-1:00的天气,如此,多个子任务被并行执行,以提高作业的执行效率。
在执行作业过程中通常还会产生数据IO。数据IO包括读请求、写IO,例如读请求用于请求从存储设备210中读取任务的输入数据。写IO用于请求将任务的输出结果写入存储设备210。对于并行或并发执行的多个进程,可能会在相同的时间段内生成各自的读请求或写IO,即同期生成多个读请求或多个写IO,例如,在第一轮迭代中,多个读请求包括读请求1、读请求2、…、读请求n,其中,读请求1用于请求读取子任务1的输入数据,读请求2用于请求读取子任务2的输入数据,依此类推。又例如,在第一轮迭代中,多个写IO包括写IO1、写IO2、…、写IOn,其中,写IO1用于请求将子任务1的输出结果写入存储设备210,写IO2用于请求将子任务2的输出结果写入存储设备210,依此类推。这些由于并行或并发执行任务而同期产生的多个读请求/写IO可以称为并行或并发读请求/写IO。本申请实施例提供一种数据处理方法,可用于对作业执行过程中,在一段时间内产生的多个读请求/写IO,如多个并行或并发读请求,或者多个并行或并发写IO执行聚合处理,以减少或避免重复IO,从而提高应用的读/写性能。
需要说明的是,上述读请求及写IO仅为举例,在作业执行过程中还可能产生其他类型的读请求或写IO,并非限定于读请求仅用于读取任务的输入数据,或写IO仅用于写输出结果,也不限定任务执行过程中每个进程生成的IO数量。另外需要说明的是,图3所示的进程数量仅为保持简洁而举例,实际应用中,一个大型作业通常由大量并行或并发进程执行。本申请实施例对任务的数量,以及可以并行执行的任务的数据、IO类型、以及生成的IO数量都没有予以限制。
在硬件层面,计算节点100包括处理器112、存储器113和数据处理装置114。
其中,处理器112、存储器113和数据处理装置114可以通过总线115(可参见下文对总线1204的介绍,此处不再赘述)连接。其中,处理器112可以是中央处理器(central  processing unit,CPU),用于执行存储于存储器113内的指令,以运行操作系统及一个或多个进程。需要说明的是,图1中仅示出了一个CPU112,在实际应用中,CPU112的数量往往有多个,其中,一个CPU112又具有一个或多个处理器核。当CPU112包括多个处理器核时,每个处理器核可运行一个进程,这样多个处理器核可以并行运行多个进程,本实施例不对CPU112的数量,以及处理器核的数量进行限定。存储器113用于存储计算机指令和数据。
存储器113的类型有多种,请参见下文对存储器1202的详细介绍,此处不再赘述。
数据处理装置114,用于对数据进行计算或处理,还用于外部设备通信,如向存储设备210发送读请求/写IO,如读请求用于获取任务的输入数据等,写IO用于请求将任务的计算结果写入存储设备210。
存储设备210,用于存储计算机程序指令及数据,指令如HPC应用的代码,输入如作业所需的输入数据、配置文件、计算结果等数据。存储设备210可以是存储服务器、存储阵列或存储系统,存储系统可以是集中式存储系统或分布式存储系统,本申请对此不做限定。在HPC应用等并行应用中,存储设备210通常是分布式存储系统,可实现分布式文件系统,每个计算节点100通过挂载该分布式文件系统的根目录来访问该文件系统,以完成数据存取。如获取HPC应用的程序代码,运行该HPC应用来执行作业及存取文件的数据。在一个示例中,当HPC应用的IO模式为N:1时,运行该HPC应用的多个MPI进程生成的N个读请求用于对同一个文件执行读操作,如获取任务的输入数据。同理,该多个MPI进程生成的N个写IO用于对同一个文件执行写操作,如将任务的计算结果写入该文件。
需要说明的是,为保持简洁,图3仅示出一个数据处理装置114,实际上一个计算节点100可包括多个数据处理装置114,本申请对此不做限定。另外需要说明的是,图3所示的结构仅为示例,在实际产品中,计算节点100可比图3具有更多或更少的组件,如计算节点100还可以包括硬盘、一个或多个专用处理器如GPU等,本申请实施例对此不做限定。
接下来以本申请实施例提供的数据处理方法应用于图1所示的系统为例,对该方法进行详细说明。本申请实施例将从处理读请求、处理写请求两方面阐述,首先结合图4介绍对读请求的数据处理方法。该数据处理方法可以由图1或图3所示的计算节点100A、计算节点100B中的数据处理装置(简称为DPU)执行。
如图4所示,该方法包括如下步骤:
步骤401,用于执行作业的多个进程生成各自的读请求,并将读请求发送至本计算节点的DPU。
用于执行作业的多个进程可以称为并行进程,如MPI进程,该多个并行进程可能在相同的时间段内生成各自的读请求。举例来说,假设图2涉及的作业被调度至计算节点100A、100B来执行,该作业中城市A包括4个区域,结合图2可知,每轮迭代包括4个子任务,如此,每一轮迭代可通过计算节点100A、100B中的至少4个并行进程来执行作业。其中,进程1执行子任务1时生成读请求1,该读请求可能是请求读取子任务1的输入数据或配置文件等,具体不做限定,类似的,进程2执行子任务2生成读请求2,进程3执行子任务3生成读请求3,进程4执行子任务4生成读请求4。
需要说明的是,图4只是一种示例,本申请并不限定作业的拆分方式以及作业的调度方式、并行程度以及并行进程的分布做限定。
每个并行进程将各自的读请求发送至本计算节点的DPU,比如,进程1、进程2分别将读请求1、读请求2发送至DPU114A。进程3、进程4分别将读请求3、读请求4发送至DPU114B。
步骤402,多个DPU(指用于执行作业的多个计算节点对应的多个DPU,下文中的多个DPU均为此意,后续不再重复说明)中的每个DPU交换各自的读请求,使得每个DPU均获取到用于执行作业的所有并行进程的读请求。
该多个DPU中的每个DPU获取本计算节点(上的一个或多个并行进程分别)生成的读请求,然后将从本计算节点获取到的读请求发送至该多个DPU中的其他任一DPU。对应的,每个DPU接收多个计算节点中其他任一计算节点发送的读请求。对DPU而言,本计算节点是指DPU所归属的计算节点。比如,DPU114A归属于计算节点100A,DPU114B归属于计算节点100B。可以理解为,多个DPU中的每个DPU广播本计算节点的读请求,以使每个DPU均能够获取到一组完整且相同的读请求。
举例来说,计算节点100A的DPU114A获取进程1、进程2分别生成的读请求1、读请求2,并将读请求1和读请求2发送至计算节点100B的DPU114B。同样,计算节点100B的DPU114B获取进程3、进程4分别生成的读请求3、读请求4,并将读请求3和读请求4发送至计算节点100A的PDU114A。对应的,DPU114A从DPU100B接收到读请求3和读请求4,DPU114B从DPU100A接收到读请求1和读请求2。至此,DPU114A和DPU114B均得到一组相同的读请求,即读请求1至读请求4。
可理解,多个DPU交换读请求的前提是,每个DPU都需要知道要将本计算节点的读请求发送给哪些计算节点(即交换对象)。如何让DPU确定全部的交换对象,这里介绍一种可选的实现方式:对于执行同一作业的多个并行进程,在作业启动后,每个并行进程可获得一个进程标识(如rank号)和进程总数,如进程总数为m,rank号从0开始编号至m-1,每个并行进程基于自身的进程标识和进程总数,便可以确定其他并行进程的进程标识,如此,DPU可基于进程的rank号进行通信,如使用rank号确定对端的链路,从而将本节点上的读请求发送至其他计算节点的DPU。在另一种实现方式中,多个DPU中的每个DPU具有一个rank号,如此可基于DPU的rank号进行通信。
具体的在交换时,由于一个计算节点可能生成多个读请求,换言之,一个DPU可能接收到本计算节点的多个读请求,示例性地,该DPU在向每个交换对象发送本计算节点的多个读请求时,可以将该本计算节点的多个读请求进行聚合,将聚合后的数据发送给其他计算节点,而非单独发送每个读请求。其中,聚合后的数据包括本计算节点的多个读请求,比如图4中,DPU114A将读请求1和读请求2进行聚合,将聚合后的数据(包括读请求1和读请求2)发送至DPU114B,而不是将读请求1和读请求2单独发送至DPU114B,如此,可以减少网络IO的数量。然而现有方式中计算节点上的网卡只能单独发送本计算节点的每个读请求,这是由于现有方式网卡只能被动执行转发,当然,本申请中DPU也可以实现单独发送本计算节点上的每个读请求,比如,DPU114A先发送读请求1至DPU114B,再发送读请求2至DPU114B,此处不做限定。
需要说明的是,图4仅示出两个计算节点,并非指多个DPU中只有两个DPU互相交换读请求,示例性的,如果在实际作业中存在两个以上计算节点,则每个DPU均需要将本计算节点上的读请求发送至多个DPU中的其他任一DPU,比如,图4中参与作业的还包括计算节点100C(包括DPU114C),假设计算节点100C运行进程4,计算节点100B用 于运行进程3,计算节点100A仍运行进行1和进程2。则DPU114A将读请求1和读请求2分别发送至计算节点100B和计算节点100C。DPU114B要将读请求3发送至计算节点100A和计算节点100C。同样,计算节点100C将读请求4发送至计算节点100A和计算节点100B。
总而言之,经过交换,用于执行作业的每个计算节点的DPU均获得一组相同的读请求,该一组读请求包括该运行作业的所有进程的读请求,具体包括本计算节点的读请求和其他计算节点的读请求。
步骤403,多个DPU中的每个DPU将获取的多个读请求中每个读请求所请求读取的数据的信息进行聚合,得到聚合信息。
示例性地,每个读请求所请求读取的数据(即待读取数据)的信息可以是该待读取数据的地址信息,每个DPU将多个读请求中每个读请求的地址信息进行聚合以得到聚合信息,该聚合信息指示多个读请求所请求读取的聚合数据。可以理解为,聚合信息为一个新的地址信息,其所指示的聚合数据包括多个读请求中每个读请求所请求读取的数据。
以一个DPU如DPU114A为例,假设读请求1请求读取数据1,读请求2请求读取数据2,读请求3请求读取数据3,读请求4请求读取数据4。DPU114A将数据1的地址信息、数据2的地址信息、数据3的地址信息和数据4的地址信息进行聚合,得到聚合信息,该聚合信息所指示的聚合数据包括数据1、数据2、数据3和数据4。应注意,此时,DPU114A上并不存在数据1至数据4,此处仅为表明聚合过程及聚合信息所指示的聚合数据。
其中,待读取数据的地址信息可包括待读取数据的起始地址和长度,举例来说,如图5所示,假设数据1的地址信息为10MB(起始地址)+2MB(长度),数据2的地址信息为12MB(起始地址)+2MB(长度),数据3的地址信息为14MB(起始地址)+2MB(长度),数据4的地址信息为16MB(起始地址)+2MB(长度)。DPU114A将10MB+2MB、12MB+2MB、14MB+2MB和16MB+2MB(长度)进行聚合,如得到聚合信息为10MB(起始地址)+8MB(长度),该聚合信息所指示的聚合数据包括数据1至数据4。DPU114B执行相同的操作,以得到相同的聚合信息10MB(起始地址)+8MB(长度)。
需要说明的是,图5所示的数据1、数据2、数据3、数据4的存储地址是连续的,且数据量大小均相同,实际上,多个待读取数据的存储地址之间还可能存在重叠,如图6的(a)所示,和/或,多个待读取数据的存储地址可能是不连续的,如图6的(b)所示。其中,待读取数据的数据量大小可以完全相同或完全不同或不完全相同,本申请对此均不做限定。
无论存储地址是否连续或重叠,聚合方式均相同,比如,假设图6的(a)中,数据1的地址信息为10MB+5MB,数据2的地址信息为12MB+6MB,数据3的地址信息为18MB+4MB,数据4的地址信息为21MB+3MB,则将多个该地址信息聚合后得到的聚合信息可包括10MB+14MB。又比如,假设图6的(b)中,数据1的地址信息为10MB+5MB,数据2的地址信息为17MB+3MB,数据3的地址信息为20MB+3MB,数据4的地址信息为23MB+5MB,则将多个该地址信息聚合后得到的聚合信息可包括10MB+18MB。
另外需要说明的是,本申请中待读取数据的地址信息并不限定于待读取数据的起始地址和长度,还可以包括其他信息,如在一个示例中,该多个读请求请求对同一个文件执行读操作,则此处的每个读请求所携带的地址信息还可以包括用于指示文件的文件路径、文件句柄、待读取数据在文件内的起始地址(偏移量)和长度等一项或多项,本申请对此不 做限定。其中,文件句柄是分布式文件系统内各文件的唯一标识,基于文件路径也可以唯一确定一个文件。
总结来说,聚合数据是多个待读取数据的集合,聚合数据包括首个待读取数据(如图5中为数据1)至末尾处待读取的数据(如图5中数据4)。聚合数据的起始地址为首个待读取数据(如图5中为数据1)的起始地址、聚合数据的长度(首个待读取数据的起始地址至最后一个待读取数据(如图5中数据4)的尾端的长度)。对应的,聚合信息指示聚合数据,聚合信息可包括聚合数据的起始地址和聚合数据的长度。
综上,每个DPU基于一组相同的读请求进行聚合,均会得到一个相同的聚合信息。应注意,这里每个DPU均需要进行聚合操作以得到聚合信息,是由于后续要从该多个DPU中随机选择一些DPU作为聚合DPU,而聚合DPU需要基于聚合信息来读数据,因此,此处每个DPU均需要执行聚合操作。
步骤404,多个DPU中的每个DPU将聚合信息指示的聚合数据对应的数据范围划分为K个子集,K取正整数。
以多个读请求请求读取同一个文件内的数据,即聚合数据为一个文件内的数据为例,示例性地,首先,每个DPU以设定的数据长度为单位将聚合数据对应的数据范围(或称文件范围)划分为多个子块,然后,将该多个子块划分为K个子集,每个子集可包括一个或多个子块。每个子集内的多个子块之间可以是连续的也可以是不连续的。
其中,用于划分子块的数据长度可以预设长度,或者是其他设备如存储设备210推荐(或通知)的数据长度,具体不做限定。需要说明的是,在不同的场景中,设定的数据长度可以是不同的,这可以与待读取数据的存储位置有关,如待读取数据对应的文件系统、存储该待读取数据的存储设备或存储系统等一项或多项因素有关,本申请对此不做限定。类似的,K也可以是预设值或其他方式确定的,下文会进行说明。
如下结合图5所示的聚合数据对应的完整文件范围为10MB+8MB,示例性列出几种分块及划分子集的示例:
示例1,返回图4的S404,假设设定的数据长度为4MB,K=1,每个DPU以4MB为单位可将该文件范围(10MB+8MB)划分为2个子块,分别为子块1(10MB+4MB)、子块2(14MB+4MB)。DPU将该2个子块划分至1个子集,可见,该子集包括子块1和子块2。
示例2,参见图7所示,假设设定的数据长度为4MB,K=2,图7与图4的区别仅在于,图7中K=2,即DPU将该2个子块划分为2个子集,记为子集1和子集2,示例性的,子集1可包括子块1,子集2包括子块2。
示例3,数据长度还可以是其他值,如参见图8所示,假设设定的数据长度为2MB,K=2,如图8的(a)或图8的(b)所示,每个DPU以2MB为单位可将该文件范围(10MB+8MB)划分为4个子块,分别记为子块1(10MB+2MB)、子块2(12MB+2MB)、子块3(14MB+2MB)和子块4(16MB+2MB)。DPU将4个子块划分为2个子集,记为子集1和子集2,示例性的,参见图8的(a)所示,子集1可以包括子块1和子块2,子集2包括子块3和子块4,此时,每个子集内的多个子块之间是连续的。再示例性的,参见图6的(b)所示,子集1可以包括子块1和子块3,子集2包括子集2和子块4,此时,每个子集内的多个子块之间是不连续的。应理解,图8仅为便于理解子集与子块的关系而举例,实际应用中,子块的数量通常小于读请求的数量,从而达到聚合的效果。
示例4,上述示例1至示例3中示出聚合数据内的待读取数据是连续的,实际上是聚合数据所包括的待读取数据可以是重叠的,如图9的(a)所示。或,聚合数据所包括的待读取数据可以是不连续的,参见图9的(b)所示。无论聚合数据内的待读取数据的关系如何,基于聚合数据对应的数据范围来划分子块及子集的方式是相同,此处不再赘述。
需要说明的是,在划分子块时若不能均分,则末尾处的子块的长度可能比设定的数据长度小,或者大于设定的数据长度,比如若待读取数据的文件范围为10MB+19MB,设定的数据长度为4MB,则在划分子块时,可以划分为5个子块,末尾处的子块的大小可为3MB,或者,划分为4个子块,末尾处的子块的大小可为7MB。
步骤405,多个DPU中的每个DPU,从该多个DPU中选择K个DPU作为聚合DPU。每个聚合DPU负责一个子集,这里负责的含义是指该子集内的数据由该聚合DPU读取。
在一种示例中,每个用于执行作业的计算节点中的一个DPU作为一个聚合DPU。K的取值可以根据用于执行作业的多个计算节点的数量设定,比如,图4中,假设K=2,则其中DPU114A可为一个聚合DPU,DPU114B为另一个聚合DPU。
在另一种示例中,多个DPU中的每个DPU根据一致性算法从多个DPU中选择相同的K个DPU作为聚合DPU。示例性的,聚合DPU的数量可以是预设值(即K值),每个DPU使用相同的输入数据和一致性算法,计算出K个DPU的标识,每个标识指示的DPU为一个聚合DPU。由于使用相同的一致性算法和输入数据,因此每个DPU能够计算出相同的K个聚合DPU。
具体的,一致性算法的输入数据包括但不限于下列中的一项或多项:多个DPU中每个DPU的标识、聚合DPU的数量的预设值、聚合信息(聚合数据对应的数据范围)、设定的数据长度、子块的数量等。一致性算法的计算结果可以包括k个DPU的标识,这样每个DPU都可以确定出相同的计算结果,从而确定相同的K个聚合DPU,以及确定DPU自身是否为聚合DPU。
比如,k为预设值,如假设k=1,图4中DPU114A的rank号为0,DPU114B的rank号为1,示例性的,输入数据可以包括rank=0、rank=1、k=1,DPU114A和DPU114B分别使用相同的一致性算法和输入数据计算出一个DPU的标识,如rank号=0,并将该rank号为0的DPU作为聚合DPU。如此,DPU114A和DPU114B均能够确定DPU114A为聚合DPU,DPU114B不是聚合DPU。
值得注意的是,k还可以是通过其他方式确定的值,比如,根据子块的数量确定k值,若子块的数量较多,则k值可以相应大一点,这样便可以通过多个聚合DPU并行执行读操作,从而提高作业的并行度,从而提高读数据的效率。若子块的数量较小,则k值也可以相应小一点,以平衡读效率和网络IO的数量。通常聚合器的数量通常为多个,以提高作业的并行度。此时,一致性算法的输入数据可以包括多个DPU的标识、子块的数量,可选的,还可以包括读操作的并行度(可理解为子块的数量和聚合DPU数量的比例)等,具体不再限定。
本申请实施例中,每个聚合DPU可以根据K个聚合DPU和K个子集的映射关系来确定聚合DPU自身负责的子集,该映射关系中一个聚合DPU对应一个子集,不同聚合DPU对应的子集不同。
举例来说,每个聚合DPU通过另一种一致性算法计算其对应的一个或多个子块,从而确定其负责的子集。比如,基于子块的总数量和k值确定每个子集内的子块的数量(记 为m),每m个连续子块为一个子集。每个聚合DPU基于k个聚合DPU的rank号的大小升序排序(或降序排序),按自身rank号在该排序中的位置选择对应位置的子集,比如,聚合DPU的数量为2,该2个聚合DPU的rank号分别为rank0、rank1。结合图8的(a),4个子块被分为2个子集,每个子集包括连续的两个子块,rank0对应排在第一个的子集1,rank1对应排在第二个的子集2。应注意,上述例子仅为示意,实际上,聚合DPU的rank号可能是不连续,比如,多个聚合DPU的rank号分别为0、4、9、14等等,下文不再赘述。
再比如,k个聚合DPU均基于相同的一致性算法来确定自身负责的子集。例如,一致性算法为:聚合DPU自身的编号+N*K。聚合DPU自身的编号可以基于k个聚合DPU的rank号的升序排序确定,比如,rank号排在首个的聚合DPU的编号为1,其余聚合DPU的编号从1开始依次加1,那么rank号排在第二个的聚合DPU的编号为2,rank号排在第三个的聚合DPU的编号为3,依此类推,结合图8的(b),假设k=2,那么编号为1的聚合DPU负责的子集包括子块1、子块3。编号为2的聚合DPU负责的子集包括子块2、子块4。
值得注意的是,步骤404和步骤405中的K值是同一值,K可以是预设值或根据其他方式确定的值。步骤404也可以在步骤405之后执行,比如,确定k个聚合DPU之后,再确定每个聚合DPU对应的子块,从而确定出k个子集。
步骤406,每个聚合DPU读取对应子集的数据。
示例性地,聚合DPU向存储设备210发送至少一个读请求,来请求读取该子集内的数据。具体的,当该子集内的数据的数据量较大或不连续时,聚合DPU也可以通过多个读请求读取子集内的数据,每个读请求用于请求读取该子集内的部分数据,具体不做限定。如聚合DPU以文件子块为单位从存储设备210获取数据,每个读请求用于请求读取该子集内的一个子块的数据。
举例来说,参见图4,假设K=1,图4中DPU114A为聚合DPU,DPU114B不是聚合DPU,DPU114A用于读取图4所示的子集的数据(即子块1和子块2),示例性地,DPU114A向存储设备210发送请求读取子块1的读请求5,以及读取子块2的读请求6;存储设备210响应读请求5和读请求6,将子块1的数据、子块2的数据发送给DPU114A。
再举例来说,结合图8的(b),假设K=2,图4中DPU114A、DPU114B均为聚合DPU。假设DPU114A负责子集1(如包括子块1和子块3),DPU114B负责子集2(如包括子块2和子块4)。DPU114A向存储设备210发送请求读取子块1的读请求5,DPU114B向存储设备210发送请求读取子块2的读请求6;类似的,DPU114A向存储设备210发送请求读取子块3读请求7,DPU114B向存储设备210发送请求读取子块4的读请求8。应注意,此处仅为举例,不同的聚合DPU对应的存储设备210可能是不同的,本申请对此不做限定。
每个聚合DPU读取自身负责的子集的数据,通过较少的IO来读取子集内的数据,减少或避免重复IO。
步骤407,聚合DPU以目标读请求为粒度,对读取到的子集内的数据进行分离,并反馈每个目标读请求所请求读取的数据。
目标读请求是指聚合DPU(在步骤402中)接收到的多个读请求中,请求读取的数据与该子集内的数据存在交集的读请求。存在交集是指,该目标读请求所请求读取的部分或 全部数据处于该子集内。应注意,目标读请求的数量可以是一个或多个。
比如,对于图4所示的子集,其对应的目标读请求包括读请求1、读请求2、读请求3和读请求4。再比如,对于图7所示的,子集1对应的目标读请求包括读请求1和读请求2;子集2对应的目标读请求包括读请求3和读请求4。再比如,对于图8的(b)所示的子集1对应的目标读请求包括读请求1和读请求3;子集2对应的目标读请求包括读请求2和读请求4。
聚合DPU确定其负责的子集所对应的一个或多个目标读请求,按照目标读请求对读取到的该子集内的数据进行分离,得到每个目标读请求对应的数据,该数据可能是该目标读请求所请求读取的部分或全部数据,将每个目标读请求对应的数据发送至该目标读请求所归属的计算节点100。
继续参见图4,假设DPU114A为聚合DPU(称为聚合DPU114A),负责读取图4所示的子集,聚合DPU114A确定子集1对应的目标读请求包括读请求1、读请求2、读请求3和读请求4,则聚合DPU114A将读取的子集1的数据按照读请求1、读请求2、读请求3和读请求4进行分离,分离为读请求1所请求读取的数据1,读请求2所请求读取的数据2,读请求3所请求读取的数据3,以及读请求4所请求读取的数据4。
在分发数据时,示例性地,参见图4,聚合DPU114A将数据1发送至进程1,将数据2发送至进程2。在向其他计算节点分发数据时,聚合DPU可以以目标读请求为粒度,将每个目标读请求所请求的数据发送至该目标读请求所归属的计算节点。比如,聚合DPU114A分别独立地将数据3、数据4发送至DPU114B。再示例性地,聚合DPU还可以以读请求所归属的计算节点为粒度,对归属于同一个计算节点的多个目标读请求所请求读取的数据进行聚合,将聚合后的数据发送至该计算节点。比如,聚合DPU114A确定读请求3和读请求4均归属于计算节点100B,聚合DPU114A对数据3和数据4进行聚合,将聚合后的数据发送至计算节点100B,聚合后的数据包括数据3和数据4,从而减少网络IO。
需要说明的是,目标读请求所请求读取的数据之间可能存在重叠,因此,分离出的数据之间也可能存在重叠。一个读请求所请求读取的数据可能被分割至一个或多个子块中,如图9的(a)中,数据2的部分数据被划分至子块1,其余部分数据被划分至子块2。一个子块中也可能包括一个或多个读请求所请求读取的数据。如继续参见图9的(a),子块1包括读请求1所请求的数据1,读请求2所请求的数据2的部分数据。子块2包括读请求2所请求的数据2的部分数据和读请求3所请求的数据3。一个读请求所请求读取的数据也可能划分出一个或多个文件子块,如图9的(a)中,数据4划分出2个子块。而一个子集可包括一个或多个子块,因此,一个子集内的数据包括一个或多个读请求所请求读取的数据,并且其中的多个读请求中的部分或全部读请求可能来自同一个计算节点100。图9的(a)中,子集1对应的目标读请求包括读请求1、读请求2和读请求3,子集1内的数据将被分离为数据1、数据2和数据3,其中,数据1和数据2之间存在部分重叠。子集2对应的目标读请求包括读请求4,子集2内的数据即为数据4,不需分离。
步骤408,DPU接收聚合DPU发送的数据,并根据本计算节点的读请求,将该数据分发至对应的进程。
DPU将接收到的数据发送至本计算节点的进程,若接收到的数据为聚合后的数据,则DPU以读请求为粒度对该聚合后的数据进行分离,将分离后的数据发送至本计算节点的对应进程。比如,DPU114B接收DPU114A发送的数据,该数据包括数据3和数据4,DPU114B 按照读请求3和读请求4对该数据进行分离,分离为数据3和数据4,并将数据3发送给进程3,将数据4发送给进程4。
上述方式,用于执行作业的多个计算节点中的每个计算节点上的DPU汇聚各并行进程的读请求,基于汇聚得到的多个读请求确定聚合信息,DPU基于聚合信息获取聚合数据,并将聚合数据分离后发送给对应的计算节点的DPU,如此,第一DPU对接收到的多个读请求中每个读请求的信息进行聚合,不需要第一DPU将多个读请求依次发送给处理器处理,减少计算节点的软硬件交互次数,降低CPU的占用率及算力开销,另外通过将多个读请求的信息进行聚合来读取数据,可减少或避免重复IO,提高IO性能,可缩短运行作业时间,进一步减少计算资源占用率。
接下来结合图10介绍本申请实施例提供的写请求的数据处理方法。该数据处理方法可以由图1或图3所示的计算节点100A、计算节点100B中的数据处理装置(简称为DPU)执行。
如图10所示,该方法包括如下步骤:
步骤1001,用于执行作业的多个计算节点中生成各自的写IO。
用于执行作业的多个进程可能在相同的时间段内生成各自的写请求。举例来说,假设图2涉及的作业被调度至计算节点100A、100B来执行,为便于说明,假设该作业中城市A仅包括4个区域,如此,计算节点100A、100B可通过至少4个进程来并行执行作业。其中进程1执行子任务1生成写请求1,该写请求可能是请求写入子任务1的计算结果,类似的,进程2执行子任务2生成写请求2,进程3执行子任务3生成写请求3,进程4执行子任务4生成写请求4。
值得注意的是,此处的写请求携带有指示待写入数据的信息,如地址信息,并不携带待写入数据。
步骤1002,多个DPU(指用于执行作业的多个计算节点所对应的多个DPU,下文中的多个DPU均为此意,后续不再重复说明)中的每个DPU交换各自的写请求,使得每个DPU均获取到用于执行作业的所有并行进程的写请求。步骤1002的具体执行流程可以参见上文中步骤402的描述,区别在于步骤402中交互读请求,而步骤1002交互写请求,此处不再赘述。
步骤1003,每个聚合DPU(指用于执行作业的每个计算节点上的DPU)将获取的多个写请求中每个写请求中的待写入数据的信息进程聚合,得到聚合信息。
每个DPU基于一组相同的读请求进行聚合,均会得到一个相同的聚合信息,即多个写请求对应的待写入数据的完整的文件范围。步骤1003的具体执行流程可以参见上文中步骤403的描述,区别在步骤403是对待读取数据的信息进行聚合,而步骤1002是对待写入数据的信息进行数据,此处不再赘述。
步骤1004,多个DPU中的每个DPU将聚合信息指示的文件范围划分为K个子集,K取正整数。步骤1004的具体执行流程可以分别参见上文中步骤404、步骤405的描述,此处不再赘述。
步骤1005,多个DPU中的每个DPU,从该多个DPU中选择K个DPU作为聚合DPU,以及确定每个聚合DPU所负责的子集。步骤1005的具体执行流程可以参见上文中步骤405的描述,此处不再赘述。
步骤1006,每个DPU将本计算节点上的待写入数据发送至对应的聚合DPU。
其中,待写入数据对应的聚合DPU是指包括待写入数据的子集所归属的DPU,比如,图10中,假设DPU114B确定DPU114A为聚合DPU,及DPU114A负责子集(参见图10),DPU114B确定本计算节点的待写入数据包括数据b和数据d,数据b和数据d对应的子集均为DPU114A所负责的子集,DPU114B将数据b和数据d发送给DPU114A。步骤1006中,DPU可以将本计算节点上的多个并行进程对应的多个待写入数据进行聚合,将聚合后的数据发送至对应的聚合DPU,比如对于DPU114B而言,聚合后的数据包括数据b和数据d。此处可参见上文中步骤402中,DPU将本计算节点上的多个读请求进行聚合后发送的方式,此处不再赘述。
步骤1007,聚合DPU将接收到的子集内的数据写入存储设备210。
比如图10中,DPU114A通过至少一个写请求将其负责的子集内的数据发送至存储设备210。比如,DPU114A向存储设备210发送写请求5、写请求6,写请求5包括子块1内的数据(数据a和数据b),写请求6包括子块2内的数据(数据c和数据d)。
需要注意的是,由于一个子集内的待写入数据可能是不连续的,且存储设备210内该子集指示的地址空间存储有数据(记为第一数据),本申请为了提高写性能,聚合DPU可首先从存储设备210中读取第一数据,在第一数据的基础上更新待写入数据,从而得到该子集对应的连续的待写入数据,然后将该子集对应的连续的待写入数据写入存储设备210。
如果存储设备210内该子集指示的地址空间未写过数据,则聚合DPU也可以不读取第一数据,通过补0的方式获得该子集对应的连续的待写入数据,然后将该子集的待写入数据写入存储设备210。
值得注意的是,图10实施例中的聚合信息与图4实施例中的聚合信息并不是同一个聚合信息,为便于区分,可以将图4实施例中的聚合信息替换为第一聚合信息,将图10实施例中的聚合信息替换为第二聚合信息。应理解,图4与图10只是相似步骤的方法相同,实际上两者之间并没有关联,比如,图4中DPU114A为聚合DPU,图10中可能是DPU114B为聚合DPU,另外,图4与图10中的两个子块的长度可以相同也可以不同,本申请对此均不做限定。
上述方式,用于执行作业的多个计算节点中的每个计算节点上的DPU汇聚各并行进程的写请求,基于汇聚得到的多个写请求中每个写请求携带的待写入数据的信息确定聚合信息,DPU基于聚合信息将聚合数据写入存储设备210,如此,便可以绕开计算节点上的处理器,减少处理器的占用率及算力开销,以及DPU与进程之间的软硬件交互次数和网络IO的次数,提高系统的写性能,可提高写效率,进一步减少计算资源占用率。
需要说明的是,图4和图10可以是两个独立流程,并非限定一个并行进程必须生成读请求和写请求。在一种可能的场景,并行进程在执行一个子任务时可以只生成读请求,或只生成写请求,具体不做限定。另外,图4和图10所涉及的参数可以是不同的,比如,图4所示方法中的聚合DPU的数量和图10所示方法中的聚合DPU的数量可以相同也可以不同,如果相同,也不一定选择相同的DPU作为聚合DPU。图4所示方法中的子块长度(设定的数据长度)和图10所示方法中的子块长度(设定的数据长度)可以相同也可以不同,等等,本申请实施例对此均不做限定。
基于与方法实施例同一发明构思,本申请实施例还提供了一种数据处理装置,该装置用于执行上述图4、图10方法实施例中DPU114A或DPU114B执行的方法。如图11所示,数据处理装置1100包括通信模块1101、聚合模块1102、处理模块1103;具体地,在数据 处理装置1100中,各模块之间通过通信通路建立连接。
通信模块1101,用于接收至少一个计算节点中的多个进程对应的多个读请求;具体实现方式请参见图4中的步骤401-步骤402的描述,此处不再赘述。
聚合模块1102,用于对接收到的多个读请求中每个读请求所读取的数据的信息进行聚合得到第一聚合信息;具体实现方式请参见图4中的步骤403的描述,此处不再赘述。
处理模块1103,用于根据所述第一聚合信息确定待读取的第一目标数据。具体实现方式请参见图4中的步骤404-步骤405的描述,此处不再赘述。
在一种可能的实现方式中,通信模块1101,还用于接收至少一个计算节点中的多个进程对应的多个写请求;具体实现方式请参见图10中的步骤1001-步骤1002的描述,此处不再赘述。
聚合模块1102,还用于对所述多个写请求中每个写请求中指示待写入数据的信息进行聚合得到第二聚合信息;具体实现方式请参见图10中的步骤1003的描述,此处不再赘述。
处理模块1103,还用于根据所述第二聚合信息确定待写入的第二目标数据;具体实现方式请参见图10中的步骤1004-步骤1005的描述,此处不再赘述。
图12为本申请实施例提供的一种计算设备的结构示意图。该计算设备用于执行上述图4、图10方法实施例中DPU114A或DPU114B执行的方法。该计算设备1200包括处理器1201、存储器1202和通信接口1203。其中,处理器1201、存储器1202和通信接口1203可以通过总线1204连接。
处理器1201用于执行存储器1202中存储的指令,以使数据处理装置1200执行本申请提供的数据处理方法,处理器1201可以是但不限于:数据处理单元(data processing unit,DPU)、系统级芯片(system on chip,SOC)、可编程门阵列(field programmable gate array,FPGA)、图形处理器(graphics processing unit,GPU)、特殊应用集成电路(application specific integrated circuit,ASIC)等处理器中的任意一种或多种。
存储器1202,用于存储计算机指令和数据,如存储器1202存储实现本申请提供的数据处理方法所需的计算机指令和数据。存储器1202包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)、动态随机存取存储器(dynamic random access memory,DRAM)等。也可以为非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM)、存储级存储器(storage-class memory,SCM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。
存储器1202中存储有可执行的程序代码,处理器1201执行该可执行的程序代码以分别实现前述通信模块1101、聚合模块1102、处理模块1103的功能,从而实现数据处理方法。也即,存储器1202上存有数据处理装置1100用于执行本申请提供的数据处理方法的指令。
通信接口1203,用于与内部设备或外部设备通信,如获取进程发送的读请求/写请求,又如与存储设备210通信,以完成数据存取。示例性的,通信接口1203可以是网卡。
总线1204可以是快捷外围部件互连标准(Peripheral Component Interconnect Express,PCIe)总线,或双数据速率(double data rate,DDR)总线,或串行高级技术附件(serial advanced technology attachment,SATA)总线,或串行连接SCSI(serial attached scsi,SAS)总线,或控制器局域网络总线(Controller Area Network,CAN),或扩展工业标准结构(extended industry standard architecture,EISA)总线、统一总线(unified bus,Ubus或UB)、计算机 快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图12中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线1204可包括在数据处理装置1200各个部件(例如,存储器1202、处理器1201、通信接口1203)之间传送信息的通路。
本申请实施例还提供了一种包含指令的计算机程序产品。所述计算机程序产品可以是包含指令的,能够运行在计算设备上或被储存在任何可用介质中的软件或程序产品。当所述计算机程序产品在至少一个计算机设备上运行时,使得至少一个计算机设备执行上述图4或图10实施例中的DPU114A所执行的数据分离方法,参见图4或图10各步骤的描述,此处不再赘述。
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行上述图4或图10实施例中的DPU114A所执行的数据处理方法,参见图4或图10各步骤的描述,此处不再赘述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包括一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。
本申请实施例中所描述的各种说明性的逻辑单元和电路可以通过通用处理器,数字信号处理器,专用集成电路(ASIC),现场可编程门阵列(FPGA)或其它可编程逻辑装置,离散门或晶体管逻辑,离散硬件部件,或上述任何组合的设计来实现或操作所描述的功能。通用处理器可以为微处理器,可选地,该通用处理器也可以为任何传统的处理器、控制器、微控制器或状态机。处理器也可以通过计算装置的组合来实现,例如数字信号处理器和微处理器,多个微处理器,一个或多个微处理器联合一个数字信号处理器核,或任何其它类似的配置来实现。
本申请实施例中所描述的方法或算法的步骤可以直接嵌入硬件、处理器执行的软件单元、或者这两者的结合。软件单元可以存储于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可移动磁盘、CD-ROM或本领域中其它任意形式的存储媒介中。示例性地,存储媒介可以与处理器连接,以使得处理器可以从存储媒介中读取信息,并可以向存储媒介存写信息。可选地,存储媒介还可以集成到处理器中。 处理器和存储媒介可以设置于ASIC中。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管结合具体特征及其实施例对本申请进行了描述,显而易见的,在不脱离本申请的精神和范围的情况下,可对其进行各种修改和组合。相应地,本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明,且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包括这些改动和变型在内。

Claims (15)

  1. 一种数据处理方法,其特征在于,所述数据处理方法应用于计算系统,所述计算系统包括多个计算节点,每个计算节点中运行有至少一个进程,每个计算节点连接数据处理设备DPU;
    所述方法包括:
    第一DPU接收至少一个计算节点中的多个进程对应的多个读请求,对所述多个读请求中每个读请求所读取的数据的信息进行聚合得到第一聚合信息;
    所述第一DPU根据所述第一聚合信息确定所述第一DPU待读取的第一目标数据。
  2. 如权利要求1所述的方法,其特征在于,所述第一聚合信息用于指示所述多个读请求所读取的第一聚合数据;
    所述第一DPU根据所述第一聚合信息确定所述第一DPU待读取的第一目标数据,包括:
    所述第一DPU将所述第一聚合数据划分为多个数据子块;
    所述第一DPU根据映射关系确定所述第一DPU对应的至少一个数据子块,所述第一目标数据包括所述第一DPU对应的至少一个数据子块;所述映射关系用于指示所述第一DPU对应的数据子块。
  3. 如权利要求2所述的方法,其特征在于,所述第一DPU被确定为所述计算系统中的聚合DPU,所述计算系统还包括第二DPU,所述第二DPU也为聚合DPU,所述第一DPU用于读取所述第一目标数据,所述第二DPU用于读取所述第二目标数据,所述第二目标数据为所述聚合数据中除所述第一目标数据之外的其余部分或全部数据。
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述方法还包括:
    所述第一DPU将从存储设备读取的所述第一目标数据按照所述多个读请求所归属的计算节点进行分离,并将分离的数据发给对应的计算节点。
  5. 如权利要求1-4任一项所述的方法,其特征在于,每个读请求所读取的数据的信息为所述数据的地址信息。
  6. 如权利要求1-5任一项所述的方法,其特征在于,所述方法还包括:
    所述第一DPU接收至少一个计算节点中的多个进程对应的多个写请求,对所述多个写请求中每个写请求中指示待写入数据的信息进行聚合得到第二聚合信息;
    所述第一DPU根据所述第二聚合信息确定所述第一DPU待写入的第三目标数据;
    所述第一DPU获取所述第三目标数据,并将所述第三目标数据写入与所述第一DPU连接的存储设备。
  7. 如权利要求6所述的方法,其特征在于,所述第二聚合信息用于指示所述多个写请求所写入的第二聚合数据;
    所述第一DPU根据所述第二聚合信息确定所述第一DPU待写入的第三目标数据,包括:
    所述第一DPU将所述第二聚合数据划分为多个数据子块;
    所述第一DPU根据DPU的标识和数据子块的映射关系确定所述第一DPU对应的至少一个数据子块,所述第三目标数据包括所述第一DPU对应的至少一个数据子块。
  8. 一种数据处理装置,其特征在于,包括:
    通信模块,用于接收至少一个计算节点中的多个进程对应的多个读请求;
    聚合模块,用于对所述多个读请求中每个读请求所读取的数据的信息进行聚合得到第一聚合信息;
    处理模块,用于根据所述第一聚合信息确定待读取的第一目标数据。
  9. 如权利要求8所述的装置,其特征在于,所述第一聚合信息用于指示所述多个读请求所读取的第一聚合数据;
    所述处理模块在确定待读取的第一目标数据时,具体用于:
    将所述第一聚合数据划分为多个数据子块;根据映射关系确定所述数据处理装置对应的至少一个数据子块,所述第一目标数据包括所述数据处理装置对应的至少一个数据子块;所述映射关系指示所述数据处理装置与数据子块的对应关系。
  10. 如权利要求8或9所述的装置,其特征在于,所述处理模块还用于:将从存储设备读取的所述第一目标数据按照所述多个读请求所归属的计算节点进行分离,并通过所述通信模块将分离的数据发给对应的计算节点。
  11. 如权利要求8-10任一项所述的装置,其特征在于,所述通信模块,还用于接收至少一个计算节点中的多个进程对应的多个写请求;
    所述聚合模块,还用于对所述多个写请求中每个写请求中指示待写入数据的信息进行聚合得到第二聚合信息;
    所述处理模块,还用于根据所述第二聚合信息确定待写入的第二目标数据,获取所述第二目标数据,并通过所述通信模块将所述第二目标数据写入与所述数据处理装置连接的存储设备。
  12. 如权利要求11所述的装置,其特征在于,所述第二聚合信息用于指示所述多个写请求所写入的第二聚合数据;
    所述处理模块在根据所述第二聚合信息确定待写入的第二目标数据时,具体用于:
    将所述第二聚合数据划分为多个数据子块;根据映射关系确定所述数据处理装置对应的至少一个数据子块,所述第二目标数据包括所述数据处理装置对应的至少一个数据子块;所述映射关系指示所述数据处理装置与数据子块的对应关系。
  13. 一种计算装置,其特征在于,所述计算装置包括处理器和供电电路;
    所述供电电路用于为所述处理器供电;
    所述处理器用于执行上述权利要求1至7中任一项所述的方法。
  14. 一种计算设备,其特征在于,所述计算设备包括存储器和至少一个处理器,所述存储器用于存储一组计算机程序指令,当所述处理器执行所述一组程序指令时,执行上述权利要求1至7中任一项所述的方法。
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质被存储设备执行时,所述存储设备执行上述权利要求1至7中任一项所述的方法。
PCT/CN2023/100813 2022-07-14 2023-06-16 一种数据处理方法及装置 WO2024012153A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210834105.1 2022-07-14
CN202210834105.1A CN117435330A (zh) 2022-07-14 2022-07-14 一种数据处理方法及装置

Publications (1)

Publication Number Publication Date
WO2024012153A1 true WO2024012153A1 (zh) 2024-01-18

Family

ID=89535479

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/100813 WO2024012153A1 (zh) 2022-07-14 2023-06-16 一种数据处理方法及装置

Country Status (2)

Country Link
CN (1) CN117435330A (zh)
WO (1) WO2024012153A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120290718A1 (en) * 2011-05-10 2012-11-15 Glenn Nethercutt Methods and Computer Program Products for Collecting Storage Resource Performance Data Using File System Hooks
CN102819407A (zh) * 2012-08-07 2012-12-12 中国科学院地理科学与资源研究所 一种在集群环境中对遥感影像数据进行高效并行存取的方法
CN103761291A (zh) * 2014-01-16 2014-04-30 中国人民解放军国防科学技术大学 一种基于聚合请求的地理栅格数据并行读写方法
CN113821164A (zh) * 2021-08-20 2021-12-21 济南浪潮数据技术有限公司 一种分布式存储系统的对象聚合方法和装置
CN114116293A (zh) * 2021-10-18 2022-03-01 中山大学 一种基于MPI-IO的MapReduce溢写改善方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120290718A1 (en) * 2011-05-10 2012-11-15 Glenn Nethercutt Methods and Computer Program Products for Collecting Storage Resource Performance Data Using File System Hooks
CN102819407A (zh) * 2012-08-07 2012-12-12 中国科学院地理科学与资源研究所 一种在集群环境中对遥感影像数据进行高效并行存取的方法
CN103761291A (zh) * 2014-01-16 2014-04-30 中国人民解放军国防科学技术大学 一种基于聚合请求的地理栅格数据并行读写方法
CN113821164A (zh) * 2021-08-20 2021-12-21 济南浪潮数据技术有限公司 一种分布式存储系统的对象聚合方法和装置
CN114116293A (zh) * 2021-10-18 2022-03-01 中山大学 一种基于MPI-IO的MapReduce溢写改善方法

Also Published As

Publication number Publication date
CN117435330A (zh) 2024-01-23

Similar Documents

Publication Publication Date Title
US9665404B2 (en) Optimization of map-reduce shuffle performance through shuffler I/O pipeline actions and planning
US8819335B1 (en) System and method for executing map-reduce tasks in a storage device
WO2021254135A1 (zh) 任务执行方法及存储设备
US10248346B2 (en) Modular architecture for extreme-scale distributed processing applications
WO2023082560A1 (zh) 一种任务处理方法、装置、设备及介质
US20120297216A1 (en) Dynamically selecting active polling or timed waits
US20180307603A1 (en) Memory hierarchy-aware processing
WO2018032519A1 (zh) 一种资源分配方法、装置及numa系统
CN114730275A (zh) 使用张量在分布式计算系统中进行矢量化资源调度的方法和装置
US9471387B2 (en) Scheduling in job execution
Cong et al. CPU-FPGA coscheduling for big data applications
WO2018113030A1 (en) Technology to implement bifurcated non-volatile memory express driver
Sun et al. HPSO: Prefetching based scheduling to improve data locality for MapReduce clusters
US20240220334A1 (en) Data processing method in distributed system, and related system
US11429299B2 (en) System and method for managing conversion of low-locality data into high-locality data
CN116400982B (zh) 配置中继寄存器模块的方法和装置、计算设备和可读介质
Weiland et al. Exploiting the performance benefits of storage class memory for HPC and HPDA workflows
WO2024012153A1 (zh) 一种数据处理方法及装置
Liu et al. An efficient job scheduling for MapReduce clusters
CN116932156A (zh) 一种任务处理方法、装置及系统
Li et al. Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing
US20200117596A1 (en) A Memory Allocation Manager and Method Performed Thereby for Managing Memory Allocation
US10824640B1 (en) Framework for scheduling concurrent replication cycles
US20180329756A1 (en) Distributed processing system, distributed processing method, and storage medium
US9176910B2 (en) Sending a next request to a resource before a completion interrupt for a previous request

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23838647

Country of ref document: EP

Kind code of ref document: A1