WO2021254135A1 - 任务执行方法及存储设备 - Google Patents
任务执行方法及存储设备 Download PDFInfo
- Publication number
- WO2021254135A1 WO2021254135A1 PCT/CN2021/097449 CN2021097449W WO2021254135A1 WO 2021254135 A1 WO2021254135 A1 WO 2021254135A1 CN 2021097449 W CN2021097449 W CN 2021097449W WO 2021254135 A1 WO2021254135 A1 WO 2021254135A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- subtask
- data
- processor
- subtasks
- special processor
- Prior art date
Links
- 238000003860 storage Methods 0.000 title claims abstract description 227
- 238000000034 method Methods 0.000 title claims abstract description 118
- 238000012545 processing Methods 0.000 claims abstract description 220
- 238000004364 calculation method Methods 0.000 claims description 59
- 238000010586 diagram Methods 0.000 claims description 17
- 230000008569 process Effects 0.000 abstract description 32
- 230000006870 function Effects 0.000 description 141
- 238000004422 calculation algorithm Methods 0.000 description 12
- 230000006835 compression Effects 0.000 description 12
- 238000007906 compression Methods 0.000 description 12
- 230000005540 biological transmission Effects 0.000 description 10
- 230000004044 response Effects 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 8
- 238000013523 data management Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 7
- 238000004590 computer program Methods 0.000 description 7
- 238000007405 data analysis Methods 0.000 description 7
- 239000011159 matrix material Substances 0.000 description 6
- 238000013403 standard screening design Methods 0.000 description 6
- 238000013473 artificial intelligence Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000001914 filtration Methods 0.000 description 5
- 238000000354 decomposition reaction Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 239000007787 solid Substances 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000009826 distribution Methods 0.000 description 2
- 238000013467 fragmentation Methods 0.000 description 2
- 238000006062 fragmentation reaction Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000004904 shortening Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000006837 decompression Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000007334 memory performance Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/7821—Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
Definitions
- This application relates to the field of computer technology, and in particular to a task execution method and storage device.
- NDP Near Data Processing
- NDP is a method or concept of data processing. NDP aims to move the processing and calculation of data to a place close to the data, thereby minimizing or even avoiding the movement of data, thus avoiding the performance bottleneck caused by the overhead of data movement, and improving the efficiency of performing data processing tasks.
- the database server informs the storage device of the table query operation to be performed and the location of the data through the Intelligent Database protocol (the Intelligent Database protocol, iDB protocol, a query pushdown protocol).
- the storage device includes The information of the, through the central processing unit (CPU) to perform predicate filtering, column filtering, connection filtering and other structured query language (Structured Query Language, referred to as SQL) query table query operations.
- CPU central processing unit
- SQL Structured Query Language
- the embodiments of the present application provide a task execution method and storage device, which can improve data processing efficiency.
- the technical solution is as follows:
- a task execution method is provided, which is applied to a storage device, and the storage device includes a central processing unit and a plurality of special processors.
- the central processing unit obtains the data processing task; the central processing unit divides the data processing task into multiple subtasks; the central processing unit determines the first subtask among the multiple subtasks according to the attributes of each subtask Assigned to the first special processor.
- the first special processor is one of the plurality of special processors.
- the above provides a method for cooperatively processing data using multiple processors of a storage device.
- the central processor in the storage device divides the data processing task into multiple subtasks, and assigns the subtasks in the storage device according to the attributes of the subtasks.
- Special processor On the one hand, because in the process of data processing, the central processing unit is responsible for task decomposition and task scheduling, and the special processor is responsible for the execution of subtasks, which makes the computing power of the central processing unit and the computing power of the special processor. All have been fully utilized.
- the attributes of the subtasks are considered when assigning the subtasks, the subtasks can be scheduled to be executed on the appropriate special processor according to their attributes. Therefore, this method improves the efficiency of data processing.
- the attribute of the subtask includes the address of the data involved in the subtask, and the first special processor is the special processor closest to the data.
- the subtask is scheduled to be executed on the special processor closest to the data.
- the special processor can access the data and process the data nearby, thus reducing the time delay and performance overhead caused by data movement, and improving the efficiency and speed of data processing.
- the attribute of the subtask includes a calculation mode and/or concurrency of the subtask
- the first special processor is a special processor that matches the calculation mode and/or concurrency.
- the attribute of the subtask includes definition information of the subtask
- the first special processor is a special processor indicated by the definition information of the first subtask.
- the developer can specify which processor executes the subtask in the definition information, so that the subtask is scheduled to be executed on the special processor designated by the developer, thereby meeting the developer's custom requirements .
- the developer can specify which processor executes the subtask in the definition information, so that the subtask is scheduled to be executed on the special processor designated by the developer, thereby meeting the developer's custom requirements .
- it can be specified by adding the identifier of the special processor to the definition information of the new task Which special processor to schedule the new task on, thereby reducing the difficulty of scheduling the new task, thereby improving the scalability.
- the attribute of the subtask includes a data set type corresponding to the subtask
- the first special processor is a special processor matching the data set type corresponding to the first subtask.
- the execution sequence of the multiple subtasks is recorded in the topology diagram, and the method further includes:
- the central processor instructs the first special processor to execute the first subtask in order according to the topology diagram.
- the central processing unit does not need to recalculate the execution order of the subtasks, and can directly schedule according to the execution order recorded in the topology chart, thereby reducing scheduling Workload.
- there are many scheduling optimization algorithms based on topological graphs which can call the scheduling optimization algorithms based on topological graphs to optimize the order of subtask scheduling, thereby shortening the overall execution time of tasks.
- a storage device in a second aspect, includes a central processing unit and a plurality of special processors.
- the storage device provided in the second aspect is used to implement the functions provided in the first aspect or any optional manner of the first aspect. For specific details, refer to the foregoing first aspect or any optional manner of the first aspect.
- a computer-readable storage medium stores at least one instruction, and the instruction is read by a central processing unit to make the storage device execute the first aspect or any one of the optional first aspects.
- the task execution method provided by the method.
- a computer program product is provided.
- the storage device executes the task execution method provided in the first aspect or any one of the optional methods in the first aspect.
- a storage device in a fifth aspect, has a function of implementing the foregoing first aspect or any one of the optional methods of the first aspect.
- the storage device includes at least one module, and the at least one module is configured to implement the task execution method provided in the first aspect or any one of the optional manners of the first aspect.
- FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application.
- FIG. 2 is a schematic diagram of application data fragmentation provided by an embodiment of the present application.
- FIG. 3 is a schematic diagram of a system architecture provided by an embodiment of the present application.
- FIG. 4 is a flowchart of a task execution method provided by an embodiment of the present application.
- FIG. 5 is a schematic diagram of a topology diagram provided by an embodiment of the present application.
- FIG. 6 is a flowchart of a task execution method provided by an embodiment of the present application.
- Fig. 7 is a schematic structural diagram of a task execution device provided by an embodiment of the present application.
- the processing of data is generally centralized, that is, the data is loaded from the memory to the memory through Input/Output (IO) or the network, and then the central processing unit (CPU) Process the data in the memory.
- IO Input/Output
- CPU central processing unit
- query processing first requires a large number of IO operations to load data into the memory of the computing node, making IO or the network a performance bottleneck in the system, bringing great performance problems: 1) A large number of Data movement increases the delay of data processing; 2) Data transmission causes competition for IO or network resources, which affects the data access of other applications in the system and affects the performance of other applications. In the era of big data, the amount of data is increasing explosively. For data analysis applications, data transmission should be avoided as much as possible to reduce data movement overhead.
- DRAM Dynamic Random Access Memory
- the CPU needs to use load/store instructions to access the memory through the memory bus.
- the performance of the CPU is rapidly increasing at a rate of about 60% per year, while the increase in memory performance is only about 7%, which causes the current speed of the memory to seriously lag behind the speed of the CPU, and there is a serious performance gap between the memory and the CPU.
- It is difficult to give full play to the advantages of the CPU causing the memory system to become the performance bottleneck of the computing system, especially in the memory-intensive (Memory Intensive) high-performance computing (High Performance Computing, HPC) scenario, the memory speed severely limits the system performance.
- NDP Near Data Processing
- NDC Near Data Computing
- the method provided in this embodiment can be applied to a distributed storage system or a centralized storage device, and these two application scenarios are respectively introduced below.
- Application Scenario 1 The scenario of a distributed storage system.
- the system architecture 100 is an example of an application scenario of a distributed storage system.
- the system architecture 100 is an architecture in which computing and storage are separated.
- the system architecture 100 includes a computing cluster 110 and a storage cluster 120, and the computing cluster 110 and the storage cluster 120 are connected through a network channel.
- the computing cluster 110 includes multiple computing nodes (computing nodes, CN). There are many situations in the form of computing nodes.
- a computing node is a host, a server, a personal computer, or other devices with computing processing capabilities.
- the computing cluster 110 includes a host 110a and a host 110b. Different computing nodes in the computing cluster 110 are connected through a wired network or a wireless network. Different computing nodes in the computing cluster 110 may be distributed in different or the same locations.
- the computing node is used to generate and issue data processing tasks.
- the computing node includes at least one application (Applications) 111 and an NDP coordination module (NDP Coordinator) 112.
- the application 111 and the NDP coordination module 112 are software on the computing node.
- the application 111 is used to generate data processing tasks.
- the application 111 is a data-intensive application, that is, a large amount of data needs to be processed.
- applications 111 are online analytical processing (On-line Analytical Processing, OLAP) applications, artificial intelligence (AI) applications, online transaction processing (On-Line Transaction Processing, OLTP) applications, big data analysis applications, HPC applications Wait.
- OLAP applications for example, are used to provide multi-table joint query services in the OLAP system.
- the application 111 sends the generated data processing task to the NDP coordination module 112.
- the NDP coordination module 112 is configured to send the data processing tasks of the application 111 to the storage node where the data is located.
- the distributed storage system further includes a data management device, which is used to record the storage node where the data is located in the storage cluster 120.
- the NDP coordination module in the computing node is used to send a query request to the data management device, so as to find out which storage node the data is located in.
- the data management device stores a mapping relationship between a file identifier (identifier, ID) and the ID of the storage node where the file is located.
- the data management apparatus stores a mapping relationship between the key and the ID of the storage node where the file is located.
- the data management device is the data format service (Data Scheme Service) 130 in FIG. 3.
- the storage cluster 120 includes multiple storage nodes (Date Node, DN).
- the storage cluster 120 includes a storage node 120a, a storage node 120b, and a storage node 120c.
- Different storage nodes in the storage cluster 120 may be distributed in different or the same locations.
- Different storage nodes in the storage cluster 120 are interconnected through a high-speed network.
- Storage nodes are used to store data. Storage nodes can carry storage services applied in computing nodes and respond to IO requests from computing nodes.
- the network channel between the computing cluster 110 and the storage cluster 120 is established through at least one network device.
- the network device is used to forward data transmitted between the computing cluster 110 and the storage cluster 120.
- Network equipment includes, but is not limited to, switches, routers, etc. The network equipment is not shown in FIG. 1.
- each computing node processes a piece of data in the application data set, and each storage node stores a piece of data in the application data set, thereby ensuring load balance between the computing node and the storage node.
- FIG. 2 shows a schematic diagram of application data fragmentation.
- the data set of application 1 and the data set of application 2 are respectively distributed in storage node 1, storage node 2 to storage node n.
- the data set of application 1 is divided into n pieces of data.
- the n pieces of data include data a of application 1, data b of application 1 to data n of application 1, where data a is distributed on storage node 1, and data b is distributed on On storage node 2, data n is distributed on storage node n, and the data distribution of application 2 is the same as that of application 1.
- the storage cluster 120 may use multiple copies or erasure codes (EC) for data redundancy protection, so that application data is still available when some storage nodes fail, thereby ensuring high data availability.
- EC erasure codes
- the second application scenario is the scenario of a centralized storage device.
- the centralized storage device is, for example, a storage array.
- the centralized storage device includes one or more controllers and one or more hard disks.
- the controller in the storage device is also called the storage controller.
- the centralized storage device is connected to the host through a wired network or a wireless network.
- the network channel between the computing cluster 110 and the storage cluster 120 or the network channel between the centralized storage device and the host will be limited by factors such as cost and distance, and the network bandwidth is relatively low. Disadvantages such as low and high latency. Therefore, for data-intensive applications such as OLAP applications and big data analysis applications, the network channel between the computing device and the storage device where the application is located has become one of the main performance bottlenecks. In view of this, how to reduce or avoid the performance overhead caused by data transmission on the network channel between the computing side and the storage side, and to improve the efficiency of data processing in the application, has become an urgent need to be met in the above application scenarios.
- the above exemplarily introduces the application scenario and the existing requirements of the application scenario.
- the storage device provided in this embodiment and the method executed by the storage device are specifically introduced below.
- the storage device and method provided in this embodiment can meet the requirements of the above application scenarios.
- the data processing process is moved from the computing node in the computing cluster 110 to the storage node in the storage cluster 120, because the storage node can access the locally stored data and treat it locally.
- the stored data is processed without requesting remotely stored data through a network channel, thereby avoiding the performance bottleneck caused by data transmission between the computing cluster 110 and the storage cluster 120 through the network channel.
- the following embodiments can be used as a set of general near-data computing mechanisms to support the execution of data processing tasks generated by various applications such as database applications, big data applications, and AI applications, thereby increasing the flexibility of near-data computing.
- each subtask is further pushed down to a solid state drive (SSD) or dual-inline-memory-modules (DIMM).
- SSD solid state drive
- DIMM dual-inline-memory-modules
- Graphics processor English: Graphics Processing Unit, GPU for short
- neural network processor neural-network processing units, NPU
- DPU dedicated data processing unit
- Each subtask can be scheduled for execution on the most suitable processor according to its computing characteristics and requirements, so as to make full use of the heterogeneous computing resources of the storage device and maximize the efficiency of data processing.
- the exemplary embodiment of the present application provides a storage device.
- the storage device is a storage node in a distributed storage system, such as the storage node 120a, the storage node 120b, and the storage node 120c in FIG. 1.
- the storage device is a centralized storage device.
- the storage device includes multiple processors, network cards, and storage media (Storage Media).
- the multiple processors include a central processing unit and multiple dedicated processors.
- the central processing unit is used to obtain data processing tasks, divide subtasks, and schedule each special processor.
- the storage node 120a is an example of a storage device
- the central processing unit 121 in the storage node 120a is an example of a central processing unit in the storage device.
- the dedicated processor is any processor other than the central processing unit.
- the special processor has computing power, and the special processor can use its own computing power to participate in the execution of subtasks.
- the GPU 122 and NPU 123 in the storage node 120a are examples of special processors in the storage device.
- the DPU 1272 in the DIMM 127 in the storage node and the DPU 1282 in the SSD 128 in the storage node are also examples of special processors in the storage device.
- the specific types of special processors include a variety of situations. The following is an example of special processors through situations one and two.
- Case 1 The dedicated processor is an independent chip.
- the dedicated processor is a chip that can work independently, such as GPU and NPU.
- the dedicated processor is a processor in any element included in the storage device.
- the dedicated processor can be integrated with other components of the storage device.
- the storage device includes a hard disk, and the special processor is a controller (SSD controller) of the hard disk.
- the SSD includes a processor, and the special processor may be a processor of the SSD.
- the SSD includes the DPU, and the dedicated processor is the DPU1282 in the SSD128.
- SSDs that include processors are also called computing SSDs or smart SSDs.
- the storage device includes a DIMM, the DIMM includes a processor, and the dedicated processor is a processor of the DIMM.
- DIMM127 includes DPU1272, and the dedicated processor is DPU1272 in DIMM127.
- DIMMs including processors are also called computing DIMMs or smart DIMMs.
- the dedicated processor is an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof.
- ASIC application-specific integrated circuit
- PLD programmable logic device
- the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (generic array logic, GAL), or any combination thereof.
- the dedicated processor can be a single-core processor or a multi-core processor.
- the multiple specialized processors included in the storage device are heterogeneous processors.
- multiple specialized processors have different hardware architectures.
- multiple specialized processors support different instruction sets.
- a special processor included in the storage device supports the X86 instruction set
- another special processor included in the storage device supports the ARM instruction set.
- storage devices include CPU, GPU, NPU, DIMM, and SSD.
- CPU, GPU, NPU, DPU in DIMM, and DPU in SSD are examples of five heterogeneous processors.
- a variety of heterogeneous specialized processors can form a heterogeneous computing resource pool, and the central processing unit can schedule resources in the heterogeneous computing resource pool to perform tasks.
- the central processing unit and the dedicated processor are connected through a high-speed interconnection network, and the central processing unit and the dedicated processor communicate through the high-speed interconnection network.
- the high-speed interconnection network is, for example, a high-speed serial computer expansion bus standard (peripheral component interconnect express, PCIe) bus, memory fabric, high-speed Ethernet, HCCS, InfiniBand (IB), or Fibre Channel (FC).
- the network card is used to provide data communication functions.
- the network card in the storage device is the network card 125 in the storage node 120a.
- the storage medium is used to store data.
- the storage medium is the hard disk 124 in the storage node 120a.
- the hard disk 124 is used to store data.
- the hard disk 124 is, for example, a solid state drive (solid state drive, abbreviated as: SSD) or a hard disk drive (hard disk drive, abbreviated as: HDD).
- the hard disk is an SSD128, the SSD128 includes at least one flash memory chip 1281, and the flash memory chip 1281 is used for persistent storage of data.
- the storage medium may also be the DRAM chip 1271 in the DIM127.
- the storage device further includes a storage interface (Storage Interface) 126.
- the storage interface 126 is used to provide a data access interface to an upper layer (such as an application of a processor of the storage device and an application of a computing node).
- the storage interface 126 is a file system interface or a key-value (KV) interface.
- the storage node includes an NDP Execution Engine 20 (NDP Execution Engine), and the NDP Execution Engine 20 is software on the storage node.
- the NDP execution engine 20 runs in the central processing unit of the storage node.
- the NDP execution engine 20 runs in the controller of the storage node.
- the NDP execution engine 20 includes a parser (Parser) 201 and an executor (Executor) 202.
- the parser 201 is used to parse the definition information 203 describing the NDP task to generate a topology map 204.
- the executor 202 is used for scheduling each special processor and the central processing unit to execute subtasks according to the topology map 204. For example, in Figure 3, the executor 202 schedules the CPU to perform subtask a, the GPU to perform subtask c, the NPU to perform subtask b, the DPU in DIMM to perform subtask e, and the DPU in SSD to perform subtasks. d.
- both the parser 201 and the executor 202 are software.
- the parser 201 and the executor 202 are functional modules generated after the central processing unit of the storage node reads the program code.
- the system architecture has been introduced above, and the method 300 and the method 400 are used to exemplarily introduce the method flow of performing tasks based on the system architecture provided above.
- FIG. 4 is a flowchart of a task execution method 300 provided by an embodiment of the present application.
- the method 300 is executed by a storage device.
- the method 300 is executed by a storage node in a distributed storage system.
- the method 300 is executed by the storage node 120a, the storage node 120b, and the storage node 120c in the system architecture shown in FIG. 1.
- the method 300 is executed by a centralized storage device.
- the data processed in the method 300 is data generated and maintained by an application of the host in the system architecture shown in FIG. 1.
- the application of the host generates a data processing task according to the data it needs to process, takes the data processing task as the input of the storage device, and triggers the storage device to execute the following steps S310 to S340.
- the method 300 includes S310 to S340.
- the central processing unit obtains a data processing task.
- the data processing task is the task of processing the data stored in the storage device.
- the data processing task is an NDP task.
- the types of data processing tasks include many situations.
- data processing tasks are multi-table joint query tasks generated by OLAP applications, model training tasks generated by AI applications, high-performance computing tasks generated by HPC applications, physical experiment data analysis tasks generated by big data analysis applications, meteorological data analysis tasks, etc. Big data analysis tasks, transaction processing tasks generated by OLTP applications, etc.
- the data processing task comes from the computing device. Specifically, the computing device generates a data processing task, sends the data processing task to the storage device, and the central processing unit of the storage device receives the data processing task. In this way, data processing tasks are pushed down from the computing device to the storage device for execution, thereby realizing near-data processing.
- the application in the host generates an NDP task; the application sends a task push request to the NDP coordination module, the task push request carries the NDP task, and the task push request is used to request the task to be sent to the storage device.
- the NDP coordination module sends the NDP task to the storage device in response to the task push request, so that the storage device obtains the NDP task.
- the data to be processed by the data processing task is stored in a storage device.
- the computing device determines the storage device where the data is located according to the attribution location of the data, and sends a data processing task to the storage device where the data is located, so that the storage device can schedule a local processor to process the local data.
- how the computing device determines the storage device where the data is located includes multiple implementation methods. For example, in the case where the data is a file, the storage device where the file is located is determined by the ID of the file. For another example, when the data is a key-value pair, the key is used to determine the storage device where the data is located.
- the process of determining the storage device where the data is located involves the interaction between the computing device and the data management apparatus. Specifically, the computing device sends a query request to the data management apparatus, and the query request includes the ID or key of the file.
- the data management apparatus queries the node where the data is located in the storage cluster according to the ID or key of the file, and sends a query response to the computing device, and the query response includes the identification of the storage device.
- the computing device receives the query response and determines the storage device where the data is located.
- the data processing task is described in a declarative language.
- Declarative language is a programming paradigm, which is opposite to imperative programming.
- the declarative language describes the goal of the data processing task, that is, instructs the storage device what operation to perform, but does not clearly indicate how the operation should be performed.
- the data processing task is an NDP task, and the developer has designed a declarative language to describe the NDP task, which is called the NDP description language.
- the application can define the NDP task that needs to be pushed down to the storage device through the NDP description language, and obtain the definition information of the NDP task.
- the definition information of the NDP task includes the input parameters of the NDP task, the operations that the NDP task needs to perform, and the output result of the NDP task.
- the NDP task structure defined by the NDP description language is as follows:
- the central processing unit divides the data processing task into multiple subtasks.
- Subtasks include but are not limited to functions or calculation steps.
- the units for dividing sub-tasks include many situations, which are illustrated below by way of way 1 to way 2.
- the central processing unit divides data processing tasks into multiple functions.
- a subtask is a function; or, a subtask includes multiple functions.
- the second way is to use one calculation step as the smallest unit for dividing subtasks.
- the central processing unit divides the data processing task into multiple functions, and divides each function into multiple calculation steps.
- a subtask is a calculation step; or, a subtask includes multiple calculation steps. Since the data processing task is decomposed into functions and further decomposed into calculation steps, the layer-by-layer decomposition of tasks is realized, which makes the granularity of sub-tasks more refined, which helps to improve the flexibility of scheduling sub-tasks.
- the subtasks are divided according to the calculation mode.
- the central processing unit divides the data processing task into multiple subtasks according to the calculation mode of the function or calculation step included in the data processing task, and each subtask has the same calculation mode.
- the data processing task includes function A and function B.
- Function A is complicated, and function A includes multiple calculation modes.
- the function B is relatively simple, and the function B has only one calculation mode.
- the central processing unit divides function A into multiple calculation steps, each calculation step has a calculation mode, and each calculation step of function A is regarded as a subtask; the central processing unit takes function B as a subtask. Task. Since the subtasks are divided according to the calculation mode, it is convenient to allocate suitable special processors for the subtasks according to the calculation mode.
- the subtasks are divided according to the definition information of the function.
- the central processing unit divides the data processing task into multiple subtasks according to the definition information of each function in the data processing task. For example, when a developer writes a function, he should indicate each calculation step included in the function. For example, add keywords to code line A and code line B in the function, and indicate code line A and code line.
- the program code between B corresponds to a single calculation step, which can be scheduled to a special processor. Then the central processing unit splits the program code between code line A and code line B as a subtask according to the definition information of the function.
- the central processing unit allocates the first subtask among the multiple subtasks to the first special processor according to the attributes of each subtask.
- the first special processor executes the first subtask.
- This embodiment relates to how the central processing unit allocates the first subtask to the first special processor, and the process of the central processing unit allocating other subtasks to other special processors is the same.
- the first subtask is one of the multiple subtasks.
- the first special processor is one of the special processors.
- the first dedicated processor is the DPU in GPU, NPU, DIMM, or DPU in SSD.
- this embodiment does not limit the allocation of the first subtask to the first special processor only.
- the central processing unit allocates other subtasks to the first special processor other than the first subtask. Task.
- the central processing unit allocates part of the subtasks to itself for execution. For example, the central processing unit selects the second subtask from a plurality of subtasks, and the central processing unit executes the second subtask.
- the central processing unit assigns different subtasks of the multiple subtasks to different special processors, thereby scheduling different special processors to execute different subtasks respectively.
- the divided subtasks include subtask a, subtask b, subtask c, and subtask d.
- the central processing unit allocates subtask a to NPU, subtask b to GPU, and subtask c to DIMM.
- the DPU in the SSD allocates subtask d to the DPU in the SSD.
- the number of subtasks allocated by the central processing unit to different special processors is the same. For example, the central processing unit evenly allocates all the divided subtasks to each special processor.
- the number of subtasks allocated by the central processing unit to different special processors is different.
- the central processing unit combines the current computing power of each special processor to allocate more subtasks to special processors with idle computing power, and fewer subtasks to special processors with insufficient computing power, or not to Special processors with insufficient computing power allocate subtasks.
- the central processing unit determines the computing resources of the first special processor, and determines whether the computing resources of the first special processor are lower than a set threshold; if the computing resources of the first special processor are higher than the set threshold, the central processing unit The processor determines that the computing power of the first special processor is idle, and then allocates the first number of subtasks to the first special processor; if the computing resources of the first special processor are lower than the set threshold, the central processing unit determines the second special processing If the computing power of the processor is insufficient, the first special processor is not allocated subtasks, or the first special processor is allocated subtasks less than the first number.
- This embodiment does not limit the first subtask to be executed by only one processor, the first special processor.
- the first special processor undertakes all the calculations of the first subtask and executes all the steps of the first subtask.
- the first dedicated processor and the central processing unit cooperate to participate in the calculation of the first subtask.
- the first special processor executes part of the steps in the first subtask
- the central processor executes another part of the steps in the first subtask.
- the first special processor monitors the remaining status of computing resources in real time during the execution of the first subtask.
- the first special processor determines that its own computing power is insufficient, it will use the calculated results and the first The unexecuted remaining part of a subtask is sent to the central processing unit, and the central processing unit continues to execute the remaining part of the first subtask according to the calculation result.
- the first dedicated processor does not work with the central processing unit, but with other dedicated processors.
- multiple specialized processors included in the storage device respectively have corresponding characteristics and are good at performing different tasks.
- the central processing unit can combine the characteristics of the special processor to allocate the tasks suitable for the special processor to the special processor, so as to give full play to the respective performance advantages of the special processor.
- the following uses (1) to (5) to illustrate how to combine the specific characteristics of the special processor to give an example of the subtasks assigned to the special processor.
- GPU is a single instruction multiple data (Single Instruction Multiple Data, SIMD) processor.
- the GPU architecture includes thousands of simple processing cores, and the GPU can perform a large number of the same operations through thousands of cores working at the same time.
- each processing core of the GPU is more suitable for calculation, not suitable for control.
- the task can be assigned to the GPU, so that the GPU can be scheduled to execute the calculation mode. And the task with a large amount of data concurrency.
- performing matrix multiplication is a subtask with a simple calculation model and a large amount of data concurrency.
- the matrix multiplication operation is composed of a large number of vector multiplication operations.
- the vector multiplication operation is a simple operation.
- the vector multiplication operation specifically includes multiplying rows and columns, and adding the resulting products together.
- the subtasks of the matrix multiplication operation are allocated to the GPU.
- each processing core of the GPU will perform vector multiplication separately.
- the GPU uses thousands of processing cores to perform vector multiplication at the same time, so that the entire vector multiplication subtask is executed. It can be accelerated, which helps to improve the efficiency of performing vector multiplication operator subtasks.
- the matrix multiplication operation is an example of subtasks suitable for allocation to the GPU, and the GPU is also suitable for performing subtasks other than the matrix multiplication operation.
- the convolution operation in the neural network is also suitable for execution by the GPU, and the GPU can be scheduled to perform the subtasks of the convolution operation.
- the NPU is specifically designed for AI.
- the NPU includes modules required for AI calculations such as multiplication and addition, activation functions, two-dimensional data operations, and decompression.
- tasks of neural network operations are assigned to the NPU, and the NPU can use its own modules to accelerate the neural network operation tasks.
- DPU is a programmable electronic component used to process data.
- DPU has the versatility and programmability of CPU, but DPU is more specific than CPU.
- DPU can run efficiently on network data packets, storage requests or analysis requests.
- DPU has a greater degree of parallelism than CPU (that is, DPU can handle a large number of concurrent requests).
- the scheduling DPU provides a data unloading service for the global memory pool. For example, assign address index, address query, partition function, and data filtering and scanning operations to the DPU.
- DIMM includes DPU and DRAM chips (DRAM chips).
- DPU can quickly access DRAM, process the data stored in DRAM, and complete tasks nearby.
- the DPU has the closest or data affinity to the data.
- the task can be assigned to the DPU of the DIMM.
- the DPU in the DIMM is scheduled to perform tasks with irregular memory access and a relatively large amount of memory access, so as to utilize the performance advantage of the DPU to access the DRAM and save the time overhead of accessing the memory.
- the DPU in the DIMM is a processor dedicated to performing specific operations and can only perform fixed types of calculations. In this case, the DPU of the DIMM is scheduled to perform tasks corresponding to these fixed types of calculations.
- the processor included in the DIMM is a DPU.
- the processor of the DIMM is not a DPU, but a type of processor other than the DPU, the same strategy can be used for DIMM Other types of processors distribute tasks.
- SSD includes DPU and flash chips (Flash chips).
- the DPU of SSD can quickly access the flash chips, process the data stored in the flash chips, and complete tasks nearby.
- the DPU in the SSD can be scheduled to perform the task.
- the high bandwidth inside the SSD can be fully utilized.
- multiple SSDs can be scheduled to perform tasks in parallel, so as to utilize the concurrent processing capabilities between multiple SSDs to speed up the execution of tasks.
- the DPU of the SSD is scheduled to perform tasks whose calculation mode is simple and the amount of output data can be greatly reduced, such as filtering operations.
- the DPU in the SSD is a processor dedicated to performing specific operations and can only perform fixed types of calculations. In this case, the DPU of the SSD is scheduled to perform tasks corresponding to these fixed types of calculations.
- the above is an example of the case where the processor included in the SSD is a DPU.
- the processor of the SSD is not a DPU but a type of processor other than the DPU, the same strategy can be adopted for the SSD Other types of processors distribute tasks.
- scheduling strategy 1 uses scheduling strategy 1 to scheduling strategy 4 to illustrate specific examples of how to schedule a special processor.
- Scheduling strategy 1. Scheduling according to the home location of the data.
- the first scheduling strategy is also called scheduling according to the affinity of the data.
- the implementation of scheduling strategy one includes: the central processing unit determines the address of the data involved in the subtask; the central processing unit selects from a plurality of special processors according to the address of the data involved in the first subtask The special processor closest to the data serves as the first special processor; the central processing unit assigns the first subtask to the first special processor closest to the data.
- the address of the data is, for example, the logical address of the data or the physical address of the data.
- the address of the data is determined by the metadata of the data, for example.
- the central processing unit schedules the processor of which device to perform subtasks on the storage medium of the device where the data is located.
- the special processor closest to the data is the processor integrated with the storage medium where the data is located. For example, if the data is located in the SSD, the central processing unit allocates subtasks to the DPUs in the SSD, thereby scheduling the DPUs of the SSD to perform the subtasks. If the data is located in the DIMM, the central processing unit allocates subtasks to the DPU in the DIMM, thereby scheduling the DPU of the DIMM to perform the subtask.
- the subtasks are scheduled to be executed on the special processor closest to the data.
- the special processor can access the data and process the data nearby, thus reducing the time delay and performance overhead caused by data movement, and improving the efficiency and speed of data processing.
- Scheduling strategy 2 Scheduling according to the computing characteristics of subtasks.
- the calculation characteristics of the subtask include the calculation mode of the subtask and/or the concurrency of the subtask.
- the implementation of scheduling strategy two includes: the central processing unit determines the calculation mode and/or concurrency of the subtasks, and selects the calculation mode and/or concurrency from multiple special processors according to the calculation mode and/or concurrency of the subtasks
- the special processor with matching amount is used as the first special processor, and the first subtask is allocated to the first special processor. For example, when the calculation mode of the subtask is simple and the amount of concurrency is large, the central processing unit selects the GPU, and the subtask with a simple calculation mode and a large amount of concurrency is allocated to the GPU.
- the computational characteristics of the subtask include the type of algorithm required to perform the subtask.
- the implementation of the second scheduling strategy includes: the central processing unit selects a special processor suitable for running this type of algorithm from a plurality of special processors according to the type of algorithm required to execute the subtask. For example, if the subtask is face recognition, a neural network algorithm is required to perform face recognition, and the storage device is just equipped with an NPU that executes the neural network algorithm, the central processing unit selects the NPU, and the NPU is scheduled to perform face recognition through the neural network algorithm .
- the subtask is image compression
- the storage device is just equipped with a dedicated chip for image compression
- the central processing unit dispatches the dedicated chip to perform image compression.
- Scheduling strategy 4. Scheduling according to the defined information of the subtasks.
- the implementation of scheduling strategy three includes: the central processing unit obtains the definition information of each subtask; the central processing unit selects the first subtask from the multiple special processors included in the storage device according to the definition information of the first subtask.
- the special processor indicated by the definition information of a subtask serves as the first special processor; the central processing unit allocates the first subtask to the first special processor.
- the definition information of the first subtask includes the identification of the first special processor.
- the identification of the first special processor is, for example, the name of the first special processor. For example, when the definition information of the first subtask includes "GPU", it indicates that the first subtask is executed by the GPU. Since the definition information includes the identifier of the first special processor, it can specify that the first subtask is to be executed by the first special processor.
- This embodiment does not limit the definition information of the first subtask to only include the identifier of the one processor, the first special processor.
- the definition information of the first subtask further includes the identifiers of processors other than the first dedicated processor.
- the definition information of the first subtask includes the identifier of each of the multiple processors, thereby indicating that there are multiple processors to choose from when assigning the first subtask.
- the central processing unit selects the first special processor from the multiple processors indicated by the definition information according to the definition information of the first subtask.
- the definition information is also used to indicate the priority of each processor in the plurality of processors, and the central processing unit is based on the priority of each processor indicated by the definition information from the plurality of processors indicated by the definition information.
- the processor with the highest priority is selected as the first special processor; or, in the case that the processor with the highest priority indicated by the definition information is insufficient in computing power, the central processing unit selects the processor with the next highest priority as the first special processor. Special processor.
- the order of the processor identifiers in the definition information indicates the priority of different processors. For example, the identifier of the first special processor in the definition information is located before the identifier of the second special processor, which means that the first special processor has a higher priority than the second special processor. For example, if the definition information includes [GPU,NPU], it means that the GPU has a higher priority than the NPU. If the definition information includes [NPU,GPU], it means that the NPU has a higher priority than the GPU.
- How to obtain the definition information of the subtask includes multiple implementation methods.
- the developer specifies that the special processor suitable for executing the first subtask is the first special processor.
- the developer is in the process of writing the program code of the first subtask. , Input the identification of the first special processor and other information to obtain the definition information of the subtask, and save the definition information of the subtask to the storage device.
- the central processing unit will read the pre-saved definition information of the first subtask.
- the first subtask is a function
- the developer defines the syntax of the function, and the definition information of the specified function needs to include the identifier of the special processor.
- the developer writes a set of NDP description language.
- the NDP description language presets some functions or calculation steps for general computing scenarios, and specifies corresponding heterogeneous processors for these functions or calculation steps, so as to pass Heterogeneous processors perform accelerated processing.
- different functions will be separately scheduled to be executed on heterogeneous processors (such as GPU, NPU, DIMM, etc.). Because different application scenarios have different functions or calculation steps. Therefore, the NDP description language supports the expansion of NDP's computing power by defining new functions.
- the developer needs to specify the type of data set corresponding to the function, the types of input parameters, output parameters, and one or more special processors that are most suitable for the function.
- NDP description language definition function For example, the syntax of the NDP description language definition function is as follows.
- Decl Func ⁇ function name> of Dataset ⁇ data set type name> (arg list)[Processor 1, Processor 2,...]//Note: This line is the declaration statement of the function, indicating the function name and data set type of the function Name and the processor that executes the function. Decl is an abbreviation for declaration. Func is an abbreviation of function. arg is the abbreviation of argument.
- End//Note The part between Begin and End is the function body, which includes the program code to realize the function.
- the definition information of the compression function written based on the above grammar is as follows.
- the algorithm type used to execute the compression function is the LZ4 compression algorithm, which is suitable for the GPU and CPU that execute the compression function, and the scheduling of the GPU is given priority, and the second is the scheduling of the CPU.
- the developer can specify which processor executes the subtask in the definition information, so that the subtask is scheduled to be executed on the special processor designated by the developer, thereby meeting the developer's custom requirements.
- the developer can specify which processor executes the subtask in the definition information, so that the subtask is scheduled to be executed on the special processor designated by the developer, thereby meeting the developer's custom requirements.
- it can be specified by adding the identifier of the special processor to the definition information of the new task Which special processor to schedule the new task on, thereby reducing the difficulty of scheduling the new task, thereby improving the scalability.
- Scheduling strategy 4 Scheduling according to the type of data set corresponding to the subtask.
- the implementation of scheduling strategy four includes: the central processing unit determines the data set type corresponding to each subtask; the central processing unit performs processing from multiple special items included in the storage device according to the data set type corresponding to the first subtask. In the processor, a special processor that matches the type of the data set is selected as the first special processor; the central processing unit allocates the first subtask to the first special processor.
- the data set types include, but are not limited to, relational data table (Table, including row storage and column storage) type, image type, text type, and so on.
- Table relational data table
- image type image type
- text type text type
- the first subtask is compression
- the available processors include GPU and CPU. If the data set type corresponding to the compression is an image, since the processor matching the image is a GPU, the central processing unit selects the GPU and allocates the subtask of compressing the image to the GPU.
- the data set type corresponding to the subtask is determined according to the definition information of the subtask.
- the definition information of the subtask includes the name of the data set type.
- the data set type is a type defined by the developer. When the developer writes the program code, use the declaration statement to declare the custom data set type so that the custom data set type can be specified in the definition information of the subtask. For example, the syntax for declaring a custom data set type is:
- a binding relationship is established between each data set and the corresponding function. For example, establish a binding relationship between a text-type data set and the Count function. If a data set of type Table requests the Count function to be called, the call is invalid. If a data set of type text requests the Count function to be called, the call is allowed. In this way, it is guaranteed that the correct function can be called when processing the data in the data set.
- special processors are suitable for processing different types of data
- GPUs are suitable for processing images
- some special codec processors are suitable for processing videos.
- scheduling strategy four the data types and special processors to be processed by subtasks are considered. Whether it matches, the subtask is scheduled to be executed on a special processor whose data set type matches, so that the special processor can process the data that it is suitable for processing, and avoid task execution due to the special processor's inability to identify and process specific types of data Failure conditions increase the success rate of task execution.
- scheduling strategy 1 has the highest priority
- scheduling strategy 2 and scheduling strategy 3 have the second priority.
- the central processing unit prioritizes the attribution position of the data, and secondly considers the calculation characteristics of the subtasks and the definition information of the subtasks. For example, the central processing unit first determines whether the data is located in the DIMM or SSD.
- the central processing unit selects the special processor according to the calculation feature of the subtask or the definition information of the subtask according to the scheduling strategy 2 or the scheduling strategy 3, loads the data to the memory of the selected special processor, and schedules the selection
- the dedicated processor performs subtasks.
- the central processing unit performs scheduling according to the execution order of multiple subtasks recorded in the topology diagram.
- the central processor instructs the first special processor to execute the first subtask in order according to the topology diagram.
- the topology diagram is used to indicate the order of execution of multiple subtasks and different subtasks.
- the topological graph includes multiple nodes and at least one edge.
- Each node in the plurality of nodes is used to represent one subtask among the plurality of subtasks.
- the subtask is a function
- the node contains the operation corresponding to the function, the input parameters of the function, the output parameters of the function, and the special processor that executes the function.
- the edges connect the nodes corresponding to different subtasks. Each edge is used to represent the dependency between different subtasks.
- the topology graph is a directed acyclic graph (DAG). DAG refers to a directed graph without loops.
- the direction of the edges in the topology graph is used to record the execution order of the subtasks. For example, if there is an edge between the first node and the second node in the topological graph, the direction of the edge is from the first node to the second node, that is, the starting point of the edge is the first node, and the end of the edge is the second node, then The subtask corresponding to the second node is executed first, and the subtask corresponding to the first node is executed later.
- the topological graph is DAG204, and the subtasks represented by nodes in DAG204 are functions. The DAG 204 shown in FIG.
- node 3 includes 5 nodes, namely node a, node b, node c, node d, and node e.
- node a represents function a
- node b represents function b
- node c represents function c
- node d represents function d
- node e represents function e.
- the topological graph has four edges, namely the edge from node a to node c, the edge from node a to node b, the edge from node a to node d, and the edge from node c to node e.
- the dependency relationship and execution order of the functions recorded by DAG204 in FIG. 3 are: function d and function e are executed first.
- Function b and function c depend on function e, and function b and function c are executed after function e is executed.
- Function a depends on function b, function c, and function d, and function a is executed last.
- the central processing unit instructs the DPU in the DIMM to execute function e, and instructs the DPU in the SSD to execute function d; when the execution of function e is completed, the central processing unit instructs the NPU to execute function b and instructs the GPU to execute function c ;
- function b, function c, and function d are all executed, the central processing unit executes function a.
- the storage device parses the task definition information to generate a topology map. For example, referring to Figure 3, after the storage device receives the definition information of the NDP task sent by the computing device, it parses the definition information of the NDP task through the parser 201 (Parser) to generate DAG204, thereby representing each of the NDP tasks through DAG204 Subtasks.
- the DAG 204 output by the parser 201 is sent to the executor 202 (Executor) included in the storage device. According to the DAG 204, the executor 202 sequentially schedules each step or each function in the NDP task to the corresponding special processor for execution, and controls the data flow between each step or each function.
- the central processing unit By parsing the task definition information into a topological graph and using the topological graph for scheduling, on the one hand, since the topological graph records the execution order of the subtasks, the central processing unit does not need to recalculate the execution order of the subtasks, and can directly follow the topological graph recorded The execution order of the scheduling is performed, thereby reducing the workload of scheduling.
- there are many scheduling optimization algorithms based on topological graphs which can call the scheduling optimization algorithms based on topological graphs to optimize the order of subtask scheduling, thereby shortening the overall execution time of tasks.
- the central processing unit determines whether the selected special processor is programmable. If the selected special processor is programmable, the central processing unit generates instructions that can be executed on the selected special processor. For example, if the selected special processor supports the X86 instruction set, X86 instructions are generated; if the selected special processor supports the ARM instruction set, then ARM instructions are generated. The central processing unit instructs the selected special processor to execute instructions to complete subtasks and cache the generated instructions. When the central processing unit schedules the subtask to the special processor next time, it can call the pre-cached instruction to execute the subtask, eliminating the instruction generation process. If the selected special processor is not programmable, the corresponding hardware calculation module in the special processor is called to perform subtasks.
- the application deployed in the computing cluster uses the NDP description language to define the following NDP tasks.
- the NDP coordination module queries the Data Scheme Service according to the fileID, obtains the storage node to which the file corresponding to the fileID belongs, and forwards the NDP task to the storage node.
- the definition information of the above NDP task describes three functions to be executed in the NDP task, namely the decompress function, the filter function and the count function.
- the description language is parsed by the parser to generate the topology diagram shown in FIG. 5.
- the decompress function is scheduled to be executed nearby on the SSD, and then the filter function and the count function are scheduled to be executed on the GPU.
- the decompress function is completed, load the data set into the memory of the GPU, generate instructions that the filter function and count function can execute on the GPU, and complete the function of the function.
- the data reading process involved in the above process is implemented by calling a data reading function.
- the data read function is a system-defined function, used to read data from the file system, object storage and other storage systems, and return a data set object.
- the data reading interface includes the following:
- This embodiment provides a method for cooperatively processing data using multiple processors of a storage device.
- the central processor in the storage device divides the data processing task into multiple subtasks, and allocates storage devices to the subtasks according to the attributes of the subtasks.
- Special processor in the.
- the central processing unit is responsible for task decomposition and task scheduling, and the special processor is responsible for the execution of subtasks, which makes the computing power of the central processing unit and the computing power of the special processor. All have been fully utilized.
- the attributes of the subtasks are considered when assigning the subtasks, the subtasks can be scheduled to be executed on the appropriate special processor according to their attributes. Therefore, this method improves the efficiency of data processing.
- the method 400 is used to illustrate the above method 300 as an example.
- the following method 400 is applied in the scenario of a distributed storage system.
- the applied data is broken up and distributed to multiple storage nodes.
- Each storage node has a variety of heterogeneous processors, including CPU, GPU, NPU, and DIMM processing. Processor, SSD processor.
- the data processing task is an NDP task, and the subtasks are functions.
- the method flow described in the method 400 relates to how the storage node schedules each function to be executed on the most suitable processor among the various heterogeneous processors. It should be understood that the steps of the method 400 and the method 300 are the same, please refer to the method 300, and the method 400 will not be repeated.
- FIG. 6 is a flowchart of a task execution method 400 according to an embodiment of the application.
- the method 400 includes S401 to S409.
- S401 Determine whether the data is in the DIMM or SSD. If the data is in the DIMM or SSD, perform the following S402; if the data is not in the DIMM and not in the SSD, perform the following S404.
- S402 Determine whether the DIMM or SSD supports the function. If the DIMM or SSD supports this function, execute the following S403; if the data is not in the DIMM and not in the SSD, execute the following S404.
- the DIMM or SSD is selected as the special processor for executing the function, and the following S406 is executed.
- the special processor is selected according to the special processor indicated by the definition information of the function or the calculation feature of the function, and the following S405 is executed.
- S405 Load the data set into the memory of the selected special processor, and execute the following S406.
- S406 Determine whether the selected special processor is programmable. If the selected special processor is programmable, execute the following S407; if the selected special processor is not programmable, execute the following S409.
- S407 According to the definition information of the function, generate an instruction that can be executed in the selected special processor, and execute the following S408.
- the task execution method of the embodiment of the present application is introduced above, and the task execution device of the embodiment of the present application is introduced below. It should be understood that the task execution device has any function of the storage device in the foregoing method.
- the task execution apparatus 600 runs on a controller of a storage device, and the storage device includes at least one hard disk.
- the task execution apparatus 600 runs on the central processing unit of the storage device.
- FIG. 7 is a schematic structural diagram of a task execution device provided by an embodiment of the present application.
- the task execution device 600 includes: an acquisition module 601 for executing S310; a dividing module 602 for executing S320; and an allocation module 603, used to execute S330.
- the task execution apparatus 600 corresponds to the storage device in the foregoing method 300 or method 400, and each module in the task execution apparatus 600 and the foregoing other operations and/or functions are used to implement the storage device in the foregoing method 300 or method 400, respectively.
- each module in the task execution apparatus 600 and the foregoing other operations and/or functions are used to implement the storage device in the foregoing method 300 or method 400, respectively.
- the various steps and methods implemented please refer to the method 300 or the method 400 described above. For the sake of brevity, the details are not repeated here.
- the task execution device 600 executes tasks, only the division of the above-mentioned functional modules is used as an example for illustration. In actual applications, the above-mentioned function allocation can be completed by different functional modules as needed, that is, the internal structure of the task execution device. Divide into different functional modules to complete all or part of the functions described above.
- the task execution device provided in the foregoing embodiment belongs to the same concept as the foregoing method 300 or method 400. For the specific implementation process, please refer to the foregoing method 300 or method 400, which will not be repeated here.
- the acquisition module 601 in the task execution device is equivalent to the network card in the storage device, and the division module 602 and the allocation module 603 in the task execution device are equivalent to the central processing unit in the storage device.
- the disclosed system, device, and method can be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of the modules is only a logical function division, and there may be other divisions in actual implementation, for example, multiple modules or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or modules, and may also be electrical, mechanical or other forms of connection.
- modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed on multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
- the functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
- the above-mentioned integrated modules can be implemented in the form of hardware or software function modules.
- the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer-readable storage medium.
- the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present application.
- the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .
- first, second and other words are used to distinguish the same or similar items that have basically the same function and function. It should be understood that there is no logical or logical relationship between “first” and “second”. The timing dependence does not limit the quantity and execution order. It should also be understood that although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used To distinguish one element from another element.
- the first subtask may be referred to as the second subtask, and similarly, the second subtask may be Called the first subtask. Both the first subtask and the second subtask can be subtasks, and in some cases, can be separate and different subtasks.
- the computer program product includes one or more computer program instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
- the computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
- the computer program instructions can be passed from a website, computer, server, or data center.
- the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
- the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a digital video disc (DVD), or a semiconductor medium (for example, a solid state hard disk).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种任务执行方法及存储设备,属于计算机技术领域。一种利用存储设备的多种处理器协作处理数据的方法,由存储设备中的中央处理器将数据处理任务划分为多个子任务,根据子任务的属性为子任务分配存储设备中的专项处理器。一方面,由于在进行数据处理的过程中,中央处理器承担了任务分解和任务调度的工作,专项处理器承担了执行子任务的工作,使得中央处理器的算力和专项处理器的算力均得到了充分利用。另一方面,由于分配子任务时考虑了子任务的属性,使得子任务能够依据其属性被调度到合适的专项处理器上执行。因此,该方法提高了数据处理的效率。
Description
本申请涉及计算机技术领域,特别涉及一种任务执行方法及存储设备。
近数据处理(Near Data Processing,NDP)是一种数据处理的方法或概念。NDP旨在将对数据的处理和计算移动到靠近数据的地方,从而尽量的减少甚至避免数据的移动,因此避免数据移动开销所带来的性能瓶颈,进而提升执行数据处理任务的效率。
相关技术在实现NDP时,由数据库服务器通过智能数据库协议(the Intelligent Database protocol,iDB协议,一种查询下推协议)告知存储设备待执行的表查询操作和数据的位置,存储设备根据iDB协议包含的信息,通过中央处理器(central processing unit,CPU)进行谓词过滤,列过滤,连接过滤等结构化查询语言(Structured Query Language,简称SQL)查询中的表查询操作。
采用以上方法时,只局限于使用存储设备的CPU的算力,因此影响了数据处理效率。
发明内容
本申请实施例提供了一种任务执行方法及存储设备,能够提高数据处理效率。所述技术方案如下:
第一方面,提供了一种任务执行方法,该方法应用于存储设备中,该存储设备包括中央处理器和多个专项处理器。在该方法中,中央处理器获取数据处理任务;该中央处理器将该数据处理任务划分为多个子任务;该中央处理器根据各个子任务的属性,将该多个子任务中的第一子任务分配给第一专项处理器。其中,第一专项处理器是该多个专项处理器的其中一个专项处理器。
以上提供了一种利用存储设备的多种处理器协作处理数据的方法,由存储设备中的中央处理器将数据处理任务划分为多个子任务,根据子任务的属性为子任务分配存储设备中的专项处理器。一方面,由于在进行数据处理的过程中,中央处理器承担了任务分解和任务调度的工作,专项处理器承担了执行子任务的工作,使得中央处理器的算力和专项处理器的算力均得到了充分利用。另一方面,由于分配子任务时考虑了子任务的属性,使得子任务能够依据其属性被调度到合适的专项处理器上执行。因此,该方法提高了数据处理的效率。
可选地,该子任务的属性包括该子任务所涉及的数据的地址,该第一专项处理器是距离该数据最近的专项处理器。
通过这种可选方式,使得子任务被调度至距离数据最近的专项处理器上执行。由于缩短了数据从存储介质至专项处理器的传输路径,专项处理器能够就近访问数据和处理数据,因此减少了数据移动造成的时延和性能开销,提高了数据处理的效率和速度。
可选地,该子任务的属性包括该子任务的计算模式和/或并发量,该第一专项处理器是与该计算模式和/或并发量匹配的专项处理器。
由于不同的专项处理器擅长处理不同的任务,通过这种可选方式,考虑了子任务的计算特征与专项处理器本身是否匹配,将子任务调度至与其计算特征匹配的专项处理器上执行,使得专项处理器能够处理自身擅长处理的任务,从而发挥了专项处理器自身的性能优势,提高了数据处理的效率。
可选地,该子任务的属性包括该子任务的定义信息,该第一专项处理器是该第一子任务的定义信息指示的专项处理器。
通过这种可选方式,一方面,开发者能够在定义信息中指定由哪个处理器执行子任务,使得子任务调度至开发者指定的专项处理器上执行,从而满足了开发者的自定义需求。另一方面,随着存储设备的算力提升以及业务需求的增长,当需要将新的任务放在存储设备上执行时,通过在新任务的定义信息中添加专项处理器的标识,即可指明将新的任务调度到哪个专项处理器上,从而降低了调度新任务的难度,因此提高了可扩展性。
可选地,该子任务的属性包括该子任务对应的数据集类型,该第一专项处理器是与该第一子任务对应的数据集类型匹配的专项处理器。
由于不同的专项处理器适于处理不同类型的数据,例如GPU适合处理图像,一些专用的编解码处理器适合处理视频,通过这种可选方式,考虑了子任务要处理的数据类型与专项处理器本身是否匹配,将子任务调度至其数据集类型匹配的专项处理器上执行,使得专项处理器能够处理自身适合处理的数据,避免由于专项处理器无法识别和处理特定类型的数据而造成任务执行失败的情况,提高了任务执行的成功率。
可选地,该多个子任务的执行顺序被记录在拓扑图中,该方法还包括:
该中央处理器根据该拓扑图指示该第一专项处理器按照顺序执行该第一子任务。
通过这种可选方式,一方面,由于拓扑图记录了子任务的执行顺序,中央处理器无需重新计算子任务的执行顺序,能够直接按照拓扑图所记录的执行顺序进行调度,从而减少了调度的工作量。另一方面,目前存在很多基于拓扑图的调度优化算法,能够调用基于拓扑图的调度优化算法优化子任务调度的顺序,从而缩短任务整体的执行时间。
第二方面,提供了一种存储设备,该存储设备包括中央处理器和多个专项处理器。第二方面提供的存储设备用于实现第一方面或第一方面任一种可选方式所提供的功能,具体细节可参见上述第一方面或第一方面任一种可选方式。
第三方面,提供了一种计算机可读存储介质,该存储介质中存储有至少一条指令,该指令由中央处理器读取以使存储设备执行上述第一方面或第一方面任一种可选方式所提供的任务执行方法。
第四方面,提供了一种计算机程序产品,当该计算机程序产品在存储设备上运行时,使得存储设备执行上述第一方面或第一方面任一种可选方式所提供的任务执行方法。
第五方面,提供了一种存储设备,该存储设备具有实现上述第一方面或第一方面任一种可选方式的功能。该存储设备包括至少一个模块,至少一个模块用于实现上述第一方面或第一方面任一种可选方式所提供的任务执行方法。第五方面提供的存储设备的具体细节可参见上述第一方面或第一方面任一种可选方式,此处不再赘述。
图1是本申请实施例提供的一种系统架构的示意图;
图2是本申请实施例提供的一种应用数据打散的示意图;
图3是本申请实施例提供的一种系统架构的示意图;
图4是本申请实施例提供的一种任务执行方法的流程图;
图5是本申请实施例提供的一种拓扑图的示意图;
图6是本申请实施例提供的一种任务执行方法的流程图;
图7是本申请实施例提供的一种任务执行装置的结构示意图。
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
由于本申请的一些实施例涉及近数据处理技术的应用,为了便于理解,下面先对近数据处理技术进行简单介绍。
在传统的数据处理架构中,对数据的处理一般是集中式的,即将数据从存储器通过输入输出(Input/Output,IO)或网络加载到内存,然后中央处理器(central processing unit,CPU)再对内存中数据进行处理。然而在大数据时代,需要处理的数据量呈爆炸式的增长,这种传统的数据处理架构需要传输大量的数据。对于像数据库这种数据密集型的应用,查询处理首先需要大量的IO操作,加载数据到计算节点的内存中,使得IO或网络成为系统的性能瓶颈,带来极大的性能问题:1)大量的数据移动,增加了数据处理的延迟;2)数据传输造成IO或网络资源竞争,影响系统中其它应用数据访问,影响其它应用的性能。在大数据时代,数据量呈爆炸式的增长,对于数据分析型应用,应该尽量避免数据的传输,减少数据的移动开销。
另一方面,当数据从外存加载到动态随机存取存储器(Dynamic Random Access Memory,DRAM)之后,CPU需要通过内存总线,使用存取(load/store)指令访问内存。而随着CPU的性能以每年大约60%的速度快速提升,而内存性能的提升速度只有大约7%,从而导致当前内存的速度严重落后于CPU的速度,内存和CPU之间存在严重的性能鸿沟,难以充分发挥CPU的优势,导致内存系统成为计算系统的性能瓶颈,特别是在内存密集(Memory Intensive)的高性能计算(High Performance Computing,HPC)场景中,内存速度严重限制着系统性能。此外,内存和CPU之间的内在总线也面临着带宽低,延迟高等问题,数据传输的代价高,严重影响计算系统的性能。这种限制系统性能的内存瓶颈通常称为“内存墙(Memory Wall)”。
为了解决数据移动开销所带来的性能瓶颈,需要将传统以处理器为中心的计算模式转变为以数据为中心的计算模式,将对数据的处理移动到靠近数据的地方,从而实现近数据处理(Near Data Processing,NDP)。NDP也称近数据计算(Near Data Computing,NDC),是一种处理数据的方法或概念。NDP是指将对数据的处理和计算移动到靠近数据的地方,从而尽量的减少甚至避免数据的移动,提升数据处理的效率。
下面介绍本申请实施例提供的应用场景。
本实施例提供的方法能够应用在分布式存储系统或集中式存储设备中,下面对这两种应用场景分别进行介绍。
应用场景一、分布式存储系统的场景。
参见附图1,本实施例提供了一种系统架构100,系统架构100是对分布式存储系统的应 用场景的举例说明。系统架构100是一种计算和存储分离的架构,系统架构100包括计算集群110和存储集群120,计算集群110和存储集群120通过网络通道相连。
计算集群110包括多个计算节点(computing node,CN)。计算节点的形态包括多种情况。例如,计算节点是主机、服务器、个人电脑或其他具有计算处理能力的设备。例如参见附图1,计算集群110包括主机110a和主机110b。计算集群110中的不同计算节点之间通过有线网络或无线网络相连。计算集群110中的不同计算节点可以分布在不同或相同的位置。计算节点用于生成和下发数据处理任务。
计算节点包括至少一个应用(Applications)111和NDP协调模块(NDP Coordinator)112。应用111和NDP协调模块112是计算节点上的软件。应用111用于生成数据处理任务。可选地,应用111是数据密集型应用,即需要处理海量的数据。例如,应用111是联机分析处理(On-line Analytical Processing,OLAP)应用、人工智能(artificial intelligence,AI)应用、联机事务处理(On-Line Transaction Processing,OLTP)应用、大数据分析应用、HPC应用等。OLAP应用例如用于提供OLAP系统中的多表联合查询的服务。应用111会将产生的数据处理任务发送至NDP协调模块112。NDP协调模块112用于将应用111的数据处理任务分别发送至数据所在的存储节点。
可选地,分布式存储系统还包括数据管理装置,数据管理装置用于记录数据在存储集群120中所在的存储节点。计算节点中的NDP协调模块用于向数据管理装置发送查询请求,从而查询出数据位于哪个存储节点。可选地,在数据是文件的情况下,数据管理装置保存有文件标识符(identifier,ID)和文件所在的存储节点的ID之间的映射关系。可选地,在数据是键值对的情况下,数据管理装置保存有key和文件所在的存储节点的ID之间的映射关系。示意性地,参见附图3,数据管理装置是附图3中的数据格式服务(Data Scheme Service)130。
存储集群120包括多个存储节点(Date Node,DN)。例如参见附图1,存储集群120包括存储节点120a、存储节点120b和存储节点120c。存储集群120中的不同存储节点可以分布在不同或相同的位置。存储集群120中的不同存储节点通过高速网络互连。存储节点用于存储数据。存储节点可以承载计算节点中应用的存储业务,响应计算节点的IO请求。
计算集群110和存储集群120之间的网络通道通过至少一个网络设备建立。网络设备用于转发计算集群110与存储集群120之间传输的数据。网络设备包括而不限于交换机、路由器等。网络设备在附图1未示出。
以上介绍了分布式存储系统的整体架构,以下对分布式存储系统存储的数据的分布进行简单介绍。可选地,通过分片(sharding)机制或其他划分方式,对每个应用的数据集分别进行划分,使得同一个应用的数据集被拆分为多份数据,该多份数据分别分布在不同的存储节点上。例如,每一个计算节点处理应用数据集中的一份数据,每一个存储节点存储应用数据集中的一份数据,从而保证计算节点和存储节点的负载均衡。例如,参见附图2,附图2示出了应用数据打散的示意图,应用1的数据集和应用2的数据集分别分布在存储节点1、存储节点2至存储节点n中。其中,应用1的数据集被划分为n份数据,n份数据包括应用1的数据a、应用1的数据b至应用1的数据n,其中数据a分布在存储节点1上,数据b分布在存储节点2上,数据n分布在存储节点n上,应用2的数据分布与应用1的数据分布同理。此外,存储集群120可以使用多副本或者纠删码(erasure code,EC)的方式进行数据冗余保护,使得在部分存储节点失效的情况下,应用数据仍然可用,从而保证数据的高可用。
应用场景二、集中式存储设备的场景。
集中式存储设备例如是存储阵列。集中式存储设备包括一个或多个控制器和一个或多个硬盘。存储设备中的控制器也称存储控制器。集中式存储设备通过有线网络或无线网络与主机相连。
在以上描述的两种应用场景中,计算集群110和存储集群120之间的网络通道或者集中式存储设备与主机之间的网络通道,会受到成本、距离等因素的限制,存在网络带宽相对较低、延迟高等缺点。因此,对于OLAP应用,大数据分析应用等数据密集型应用而言,应用所在的计算设备与存储设备之间的网络通道成为了主要的性能瓶颈之一。有鉴于此,如何减少或避免数据在计算侧与存储侧之间的网络通道进行传输所带来的性能开销,提高应用中数据处理的效率,已成为以上应用场景亟需满足的需求。
以上示例性介绍了应用场景以及应用场景存在的需求,以下对本实施例提供的存储设备以及存储设备执行的方法进行具体介绍。在以上应用场景中,通过本实施例提供的存储设备以及方法,能够满足上述应用场景存在的需求。具体地,通过将数据处理任务交给存储节点,使得数据的处理过程从计算集群110中的计算节点移动至存储集群120中的存储节点,由于存储节点能够访问本地存储的数据并在本地对其存储的数据进行处理,而无需通过网络通道请求远端存储的数据,从而避免数据在计算集群110和存储集群120之间通过网络通道传输带来的性能瓶颈。此外,下述实施例可以作为一套通用的近数据计算机制,支持执行数据库应用、大数据应用、AI应用等各种应用产生的数据处理任务,从而增加近数据计算的灵活性。此外,通过将数据处理任务分解为多个子任务,将每个子任务分别进一步下推至固态硬盘(solid state drive,SSD)或双列直插式存储模块(Dual-Inline-Memory-Modules,DIMM)、图形处理器(英文:Graphics Processing Unit,简称:GPU)、神经网络处理器(neural-network processing units,NPU)或者专用的数据处理单元(Data Processing Unit,DPU),分别调度各个处理器执行子任务,从而实现任务的分解和调度,每个子任务能够根据其计算特征和需求被调度最合适的处理器上执行,从而充分利用存储设备的异构计算资源,最大化数据处理的效率。
下面结合附图1和附图3,对存储设备内部的结构进行介绍。
本申请性实施例提供了一种存储设备。例如,存储设备是分布式存储系统中的存储节点,例如是附图1中的存储节点120a、存储节点120b和存储节点120c。又如,存储设备是集中式存储设备。存储设备包括多个处理器、网卡以及存储介质(Storage Media)。多个处理器包括中央处理器和多个专项处理器。
中央处理器用于获取数据处理任务、划分子任务以及对每个专项处理器进行调度。例如,参见附图1,存储节点120a是对存储设备的举例说明,存储节点120a中的中央处理器121是对存储设备中的中央处理器的举例说明。
专项处理器是中央处理器之外的任意处理器。专项处理器具有算力,专项处理器能够自身的利用算力参与子任务的执行。例如,参见附图1,存储节点120a中的GPU122、NPU123是对存储设备中的专项处理器的举例说明。此外,参见附图3,存储节点中的DIMM127中的DPU1272、存储节点中的SSD128中的DPU1282也是对存储设备中的专项处理器的举例说明。 专项处理器的具体类型包括多种情况,以下通过情况一和情况二对专项处理器举例说明。
情况一、专项处理器是独立的芯片。
例如,参见附图1或附图3,专项处理器是GPU、NPU一类的能单独工作的芯片。
情况二、专项处理器是存储设备包括的任意元件中的处理器。
在情况二下,专项处理器可以和存储设备的其他元件集成在一起。例如,参见附图1,存储设备包括硬盘,专项处理器是硬盘的控制器(SSD controller)。例如,在硬盘是SSD的情况下,SSD包括处理器,专项处理器可以是SSD的处理器。例如,参见附图3,SSD包括DPU,专项处理器是SSD128中的DPU1282。其中,包括处理器的SSD也称计算型SSD或智能SSD。在一些实施例中,存储设备包括DIMM,DIMM包括处理器,专项处理器是DIMM的处理器。例如,参见附图3,DIMM127包括DPU1272,专项处理器是DIMM127中的DPU1272。其中,包括处理器的DIMM也称计算型DIMM或智能DIMM。
在一些实施例中,专项处理器是专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。专项处理器可以是单核处理器,也可以是多核处理器。
在一些实施例中,存储设备包括的多个专项处理器是异构处理器。可选地,多个专项处理器具有不同的硬件架构。可选地,多个专项处理器支持不同的指令集。例如,存储设备包括的一个专项处理器支持X86指令集,存储设备包括的另一个专项处理器支持ARM指令集。例如,存储设备包括CPU、GPU、NPU、DIMM和SSD,在这个例子中,CPU、GPU、NPU、DIMM中的DPU以及SSD中的DPU是对五种异构处理器的举例说明。对于中央处理器而言,多种异构的专项处理器可以组成异构计算资源池,中央处理器可以调度异构计算资源池中的资源来执行任务。
中央处理器如何与专项处理器通信包括多种方式。在一些实施例中,中央处理器与专项处理器通过高速互联网络连接,中央处理器与专项处理器通过高速互联网络通信。高速互联网络例如是高速串行计算机扩展总线标准(peripheral component interconnect express,PCIe)总线、memory fabric、高速以太网、HCCS、无限带宽(InfiniBand,IB)或者光纤通道(Fibre Channel,FC)。
网卡用于提供数据通信的功能。例如,参见附图1,存储设备中的网卡是存储节点120a中的网卡125。
存储介质用于存放数据。例如,参见附图1,存储介质是存储节点120a中的硬盘124。硬盘124用于存放数据。硬盘124例如是固态硬盘(solid state drive,简称:SSD)、机械硬盘(hard disk drive,简称:HDD)。例如,参见附图3,硬盘是SSD128,SSD128包括至少一个闪存芯片1281,闪存芯片1281用于持久化存储数据。例如,参见附图3,存储介质也可以是DIM127中的DRAM芯片1271。
在一些实施例中,存储设备还包括存储接口(Storage Interface)126,存储接口126用于向上层(如存储设备的处理器和计算节点的应用)提供数据访问接口。例如,存储接口126是文件系统接口或键值(Key-Value,KV)接口。
以上从硬件的角度介绍了存储设备的内部结构,下面从软件的角度,介绍存储设备内部的逻辑功能架构。
参见附图3,存储节点包括NDP执行引擎20(NDP Execution Engine),NDP执行引擎20是存储节点上的软件。NDP执行引擎20在存储节点的中央处理器中运行。例如,NDP执行引擎20在存储节点的控制器中运行。
NDP执行引擎20包括解析器(Parser)201和执行器(Executor)202。解析器201用于对描述NDP任务的定义信息203进行解析,生成拓扑图204。执行器202用于根据拓扑图204,分别调度各个专项处理器以及中央处理器执行子任务。例如,在附图3中,执行器202调度CPU执行子任务a,调度GPU执行子任务c,调度NPU执行子任务b,调度DIMM中的DPU执行子任务e,调度SSD中的DPU执行子任务d。在一些实施例中,解析器201和执行器202均是软件。例如,解析器201和执行器202是存储节点的中央处理器读取程序代码后生成的功能模块。
以上介绍了系统架构,以下通过方法300和方法400示例性介绍基于上文提供的系统架构执行任务的方法流程。
参见附图4,附图4是本申请实施例提供的一种任务执行方法300的流程图。
所述方法300由存储设备执行。可选地,所述方法300由分布式存储系统中的存储节点执行。例如,所述方法300由附图1所示系统架构中的存储节点120a、存储节点120b和存储节点120c执行。可选地,所述方法300由集中式存储设备执行。
可选地,方法300中处理的数据是附图1所示系统架构中主机的应用产生和维护的数据。例如,主机的应用根据其需要处理的数据,产生数据处理任务,将数据处理任务作为存储设备的输入,触发存储设备执行以下步骤S310至步骤S340。
示例性地,方法300包括S310至S340。
S310、中央处理器获取数据处理任务。
数据处理任务是对存储设备存储的数据进行处理的任务。可选地,数据处理任务是NDP任务。数据处理任务的类型包括多种情况。例如,数据处理任务是OLAP应用产生的多表联合查询任务、AI应用产生的模型训练任务、HPC应用产生的高性能计算任务、大数据分析应用产生的物理实验数据分析任务、气象数据分析任务等大数据分析任务、OLTP应用产生的事务处理任务等。
中央处理器如何获取数据处理任务包括多种实现方式。在一些实施例中,数据处理任务来自于计算设备。具体地,计算设备生成数据处理任务,向存储设备发送数据处理任务,存储设备的中央处理器接收数据处理任务。通过这种方式,数据处理任务从计算设备下推到存储设备执行,从而实现近数据处理。例如,参见附图3,主机中的应用生成NDP任务;应用向NDP协调模块发送任务下推请求,任务下推请求携带NDP任务,任务下推请求用于请求将任务发送至存储设备。NDP协调模块响应于任务下推请求,将NDP任务发送至存储设备,使得存储设备得到NDP任务。
在一些实施例中,数据处理任务所要处理的数据存储在存储设备中。例如,计算设备根据数据的归属位置,确定数据所在的存储设备,向数据所在的存储设备发送数据处理任务,以便存储设备就近调度本地的处理器处理本地的数据。
其中,计算设备如何确定数据所在的存储设备包括多种实现方式。例如,在数据是文件的情况下,通过文件的ID确定文件所在的存储设备。又如,在数据是键值对的情况下,通过键(key)确定数据所在的存储设备。在一些实施例中,确定数据所在的存储设备的过程涉及计算设备与数据管理装置的交互。具体地,计算设备向数据管理装置发送查询请求,查询请求包括文件的ID或者key。数据管理装置响应于查询请求,根据文件的ID或者key,查询数据在存储集群中所在的节点,向计算设备发送查询响应,查询响应包括存储设备的标识。计算设备接收查询响应,确定数据所在的存储设备。
可选地,数据处理任务通过声明式语言描述。声明式语言是一种编程范式,与命令式编程相对立。声明式语言描述数据处理任务的目标,即指示存储设备执行什么操作,而不明确地指示具体应该如何执行操作。例如,数据处理任务是NDP任务,开发者设计了一种描述NDP任务的声明式语言,将其称为NDP描述语言。应用程序可以通过NDP描述语言定义需要下推至存储设备的NDP任务,得到NDP任务的定义信息。其中,NDP任务的定义信息包含了NDP任务的输入参数、NDP任务需要执行的操作和NDP任务的输出结果。例如,使用NDP描述语言定义的NDP任务结构如下:
S320、中央处理器将数据处理任务划分为多个子任务。
子任务包括而不限于函数或者计算步骤。划分子任务的单位包括多种情况,以下通过方式一至方式二举例说明。
方式一、以一个函数作为划分子任务的最小单位。
例如,中央处理器将数据处理任务划分为多个函数。一个子任务是一个函数;或者,一个子任务包括多个函数。
方式二、以一个计算步骤作为划分子任务的最小单位。
例如,中央处理器将数据处理任务划分为多个函数,将每个函数划分为多个计算步骤。其中,一个子任务是一个计算步骤;或者,一个子任务包括多个计算步骤。由于将数据处理任务分解为函数并进一步分解为计算步骤,实现了任务的逐层分解,使得子任务的粒度更加精细化,有助于提高调度子任务的灵活性。
在一些实施例中,子任务是根据计算模式划分的。具体而言,中央处理器根据数据处理任务包含的函数或计算步骤的计算模式,将数据处理任务划分为多个子任务,每一个子任务具有相同的计算模式。例如,数据处理任务中包括函数A和函数B。函数A复杂,函数A包括多种计算模式。而函数B比较简单,函数B只具有一种计算模式。在这个例子中,中央处理器将函数A拆分为多个计算步骤,每个计算步骤具有一个计算模式,将函数A的每个计算步骤作为一个子任务;中央处理器将函数B作为一个子任务。由于依据计算模式划分子任务,便于根据计算模式为子任务分配合适的专项处理器。
在一些实施例中,子任务是根据函数的定义信息划分的。具体而言,中央处理器根据数据处理任务中每个函数的定义信息,将数据处理任务划分为多个子任务。例如,开发者在编写函数时,在函数中注明函数包含的每个计算步骤,比如说,在函数中的代码行A和代码行B中分别添加关键字,注明代码行A和代码行B之间的程序代码对应一个单独的计算步骤,该计算步骤可被调度至某个专项处理器上。则中央处理器根据函数的定义信息,将代码行A和代码行B之间的程序代码拆分出来,作为一个子任务。
S330、中央处理器根据各个子任务的属性,将多个子任务中的第一子任务分配给第一专项处理器。
S340、第一专项处理器执行第一子任务。
本实施例涉及中央处理器如何为第一专项处理器分配第一子任务,中央处理器为其他专项处理器分配其他子任务的过程与此同理。
第一子任务是多个子任务中的其中一个子任务。第一专项处理器是多个专项处理器的其中一个专项处理器。例如,第一专项处理器是GPU、NPU、DIMM中的DPU或者SSD中的DPU。
应理解,本实施例并不限定仅为第一专项处理器分配第一子任务这一个子任务,可选地,中央处理器为第一专项处理器还分配第一子任务之外的其他子任务。
应理解,本实施例并不限定所有的子任务都要分配给专项处理器。在一些实施例中,中央处理器将部分子任务分配给自己执行。例如,中央处理器从多个子任务中选择第二子任务,中央处理器执行第二子任务。
在一些实施例中,中央处理器将多个子任务中不同子任务分配给不同的专项处理器,从而调度不同的专项处理器分别执行不同的子任务。例如,划分出的多个子任务包括子任务a、子任务b、子任务c和子任务d,中央处理器将子任务a分配给NPU,将子任务b分配给GPU,将子任务c分配给DIMM中的DPU,将子任务d分配给SSD中的DPU。
在一些实施例中,中央处理器为不同专项处理器分配的子任务的数量是相同的,例如,中央处理器将划分出的所有子任务平均分配给每个专项处理器。
在另一些实施例中,中央处理器为不同专项处理器分配的子任务的数量是不同的。例如,中央处理器结合每个专项处理器当前的算力,为具有空闲算力的专项处理器分配更多的子任务,为算力不足的专项处理器分配更少的子任务,或者不为算力不足的专项处理器分配子任务。例如,中央处理器确定第一专项处理器的计算资源,判断第一专项处理器的计算资源是否低于设定的阈值;如果第一专项处理器的计算资源高于设定的阈值,中央处理器确定第一专项处理器算力空闲,则为第一专项处理器分配第一数量的子任务;如果第一专项处理器的计算资源低于设定的阈值,中央处理器确定第二专项处理器算力不足,则不为第一专项处理器分配子任务,或者为第一专项处理器分配小于第一数量的子任务。
本实施例并不限定第一子任务仅通过第一专项处理器这一个处理器执行。在一些实施例中,第一专项处理器承担第一子任务的所有运算量,执行第一子任务的所有步骤。在另一些实施例中,第一专项处理器和中央处理器协同参与第一子任务的运算。例如,第一专项处理器执行第一子任务中的部分步骤,中央处理器执行第一子任务中另一部分步骤。比如说,第一专项处理器在执行第一子任务的过程中,实时监控计算资源的剩余情况,当第一专项处理器确定自身的算力不足时,则将已经得出的计算结果以及第一子任务中未执行的剩余部分发 送给中央处理器,中央处理器依据计算结果,继续执行第一子任务中的剩余部分。在另一些实施例中,第一专项处理器不是和中央处理器协同运算,而是和其他专项处理器协同运算。
在一些实施例中,存储设备包括的多个专项处理器分别具有对应的特点,擅长执行不同的任务。有鉴于此,中央处理器可以结合专项处理器的特点,将专项处理器适于执行的任务分配给专项处理器,从而充分发挥专项处理器各自的性能优势。以下通过(1)至(5),对如何结合专项处理器的具体特点对分配给专项处理器的子任务举例说明。
(1)适于分配给GPU的子任务。
GPU是一种单指令多数据流(Single Instruction Multiple Data,SIMD)的处理器。GPU的架构包括成千上万个简单的处理核,GPU通过成千上万个核同时工作,能进行大量的相同运算。此外,GPU的每个处理核比较适合做运算,不适合做控制。
考虑到GPU的这一特点,如果子任务涉及的运算简单且模式单一,并且子任务是由大量的这种简单、单一的运算构成的,可以将任务分配给GPU,从而调度GPU执行计算模式简单且数据并发量大的任务。
例如,执行矩阵乘运算就是一种计算模式简单且数据并发量大的子任务。具体而言,矩阵乘运算是由大量的向量乘运算构成的。向量乘运算是一种简单的操作,向量乘运算具体包括对行和列相乘,将得到的积再相加。考虑到向量乘运算任务的这种属性,在一些实施例中,将矩阵乘运算的子任务分配给GPU。GPU在执行矩阵乘运算的子任务的过程中,GPU的每个处理核会分别进行向量乘运算,GPU通过成千上万个处理核同时进行向量乘运算,使得整个向量乘运算子任务的执行得以加速,有助于提高执行向量乘运算子任务的效率。
应理解,矩阵乘运算是对适于分配给GPU的子任务的举例说明,GPU也适于执行矩阵乘运算之外的子任务。例如,神经网络中的卷积运算也适于通过GPU执行,可以调度GPU执行卷积运算子任务。
(2)适于分配给NPU的子任务。
NPU专门为AI设计,NPU包括乘加、激活函数、二维数据运算、解压缩等AI计算所需的模块。考虑到NPU的这一属性,在一些实施例中,将神经网络运算的任务(如图像识别的任务)分配给NPU,NPU能够利用自身包括的模块,加速神经网络运算任务。
(3)适于分配给DPU的子任务。
DPU是一个可编程的电子部件,用于处理数据。DPU具有CPU的通用性和可编程性,但DPU比CPU更具有专用性,DPU能在网络数据包,存储请求或分析请求上高效运行。此外,DPU比CPU具有更大程度的并行性(即DPU能处理大量并发的请求)。考虑到DPU的这一特点,在一些实施例中,调度DPU提供对全局内存池的数据卸载服务。例如,将地址索引、地址查询、分区功能以及对数据进行过滤、扫描等操作分配给DPU。
(4)适于分配给DIMM的处理器的子任务。
例如,DIMM内包括DPU和DRAM芯片(DRAM chips),DPU能够快速地访问DRAM,对存放在DRAM中的数据进行处理,从而就近完成任务。考虑到DIMM的这一特点,在一些实施例中,当任务所需处理的数据位于DIMM中的DRAM时,由于DPU与DRAM集成在同一个DIMM内,DPU具有距离数据最近或者说数据亲和性最高的优势,可以将任务分配给DIMM的DPU。通过调度DIMM中的DPU对DIMM存储的数据进行处理,能够实现存内计算(Processing in Memory)或近内存计算(Near Memory Computing),避免数据通过内 存总线进行传输,使得任务的执行得以加速,提高执行任务的效率。此外,在一些实施例中,调度DIMM中的DPU执行内存访问不规则,且内存访问量比较大的任务,从而利用DPU访问DRAM的性能优势,节省访问内存的时间开销。此外,在一些实施例中,DIMM中的DPU是专用于执行特定操作的处理器,只能完成固定类型的计算,在这种情况下,调度DIMM的DPU执行这些固定类型的计算对应的任务。
应理解,以上是对以DIMM包括的处理器是DPU的情况的举例说明,在DIMM的处理器不是DPU,而是DPU之外的其他类型的处理器的情况下,可以采用同样的策略为DIMM的其他类型处理器分配任务。
(5)适于分配给SSD的处理器的子任务。
例如,SSD包括DPU和闪存芯片(Flash chips),SSD的DPU能够快速地访问闪存芯片,对存放在闪存芯片中的数据进行处理,从而就近完成任务。在一些实施例中,考虑到SSD的这一特点,当任务要处理的数据位于SSD中的闪存芯片时,可以调度SSD中的DPU执行任务。通过调度SSD的DPU对SSD存储的数据进行处理,能够充分利用SSD盘内部的高带宽。此外,当数据分别位于多个SSD上时,可以调度多个SSD并行地执行任务,从而利用多个SSD之间的并发处理能力,加速任务的执行。此外,在一些实施例中,调度SSD的DPU执行计算模式简单且输出的数据量能极大减少的任务,例如过滤操作。此外,在一些实施例中,SSD中的DPU是专用于执行特定操作的处理器,只能完成固定类型的计算,在这种情况下,调度SSD的DPU执行这些固定类型的计算对应的任务。
应理解,以上是对以SSD包括的处理器是DPU的情况的举例说明,在SSD的处理器不是DPU,而是DPU之外的其他类型的处理器的情况下,可以采用同样的策略为SSD的其他类型处理器分配任务。
以下通过调度策略一至调度策略四,对具体如何调度专项处理器举例说明。
调度策略一、按照数据的归属位置调度。
调度策略一也称为按照数据的亲和性进行调度。在一些实施例中,调度策略一的实现方式包括:中央处理器确定子任务所涉及的数据的地址;中央处理器根据第一子任务所涉及的数据的地址,从多个专项处理器中选择距离数据最近的专项处理器,作为第一专项处理器;中央处理器将第一子任务分配给该距离数据最近的第一专项处理器。
数据的地址例如是数据的逻辑地址或数据的物理地址。数据的地址例如通过数据的元数据确定。
可选地,在采用调度策略一时,数据位于哪个装置的存储介质,中央处理器就调度哪个装置的处理器执行子任务。在这种情况下,距离数据最近的专项处理器是与数据所在的存储介质集成在一起的处理器。例如,如果数据位于SSD,则中央处理器将子任务分配给SSD中的DPU,从而调度SSD的DPU执行子任务。如果数据位于DIMM,则中央处理器将子任务分配给DIMM中的DPU,从而调度DIMM的DPU执行子任务。
通过采用调度策略一,使得子任务被调度至距离数据最近的专项处理器上执行。由于缩短了数据从存储介质至专项处理器的传输路径,专项处理器能够就近访问数据和处理数据,因此减少了数据移动造成的时延和性能开销,提高了数据处理的效率和速度。
调度策略二、按照子任务的计算特征调度。
在一些实施例中,子任务的计算特征包括子任务的计算模式和/或子任务的并发量。调度 策略二的实现方式包括:中央处理器确定子任务的计算模式和/或并发量,根据子任务的计算模式和/或并发量,从多个专项处理器中选择与计算模式和/或并发量匹配的专项处理器,作为第一专项处理器,将第一子任务分配给第一专项处理器。例如,当子任务的计算模式简单且并发量大时,则中央处理器选择GPU,将计算模式简单且并发量大的子任务分配给GPU。
在一些实施例中,子任务的计算特征包括执行子任务所需的算法的类型。调度策略二的实现方式包括:中央处理器根据执行子任务所需的算法的类型,从多个专项处理器中选择适于运行该类型算法的专项处理器。例如,子任务是人脸识别,执行人脸识别时需要使用神经网络算法,而存储设备刚好配置了执行神经网络算法的NPU,则中央处理器选择NPU,调度NPU通过神经网络算法进行人脸识别。又如,子任务是图像压缩,存储设备刚好配置了图像压缩的专用芯片,则中央处理器调度该专用芯片进行图像压缩。
由于不同的专项处理器擅长处理不同的任务,通过采用调度策略二,考虑了子任务的计算特征与专项处理器本身是否匹配,将子任务调度至与其计算特征匹配的专项处理器上执行,使得专项处理器能够处理自身擅长处理的任务,从而发挥了专项处理器自身的性能优势,提高了数据处理的效率。
调度策略三、按照子任务的定义信息调度。
在一些实施例中,调度策略三的实现方式包括:中央处理器获取每个子任务的定义信息;中央处理器根据第一子任务的定义信息,从存储设备包括的多个专项处理器中选择第一子任务的定义信息指示的专项处理器,作为第一专项处理器;中央处理器将第一子任务分配给该第一专项处理器。
第一子任务的定义信息包括第一专项处理器的标识。第一专项处理器的标识例如是第一专项处理器的名称。比如说,当第一子任务的定义信息包括“GPU”时,指示通过GPU执行第一子任务。定义信息由于包含第一专项处理器的标识,能够指明了要通过第一专项处理器执行第一子任务。
本实施例并不限定第一子任务的定义信息仅包括第一专项处理器这一个处理器的标识。在一些实施例中,第一子任务的定义信息还包括第一专项处理器之外的其他处理器的标识。例如,第一子任务的定义信息包括多个处理器中每个处理器的标识,从而指明分配第一子任务时存在多个处理器可供选择。中央处理器根据第一子任务的定义信息,从定义信息指示的多个处理器中选择第一专项处理器。
在一些实施例中,定义信息还用于指示多个处理器中每个处理器的优先级,中央处理器根据定义信息指示的每个处理器的优先级,从定义信息指示的多个处理器中选择优先级最高的处理器,作为第一专项处理器;或者,在定义信息指示的优先级最高的处理器算力不足的情况下,中央处理器选择优先级其次高的处理器作为第一专项处理器。
在一些实施例中,定义信息中通过处理器的标识的排列顺序指明不同处理器优先级的高低。例如,定义信息中第一专项处理器的标识位于第二专项处理器的标识之前,表示第一专项处理器具有比第二专项处理器更高的优先级。例如,如果定义信息包括[GPU,NPU],表示GPU比NPU优先级更高。如果定义信息包括[NPU,GPU],表示NPU比GPU优先级更高。
如何获得子任务的定义信息包括多种实现方式,例如,开发者指定适于执行第一子任务的专项处理器是第一专项处理器,开发者在编写第一子任务的程序代码的过程中,输入第一专项处理器的标识以及其他信息,得到子任务的定义信息,将子任务的定义信息保存至存储 设备中。中央处理器在调度的过程中,会读取预先保存的第一子任务的定义信息。
例如,第一子任务是函数,开发者对函数的语法进行定义,指定函数的定义信息需要包括专项处理器的标识。在一些实施例中,开发者编写了一套NDP描述语言,NDP描述语言针对通用计算场景预置了部分函数或计算步骤,并针对这些函数或计算步骤指定了对应的异构处理器,从而通过异构处理器进行加速处理。在使用这些基本的函数时,不同函数会被分别调度到异构处理器(例如GPU、NPU和DIMM等)上执行。由于不同的应用场景有不同的函数或计算步骤。因此,NDP描述语言支持通过定义新的函数来扩展NDP的计算能力。在定义新的函数时,开发者需要指定该函数对应的数据集类型、输入参数、输出参数的类型以及一个或多个该函数最适合的专项处理器。
例如,NDP描述语言定义函数的语法如下。
Decl Func<函数名>of Dataset<数据集类型名>(arg list)[处理器1,处理器2,…]//注释:这一行是函数的声明语句,表示函数的函数名、数据集类型名和执行函数的处理器。Decl是declaration(声明)的缩写。Func是function(函数)的缩写。arg是argument(参数)的缩写。
Ret<返回类型>//注释:这一行表示函数输出参数的类型。Ret是return(返回)的缩写。
Begin
<函数体>
End//注释:Begin和End之间的部分是函数体,函数体包括实现函数功能的程序代码。
例如,基于以上语法编写的压缩函数的定义信息如下。
Decl Func Compress of Dataset Table(“LZ4”)[GPU,CPU]//注释:这一行是压缩函数的声明语句,表示压缩函数的函数名是Compress,压缩函数要处理的数据集的类型是Table类型,执行压缩函数采用的算法类型是LZ4压缩算法,适于执行该压缩函数的GPU和CPU,且优先考虑调度GPU,其次考虑调度CPU。
Ret Table//注释:这一行表示函数输出参数的类型是Table类型。
Begin
……
End
通过采用调度策略三,一方面,开发者能够在定义信息中指定由哪个处理器执行子任务,使得子任务调度至开发者指定的专项处理器上执行,从而满足了开发者的自定义需求。另一方面,随着存储设备的算力提升以及业务需求的增长,当需要将新的任务放在存储设备上执行时,通过在新任务的定义信息中添加专项处理器的标识,即可指明将新的任务调度到哪个专项处理器上,从而降低了调度新任务的难度,因此提高了可扩展性。
调度策略四、按照子任务对应的数据集类型调度。
在一些实施例中,调度策略四的实现方式包括:中央处理器确定每个子任务对应的数据集类型;中央处理器根据第一子任务对应的数据集类型,从存储设备包括的多个专项处理器中,选择与该数据集类型匹配的专项处理器,作为第一专项处理器;中央处理器将第一子任务分配给该第一专项处理器。
其中,数据集类型包括而不限于关系数据表(Table,包括行存和列存)类型、图像(Image)类型、文本(Text)类型等。
例如,第一子任务是压缩,可供选择的处理器包括GPU和CPU。如果压缩对应的数据集类型是图像,由于与图像匹配的处理器是GPU,则中央处理器选择GPU,将对图像进行压缩这种子任务分配给GPU。
如何确定子任务对应的数据集类型包括多种方式。在一些实施例中,根据子任务的定义信息,确定子任务对应的数据集类型。其中,子任务的定义信息包括数据集类型的名称。可选地,数据集类型是开发者自定义的类型。开发者在编写程序代码时,使用declaration语句声明自定义的数据集类型,以便在子任务的定义信息中指定自定义的数据集类型。例如,声明自定义的数据集类型的语法为:
Decl Dataset<数据集类型名>;
例如,基于以上语法,编写了语句:Decl Dataset Foo;这条语句声明了一个名为Foo的数据集类型。
此外,可选地,将每种数据集与对应的函数建立绑定关系。例如,将文本类型的数据集与Count函数建立绑定关系。如果类型为Table的数据集请求调用Count函数,则调用是无效的。如果类型为文本的数据集请求调用Count函数,则调用是允许的。通过这种方式,保证对数据集中的数据进行处理时能够调用正确的函数。
由于不同的专项处理器适于处理不同类型的数据,例如GPU适合处理图像,一些专用的编解码处理器适合处理视频,通过采用调度策略四,考虑了子任务要处理的数据类型与专项处理器本身是否匹配,将子任务调度至其数据集类型匹配的专项处理器上执行,使得专项处理器能够处理自身适合处理的数据,避免由于专项处理器无法识别和处理特定类型的数据而造成任务执行失败的情况,提高了任务执行的成功率。
以上通过调度策略一至调度策略四,列举了几种可能的调度策略。在一些实施例中,不同的调度策略具有不同的优先级,中央处理器根据每种调度策略的优先级,判定使用哪一种调度策略。例如,调度策略一、调度策略二和调度策略三这三种调度策略中,调度策略一的优先级最高,调度策略二的优先级和调度策略三的优先级其次。采用这种优先级顺序时,中央处理器优先考虑数据的归属位置,其次考虑子任务的计算特征以及子任务的定义信息。例如,中央处理器首先判断数据是否位于DIMM或SSD中,如果数据位于DIMM或SSD,且DIMM或SSD支持任务的执行,则按照调度策略一,将子任务分配给DIMM的处理器或SSD的处理器。如果数据不在DIMM或SSD,则中央处理器按照调度策略二或调度策略三,根据子任务的计算特征或者子任务的定义信息选择专项处理器,向选择的专项处理器的内存加载数据,调度选择的专项处理器执行子任务。
在一些实施例中,中央处理器根据拓扑图中记录的多个子任务的执行顺序进行调度。例如,中央处理器根据拓扑图指示第一专项处理器按照顺序执行第一子任务。
拓扑图用于指示多个子任务以及不同子任务执行的先后顺序。具体地,拓扑图包括多个节点和至少一条边。多个节点中的每个节点用于表示多个子任务中的一个子任务。例如,在子任务为函数的情况下,节点包含了函数对应的运算、函数的输入参数、函数的输出参数和执行函数的专项处理器。边连接了不同子任务对应的节点。每条边用于表示不同子任务之间的依赖关系。可选地,拓扑图是有向无环图(Directed acyclic graph,DAG)。DAG是指一个无回路的有向图。
在一些实施例中,使用拓扑图中边的方向记录子任务的执行顺序。例如,如果拓扑图中 第一节点和第二节点之间具有一条边,边的方向是从第一节点至第二节点,即边的起点是第一节点,边的终点是第二节点,那么第二节点对应的子任务先被执行,第一节点对应的子任务后被执行。例如,参见附图3,拓扑图是DAG204,DAG204中节点表示的子任务是函数。附图3所示的DAG204包括5个节点,分别是节点a、节点b、节点c、节点d和节点e。其中,节点a表示函数a,节点b表示函数b,节点c表示函数c,节点d表示函数d,节点e表示函数e。拓扑图具有四条边,分别是从节点a至节点c的边、从节点a至节点b的边、从节点a至节点d的边和从节点c至节点e的边。附图3中DAG204记录的函数的依赖关系以及执行顺序是:函数d和函数e先被执行。函数b和函数c依赖于函数e,函数b和函数c在函数e被执行完之后再被执行。函数a依赖于函数b、函数c和函数d,函数a最后被执行。根据该DAG204,首先,中央处理器会指示DIMM中的DPU执行函数e,指示SSD中的DPU执行函数d;当函数e执行完成之后,中央处理器会指示NPU执行函数b,指示GPU执行函数c;当函数b、函数c和函数d均执行完成后,中央处理器执行函数a。
如何获得拓扑图包括多种方式。在一些实施例中,存储设备收到计算设备发送的任务的定义信息后,对任务的定义信息进行解析,生成拓扑图。例如,参见附图3,存储设备接收到计算设备发送的NDP任务的定义信息后,通过解析器201(Parser)对NDP任务的定义信息进行解析,生成DAG204,从而通过DAG204表示NDP任务中的各个子任务。解析器201输出的DAG204会发送至存储设备包括的执行器202(Executor)。执行器202根据DAG204,依次调度NDP任务中的各个步骤或各个函数到对应的专项处理器上执行,并控制各个步骤或各个函数之间的数据流动。
通过将任务的定义信息解析为拓扑图并使用拓扑图进行调度,一方面,由于拓扑图记录了子任务的执行顺序,中央处理器无需重新计算子任务的执行顺序,能够直接按照拓扑图所记录的执行顺序进行调度,从而减少了调度的工作量。另一方面,目前存在很多基于拓扑图的调度优化算法,能够调用基于拓扑图的调度优化算法优化子任务调度的顺序,从而缩短任务整体的执行时间。
在一些实施例中,对于划分出的多个子任务中的任一个子任务,中央处理器为子任务选择匹配的专项处理器之后,中央处理器判断选择的专项处理器是否可编程。如果选定的专项处理器可编程,中央处理器生成能在该选择的专项处理器上执行的指令。例如,如果选择的专项处理器支持X86指令集,则生成X86指令;如果选择的专项处理器支持ARM指令集,则生成ARM指令。中央处理器指示选定的专项处理器执行指令,从而完成子任务,并对产生的指令进行缓存。当中央处理器下次再调度该子任务到该专项处理器时,便可调用预先缓存的指令来执行子任务,省去指令生成过程。如果选定的专项处理器不可编程,则调用专项处理器中相应的硬件计算模块执行子任务。
在一些实施例中,部署在计算集群的应用使用NDP描述语言定义了如下的NDP任务。NDP协调模块根据fileID查询数据管理装置(Data Scheme Service),获得fileID所对应的文件所归属的存储节点,将NDP任务转发到存储节点。
以上NDP任务的定义信息描述了NDP任务中要执行三个函数,分别是decompress函数、filter函数和count函数。
文件归属的存储节点收到NDP任务后,根据NDP任务的定义信息,描述以后,通过解析器对描述语言进行解析,生成了附图5所示的拓扑图。存储节点在调度过程中,根据数据集的位置和函数的计算特征,将decompress函数调度到SSD上就近执行,然后将filter函数和count函数调度到GPU上执行。在decompress函数完成后,将数据集加载到GPU的内存,生成filter函数和count函数能在GPU上执行的指令,完成函数的功能。
在一些实施例中,以上过程涉及的数据读取过程通过调用数据读取函数实现。数据读取函数为系统定义的函数,用于从文件系统,对象存储等存储系统中读取数据,返回一个数据集对象。例如,数据读取接口包括以下:
RD_File(fileID,offset,length);
RD_Object(key);
RD_Plog(PlogID,offset,length)。
本实施例提供了一种利用存储设备的多种处理器协作处理数据的方法,由存储设备中的中央处理器将数据处理任务划分为多个子任务,根据子任务的属性为子任务分配存储设备中的专项处理器。一方面,由于在进行数据处理的过程中,中央处理器承担了任务分解和任务调度的工作,专项处理器承担了执行子任务的工作,使得中央处理器的算力和专项处理器的算力均得到了充分利用。另一方面,由于分配子任务时考虑了子任务的属性,使得子任务能够依据其属性被调度到合适的专项处理器上执行。因此,该方法提高了数据处理的效率。
以下通过方法400,对上述方法300举例说明。以下方法400应用在分布式存储系统的场景,应用的数据被打散并分布到多个存储节点,每个存储节点具有多种异构的处理器,具体包括CPU、GPU、NPU、DIMM的处理器、SSD的处理器。以下方法400中,数据处理任务是NDP任务,子任务是函数。换句话说,方法400描述的方法流程关于存储节点如何将每个函数调度至多种异构处理器中最合适的处理器上执行。应理解,方法400与方法300同理的步骤还请参见方法300,在方法400中不做赘述。
参见附图6,附图6为本申请实施例提供的一种任务执行方法400的流程图。
示例性地,方法400包括S401至S409。
S401、判断数据是否在DIMM或SSD中。如果数据在DIMM或SSD中,执行以下S402;如果数据不在DIMM且不在SSD中,执行以下S404。
S402、判断DIMM或SSD是否支持该函数。如果DIMM或SSD支持该函数,执行以下S403;如果数据不在DIMM且不在SSD中,执行以下S404。
S403、选定DIMM或SSD作为用于执行该函数的专项处理器,执行以下S406。
S404、根据函数的定义信息指示的专项处理器或函数的计算特征选定专项处理器,执行以下S405。
S405、加载数据集到选定的专项处理器的内存中,执行以下S406。
S406、判断选定的专项处理器是否可编程。如果选定的专项处理器可编程,执行以下S407;如果选定的专项处理器不可编程,执行以下S409。
S407、根据函数的定义信息,产生能在选定的专项处理器中执行的指令,执行以下S408。
S408、使用选定的专项处理器执行指令,完成函数,并缓存指令,以便下次直接调用。
S409、调用选定的专项处理器中相应的硬件模块完成函数。
以上介绍了本申请实施例的任务执行方法,以下介绍本申请实施例的任务执行装置,应理解,该任务执行装置其具有上述方法中存储设备的任意功能。可选地,任务执行装置600在存储设备的控制器上运行,存储设备包括至少一个硬盘。可选地,任务执行装置600在存储设备的中央处理器上运行。
图7是本申请实施例提供的一种任务执行装置的结构示意图,如图7所示,任务执行装置600包括:获取模块601,用于执行S310;划分模块602,用于执行S320;分配模块603,用于执行S330。
应理解,任务执行装置600对应于上述方法300或方法400中的存储设备,任务执行装置600中的各模块和上述其他操作和/或功能分别为了实现上述方法300或方法400中的存储设备所实施的各种步骤和方法,具体细节可参见上述方法300或方法400,为了简洁,在此不再赘述。
应理解,任务执行装置600在执行任务时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将任务执行装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的任务执行装置与上述方法300或方法400属于同一构思,其具体实现过程详见上述方法300或方法400,这里不再赘述。
在一些实施例中,任务执行装置中的获取模块601相当于存储设备中的网卡,任务执行装置中的划分模块602和分配模块603相当于存储设备中的中央处理器。
本领域普通技术人员可以意识到,结合本文中所公开的实施例中描述的各方法步骤和模块,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各实施例的步骤及组成。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、装置和模块的具体工作过程,可以参见前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或模块的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
该作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以是两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
该集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例中方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二””之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。还应理解,尽管以下描述使用术语第一、第二等来描述各种元素,但这些元素不应受术语的限制。这些术语只是用于将一元素与另一元素区别分开。例如,在不脱离各种所述示例的范围的情况下,第一子任务可以被称为第二子任务,并且类似地,第二子任务可以被称为第一子任务。第一子任务和第二子任务都可以是子任务,并且在某些情况下,可以是单独且不同的子任务。
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个或两个以上,例如,多个第二专项处理器是指两个或两个以上的第二专项处理器。
还应理解,术语“如果”可被解释为意指“当...时”(“when”或“upon”)或“响应于确定”或“响应于检测到”。类似地,根据上下文,短语“如果确定...”或“如果检测到[所陈述的条件或事件]”可被解释为意指“在确定...时”或“响应于确定...”或“在检测到[所陈述的条件或事件]时”或“响应于检测到[所陈述的条件或事件]”。
以上描述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机程序指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例中的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机程序指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是 磁性介质(例如软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD)、或者半导体介质(例如固态硬盘)等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上描述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。
Claims (12)
- 一种任务执行方法,其特征在于,所述方法应用于存储设备中,所述存储设备包括中央处理器和多个专项处理器,所述方法包括:所述中央处理器获取数据处理任务;所述中央处理器将所述数据处理任务划分为多个子任务;所述中央处理器根据各个子任务的属性,将所述多个子任务中的第一子任务分配给第一专项处理器,所述第一专项处理器是所述多个专项处理器的其中一个专项处理器。
- 根据权利要求1所述的方法,其特征在于,所述子任务的属性包括所述子任务所涉及的数据的地址,所述第一专项处理器是距离所述数据最近的专项处理器。
- 根据权利要求1所述的方法,其特征在于,所述子任务的属性包括所述子任务的计算模式和/或并发量,所述第一专项处理器是与所述计算模式和/或并发量匹配的专项处理器。
- 根据权利要求1所述的方法,其特征在于,所述子任务的属性包括所述子任务的定义信息,所述第一专项处理器是所述第一子任务的定义信息指示的专项处理器。
- 根据权利要求1所述的方法,其特征在于,所述子任务的属性包括所述子任务对应的数据集类型,所述第一专项处理器是与所述第一子任务对应的数据集类型匹配的专项处理器。
- 根据权利要求1所述的方法,其特征在于,所述多个子任务的执行顺序被记录在拓扑图中,所述方法还包括:所述中央处理器根据所述拓扑图指示所述第一专项处理器按照顺序执行所述第一子任务。
- 一种存储设备,其特征在于,所述存储设备包括中央处理器和多个专项处理器;所述中央处理器,用于获取数据处理任务;所述中央处理器,用于将所述数据处理任务划分为多个子任务;所述中央处理器,还用于根据各个子任务的属性,将所述多个子任务中的第一子任务分配给第一专项处理器,所述第一专项处理器是所述多个专项处理器的其中一个专项处理器。
- 根据权利要求7所述的存储设备,其特征在于,所述子任务的属性包括所述子任务所涉及的数据的地址,所述第一专项处理器是距离所述数据最近的专项处理器。
- 根据权利要求7所述的存储设备,其特征在于,所述子任务的属性包括所述子任务的计算模式和/或并发量,所述第一专项处理器是与所述计算模式和/或并发量匹配的专项处理器。
- 根据权利要求7所述的存储设备,其特征在于,所述子任务的属性包括所述子任务的定义信息,所述第一专项处理器是所述第一子任务的定义信息指示的专项处理器。
- 根据权利要求7所述的存储设备,其特征在于,所述子任务的属性包括所述子任务对应的数据集类型,所述第一专项处理器是与所述第一子任务对应的数据集类型匹配的专项处理器。
- 根据权利要求7所述的存储设备,其特征在于,所述中央处理器,还用于根据所述拓扑图指示所述第一专项处理器按照顺序执行所述第一子任务。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP21825322.7A EP4160405A4 (en) | 2020-06-19 | 2021-05-31 | TASK EXECUTION METHOD AND STORAGE DEVICE |
US18/067,492 US20230124520A1 (en) | 2020-06-19 | 2022-12-16 | Task execution method and storage device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010564326.2 | 2020-06-19 | ||
CN202010564326.2A CN113821311A (zh) | 2020-06-19 | 2020-06-19 | 任务执行方法及存储设备 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/067,492 Continuation US20230124520A1 (en) | 2020-06-19 | 2022-12-16 | Task execution method and storage device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021254135A1 true WO2021254135A1 (zh) | 2021-12-23 |
Family
ID=78912077
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/097449 WO2021254135A1 (zh) | 2020-06-19 | 2021-05-31 | 任务执行方法及存储设备 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20230124520A1 (zh) |
EP (1) | EP4160405A4 (zh) |
CN (1) | CN113821311A (zh) |
WO (1) | WO2021254135A1 (zh) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115374031A (zh) * | 2021-05-17 | 2022-11-22 | 三星电子株式会社 | 近存储器处理双列直插式存储器模块及其操作方法 |
CN116560785A (zh) * | 2022-01-30 | 2023-08-08 | 华为技术有限公司 | 一种访问存储节点的方法、装置及计算机设备 |
CN116709553A (zh) * | 2022-02-24 | 2023-09-05 | 华为技术有限公司 | 一种任务执行方法及相关装置 |
CN117666921A (zh) * | 2022-08-23 | 2024-03-08 | 华为技术有限公司 | 数据处理方法、加速器及计算设备 |
CN115658325B (zh) * | 2022-11-18 | 2024-01-23 | 北京市大数据中心 | 数据处理方法、装置、多核处理器、电子设备以及介质 |
CN115658277B (zh) * | 2022-12-06 | 2023-03-17 | 苏州浪潮智能科技有限公司 | 一种任务调度方法、装置及电子设备和存储介质 |
CN115951998A (zh) * | 2022-12-29 | 2023-04-11 | 上海芷锐电子科技有限公司 | 任务执行方法、图形处理器、电子设备及存储介质 |
CN116149856A (zh) * | 2023-01-09 | 2023-05-23 | 中科驭数(北京)科技有限公司 | 算子计算方法、装置、设备及介质 |
CN116074179B (zh) * | 2023-03-06 | 2023-07-14 | 鹏城实验室 | 基于cpu-npu协同的高扩展节点系统及训练方法 |
CN116594745A (zh) * | 2023-05-11 | 2023-08-15 | 阿里巴巴达摩院(杭州)科技有限公司 | 任务执行方法、系统、芯片及电子设备 |
CN118642860A (zh) * | 2024-08-15 | 2024-09-13 | 杭州嗨豹云计算科技有限公司 | 一种基于任务自适应匹配的多功能服务器及其应用方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1910553A (zh) * | 2004-01-08 | 2007-02-07 | 皇家飞利浦电子股份有限公司 | 基于存储器要求在多处理器系统中进行任务调度的方法和设备 |
CN101441615A (zh) * | 2008-11-24 | 2009-05-27 | 中国人民解放军信息工程大学 | 面向任务流的高效能立体并行柔性可重构计算架构模型 |
US20150286501A1 (en) * | 2014-04-03 | 2015-10-08 | Strato Scale Ltd. | Register-type-aware scheduling of virtual central processing units |
CN105589829A (zh) * | 2014-09-15 | 2016-05-18 | 华为技术有限公司 | 基于多核处理器芯片的数据处理方法、装置以及系统 |
CN110196775A (zh) * | 2019-05-30 | 2019-09-03 | 苏州浪潮智能科技有限公司 | 一种计算任务处理方法、装置、设备以及可读存储介质 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8473723B2 (en) * | 2009-12-10 | 2013-06-25 | International Business Machines Corporation | Computer program product for managing processing resources |
US10073715B2 (en) * | 2016-12-19 | 2018-09-11 | Intel Corporation | Dynamic runtime task management |
CN110502330A (zh) * | 2018-05-16 | 2019-11-26 | 上海寒武纪信息科技有限公司 | 处理器及处理方法 |
CN108491263A (zh) * | 2018-03-02 | 2018-09-04 | 珠海市魅族科技有限公司 | 数据处理方法、数据处理装置、终端及可读存储介质 |
US10871989B2 (en) * | 2018-10-18 | 2020-12-22 | Oracle International Corporation | Selecting threads for concurrent processing of data |
CN109885388A (zh) * | 2019-01-31 | 2019-06-14 | 上海赜睿信息科技有限公司 | 一种适用于异构系统的数据处理方法和装置 |
US11106495B2 (en) * | 2019-06-13 | 2021-08-31 | Intel Corporation | Techniques to dynamically partition tasks |
CN110489223B (zh) * | 2019-08-26 | 2022-03-29 | 北京邮电大学 | 一种异构集群中任务调度方法、装置及电子设备 |
CN110532103A (zh) * | 2019-09-09 | 2019-12-03 | 北京西山居互动娱乐科技有限公司 | 一种多任务处理的方法及装置 |
-
2020
- 2020-06-19 CN CN202010564326.2A patent/CN113821311A/zh active Pending
-
2021
- 2021-05-31 EP EP21825322.7A patent/EP4160405A4/en active Pending
- 2021-05-31 WO PCT/CN2021/097449 patent/WO2021254135A1/zh unknown
-
2022
- 2022-12-16 US US18/067,492 patent/US20230124520A1/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1910553A (zh) * | 2004-01-08 | 2007-02-07 | 皇家飞利浦电子股份有限公司 | 基于存储器要求在多处理器系统中进行任务调度的方法和设备 |
CN101441615A (zh) * | 2008-11-24 | 2009-05-27 | 中国人民解放军信息工程大学 | 面向任务流的高效能立体并行柔性可重构计算架构模型 |
US20150286501A1 (en) * | 2014-04-03 | 2015-10-08 | Strato Scale Ltd. | Register-type-aware scheduling of virtual central processing units |
CN105589829A (zh) * | 2014-09-15 | 2016-05-18 | 华为技术有限公司 | 基于多核处理器芯片的数据处理方法、装置以及系统 |
CN110196775A (zh) * | 2019-05-30 | 2019-09-03 | 苏州浪潮智能科技有限公司 | 一种计算任务处理方法、装置、设备以及可读存储介质 |
Non-Patent Citations (1)
Title |
---|
See also references of EP4160405A4 * |
Also Published As
Publication number | Publication date |
---|---|
EP4160405A1 (en) | 2023-04-05 |
US20230124520A1 (en) | 2023-04-20 |
EP4160405A4 (en) | 2023-10-11 |
CN113821311A (zh) | 2021-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021254135A1 (zh) | 任务执行方法及存储设备 | |
US10728091B2 (en) | Topology-aware provisioning of hardware accelerator resources in a distributed environment | |
US10691597B1 (en) | Method and system for processing big data | |
US20230169351A1 (en) | Distributed training method based on end-to-end adaption, and device | |
US7830387B2 (en) | Parallel engine support in display driver model | |
WO2016078008A1 (zh) | 调度数据流任务的方法和装置 | |
US20080294872A1 (en) | Defragmenting blocks in a clustered or distributed computing system | |
US10268741B2 (en) | Multi-nodal compression techniques for an in-memory database | |
US20170228422A1 (en) | Flexible task scheduler for multiple parallel processing of database data | |
US9836516B2 (en) | Parallel scanners for log based replication | |
CN110874271B (zh) | 一种海量建筑图斑特征快速计算方法及系统 | |
Arfat et al. | Big data for smart infrastructure design: Opportunities and challenges | |
US11194522B2 (en) | Networked shuffle storage | |
US20210390405A1 (en) | Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof | |
Senthilkumar et al. | A survey on job scheduling in big data | |
Cong et al. | CPU-FPGA coscheduling for big data applications | |
CN111756802B (zh) | 一种数据流任务在numa平台上的调度方法及系统 | |
US20240220334A1 (en) | Data processing method in distributed system, and related system | |
Yankovitch et al. | Hypersonic: A hybrid parallelization approach for scalable complex event processing | |
Wang et al. | Improved intermediate data management for mapreduce frameworks | |
US10824640B1 (en) | Framework for scheduling concurrent replication cycles | |
Zheng et al. | Conch: A cyclic mapreduce model for iterative applications | |
US9176910B2 (en) | Sending a next request to a resource before a completion interrupt for a previous request | |
CN112486402A (zh) | 一种存储节点及系统 | |
Thamsen et al. | Adaptive resource management for distributed data analytics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21825322 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2021825322 Country of ref document: EP Effective date: 20221228 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |