WO2023050857A1 - 一种计算节点集群、数据聚合方法和相关设备 - Google Patents

一种计算节点集群、数据聚合方法和相关设备 Download PDF

Info

Publication number
WO2023050857A1
WO2023050857A1 PCT/CN2022/097285 CN2022097285W WO2023050857A1 WO 2023050857 A1 WO2023050857 A1 WO 2023050857A1 CN 2022097285 W CN2022097285 W CN 2022097285W WO 2023050857 A1 WO2023050857 A1 WO 2023050857A1
Authority
WO
WIPO (PCT)
Prior art keywords
aggregation
computing node
data
computing
node
Prior art date
Application number
PCT/CN2022/097285
Other languages
English (en)
French (fr)
Inventor
李秀桥
潘孝刚
陈强
高帅
孙宏伟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023050857A1 publication Critical patent/WO2023050857A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0877Cache access modes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/10Program control for peripheral devices
    • G06F13/12Program control for peripheral devices using hardware independent of the central processor, e.g. channel or peripheral processor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication

Definitions

  • the present application relates to the field of data processing, and in particular to a computing node cluster, a data aggregation method and related equipment.
  • the data that needs to be accessed can be assigned to multiple computing nodes for processing separately, and each computing node processes part of the data, but since each computing node runs multiple processes for data processing, for one computing node In other words, the addresses of data processed by multiple processes are not continuous.
  • data exchange with other computing nodes is required so that the addresses of data in each computing node are continuous. In this way, the computing nodes can write the data therein to the storage nodes.
  • data can only be written into the storage node after the computing node has exchanged data with other computing nodes, thus increasing the data access delay.
  • Embodiments of the present application provide a cluster of computing nodes, a data aggregation method, and related devices, which are used to reduce the time delay of distributed computing.
  • the embodiment of the present application provides a computing node cluster, which includes multiple computing nodes, and the multiple computing nodes include aggregation computing nodes; the multiple computing nodes are used to jointly execute the write operation of the data to be written, and the multiple computing nodes Each computing node in the computing node is used to write part of the data to be written into the local cache, and returns the write success; the aggregation computing node is used to aggregate part of the data stored in the cache of multiple computing nodes into a continuous address aggregated data, and write the aggregated data to the storage node.
  • the computing node when the computing node returns the write success, the input and output (IO) of the corresponding part of the data is completed, so that other data can be processed.
  • the IO is completed only after the data aggregation is completed to obtain continuous aggregated data, and each computing node can process other data; the embodiment of this application realizes the decoupling of IO and data aggregation.
  • the IO has been completed, and the computing nodes can process other data, thus releasing the CPU computing, memory and other resources occupied by the IO during the data aggregation process, improving the utilization rate of CPU resources, and improving the efficiency of data processing.
  • each computing node needs to write partial data (data blocks) of multiple data to be written, and the aggregated computing nodes Multiple data aggregations are required (IO1, aggregation of data blocks in IO1, IO2, aggregation of data blocks in IO2, ...); decouple IO from data aggregation, and perform the previous data aggregation process on the aggregation computing node , each computing node can perform the IO corresponding to the next data aggregation (for example, IO2 can be performed while the data block in IO1 is being aggregated), which realizes the parallelization of different rounds of IO and data aggregation, and reduces the number of computing nodes. Waiting time delay, thereby reducing the time delay of performing multiple write operations of data to be written.
  • the computing node cluster includes at least two aggregation computing nodes, and each aggregation computing node in the at least two aggregation computing nodes is used to aggregate some data blocks in the data to be written, and part of The addresses of the data blocks are consecutive.
  • each aggregation computing node aggregates some data blocks it is specifically used to: determine whether the data blocks aggregated by this aggregation computing node are local to the aggregation computing node; The data block is obtained from the cache of the computing node, and the obtained data block is aggregated with the data block in the aggregation computing node.
  • the aggregation computing node can determine the computing node where the aggregated data block is located, and obtain the corresponding data block from the computing node, realizing the aggregation of data blocks across nodes.
  • each aggregation computing node when each aggregation computing node aggregates some data blocks, it is specifically used to: determine whether the data blocks aggregated by this aggregation computing node are local, and if so, obtain them from the local cache Data blocks, and realize the aggregation of data blocks.
  • the aggregation computing node realizes the aggregation of the data blocks of the node.
  • the computing node cluster includes at least two aggregation computing nodes, and each aggregation computing node in the at least two aggregation computing nodes is used to aggregate some data blocks in the data to be written, and some data blocks The addresses of the blocks are consecutive.
  • each aggregation computing node When each aggregation computing node aggregates some data blocks, it is specifically used to: determine whether the cache of the aggregation computing node includes the data blocks aggregated by the aggregation computing node, if so, determine the aggregation node of the data block, and The data block is sent to the aggregation computing node that aggregates the data block; the data block aggregated by the aggregation computing node sent by other computing nodes is received, and the data block is aggregated with the data block of the aggregation computing node.
  • multiple computing nodes are specifically configured to: jointly execute the write operation of the data to be written according to the task issued by the application server; each aggregation computing node is specifically configured to: determine the aggregated view according to the task ; According to the aggregation view, determine the computing node information where the data block to be aggregated by the aggregation computing node is located, and obtain the data block to be aggregated by the aggregation computing node from the corresponding computing node according to the computing node information.
  • the aggregation computing node obtains the data block to be aggregated through the aggregation view, which can prevent wrong or missing acquisition of the data block to be aggregated, and ensure the integrity and accuracy of the aggregation result.
  • multiple computing nodes include caches, and the caches of multiple computing nodes form a shared cache pool (also referred to as a cache pool in this application), and each computing node can access the shared cache pool.
  • Data: the process of the aggregation computing node obtaining the data block from the determined cache of the computing node may specifically include: the aggregation computing node directly reads the data block from the cache of the computing node.
  • the caches of each computing node jointly form a cache pool, during the data aggregation process of the aggregation computing node, the data blocks to be aggregated can be directly read from the caches of other computing nodes, which improves the aggregation
  • the efficiency of computing nodes to obtain the data blocks to be aggregated reduces the time delay of the data aggregation process, thereby reducing the time delay of the write operation of the data to be written.
  • the step of obtaining the data block from the cache of the determined computing node may specifically include: the aggregation computing node receives a communication message from the computing node, and the communication message includes the data block to be aggregated by the aggregation computing node. of data blocks.
  • the communication message may be a message of a high-speed data transmission protocol.
  • the high-speed data transmission protocol may be remote direct data access (remote direct memory access, RDMA).
  • the aggregation computing node obtains the data blocks to be aggregated through the high-speed data transmission protocol, which can reduce the delay of obtaining the data blocks, thereby reducing the time of the data aggregation process. delay, further reducing the delay of the write operation of the data to be written.
  • the embodiment of the present application provides a data aggregation method.
  • the method is applied to a computing node cluster including multiple computing nodes.
  • the multiple computing nodes include aggregation computing nodes, and the multiple computing nodes are used to jointly execute the to-be-written
  • a data write operation the method includes: each computing node among the multiple computing nodes writes part of the data to be written into the local cache, and returns the write success; Part of the stored data is aggregated into aggregated data with continuous addresses, and the aggregated data is written to the storage node.
  • the computing node cluster includes at least two aggregation computing nodes, and each of the at least two aggregation computing nodes is used to aggregate some data blocks in the data to be written, wherein, some The addresses of the data blocks are continuous; the aggregation computing node aggregates the partial data stored in the caches of the multiple computing nodes into aggregated data with continuous addresses, which may specifically include: each aggregation computing node determines the data aggregated by the aggregation computing node Whether the block is local, if not, determine the computing node where the data block is located, and obtain the data block from the cache of the determined computing node and aggregate it with the data block in the aggregation computing node.
  • the computing node cluster includes at least two aggregation computing nodes, and each of the at least two aggregation computing nodes is used to aggregate some data blocks in the data to be written, wherein, some The addresses of the data blocks are continuous; the aggregation computing node aggregates the partial data stored in the caches of multiple computing nodes into aggregated data with continuous addresses, which may specifically include: each aggregation computing node determines whether the cache of the aggregation computing node is Include data blocks that are not aggregated by the aggregation computing node.
  • each aggregation computing node receives other computing nodes.
  • the data blocks aggregated by the local aggregation computing node are aggregated with the data blocks of the local aggregation computing node.
  • the multiple computing nodes are specifically configured to: jointly execute the write operation of the data to be written according to the task issued by the application server; Before aggregating part of the data into aggregated data with continuous addresses, the method may also include: determining the aggregation view according to the task; determining the computing node information where the data block aggregated by the aggregation computing node is located according to the aggregation view, and according to the computing node information from The corresponding computing node obtains the data blocks aggregated by the aggregation computing node.
  • multiple computing nodes include a cache, and the caches of the multiple computing nodes form a shared cache pool, and each computing node can access data in the shared cache pool;
  • the step of obtaining the data block in the method may specifically include: the aggregation computing node directly reads the data block from the cache of the computing node.
  • the step of obtaining the data block from the cache of the determined computing node may specifically include: the aggregation computing node receives a communication message from the computing node, and the communication message includes the data block.
  • an embodiment of the present application provides a computing node, including a processor, a cache, and a network card.
  • the cache is used to store instructions
  • the processor is used to invoke the instructions, so that the computing device executes the data aggregation method in the second aspect.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed, the aforementioned method in the second aspect is implemented.
  • an embodiment of the present application provides a computer program product, the computer program product comprising: computer program code, when the computer program code is executed, the aforementioned method in the second aspect is implemented.
  • FIG. 1 is a schematic diagram of a network architecture applicable to an embodiment of the present application
  • Fig. 2 is a schematic flow chart of the data aggregation method provided by the embodiment of the present application.
  • Fig. 3 is a schematic diagram of the data aggregation method provided by the embodiment of the present application.
  • FIG. 4 is another schematic diagram of the data aggregation method provided by the embodiment of the present application.
  • FIG. 5 is another schematic diagram of the data aggregation method provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of a computing node cluster provided by an embodiment of the present application.
  • FIG. 1 is a schematic diagram of a network architecture of a computing system applicable to an embodiment of the present application.
  • the architecture includes an application server, a computing node cluster, and a storage node cluster.
  • the application server runs an application, and the application generates a data access request during operation, and sends the generated data access request to the computing node cluster through the network for processing.
  • the computing node cluster includes multiple computing nodes 110 (3 computing nodes 110 are shown in FIG. 1 , but not limited to 3 computing nodes 110 ), and the computing nodes 110 can communicate with each other.
  • the computing node 110 is a computing device, such as a server, a desktop computer, or a controller of a storage array.
  • the computing node 110 includes at least a processor 112 , a memory 113 and a network card 114 .
  • the processor 112 is a central processing unit (central processing unit, CPU), used for processing data access requests from outside the computing node 110, or requests generated inside the computing node 110.
  • CPU central processing unit
  • the processor 112 when the processor 112 receives a write data request (data access request), it will temporarily store the data in the data write request in the memory 113 .
  • the processor 112 sends the data stored in the memory 113 to the storage node 100 for persistent storage.
  • the processor 112 is also used for computing or processing data, such as metadata management, deduplication, data compression, virtualized storage space, and address translation.
  • One computing node 110 in FIG. 1 only shows one CPU 112. In practical applications, there are often multiple CPUs 112 in the computing node 110, and one CPU 112 has one or more CPU cores. This embodiment does not limit the number of CPUs and the number of CPU cores.
  • the cache 113 refers to an internal memory directly exchanging data with the processor. It can read and write data at any time, and the speed is very fast. It is used as a temporary data storage for the operating system or other running programs.
  • the cache includes at least two types of memory, for example, the cache can be either a random access memory or a read only memory (ROM).
  • the random access memory is dynamic random access memory (DRAM), or storage class memory (storage class memory, SCM).
  • DRAM is a semiconductor memory that, like most random access memory (RAM), is a volatile memory device.
  • SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory.
  • Storage-class memory can provide faster read and write speeds than hard disks, but the access speed is slower than DRAM, and the cost is also cheaper than DRAM.
  • the DRAM and the SCM are only illustrative examples in this embodiment, and the cache may also include other random access memories, such as static random access memory (static random access memory, SRAM).
  • the read-only memory for example, it may be programmable read-only memory (programmable read only memory, PROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM) and the like.
  • the cache 113 can also be a dual in-line memory module or a dual in-line memory module (DIMM), that is, a module composed of a dynamic random access memory (DRAM), or a solid state drive ( solid state disk, SSD).
  • DIMM dual in-line memory module
  • DRAM dynamic random access memory
  • SSD solid state drive
  • multiple caches 113 and different types of caches 113 may be configured in the computing node 110 .
  • This embodiment does not limit the quantity and type of caches 113 .
  • the cache 113 can be configured to have a power saving function.
  • the power saving function means that the data stored in the cache memory 113 will not be lost when the system is powered off and then powered on again. Cache with power saving function is called non-volatile memory.
  • the network card 114 is used for communicating with the storage node 100 .
  • the computing node 110 may send a request to the storage node 100 through the network card 114 to persistently store the data.
  • the computing node 110 may further include a bus for communication between components inside the computing node 110 .
  • remote storage can be used to achieve persistent storage when storing data, so it has less local storage than conventional servers, thereby achieving cost and Space saving.
  • this does not mean that the computing node 110 cannot have a local storage.
  • the computing node 110 may also have a small number of built-in hard disks, or a small number of external hard disks.
  • Any computing node 110 can access any storage node 100 in the storage node cluster through the network.
  • the storage node cluster includes a plurality of storage nodes 100 (three storage nodes 100 are shown in FIG. 1 , but not limited to three storage nodes 100).
  • multiple computing nodes 110 in the computing node cluster are configured to jointly execute the write operation of the data to be written according to the data access request issued by the application server.
  • each computing node 110 is configured to write a part of the data to be written into the local cache, and return the write success.
  • Part of the computing nodes in the computing node cluster are aggregation computing nodes, and the aggregation computing nodes are used to: aggregate part of the data to be written stored in the cache 113 on the multiple computing nodes 110 into aggregated data with continuous addresses; Data is written to storage nodes.
  • aggregated data with continuous addresses is also referred to as data to be written.
  • the caches 113 on different computing nodes 110 together form a cache pool.
  • Any computing node 110 in the computing node cluster may obtain data stored on any cache 113 in the cache pool.
  • any computing node 110 in the computing node cluster can directly read the content stored in the cache 113 on other computing nodes 110 in the cache pool.
  • any computing node 110 in the computing node cluster may also obtain data stored in the cache 113 on other computing nodes 110 through communication messages or the like.
  • the embodiment of the present application provides a data aggregation method, which reduces the time delay of distributed computing by decoupling IO and aggregation.
  • the data aggregation method includes:
  • a task process on each computing node runs, and each task process writes at least one data block of data to be written.
  • Fig. 3 is a schematic diagram of the data aggregation method provided by the embodiment of the present application. As shown in Fig. 3, at least one task process is running on each computing node, and each task process writes at least one data block of the data to be written into the process in the memory of the computing node. Wherein, the data blocks written by different task processes are data blocks of the same task.
  • each computing node may receive a task issued by the application server, and write at least one data block of data to be written according to the task in step 201.
  • the task process on each computing node can determine the corresponding task and write the corresponding data block .
  • the embodiment of the present application is a distributed computing architecture, so when the application server issues a task, it will assign multiple task processes to the task, and each task process executes a part of the task.
  • the task in FIG. 4 can be realized through the four task processes shown in FIG. 3 .
  • the application server can assign tasks to four task processes for execution, and number each task process to identify part of the data processed by different task processes.
  • the application server can send the task and the number of the task process assigned to the corresponding computing node to each computing node through the message passing interface (message passing interface, MPI) message communication system; each computing node according to the task and number Run the corresponding task process (for example, compute node 1 runs task process 1).
  • message passing interface messages passing interface, MPI
  • each computing node runs an MPI component, and the communication between computing nodes can be realized through the MPI component.
  • the application server will send tasks to the MPI components on each computing node through the MPI message communication system, including:
  • information 4 information about computing nodes corresponding to each task process, including information about computing nodes where all task processes that complete the task are located. For any task process, as long as the number of the task process is determined, the address of the computing node where the task process is located can be determined according to the above information 4), thereby realizing communication with the computing node (data written by the task process).
  • the task process number corresponding to the aggregation process indicates the aggregation process used to aggregate the data blocks, and corresponds to one of the multiple task processes.
  • the aggregation process and the corresponding task process run on the same computing node (aggregation computing node).
  • the aggregation process indicates that the aggregation process is used to implement the aggregation of data blocks written by which task processes.
  • the aggregation process (corresponding to task process 2) in FIG. 4 is used to implement the aggregation of data blocks B1, B2, B3 and B4.
  • the aggregation of all data blocks is implemented through one aggregation process; in fact, there may be multiple aggregation processes, which are respectively used to aggregate data blocks of different parts, which are not limited here.
  • the task process on each computing node processes the corresponding data according to the number corresponding to the above information 1) task and information 3).
  • the number of the task process 1 is 1, and the corresponding task is the result of multiplying the first row of the calculation matrix by n, and the tasks of the task processes on other computing nodes can be deduced by analogy, which will not be repeated here.
  • Figure 3 is only an example of task allocation, and does not limit the number of computing nodes that complete the task. Except for the three shown in Figure 3, tasks can be allocated to more or fewer computing nodes. The node is completed, and there is no limitation here.
  • Each computing node writes at least one data block corresponding to each task process into the cache and returns a write success.
  • each computing node in addition to running the task process according to the task issued by the application server, it will also run the cache process corresponding to each task process. As shown in Figure 3, a task process on each computing node corresponds to a cache process running.
  • the task process running on each computing node writes the data block written into the memory in step 201 to the cache of the computing node where the task process is located.
  • the cache process determines that the data block is written into the cache, and then returns the write success to the task process.
  • a successful write indicates that the IO of the corresponding task process is completed, and the task process can process other data.
  • Each computing node acquires an aggregation view of tasks.
  • Each computing node can obtain the aggregation view corresponding to the task according to the task delivered by the application server.
  • the task is completed by four task processes (i.e., task processes 1 to 4 in Figure 3 and Figure 4), and each task process writes a part of data, and the aggregation view indicates that these 4 parts of data are continuously aggregated position in the data. Since the application server assigns the tasks to task processes with different numbers for processing, the aggregation view here also indicates the task processes corresponding to the four parts of data.
  • FIG. 4 is only an example of the aggregation view, and the parameters corresponding to the data blocks written by each task process in the aggregation view may also include other content, which is not limited here.
  • the parameters corresponding to B2 may include (256, 511), indicating that the position of B2 in the aggregation data is from the 256th to the 511th.
  • the aggregation computing node acquires data blocks from each computing node.
  • the plurality of computing nodes in the computing node cluster include at least one aggregation node, and the aforementioned steps 201 to 203 are actions performed by each computing node, so the aggregation node also executes steps 201 to 203 .
  • each computing node runs a cache process corresponding to the task process, and the cache process on the aggregation computing node is also called an aggregation process.
  • each computing node (task process) writes its corresponding data block into the cache; in step 203, the aggregation node obtains the aggregation view; then the aggregation process running on the aggregation node can determine the aggregation The task process corresponding to the data block of the task process, so as to determine the computing nodes where these task processes are located, and then obtain the data blocks stored in the cache of these computing nodes.
  • the aggregation process on the aggregation computing node can obtain the address of the computing node where each task process is located through the MPI component on the node (see step 201 for information 4) Description).
  • the aggregation process on the aggregation computing node can determine the data blocks B1 to B4 to be aggregated according to the aggregation view, where B2 is on this node, and data block B1 corresponds to task process 1 and data block B3 Task process 3, data block B4 corresponds to task process 4.
  • the aggregation computing node can directly read the corresponding data block from the cache of other computing nodes.
  • the aggregation node can determine the task process 3 corresponding to the data blocks B3 and B4 according to the MPI component on the node. and 4 are both on computing node 3, so that B3 and B4 are directly read from the cache of computing node 3.
  • the aggregation computing node can also obtain data blocks on other nodes in other ways. For example, when a computing node actively sends data blocks, the cache process on each computing node can determine which aggregation process aggregates the data block on the node according to the aggregation view, thereby obtaining the information of the aggregation computing node where the aggregation process is located, and according to the The information of the computing node is aggregated, and the data block cached in step 202 is sent to the aggregation node.
  • the embodiment of this application takes the MPI message communication system as an example to illustrate how the aggregation computing node obtains data blocks from each computing node.
  • the MPI message communication system is only an implementation of interactive data blocks, except for the MPI message communication system , and the transmission of data blocks can also be realized through parallel network common data form (PnetCDF) or other methods, which are not limited here.
  • PnetCDF parallel network common data form
  • the aggregation computing node aggregates the data blocks according to the aggregation view to obtain continuous aggregation data.
  • the aggregation computing node may aggregate the data blocks from each computing node obtained in step 204 according to the aggregation view obtained in step 203 to obtain continuous aggregated data.
  • step 205 may be executed by an aggregation process running on the aggregation node.
  • the aggregation process aggregates the data blocks B1, B2, B3 and B4 from computing nodes 1 to 3 according to the aggregation view to obtain continuous aggregated data.
  • the aggregation computing node writes the continuous aggregation data into the storage node.
  • the aggregation computing node After the aggregation computing node obtains the continuous aggregation data, it can write the continuous aggregation data to the storage node. Specifically, this step can be performed by an aggregation process.
  • the computing node cluster may also include at least two aggregation computing nodes, and each aggregation computing node may be used to implement the aggregation of some data blocks in the task.
  • aggregation computing node 1 is used to realize aggregation of data blocks C1 and C3
  • aggregation computing node 2 is used to realize aggregation of data blocks C2 and C4 .
  • the aggregation computing node 1 can determine whether the data blocks C1 and C3 to be aggregated are local to the aggregation computing node according to the aggregation view, wherein C3 is not locally , the aggregation computing node 1 can determine that the computing node where C3 is located is the computing node 2, so as to obtain the data block C3 from the computing node 2, and realize the aggregation of the data blocks C1 and C3 in step 205 .
  • the process of the aggregation computing node determining the information of the computing node of the data block to be aggregated is referred to the description in step 204, and will not be repeated here.
  • the aggregation computing node 1 may directly read the data block C3 from the cache of the computing node 2.
  • the aggregation process of the aggregation computing node 2 can be deduced by analogy, and will not be repeated here.
  • the process in which the aggregation computing node 1 obtains the data block C3 from the aggregation computing node 2 may also be realized by the aggregation computing node 2 actively sending it.
  • the aggregation computing node 2 can determine whether the data blocks C3 and C4 written in the cache in step 202 include data blocks that are not aggregated by the aggregation computing node (aggregation computing node 2) according to the aggregation view, where C3 If aggregation is not performed at this aggregation computing node, then aggregation computing node 2 may determine that the aggregation computing node used to aggregate data block C3 is aggregation computing node 1, so as to send data block C3 to aggregation computing node 1, so as to realize the aggregation of data block C3.
  • the aggregation computing node 2 may aggregate the data block C4 cached on this node with the data block C2 from the aggregation computing node 1 .
  • the computing node determining the information of the aggregation computing node used to aggregate the corresponding data block refer to the description in step 204, which will not be repeated here.
  • the method in FIG. 2 is applied to the architecture shown in FIG. 1 .
  • the task process is a process running on the CPU 112 , and the cache process/aggregation process can run on the CPU 112 or the network card 114 , which is not limited here. That is to say, on the aggregation computing node, steps 201 to 206 can be implemented by CPU112; or steps 201 and 202 (not including return write success) can be implemented by CPU112, and steps 202 (return write success) and 203 to 206 can be implemented by network card 114 accomplish.
  • steps 201 and 202 are referred to as input-output (IO) of each task process, and steps 203 to 206 are referred to as a data aggregation process of an aggregation process.
  • step 202 the writing is returned successfully, and the task process has completed the IO, so that other data can be processed.
  • the IO is completed only after the data aggregation is completed to obtain continuous aggregated data, and each task process can process other data; the embodiment of this application realizes the decoupling of IO and data aggregation.
  • the IO has been completed, and the task process can process other data, thus releasing the CPU computing, memory and other resources occupied by the task process during the data aggregation process, improving the utilization of CPU resources, and Improve the efficiency of data processing.
  • each task process needs to write part of the data to be written (data block)
  • the aggregation computing node needs to perform multiple data aggregations (IO1, aggregation of data blocks in IO1, IO2, aggregation of data blocks in IO2, ...); decoupling IO from data aggregation, in the aggregation process
  • each task process can perform the IO corresponding to the next data aggregation (for example, IO2 can be performed while the data blocks in IO1 are being aggregated), realizing different rounds of IO and data aggregation.
  • Parallelism reduces the waiting delay of the task process, thereby reducing the delay of completing the entire task.
  • the structure of the computing node is shown as the computing node 110 in FIG. 1 .
  • the computing node cluster 6000 includes multiple computing nodes 6100 , and the multiple computing nodes 6100 include at least one aggregation computing node.
  • the two computing nodes 6100 in FIG. 6 are only examples, and do not limit the number of computing nodes 6100 and the number of aggregated computing nodes.
  • Each computing node 6100 includes a write module 6101 and a cache module 6102, and the cache module 6102 on the aggregation computing node 6100 is also called an aggregation module.
  • the writing module 6101 is used to jointly execute the writing operation of the data to be written. Specifically, each writing module 6101 is used to write part of the data to be written into the cache of the computing node where the writing module 6101 is located; the cache module 6102 is used to write part of the data to the corresponding computing node where the writing module 6101 is located After computing the cache of the node, the write success is returned to the writing module 6101.
  • the caching module 6102 (aggregation module) on the aggregation computing node is configured to aggregate part of the data to be written stored in the caches on multiple computing nodes into aggregated data with continuous addresses; and write the aggregated data to the storage node.
  • the writing module 6101 is used to implement the steps performed by the task process in the embodiment shown in FIG. 2 , that is, steps 201 to 202 (not including returning write success).
  • the caching module 6102 is used to realize the return of writing success to step 204 in step 202 in FIG. , namely steps 204 to 206.
  • the cache module 6102 may be a functional module in the processor of the aggregation computing node, or may be a network card on the aggregation computing node, and the network card may be an intrinsic
  • the network card used to interact with other devices can also be a pluggable network card, which is not limited here.
  • the disclosed system, device and method can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例公开了一种计算节点集群、数据聚合方法和相关设备,用于减小分布式计算的时延。本申请实施例提供的计算节点集群,包括多个计算节点,多个计算节点中包括聚合计算节点;多个计算节点用于共同执行待写数据的写入操作,多个计算节点中的每个计算节点用于将待写数据中的部分数据写入本地缓存后,返回写成功;聚合计算节点用于将多个计算节点中的缓存中存储的部分数据聚合为地址连续的聚合数据,并将聚合数据写入存储节点。

Description

一种计算节点集群、数据聚合方法和相关设备
本申请要求于2021年9月30日提交中国国家知识产权局、申请号为CN202111166666.0、发明名称为“一种计算节点集群、数据聚合方法和相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理领域,尤其涉及一种计算节点集群、数据聚合方法和相关设备。
背景技术
在分布式数据计算系统中,可以将需要访问的数据分配至多个计算节点分别进行处理,每个计算节点处理部分数据,但是由于每个计算节点运行多个进程进行数据处理,对于一个计算节点来说,其中的多个进程所处理的数据的地址并不是连续的,为了提高数据写入的效率,需要与其他计算节点进行数据交换,使每个计算节点中的数据的地址都是连续的,这样计算节点就可以将其中的数据写入存储节点。但是相关技术中,只有在计算节点与其他计算节点交换完数据,才能将数据写入存储节点,这样,增加了数据的访问时延。
发明内容
本申请实施例提供了一种计算节点集群、数据聚合方法和相关设备,用于减小分布式计算的时延。
第一方面,本申请实施例提供了一种计算节点集群,包括多个计算节点,多个计算节点中包括聚合计算节点;多个计算节点用于共同执行待写数据的写入操作,多个计算节点中的每个计算节点用于将待写数据中的部分数据写入本地缓存后,返回写成功;聚合计算节点用于将多个计算节点中的缓存中存储的部分数据聚合为地址连续的聚合数据,并将聚合数据写入存储节点。
在本申请实施例中,计算节点返回写成功,就完成了对应部分数据的输入输出(input output,IO),从而可以进行其他数据的处理。相较于传统的数据聚合方法中,只有完成了数据聚合得到连续的聚合数据,IO才完成,各计算节点才能进行其他数据的处理;本申请实施例实现了IO与数据聚合的解耦,在数据聚合的过程中,IO已完成,计算节点可以进行其他数据的处理,从而在数据聚合的过程中释放了IO所占用的CPU的运算、内存等资源,提升了CPU资源的利用率,并且提升了数据处理的效率。
若需要执行多个待写数据的写入操作,即需要进行多个轮次的IO和数据聚合,则每个计算节点需要写入多个待写数据的部分数据(数据块),聚合计算节点需要进行多次数据聚合(IO1、对IO1中数据块的聚合、IO2、对IO2中数据块的聚合、……);将IO与数据聚合解耦,在聚合计算节点进行前一次数据聚合的过程中,各计算节点可以进行下一次数据聚合所对应的IO(例如对IO1中数据块进行聚合的同时,可以进行IO2),实现了不同轮次的IO与数据聚合的并行,减小了计算节点等待的时延,从而减小了执行多个待写数据的写入操作的时延。
在一种可选的实施方式中,计算节点集群包括至少两个聚合计算节点,至少两个聚合计算节点中的每个聚合计算节点,用于聚合待写入数据中的部分数据块,并且部分数据块的地址连续。每个聚合计算节点在聚合部分数据块时,具体用于:确定本聚合计算节点所聚合的数据块是否在聚合计算节点本地,若不在本地,则确定数据块所在的计算节点,并从确定的计算节点的缓存中获取数据块,并将获取的数据块与本聚合计算节点中的数据块聚合。
在本申请实施例中,聚合计算节点可以确定聚合的数据块所在的计算节点,并从该计算节点获取对应的数据块,实现了跨节点的数据块的聚合。
在一种可选的实施方式中,每个聚合计算节点在聚合部分数据块时,具体用于:确定本聚合计算节点所聚合的数据块是否在本地,若在本地,则从本地缓存中获取数据块,并实现数据块的聚合。
在本申请实施例中,聚合计算节点实现了本节点的数据块的聚合。
在一种可选的实施方式中,计算节点集群包括至少两个聚合计算节点,至少两个聚合计算节点中的每个聚合计算节点,用于聚合待写入数据中的部分数据块,部分数据块的地址连续。每个聚合计算节点在聚合部分数据块时,具体用于:确定本聚合计算节点的缓存中是否包括非本聚合计算节点聚合的数据块,若包括,则确定该数据块的聚合节点,并将该数据块发送至聚合该数据块的聚合计算节点;接收其他计算节点发送的,本聚合计算节点聚合的数据块,并将该数据块与本聚合计算节点的数据块聚合。
在一种可选的实施方式中,多个计算节点具体用于:根据应用服务器下发的任务,共同执行待写数据的写入操作;每个聚合计算节点具体用于:根据任务确定聚合视图;根据该聚合视图,确定本聚合计算节点所要聚合的数据块所在的计算节点信息,并根据该计算节点信息从对应的计算节点获取本聚合计算节点所要聚合的数据块。
在本申请实施例中,聚合计算节点通过聚合视图获取所要聚合的数据块,可以防止错误获取或遗漏获取所要聚合的数据块,确保聚合结果的完整新和准确性。
在一种可选的实施方式中,多个计算节点包括缓存,多个计算节点的缓存构成共享缓存池(本申请中也称为缓存池),每个计算节点都可以访问共享缓存池中的数据;聚合计算节点从所确定的计算节点的缓存中获取数据块的过程,具体可以包括:聚合计算节点直接从计算节点的缓存中读取数据块。
在本申请实施例中,由于各计算节点的缓存共同构成缓存池,因此在聚合计算节点进行数据聚合的过程中,可以直接从其他计算节点的缓存中读取所要聚合的数据块,提升了聚合计算节点获取所要聚合的数据块的效率,减小了数据聚合过程的时延,从而减小了待写数据的写入操作的时延。
在一种可选的实施方式中,从所确定的计算节点的缓存中获取数据块的步骤,具体可以包括:聚合计算节点接收来自计算节点的通信消息,通信消息中包括该聚合计算节点所要聚合的数据块。
在一种可选的实施方式中,通信消息可以是高速数据传输协议的消息。可选的,高速数据传输协议可以是远程直接数据存取(remote direct memory access,RDMA)。
在本申请实施例中,由于高速传输协议的传输效率高,因此聚合计算节点通过高速数据传输协议获取所要聚合的数据块,可以减小获取数据块的时延,从而减小数据聚合过程的时延,进一步减小待写数据的写入操作的时延。
第二方面,本申请实施例提供了一种数据聚合方法,该方法应用于包括多个计算节点的计算节点集群,多个计算节点中包括聚合计算节点,多个计算节点用于共同执行待写数据的写入操作,该方法包括:多个计算节点中的每个计算节点将待写数据中的部分数据写入本地缓存后,返回写成功;聚合计算节点将多个计算节点中的缓存中存储的部分数据聚合为地址连续的聚合数据,并将聚合数据写入存储节点。
第二方面的有益效果参见第一方面,此处不再赘述。
在一种可选的实施方式中,计算节点集群包括至少两个聚合计算节点,至少两个聚合计算节点中的每个聚合计算节点用于聚合待写入数据中的部分数据块,其中,部分数据块的地址连续;聚合计算节点将多个计算节点中的缓存中存储的部分数据聚合为地址连续的聚合数据的步骤,具体可以包括:每个聚合计算节点确定本聚合计算节点所聚合的数据块是否在本地,如果不在本地,则确定该数据块所在的计算节点,并从所确定的计算节点的缓存中获取该数据块与本聚合计算节点中的数据块聚合。
在一种可选的实施方式中,计算节点集群包括至少两个聚合计算节点,至少两个聚合计算节点中的每个聚合计算节点用于聚合待写入数据中的部分数据块,其中,部分数据块的地址连续;聚合计算节点将多个计算节点中的缓存中存储的部分数据聚合为地址连续的聚合数据的步骤,具体可以包括:每个聚合计算节点确定本聚合计算节点的缓存中是否包括非本聚合计算节点聚合的数据块,如果包括,则确定聚合该数据块的聚合计算节点,将该数据块发送至聚合该数据块的聚合计算节点;每个聚合计算节点接收其他计算节点发送的本聚合计算节点聚合的数据块,并与本聚合计算节点的数据块聚合。
在一种可选的实施方式中,多个计算节点具体用于:根据应用服务器下发的任务,共同执行待写数据的写入操作;在聚合计算节点将多个计算节点中的缓存中存储的部分数据聚合为地址连续的聚合数据之前,该方法还可以包括:根据任务确定聚合视图;根据聚合视图,确定本聚合计算节点聚合的数据块所在的计算节点信息,并根据该计算节点信息从对应计算节点获取本聚合计算节点聚合的数据块。
在一种可选的实施方式中,多个计算节点包括缓存,多个计算节点的缓存构成共享缓存池,每个计算节点都可以访问共享缓存池中的数据;从所确定的计算节点的缓存中获取数据块的步骤,具体可以包括:聚合计算节点直接从计算节点的缓存中读取数据块。
在一种可选的实施方式中,从所确定的计算节点的缓存中获取数据块的步骤,具体可以包括:聚合计算节点接收来自计算节点的通信消息,通信消息中包括数据块。
第三方面,本申请实施例提供了一种计算节点,包括处理器、缓存和网卡,缓存用于存储指令,处理器用于调用该指令,以使得计算设备执行前述第二方面的数据聚合方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,当该计算机程序被运行时,实现前述第二方面的方法。
第五方面,本申请实施例提供了一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码被运行时,实现前述第二方面的方法。
附图说明
图1为本申请实施例所适用的网络架构示意图;
图2为本申请实施例提供的数据聚合方法的一个流程示意图;
图3为本申请实施例提供的数据聚合方法的一个示意图;
图4为本申请实施例提供的数据聚合方法的另一示意图;
图5为本申请实施例提供的数据聚合方法的另一示意图;
图6为本申请实施例提供的计算节点集群的一个示意图。
具体实施方式
图1为本申请实施例所适用的计算系统的网络架构示意图,如图1所示,该架构包括应用服务器、计算节点集群和存储节点集群。其中,应用服务器运行有应用,应用在运行中产生数据访问请求,并将所产生的数据访问请求通过网络发送至计算节点集群进行处理。
计算节点集群包括多个计算节点110(图1中示出了3个计算节点110,但不限于3个计算节点110),各个计算节点110之间可以相互通信。计算节点110是一种计算设备,如服务器、台式计算机或者存储阵列的控制器等。
在硬件上,如图1所示,计算节点110至少包括处理器112、内存113和网卡114。其中,处理器112是一个中央处理器(central processing unit,CPU),用于处理来自计算节点110外部的数据访问请求,或者计算节点110内部生成的请求。示例性的,处理器112接收写数据请求(数据访问请求)时,会将这些写数据请求中的数据暂时保存在内存113中。当内存113中的数据总量达到一定阈值时,处理器112将内存113中存储的数据发送给存储节点100进行持久化存储。除此之外,处理器112还用于数据进行计算或处理,例如元数据管理、重复数据删除、数据压缩、虚拟化存储空间以及地址转换等。
图1中一个计算节点110仅示出了一个CPU 112,在实际应用中,计算节点110中CPU 112的数量往往有多个,其中,一个CPU 112又具有一个或多个CPU核。本实施例不对CPU的数量,以及CPU核的数量进行限定。
缓存113是指与处理器直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为操作系统或其他正在运行中的程序的临时数据存储器。缓存包括至少两种存储器,例如缓存既可以是随机存取存储器,也可以是只读存储器(read only memory,ROM)。举例来说,随机存取存储器是动态随机存取存储器(dynamic random access memory,DRAM),或者存储级存储器(storage class memory,SCM)。DRAM是一种半导体存储器,与大部分随机存取存储器(random access memory,RAM)一样,属于一种易失性存储器(volatile  memory)设备。SCM是一种同时结合传统储存装置与存储器特性的复合型储存技术,存储级存储器能够提供比硬盘更快速的读写速度,但存取速度上比DRAM慢,在成本上也比DRAM更为便宜。然而,DRAM和SCM在本实施例中只是示例性的说明,缓存还可以包括其他随机存取存储器,例如静态随机存取存储器(static random access memory,SRAM)等。而对于只读存储器,举例来说,可以是可编程只读存储器(programmable read only memory,PROM)、可抹除可编程只读存储器(erasable programmable read only memory,EPROM)等。另外,缓存113还可以是双列直插式存储器模块或双线存储器模块(dual In-line memory module,DIMM),即由动态随机存取存储器(DRAM)组成的模块,还可以是固态硬盘(solid state disk,SSD)。实际应用中,计算节点110中可配置多个缓存113,以及不同类型的缓存113。本实施例不对缓存113的数量和类型进行限定。此外,可对缓存113进行配置使其具有保电功能。保电功能是指系统发生掉电又重新上电时,缓存113中存储的数据也不会丢失。具有保电功能的缓存被称为非易失性存储器。
网卡114用于与存储节点100通信。例如,当缓存113中的数据总量达到一定阈值时,计算节点110可通过网卡114向存储节点100发送请求以对所述数据进行持久化存储。另外,计算节点110还可以包括总线,用于计算节点110内部各组件之间的通信。在功能上,由于图1中的计算节点110的主要功能是计算业务,在存储数据时可以利用远程存储器来实现持久化存储,因此它具有比常规服务器更少的本地存储器,从而实现了成本和空间的节省。但这并不代表计算节点110不能具有本地存储器,在实际实现中,计算节点110也可以内置少量的硬盘,或者外接少量硬盘。
任意一个计算节点110可通过网络访问存储节点集群中的任意一个存储节点100。存储节点集群包括多个存储节点100(图1中示出了三个存储节点100,但不限于三个存储节点100)。
在本申请实施例中,计算节点集群中的多个计算节点110,用于根据应用服务器下达的数据访问请求,共同执行待写数据的写入操作。具体的,每个计算节点110用于将待写数据中的部分数据写入本地缓存后,返回写成功。
计算节点集群中的部分计算节点为聚合计算节点,聚合计算节点用于:将多个计算节点110上缓存113中存储的待写数据中的部分数据,聚合为地址连续的聚合数据;并将聚合数据写入存储节点。在本申请实施例中,地址连续的聚合数据也称为待写数据。
在本申请实施例中,不同计算节点110上的缓存113,共同构成缓存池。计算节点集群中的任一计算节点110,可以获取缓存池中任一缓存113上存储的数据。可选的,计算节点集群中的任一计算节点110,可以直接读取缓存池中其他计算节点110上的缓存113所存储的内容。或者,计算节点集群中的任一计算节点110,也可以通过通信消息等形式获取其他计算节点110上的缓存113所存储的数据。
基于图1所示的架构,本申请实施例提供了一种数据聚合方法,通过IO与聚合的解耦,减小分布式计算的时延。
请参阅图2,本申请实施例提供的数据聚合方法包括:
201、每个计算节点上的任务进程运行,每个任务进程写入待写数据的至少一个数据块。
图3为本申请实施例提供的数据聚合方法的示意图,如图3所示,每个计算节点上都运行着至少一个任务进程,每个任务进程将待写数据的至少一个数据块写入进程所在计算节点的内存中。其中,不同任务进程写入的数据块为同一任务的数据块。
可选的,在步骤201之前,各计算节点可以接收应用服务器下发的任务,并在步骤201中根据该任务写入待写数据的至少一个数据块。
以图4为例,若应用服务器下发的任务为,计算4×4的矩阵与数字n的数乘结果,则各计算节点上的任务进程可以确定各自对应的任务并写入对应的数据块。
本申请实施例是分布式计算的架构,因此应用服务器在下发任务时,会为该任务分配多个任务进程,每个任务进程执行该任务的一部分。示例地,图4的任务可以通过图3所示的4个任务进程实现。应用服务器可以将任务分配至4个任务进程执行,并为每个任务进程编号,以标识不同任务进程所处理的部分数据。可选的,应用服务器可以通过消息传递接口(message passing interface,MPI)消息通信系统,将任务以及分配至对应计算节点的任务进程的编号下发至各计算节点;每个计算节点根据任务和编号运行对应的任务进程(例如计算节点1运行任务进程1)。
若应用服务器通过MPI消息通信系统向各计算节点下发任务,则各计算节点上都运行着MPI组件,通过MPI组件可以实现计算节点间的通信。在下发任务的过程中,应用服务器将通过MPI消息通信系统,向各计算节点上的MPI组件发送的任务包括:
1)运行任务的程序代码(多个计算节点写入数据块时都执行该代码);
2)为该任务分配的任务进程的数量(例如图3和图4中所示的4个);
3)分配给该计算节点的任务进程在所有任务进程中的编号(例如图3中,计算节点2上的任务进程编号为2);
4)各任务进程对应的计算节点的信息;
5)聚合进程所对应的任务进程编号;
6)聚合进程所聚合的数据块。
其中,信息4)各任务进程对应的计算节点的信息,包括完成该任务的所有任务进程所在的计算节点的信息。对于任意一个任务进程,只要确定该任务进程的编号,就可以根据上述信息4),确定该任务进程所在的计算节点的地址,从而实现与该计算节点(任务进程写入的数据)的通信。
信息5)聚合进程所对应的任务进程编号,表示用于聚合数据块的聚合进程,对应于多个任务进程中的某一个。在本申请实施例中,聚合进程与对应的任务进程运行在同一计算节点(聚合计算节点)上。
信息6)表示该聚合进程用于实现哪些任务进程所写入数据块的聚合。例如图4中的聚合进程(对应于任务进程2),用于实现数据块B1、B2、B3和B4的聚合。图4的示例中通过一个聚合进程实现所有数据块的聚合;实际上聚合进程可能有多个,分别用于聚合不同部分的数据块,此处不做限定。
各计算节点上的任务进程根据上述信息1)任务和信息3)自己对应的编号,处理对应的数据。例如任务进程1的编号为1,则对应的任务为计算矩阵第一行与n的数乘结果, 其他计算节点上任务进程的任务以此类推,此处不再赘述。
值得注意的是,图3仅是对任务分配的一个示例,并不造成对完成任务的计算节点数量的限定,除了图3所示的3个,可以将任务分配至更多或更少的计算节点完成,此处不做限定。
202、每个计算节点将每个任务进程对应的至少一个数据块写入缓存并返回写成功。
在每个计算节点上,除了根据应用服务器下发的任务运行任务进程,还会对应于每个任务进程运行缓存进程。如图3所示,每个计算节点上的任务进程,都对应运行着一个缓存进程。
每个计算节点上运行的任务进程,将步骤201中写入内存的数据块,写入任务进程所在计算节点的缓存上,缓存进程确定数据块被写入缓存,则向任务进程返回写成功。
在每个计算节点上,写成功表示对应任务进程的IO完成,则该任务进程可以进行其他数据的处理。
203、每个计算节点获取任务的聚合视图。
每个计算节点可以根据应用服务器下发的任务,获取任务对应的聚合视图。如图4所示,任务由4个任务进程(即图3和图4中的任务进程1至4)完成,各任务进程写入一部分数据,则聚合视图指示了这4部分数据在连续的聚合数据中的位置。由于在应用服务器下发任务时,将任务分配至不同编号的任务进程处理,因此此处的聚合视图还指示了这4部分数据各自对应的任务进程。
以图4为例,B1为任务进程1执行任务后写入的部分数据(即矩阵第1行与n的数乘结果,B1=(A11×n,A12×n,A13×n,A14×n)),在聚合视图中,B1对应的参数为(0,256)和任务进程1,表示B1在聚合数据中的位置为从文件头的第0位开始,以及B1的区段范围/偏移范围(offset)为256位,以及写入B1的任务进程为任务进程1;B2对应的参数为(256,256)和任务进程1,表示B1在聚合数据中的位置为从文件头的第256位开始,以及B2的区段范围/偏移范围(offset)为256位,以及写入B2的任务进程为任务进程2;其他计算节点生成的部分数据在聚合视图中的参数及其含义以此类推,不再赘述。
值得注意的是,图4仅是聚合视图的一个示例,聚合视图中各任务进程写入的数据块所对应的参数,也可以包括其他内容,此处不作限定。例如B2对应的参数可以包括(256,511),表示B2在聚合数据中的位置为从第256位至第511位。
204、聚合计算节点获取来自各计算节点的数据块。
计算节点集群的多个计算节点中,包括至少一个聚合节点,前述步骤201至203是每个计算节点执行的动作,所以聚合节点也执行了步骤201至203。
由步骤202可知,每个计算节点上都运行着对应于任务进程的缓存进程,在聚合计算节点上的缓存进程,也称为聚合进程。
在步骤202中,各计算节点(任务进程)将各自对应的数据块写入缓存;在步骤203中,聚合节点获取了聚合视图;则聚合节点上运行的聚合进程可以根据聚合视图,确定所要聚合的数据块对应的任务进程,从而确定这些任务进程所在的计算节点,进而获取这些计算节点的缓存上存储的数据块。
可选的,若应用服务器通过MPI消息通信系统下发任务,则聚合计算节点上的聚合进程可以通过本节点上的MPI组件,获取各任务进程所在的计算节点的地址(参见步骤201中对信息4)的说明)。
如图4中的步骤4所示,聚合计算节点上的聚合进程可以根据聚合视图,确定所要聚合的数据块B1至B4,其中B2在本节点上,数据块B1对应任务进程1、数据块B3任务进程3,数据块B4对应任务进程4。
可选的,聚合计算节点可以直接从其他计算节点的缓存中读取对应的数据块,例如图3中聚合节点可以根据本节点上的MPI组件,确定数据块B3和B4所对应的任务进程3和4,均在计算节点3上,从而从计算节点3的缓存中直接读取B3和B4。
除此之外,聚合计算节点也可以通过其他方式获取其他节点上的数据块。例如通过计算节点主动发送数据块,各计算节点上的缓存进程可以根据聚合视图,确定该节点上的数据块被哪个聚合进程聚合,从而获取该聚合进程所在的聚合计算节点的信息,并根据该聚合计算节点的信息,向聚合节点发送步骤202中缓存的数据块。
值得注意的是,本申请实施例以MPI消息通信系统为例,说明聚合计算节点如何从各计算节点获取数据块,MPI消息通信系统仅是交互数据块的一种实现方式,除了MPI消息通信系统,也可以通过并行网络通用数据格式(parallel network common data form,PnetCDF)或者其他方式实现数据块的传输,此处不做限定。
205、聚合计算节点根据聚合视图对数据块进行聚合,得到连续的聚合数据。
聚合计算节点可以根据步骤203中获取的聚合视图,将步骤204中获取的来自各计算节点的数据块进行聚合,得到连续的聚合数据。
如图3所示,步骤205可以由聚合节点上运行的聚合进程执行。聚合进程根据聚合视图,将来自计算节点1至3的数据块B1、B2、B3和B4聚合,得到连续的聚合数据。
206、聚合计算节点将连续的聚合数据写入存储节点。
聚合计算节点获取了连续的聚合数据,即可将连续的聚合数据写入存储节点。具体的,该步骤可以由聚合进程执行。
可选的,计算节点集群中也可以包括至少两个聚合计算节点,每个聚合计算节点可以用于实现任务中部分数据块的聚合。
如图5所示,聚合计算节点1用于实现数据块C1和C3的聚合,集合计算节点2用于实现数据块C2和C4的聚合。则在步骤204(聚合计算节点获取来自各计算节点的数据块)中,聚合计算节点1可以根据聚合视图,确定所要聚合的数据块C1和C3是否在本聚合计算节点的本地,其中C3不在本地,则聚合计算节点1可以确定C3所在的计算节点为计算节点2,从而从计算节点2处获取数据块C3,并在步骤205中实现数据块C1和C3的聚合。其中,聚合计算节点确定所要聚合数据块的计算节点的信息的过程,参见步骤204中的说明,此处不再赘述。
可选的,在聚合计算节点1从计算节点2获取数据块C3的过程中,聚合计算节点1可以从计算节点2的缓存中直接读取数据块C3。聚合计算节点2的聚合过程以此类推,不再 赘述。
可选的,聚合计算节点1从聚合计算节点2获取数据块C3的过程,也可以通过聚合计算节点2的主动发送实现。具体的,在步骤204,聚合计算节点2可以根据聚合视图,确定步骤202中写入缓存的数据块C3和C4中是否包括不在本聚合计算节点(聚合计算节点2)聚合的数据块,其中C3不在本聚合计算节点聚合,则聚合计算节点2可以确定用于聚合数据块C3的聚合计算节点为聚合计算节点1,从而向聚合计算节点1发送数据块C3,以实现数据块C3的聚合。并且,聚合计算节点2可以将缓存在本节点上的数据块C4,与来自聚合计算节点1的数据块C2聚合。其中,计算节点确定用于聚合对应数据块的聚合计算节点的信息的过程,参见步骤204中的说明,此处不再赘述。
图2的方法应用于图1所示的架构中,任务进程是运行在CPU112上的进程,缓存进程/聚合进程可以运行在CPU112上,也可以运行在网卡114上,此处不做限定。也就是说,在聚合计算节点上,步骤201至206可以由CPU112实现;也可以步骤201和202(不包括返回写成功)由CPU112实现,步骤202(返回写成功)和203至206由网卡114实现。
在本申请实施例中,步骤201和202称为各任务进程的输入输出(input-output,IO),步骤203至206称为聚合进程的数据聚合过程。步骤202中返回写成功,任务进程就完成了IO,从而可以进行其他数据的处理。相较于传统的数据聚合方法中,只有完成了数据聚合得到连续的聚合数据,IO才完成,各任务进程才能进行其他数据的处理;本申请实施例实现了IO与数据聚合的解耦,在数据聚合的过程中,IO已完成,任务进程可以进行其他数据的处理,从而在数据聚合的过程中释放了任务进程所占用的CPU的运算、内存等资源,提升了CPU资源的利用率,并且提升了数据处理的效率。
若在一个任务的执行过程中,需要执行多个待写数据的写入操作,即需要进行多个轮次的IO和数据聚合,则每个任务进程需要写入多个待写数据的部分数据(数据块),聚合计算节点需要进行多次数据聚合(IO1、对IO1中数据块的聚合、IO2、对IO2中数据块的聚合、……);将IO与数据聚合解耦,在聚合进程进行前一次数据聚合的过程中,各任务进程可以进行下一次数据聚合所对应的IO(例如对IO1中数据块进行聚合的同时,可以进行IO2),实现了不同轮次的IO与数据聚合的并行,减小了任务进程等待的时延,从而减小了完成整个任务的时延。
上面介绍了本申请实施例的实施架构和方法流程,接下来说明本申请实施例提供的计算设备。
在硬件层面上,计算节点的结构如图1中的计算节点110所示。如图6所示,在软件层面上,计算节点集群6000包括多个计算节点6100,多个计算节点6100中包括至少一个聚合计算节点。图6中的2个计算节点6100仅为示例,并不造成对计算节点6100数量以及聚合计算节点数量的限定。
每个计算节点6100包括写入模块6101和缓存模块6102,聚合计算节点6100上的缓存模块6102也称为聚合模块。
写入模块6101用于共同执行待写数据的写入操作。具体的,每个写入模块6101用于将待写数据中的部分数据写入该写入模块6101所在计算节点的缓存;缓存模块6102用于在部分数据写入对应写入模块6101所在计算节点的计算节点的缓存后,向写入模块6101返回写成功。
聚合计算节点上的缓存模块6102(聚合模块),用于将多个计算节点上缓存中存储的待写数据中的部分数据,聚合为地址连续的聚合数据;并将聚合数据写入存储节点。
其中,写入模块6101用于实现图2所示实施例中任务进程所执行的步骤,即步骤201至202(不包括返回写成功)。
缓存模块6102用于实现图2中的步骤202的返回写成功至步骤204;除此之外,聚合计算节点上的缓存模块6102(聚合模块)还用于实现图2中聚合进程所执行的步骤,即步骤204至206。
可选的,在聚合计算节点上,缓存模块6102(聚合模块)可以是聚合计算节点中处理器中的一个功能模块,也可以是聚合计算节点上的网卡,该网卡可以是聚合计算节点本身固有的用于与其他设备交互的网卡,也可以是可插拔的网卡,此处不做限定。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (11)

  1. 一种计算节点集群,包括多个计算节点,所述多个计算节点中包括聚合计算节点;
    所述多个计算节点用于共同执行待写数据的写入操作,所述多个计算节点中的每个计算节点用于将所述待写数据中的部分数据写入本地缓存后,返回写成功;
    所述聚合计算节点用于将所述多个计算节点中的缓存中存储的所述部分数据聚合为地址连续的聚合数据,并将所述聚合数据写入存储节点。
  2. 根据权利要求1所述的计算节点集群,其特征在于,所述计算节点集群包括至少两个聚合计算节点,所述至少两个聚合计算节点中的每个聚合计算节点用于聚合所述待写入数据中的部分数据块,所述部分数据块的地址连续;每个聚合计算节点在聚合部分数据块时,具体用于:
    确定本聚合计算节点所聚合的数据块是否在本地,如果不在本地,则确定所述数据块所在的计算节点,并从所确定的计算节点的缓存中获取所述数据块与本聚合计算节点中的数据块聚合。
  3. 根据权利要求1所述的计算节点集群,其特征在于,所述计算节点集群包括至少两个聚合计算节点,所述至少两个聚合计算节点中的每个聚合计算节点用于聚合所述待写入数据中的部分数据块,所述部分数据块的地址连续;每个聚合计算节点在聚合部分数据块时,具体用于:
    确定本聚合计算节点的缓存中是否包括非本聚合计算节点聚合的数据块,如果包括,则确定聚合所述数据块的聚合计算节点,将所述数据块发送至聚合所述数据块的聚合计算节点;
    接收其他计算节点发送的本聚合计算节点聚合的数据块,并与本聚合计算节点的数据块聚合。
  4. 根据权利要求1至3中任一项所述的计算节点集群,其特征在于,所述多个计算节点具体用于:根据应用服务器下发的任务,共同执行所述待写数据的写入操作;
    所述每个聚合计算节点具体用于:
    根据所述任务确定聚合视图;
    根据所述聚合视图,确定本聚合计算节点聚合的数据块所在的计算节点信息,并根据所述计算节点信息从对应计算节点获取本聚合计算节点聚合的数据块。
  5. 根据权利要求2所述的计算节点集群,其特征在于,所述多个计算节点包括缓存,所述多个计算节点的缓存构成共享缓存池,所述每个计算节点都可以访问所述共享缓存池中的数据;
    所述从所确定的计算节点的缓存中获取所述数据块,包括:
    所述聚合计算节点直接从所述计算节点的缓存中读取所述数据块。
  6. 一种数据聚合方法,其特征在于,所述方法应用于包括多个计算节点的计算节点集群,所述多个计算节点中包括聚合计算节点,所述多个计算节点用于共同执行待写数据的写入操作,所述方法包括:
    所述多个计算节点中的每个计算节点将所述待写数据中的部分数据写入本地缓存后, 返回写成功;
    所述聚合计算节点将所述多个计算节点中的缓存中存储的所述部分数据聚合为地址连续的聚合数据,并将所述聚合数据写入存储节点。
  7. 根据权利要求6所述的方法,其特征在于,所述计算节点集群包括至少两个聚合计算节点,所述至少两个聚合计算节点中的每个聚合计算节点用于聚合所述待写入数据中的部分数据块,所述部分数据块的地址连续;
    所述聚合计算节点将所述多个计算节点中的缓存中存储的所述部分数据聚合为地址连续的聚合数据,包括:
    所述每个聚合计算节点确定本聚合计算节点所聚合的数据块是否在本地,如果不在本地,则确定所述数据块所在的计算节点,并从所确定的计算节点的缓存中获取所述数据块与本聚合计算节点中的数据块聚合。
  8. 根据权利要求6所述的方法,其特征在于,所述计算节点集群包括至少两个聚合计算节点,所述至少两个聚合计算节点中的每个聚合计算节点用于聚合所述待写入数据中的部分数据块,所述部分数据块的地址连续;
    所述聚合计算节点将所述多个计算节点中的缓存中存储的所述部分数据聚合为地址连续的聚合数据,包括:
    所述每个聚合计算节点确定本聚合计算节点的缓存中是否包括非本聚合计算节点聚合的数据块,如果包括,则确定聚合所述数据块的聚合计算节点,将所述数据块发送至聚合所述数据块的聚合计算节点;
    所述每个聚合计算节点接收其他计算节点发送的本聚合计算节点聚合的数据块,并与本聚合计算节点的数据块聚合。
  9. 根据权利要求6至8中任一项所述的方法,其特征在于,所述多个计算节点具体用于:根据应用服务器下发的任务,共同执行所述待写数据的写入操作;
    在所述聚合计算节点将所述多个计算节点中的缓存中存储的所述部分数据聚合为地址连续的聚合数据之前,所述方法还包括:
    根据所述任务确定聚合视图;
    根据所述聚合视图,确定本聚合计算节点聚合的数据块所在的计算节点信息,并根据所述计算节点信息从对应计算节点获取本聚合计算节点聚合的数据块。
  10. 根据权利要求7所述的方法,其特征在于,所述多个计算节点包括缓存,所述多个计算节点的缓存构成共享缓存池,所述每个计算节点都可以访问所述共享缓存池中的数据;
    所述从所确定的计算节点的缓存中获取所述数据块,包括:
    所述聚合计算节点直接从所述计算节点的缓存中读取所述数据块。
  11. 一种计算节点,其特征在于,包括处理器、缓存和网卡,所述缓存用于存储指令,所述处理器用于调用所述指令,以使得所述计算设备执行如权利要求6至10中任一项所述的数据聚合方法。
PCT/CN2022/097285 2021-09-30 2022-06-07 一种计算节点集群、数据聚合方法和相关设备 WO2023050857A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111166666.0A CN115878311A (zh) 2021-09-30 2021-09-30 一种计算节点集群、数据聚合方法和相关设备
CN202111166666.0 2021-09-30

Publications (1)

Publication Number Publication Date
WO2023050857A1 true WO2023050857A1 (zh) 2023-04-06

Family

ID=85756744

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/097285 WO2023050857A1 (zh) 2021-09-30 2022-06-07 一种计算节点集群、数据聚合方法和相关设备

Country Status (2)

Country Link
CN (1) CN115878311A (zh)
WO (1) WO2023050857A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101183298A (zh) * 2007-12-26 2008-05-21 杭州华三通信技术有限公司 一种scsi数据读写方法、系统和装置
CN101187906A (zh) * 2006-11-22 2008-05-28 国际商业机器公司 用于提供高性能可缩放文件i/o的系统和方法
CN103389884A (zh) * 2013-07-29 2013-11-13 华为技术有限公司 处理输入/输出请求的方法、宿主机、服务器和虚拟机
US20190095284A1 (en) * 2017-09-28 2019-03-28 International Business Machines Corporation Enhanced application write performance

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101187906A (zh) * 2006-11-22 2008-05-28 国际商业机器公司 用于提供高性能可缩放文件i/o的系统和方法
CN101183298A (zh) * 2007-12-26 2008-05-21 杭州华三通信技术有限公司 一种scsi数据读写方法、系统和装置
CN103389884A (zh) * 2013-07-29 2013-11-13 华为技术有限公司 处理输入/输出请求的方法、宿主机、服务器和虚拟机
US20190095284A1 (en) * 2017-09-28 2019-03-28 International Business Machines Corporation Enhanced application write performance

Also Published As

Publication number Publication date
CN115878311A (zh) 2023-03-31

Similar Documents

Publication Publication Date Title
US20170147516A1 (en) Direct interface between graphics processing unit and data storage unit
CN109388590B (zh) 提升多通道dma访问性能的动态缓存块管理方法和装置
US8037251B2 (en) Memory compression implementation using non-volatile memory in a multi-node server system with directly attached processor memory
US7966455B2 (en) Memory compression implementation in a multi-node server system with directly attached processor memory
WO2021218038A1 (zh) 一种存储系统、内存管理方法和管理节点
WO2023035646A1 (zh) 一种扩展内存的方法、装置及相关设备
US11693809B2 (en) Asymmetric read / write architecture for enhanced throughput and reduced latency
US20220147476A1 (en) Memory device including direct memory access engine, system including the memory device, and method of operating the memory device
WO2023125524A1 (zh) 数据存储方法、系统、存储访问配置方法及相关设备
WO2023000770A1 (zh) 一种处理访问请求的方法、装置、存储设备及存储介质
US7725654B2 (en) Affecting a caching algorithm used by a cache of storage system
CN115793957A (zh) 写数据的方法、装置及计算机存储介质
WO2023050857A1 (zh) 一种计算节点集群、数据聚合方法和相关设备
CN115079936A (zh) 一种数据写入方法及装置
WO2016180063A1 (zh) 写请求的处理方法和内存控制器
WO2022262345A1 (zh) 一种数据管理方法、存储空间管理方法及装置
WO2023000696A1 (zh) 一种资源分配方法及装置
WO2023024656A1 (zh) 数据访问方法、存储系统及存储节点
EP4398116A1 (en) Computing node cluster, data aggregation method and related device
WO2022073399A1 (zh) 存储节点、存储设备及网络芯片
Choi et al. Performance evaluation of a remote block device with high-speed cluster interconnects
WO2021159608A1 (zh) 一种基于Protocol Buffer的镜像缓存方法
CN116560560A (zh) 存储数据的方法和相关装置
CN116340203A (zh) 数据预读取方法、装置、处理器及预取器
WO2022222523A1 (zh) 一种日志管理方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874255

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022874255

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022874255

Country of ref document: EP

Effective date: 20240403

NENP Non-entry into the national phase

Ref country code: DE