WO2021213076A1 - 基于多处理节点来构建通信拓扑结构的方法和设备 - Google Patents

基于多处理节点来构建通信拓扑结构的方法和设备 Download PDF

Info

Publication number
WO2021213076A1
WO2021213076A1 PCT/CN2021/080889 CN2021080889W WO2021213076A1 WO 2021213076 A1 WO2021213076 A1 WO 2021213076A1 CN 2021080889 W CN2021080889 W CN 2021080889W WO 2021213076 A1 WO2021213076 A1 WO 2021213076A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
processing
information
configuration information
communication
Prior art date
Application number
PCT/CN2021/080889
Other languages
English (en)
French (fr)
Inventor
朝鲁
梁帆
柴庆龙
张潇
高燕强
孙咏哲
李志勇
张晨
孟天
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Priority to US17/920,961 priority Critical patent/US12050545B2/en
Priority to EP21793522.0A priority patent/EP4141685A4/en
Publication of WO2021213076A1 publication Critical patent/WO2021213076A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17318Parallel communications techniques, e.g. gather, scatter, reduce, roadcast, multicast, all to all
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17375One dimensional, e.g. linear array, ring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/177Initialisation or configuration control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of artificial intelligence, and more specifically, to the field of multi-processor inter-chip communication.
  • the training time should be T/N, which is also called Is the ideal linear acceleration ratio.
  • T/N which is also called Is the ideal linear acceleration ratio.
  • the ideal linear acceleration is unrealistic because of the introduction of communication overhead.
  • the calculation part can be linearly accelerated; the communication part (such as the AllReduce algorithm) exists objectively and cannot be eliminated.
  • each node participating in the distributed training needs to pass the current node reverse transfer (BP) gradient information ⁇ Wi to For other nodes, eventually each node can obtain all the gradient information, that is, ⁇ Wi.
  • BP node reverse transfer
  • the AllReduce algorithm can be implemented on different network topologies, and the AllReduce algorithm optimized in the ring topology (Ring) adopts the Ring AllReduce algorithm.
  • the core process that AllReduce needs to implement is: Receive (Receive, abbreviated as R), Calculate (Compute, abbreviated as C), and Send (Send, abbreviated as S).
  • R Receive
  • C Calculate
  • S Send
  • the R part corresponds to receiving the gradient information ⁇ Wi-1 sent by the upstream node
  • the S part corresponds to the calculation sent downstream Gradient information ⁇ Wi.
  • the problem to be solved by the present disclosure is how to support the R-C-S process completely at the processing device under the premise of efficient use of computing resources and no introduction of chip thread management capabilities.
  • the purpose of the present disclosure is to solve the shortcomings of unreasonable occupation of computing resources and wasted computing resources in the prior art.
  • a method for constructing a communication topology based on multi-processing nodes including: constructing node configuration information, the node configuration information including upstream node information, current node information, and downstream node information; Two processing nodes send the node configuration information to construct the communication topology structure.
  • a device for constructing a communication topology based on multi-processing nodes including: a first device for constructing node configuration information, the node configuration information including upstream node information, current node information, and downstream nodes Information; the second device sends the node configuration information to at least two processing nodes to construct a communication topology.
  • a system for constructing a communication topology based on multiple processing nodes includes: a plurality of processing nodes; and a host, the host includes a building module, the building module includes: a first device, For constructing node configuration information, the node configuration information includes upstream node information, current node information, and downstream node information; the second device sends the node configuration information to at least two processing nodes to construct a communication topology structure.
  • an electronic device including: one or more processors; and a memory in which computer-executable instructions are stored. When each processor is running, the electronic device executes the method described above.
  • a computer-readable storage medium including computer-executable instructions, and when the computer-executable instructions are executed by one or more processors, the method described above is executed.
  • the method of pre-applying resources in the present disclosure solves the consistent occupation of multi-node resources in a distributed scenario, and reduces the resource deadlock caused by insufficient application of some node resources of the processing equipment; in addition, it also solves the problem of processing equipment data receiving and computing , Automatic routing of sending, without the host's active intervention in the processing equipment execution process; more user-friendly, without the user's understanding of the underlying hardware structure, descriptors or template complex configuration process, reducing the development complexity of distributed tasks (such as AllReduce) Spend.
  • Fig. 1 shows a schematic structural diagram of a processing node according to an embodiment of the present disclosure.
  • Fig. 2 shows a schematic diagram of the connection relationship between one processing node and other processing nodes according to an embodiment of the present disclosure.
  • Fig. 3 shows a schematic diagram of a system environment to which the method according to the present disclosure can be applied.
  • Fig. 4a shows a flowchart of a method for constructing a communication topology based on multi-processing nodes according to an embodiment of the present disclosure.
  • Fig. 4b shows a schematic diagram of a multi-processing node system based on a multi-processing node to construct a communication topology structure according to an embodiment of the present disclosure.
  • Figures 5a-5c show schematic diagrams of setting multiple node configuration information for a single node according to an embodiment of the present disclosure, wherein Figure 5a shows a situation where a single node has multiple inputs; Figure 5b shows a single node A situation with multiple outputs; and Figure 5c shows a situation where a single node has multiple inputs and multiple outputs.
  • 6a to 6c respectively exemplarily show schematic diagrams of a chain topology structure, a ring topology structure and a tree topology structure.
  • Fig. 7 shows a device for constructing a communication topology based on multi-processing nodes according to an embodiment of the present disclosure.
  • Fig. 8 shows a schematic block diagram of a combined processing device.
  • Figure 9 shows a schematic block diagram of a board.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • the processing device may be any device, module, device, unit, etc. capable of receiving, calculating, and sending data, such as a processor, a chip, or a circuit.
  • Fig. 1 shows a schematic structural diagram of a processing node according to an embodiment of the present disclosure.
  • the processing node may be, may include, or may be included in the above-mentioned processing device.
  • the processing node may include a communication device 100, which may include a receiving device 110, a task processing device 130, a sending device 120, and a memory 140; one end of the task processing device 130 is connected to the receiving device 110, and the other end is connected to the sending device 120; the receiving device 110 , The sending device 120 is connected to the memory 140 respectively.
  • the receiving device 110 can receive data from other processing nodes or from upper-layer drivers, and send the received data to the task processing device 130 for calculation to obtain the data to be sent; the memory 140 is used to store the received and calculated data from the communication Various data in the process; the sending device 130 is used to send these data out.
  • Fig. 2 shows a schematic diagram of the connection relationship between one processing node and other processing nodes according to an embodiment of the present disclosure.
  • Z can be regarded as a processing node, and the processing node can have multiple ports, such as ports a-f, and the processing node Z can be connected to other processing nodes A-F through these ports, respectively.
  • the connection between processing node Z and other nodes A-F can be enabled or disabled to form different topological structures.
  • Figure 2 shows that the connection between processing node Z and processing nodes A and C is enabled (indicated by solid lines), while there are physical connections between processing node Z and processing nodes B, D, E, and F. However, no actual communication occurs (indicated by the dashed line), and the resulting topology is (A, Z, C).
  • the processing node Z can also form any other type of topological structure, such as (F, Z, B), (E, Z, A), (A, Z, (B, C)), and so on.
  • topology A, Z, (B, C)
  • Fig. 3 shows a schematic diagram of a system environment to which the method according to the present disclosure can be applied.
  • the system may include a host and a processing device.
  • the processing device and the processing node may be equal to or contain each other. Therefore, the two can be interchanged in the context. It should be understood that the processing device can be combined with the host to form a system, or it can be a discrete system.
  • the user can edit in the host to manage the processing equipment.
  • the host may be implemented by a general-purpose computer or a special-purpose computer, which may include collective communication primitives, such as AllReduce and Allgather mentioned above; multiple user applications; and communication drivers between nodes.
  • the processing equipment may include inter-node communication modules.
  • the inter-node communication modules have multiple communication media and corresponding ports, such as RoCE, Interlaken, etc.
  • the host also includes the user communication interface of the present disclosure, and the communication of the processing node can be managed through the user communication interface without modifying the driver every time.
  • the user does not need to know the underlying hardware structure, nor does it need to understand the analysis process of the underlying signal.
  • Fig. 4a shows a flowchart of a method for constructing a communication topology based on multi-processing nodes according to an embodiment of the present disclosure
  • Fig. 4b shows a method for constructing a communication topology based on multi-processing nodes according to an embodiment of the present disclosure
  • the method includes: in operation S410, constructing node configuration information, where the node configuration information includes upstream node information, current node information, and downstream node information; in operation S420, sending the node configuration information to at least two processing nodes Node configuration information to construct the communication topology structure.
  • node configuration information can be established on the host.
  • the node configuration information can indicate how the processing nodes are to be configured or the relationship between each processing node and other processing nodes. Establishing node configuration information on the host can be achieved through parallel programming.
  • the "other processing node” mentioned here may be a processing node that has a connection relationship with one processing node. Assuming that a certain processing node is called the current node, the processing node that sends data or information to the current node is the upstream node of the current node, and the processing node that receives data or information from the current node is the current node’s Downstream nodes, therefore, node configuration information with upstream node information, current node information, and downstream node information can completely describe a certain node and other nodes adjacent to the node.
  • processing node A and processing node B and processing node A sends data to processing node B, and the data is processed at processing node B, processing node B is the current node, and processing node A is processing node B
  • the upstream node of processing node B is empty.
  • processing node A is the current node
  • processing node B is the downstream node of processing node A
  • the upstream node of processing node A is empty.
  • sending to at least two processing nodes does not necessarily mean sending node configuration information directly to the processing node, but for example, it can be sent to the driver, and then the driver directly or indirectly sends it to the processing node. . Any direct or indirect method that enables node configuration information to reach the processing node falls within the protection scope of the present disclosure.
  • the node configuration information can be sent to at least two processing nodes, thereby forming different topological networks through multiple processing nodes.
  • the constructed node configuration information can be sent to processing node 1, processing node 2, ... processing node n, and so on.
  • the processing nodes After receiving the configuration information of these nodes, the processing nodes form different topological networks, and communicate and process data based on these topological networks.
  • the subsequent host does not need to participate in the communication and data processing between the processing nodes, thereby reducing the interaction between the host and the device, and improving operation efficiency.
  • Fig. 4b is only an example of a host and a processing device, and the two do not have to be as shown in Fig. 4b.
  • multiple processing nodes can be in one processing device, or in multiple processing devices, and controlled by one or more hosts; each host can control one or more processing nodes, and the host can control one or more processing nodes.
  • the control can be serial or parallel.
  • each processing node can be configured one by one, or multiple processing nodes can be configured at the same time. Any combination of the host and the processing node is within the protection scope of the present disclosure.
  • the upstream node information is used to indicate a processing node that transmits data to the current node
  • the current node information is used to indicate a processing node that performs calculations on the received data
  • the downstream node The information is used to indicate the processing node that receives the calculated data from the current node.
  • Processing node A is the upstream node of processing node B and sends data to processing node B; processing node B performs computing functions and receives data from processing node B. The data of A is then calculated and processed; the processing node C is a downstream node of the processing node B, and the processing node B sends the processed data to the processing node C after the processing node B finishes processing the data.
  • the node configuration information can be sent to the processing node B, and the processing node B analyzes the node configuration information after receiving the node configuration information, thereby knowing that the upstream node sending data to it is processing node A, and it performs calculations and calculations on the received data.
  • the data is sent to the downstream processing node C.
  • the processing node that receives the node configuration information will know the role it plays and the specific information of the upstream and downstream nodes. Therefore, by modifying the content of the node configuration information, different network topologies can be deployed, which improves the efficiency of network topology settings and reduces the difficulty.
  • the node configuration information may be in the form of a queue tuple ⁇ upstream node, current node, downstream node>.
  • the information contained in the tuple allows the processing node that receives the node configuration information to know its role and specific information about upstream and downstream nodes.
  • the node configuration information may be in the form of a queue tuple ⁇ upstream node, downstream node>.
  • the element "current node" is omitted, because the current node can be set as the default, that is, the processing node to which the node configuration information is sent, then the processing node that receives the node configuration information is defaulted to the current node node.
  • Figures 5a-5c show schematic diagrams of setting multiple node configuration information for a single node according to an embodiment of the present disclosure, wherein Figure 5a shows a situation where a single node has multiple inputs; Figure 5b shows a single node A situation with multiple outputs; and Figure 5c shows a situation where a single node has multiple inputs and multiple outputs.
  • node Z is the current node, which has two upstream nodes A and B, and one downstream node C. Therefore, in order to achieve such a configuration, the node configuration information sent to the processing node Z may include: as well as Where the symbol Indicates empty.
  • the processing node Z may receive data from the processing nodes A and B, and after calculation and processing are performed at the processing node Z, the processed data is sent to the processing node C.
  • the task processing part that processes and calculates data from processing node A and processing node B is also exemplarily shown in boxes, which may correspond to the task processing device in FIG. 1, hereinafter Will not repeat them.
  • node Z is the current node, which has an upstream node A and two downstream nodes C and D. Therefore, in order to achieve such a configuration, the node configuration information sent to the processing node Z may include: as well as Where the symbol Indicates empty.
  • the processing node Z may receive data from the processing node A, and after performing calculation and processing at the processing node Z, send the processed data to the processing nodes C and D.
  • node Z is the current node, which has two upstream nodes A and B, and two downstream nodes C and D. Therefore, in order to achieve such a configuration, the node configuration information sent to the processing node Z Can include: as well as Where the symbol Indicates empty.
  • the processing node Z may receive data from the processing nodes A and B, and after calculation and processing are performed at the processing node Z, the processed data is sent to the processing nodes C and D.
  • the form of the tuple can also include only the upstream node and the downstream node without including the current node.
  • processing node Z is a bridge node between processing nodes B and C.
  • one of the upstream node information and the downstream node information may be empty.
  • upstream node or downstream node In addition to the above situation where the upstream node or downstream node is empty, there are other situations. For example, when a certain processing node is an endpoint in the topology, there is a situation where the upstream node or downstream node is empty. This will be done in the following More detailed description.
  • sending the node configuration information to at least two processing nodes to construct a communication topology includes: sending different node configuration information to at least a part of all the processing nodes, so as At least a part of the processing nodes are constructed into different communication topologies.
  • the processing nodes that have received the node configuration information can form different connection relationships. Therefore, by sending node configuration information to multiple processing nodes Information can form more complex and diverse topological structures.
  • 6a to 6c respectively exemplarily show schematic diagrams of a chain topology structure, a ring topology structure and a tree topology structure.
  • processing nodes A, B, and C form a chain topology.
  • the three processing nodes A, B, and C are serially connected in turn, where processing node A is an endpoint, and its node configuration information is This means that processing node A is the current node, its upstream node is empty, and its downstream node is processing node B; similarly, for processing node B, its node configuration information is ⁇ A, B, C>, which means that processing point B is current Node, its upstream node is processing node A, and its downstream node is processing node C; similarly, for processing node C, its node configuration information is This means that processing point C is the current node, its upstream node is processing node B, and the downstream node is empty.
  • processing nodes A, B, and C form a ring topology.
  • the three processing nodes A, B, and C are sequentially connected in series, and the processing nodes A and C are connected to form a ring structure.
  • the node configuration information for processing node A is ⁇ C, A, B>, which means that processing node A is the current node, its upstream node is processing node C, and the downstream node is processing node B; similarly, for processing node B, Its node configuration information is ⁇ A, B, C>, which means that processing point B is the current node, its upstream node is processing node A, and its downstream node is processing node C; similarly, for processing node C, its node configuration information is ⁇ B, C, A>, which means that processing point C is the current node, its upstream node is processing node B, and the downstream node is processing node A.
  • processing nodes A, B, C, and D form a tree topology.
  • processing nodes A and B are respectively connected to processing node C
  • processing node D is connected to processing node C.
  • the node configuration information for processing node A is This means that processing node A is the current node, its upstream node is empty, and its downstream node is processing node C.
  • processing node B its node configuration information is This means that processing point B is the current node, its upstream node is empty, and the downstream node is processing node C.
  • node configuration information Indicates that the current node is C, which has upstream node A; node configuration information Indicates that the current node is C, which has upstream node B; node configuration information Indicates that the current node is C, which has a downstream node D.
  • node configuration information For processing node D, its node configuration information is This means that the current node is D, its upstream ground is processing node C, and its downstream node is empty.
  • FIGS. 6a-6c are just a few examples of multiple topologies. Those skilled in the art can modify the node configuration information and distribute the node configuration information to different nodes to construct each Kind of required topology. In addition, for the sake of brevity, the task processing part in FIGS. 5a to 5c is omitted from FIGS. 6a to 6c.
  • This configuration method is convenient for users to construct different topological structures in a simple way, thereby simplifying operations and improving efficiency.
  • constructing a communication topology structure may include allowing processing nodes in the communication topology structure to reserve resources.
  • resources such as communication resources and/or register resources, can be reserved for all processing nodes in the constructed topology structure. These resources can be used for subsequent communications, storage, calculations, etc. performed by the processing node, so that the processing node does not need to temporarily apply for resources during the processing process, thereby making subsequent processing more efficient.
  • the above-mentioned communication resources may include: ports and/or channels required for communication between nodes.
  • the communication port is a network media port module that physically connects two processing nodes.
  • the communication channel is a virtual communication link between the sending and receiving devices matched by the two processing nodes. Generally speaking, which send DMA module and which receive DMA module is selected from a large group of DMAs.
  • the register resource may include a storage space for storing task description information, which is used to indicate operations to be performed by each processing node in the constructed communication topology structure.
  • the task description information may, for example, specify what operation (sending, calculating or receiving, etc.) each processing node should perform, how to perform the operation, when to perform the operation, and so on.
  • Fig. 7 shows a device for constructing a communication topology based on multi-processing nodes according to an embodiment of the present disclosure, including: a first device M710 configured to construct node configuration information, the node configuration information including upstream node information, current Node information and downstream node information; and a second device M720 configured to send the node configuration information to at least two processing nodes to construct a communication topology structure.
  • a first device M710 configured to construct node configuration information, the node configuration information including upstream node information, current Node information and downstream node information
  • a second device M720 configured to send the node configuration information to at least two processing nodes to construct a communication topology structure.
  • the above-mentioned equipment can be implemented by software, hardware, firmware, etc., to realize the functions shown in FIG. 4.
  • the device can be set up or integrated in any other device, such as a host or server.
  • the present disclosure also provides a system for constructing a communication topology based on multiple processing nodes, including: multiple processing nodes; and a host, the host includes a building module, and the building module includes: a first device M710, Used to construct node configuration information, the node configuration information includes upstream node information, current node information, and downstream node information; the second device M720 sends the node configuration information to at least two processing nodes to construct a communication topology structure.
  • an electronic device including: one or more processors; and a memory in which computer-executable instructions are stored. When multiple processors are running, the electronic device is caused to execute the method as described above.
  • a computer-readable storage medium including computer-executable instructions, and when the computer-executable instructions are executed by one or more processors, the method as described above is executed.
  • the method of pre-applying resources in the present disclosure solves the consistent occupation of multi-node resources in a distributed scenario, and reduces the resource deadlock caused by insufficient application of some node resources of the processing equipment; in addition, it also solves the problem of processing equipment data receiving and computing , Automatic routing of sending, without the host actively intervening in the processing equipment execution process; more user-friendly, without the user understanding the underlying hardware structure, descriptors or template complex configuration process, reducing the development complexity of distributed tasks (such as AllReduce) Spend.
  • the technical solution of the present disclosure can be applied to the field of artificial intelligence, implemented in a host, a server, or implemented as or implemented in an artificial intelligence chip.
  • the chip can exist alone or included in the communication configuration device.
  • FIG. 8 shows a combined processing device 800, which includes the aforementioned communication configuration device 802, an interconnection interface 804, and other processing devices 806.
  • the communication configuration device according to the present disclosure interacts with other processing devices to jointly complete the operation specified by the user.
  • Fig. 8 is a schematic diagram of a combined processing device.
  • Other processing devices include one or more types of general-purpose/special processors such as central processing unit CPU, graphics processing unit GPU, neural network processor, etc.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, and completing basic controls such as turning on and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.
  • the interconnection interface is used to transmit data and control commands between a communication configuration device (including, for example, a machine learning computing device) and other processing devices.
  • the communication configuration device obtains the required input data from other processing devices and writes it to the storage device on the communication configuration device chip; it can obtain control instructions from other processing devices and write it to the control buffer on the communication configuration device chip; it can also read Take the data in the storage module of the communication configuration device and transmit it to other processing devices.
  • the structure may further include a storage device 808, which is respectively connected to the communication configuration device and the other processing device.
  • the storage device is used to store the data in the communication configuration device and the other processing device, and is particularly suitable for data that cannot be fully stored in the internal storage of the communication configuration device or other processing device.
  • the combined processing device can be used as an SOC system on chip for mobile phones, robots, unmanned aerial vehicles, video surveillance equipment and other equipment, effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
  • the interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • the present disclosure also discloses a board card, which includes a chip packaging structure.
  • a board card which includes a chip packaging structure.
  • the board card may also include other supporting components.
  • the supporting components include, but are not limited to: a storage device 904, an interface device 906, and a control device. 908.
  • the storage device is connected to the chip in the chip packaging structure through a bus for storing data.
  • the storage device may include multiple groups of storage units 910. Each group of the storage unit and the chip are connected by a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • the storage device may include 4 groups of the storage units. Each group of the storage unit may include a plurality of DDR4 particles (chips). In an embodiment, the chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification. In one embodiment, each group of the storage unit includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transmit data twice in one clock cycle. A controller for controlling the DDR is provided in the chip for controlling the data transmission and data storage of each storage unit.
  • the interface device is electrically connected with the chip in the chip packaging structure.
  • the interface device is used to implement data transmission between the chip and an external device 912 (for example, a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces.
  • the present disclosure does not limit the specific manifestations of the other interfaces mentioned above, and the interface unit only needs to be able to realize the switching function.
  • the calculation result of the chip is still transmitted by the interface device back to an external device (such as a server).
  • the control device is electrically connected with the chip.
  • the control device is used to monitor the state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a single-chip microcomputer (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the chip can be in different working states such as multi-load and light-load.
  • the control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the chip.
  • the present disclosure also discloses an electronic device or device, which includes the above-mentioned board.
  • Electronic equipment or devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches , Earphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
  • the transportation means include airplanes, ships, and/or vehicles;
  • the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, optical, acoustic, magnetic or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or in the form of software program modules.
  • the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the computer software product is stored in a memory and includes several instructions to enable a computer device (which can be a personal computer, a server, or a network device) Etc.) Perform all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • a method for constructing a communication topology based on multiple processing nodes including:
  • node configuration information includes upstream node information, current node information, and downstream node information;
  • the node configuration information is sent to at least two processing nodes to construct the communication topology structure.
  • the upstream node information is used to indicate the processing node that transmits data to the current node
  • the current node information is used to indicate the processing node that performs calculations on the received data
  • the downstream node information is used to indicate The node is the processing node that receives the calculated data.
  • Clause 5 The method according to any one of clauses 1-4, wherein one of the upstream node information and the downstream node information is empty.
  • Clause 7 The method according to Clause 6, wherein the communication topology includes at least one of a chain topology, a ring topology, and a tree topology.
  • Clause 8 The method of any one of clauses 1-7, wherein constructing a communication topology includes causing processing nodes in the communication topology to reserve resources.
  • Clause 9 The method according to Clause 8, wherein the resources include communication resources and/or register resources.
  • the communication resources include: ports and/or channels required for communication between nodes;
  • the register resource includes a storage space for storing task description information, the task description information being used to indicate operations to be performed by each processing node in the constructed communication topology structure.
  • a device that builds a communication topology based on multiple processing nodes including:
  • the first device constructs node configuration information, where the node configuration information includes upstream node information, current node information, and downstream node information;
  • the second device sends the node configuration information to at least two processing nodes to construct a communication topology structure.
  • a system that builds a communication topology based on multiple processing nodes including:
  • a host the host includes a building module, and the building module includes:
  • the first device is configured to construct node configuration information, where the node configuration information includes upstream node information, current node information, and downstream node information;
  • the second device sends the node configuration information to at least two processing nodes to construct a communication topology structure.
  • An electronic device including:
  • One or more processors are One or more processors.
  • a memory where computer-executable instructions are stored in the memory, and when the computer-executable instructions are executed by the one or more processors, the electronic device is caused to execute any one of clauses 1-10 method.
  • Clause 14 A computer-readable storage medium comprising computer-executable instructions, when the computer-executable instructions are executed by one or more processors, the method according to any one of clauses 1-10 is performed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Neurology (AREA)
  • Computer And Data Communications (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本公开涉及基于多处理节点来构建通信拓扑结构的方法、设备和通信配置装置,其中通信配置装置可以包括在组合处理装置中,该组合处理装置还可以包括互联接口和其他处理装置。所述通信配置装置与其他处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置,该存储装置分别与通信配置装置和其他处理装置连接,用于保存通信配置装置和其他处理装置的数据。本公开的技术方案能够提高片间通信的效率。

Description

基于多处理节点来构建通信拓扑结构的方法和设备
相关申请的交叉引用
本申请要求于2020年04月24日申请,申请号为202010334771.X,名称为“基于多处理节点来构建通信拓扑结构的方法和设备”,在此将其全文引入作为参考。
技术领域
本申请涉及人工智能领域,更具体地,涉及多处理器的片间通信领域。
背景技术
在神经网络的训练中,如果单机训练规模为X的神经网络耗时T,那么当有N台相同机器训练该神经网络时,理想状态下,训练耗时应该为T/N,这也被称为理想线性加速比。但是,理想线性加速是不现实的,因为引入了通信开销。虽然计算部分是可以线性加速的;但是通信部分(如AllReduce算法)是客观存在且无法消除的。
为了接近理想线性加速比,存在多种方法。一种是对通信时间进行优化,例如缩短通信时间;另一种是进行操作的重叠,例如把通信时间掩盖在计算时间之中(通信融合、异步更新等)。
对通信时间进行优化可以有多种方式,例如采用高速通信技术、采用优化的通信算法等。
在多机多卡的神经网络训练过程中,为确保多机多卡的数据并行训练结果收敛,参与分布式训练的每个节点需要将当前节点反向传递(BP)的梯度信息△Wi传递给其他节点,最终使得每个节点都能获得全部梯度信息,即∑△Wi。梯度信息被传播和累加计算的方法被称为AllReduce算法。
AllReduce算法可以在不同的网络拓扑结构上实现,其中在环形拓扑(Ring)中优化实现的AllReduce算法采用了Ring AllReduce算法。
从单卡角度看,AllReduce所需实现的核心过程为:收(Receive,简记为R),算(Compute,简记为C),发(Send,简记为S)。在Ring AllReduce算法中,R部分对应于接收上游节点发来的梯度信息△Wi-1,C部分对应于计算△Wi=Add(△ Wi-1,△Wi),S部分对应于计算向下游发送梯度信息△Wi。
但是,现有技术不能完全支持在处理设备侧的接收、运算和发送过程,或者处理设备侧即使能完全支持接收、运算和发送过程,也会引入计算资源的浪费或线程管理带来的芯片面积增大、能耗增加等的问题。
此外,在现有技术中,处理节点之间的通信需要主机的参与和管理,这也将导致主机和处理节点之间频繁的通信,导致通信效率和运算效率的降低。因此本公开所要解决的问题是,如何在高效能利用计算资源和不引入芯片线程管理能力前提下,支持完全在处理设备处进行R-C-S过程。
发明内容
本公开的目的是解决现有技术中计算资源被不合理占用,计算资源被浪费的缺点。
根据本公开的第一方面,提供一种基于多处理节点来构建通信拓扑结构的方法,包括:构建节点配置信息,所述节点配置信息包括上游节点信息、当前节点信息和下游节点信息;向至少两个处理节点发送所述节点配置信息,以构建所述通信拓扑结构。
根据本公开的第二方面,提供一种基于多处理节点来构建通信拓扑结构的设备,包括:第一装置,构建节点配置信息,所述节点配置信息包括上游节点信息、当前节点信息和下游节点信息;第二装置,向至少两个处理节点发送所述节点配置信息,以构建通信拓扑结构。
根据本公开的第三方面,一种基于多处理节点来构建通信拓扑结构的系统,包括:多个处理节点;以及主机,所述主机包括构建模块,所述构建模块包括:第一装置,用于构建节点配置信息,所述节点配置信息包括上游节点信息、当前节点信息和下游节点信息;第二装置,向至少两个处理节点发送所述节点配置信息,以构建通信拓扑结构。
根据本公开的第四方面,提供一种电子设备,包括:一个或多个处理器;以及存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如上所述的方法。
根据本公开的第五方面,提供一种计算机可读存储介质,包括计算机可执行 指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如上所述的方法。
本公开的技术方案所提供的有益效果至少包括:
本公开中预申请资源的方法解决了分布式场景下多节点资源的一致性占用,减缓了处理设备部分节点资源申请不足带来的资源死锁现象;此外,还解决了处理设备数据接收、计算、发送的自动路由,而无需主机Host主动干预处理设备执行过程;对用户较为友好,无需用户了解硬件底层的硬件结构、描述符或者模板复杂配置过程,降低分布式任务(例如AllReduce)的开发复杂度。
附图说明
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:
图1示出了根据本公开的一个实施方式的处理节点的示意性结构图。
图2示出了根据本公开一个实施方式的一个处理节点与其他处理节点的连接关系示意图。
图3示出了根据本公开的方法所能应用的系统环境示意图。
图4a示出了根据本公开的一个实施方式的基于多处理节点来构建通信拓扑结构的方法的流程图。
图4b示出了根据本公开的一个实施方式的基于多处理节点来构建通信拓扑结构的多处理节点系统的示意图。
图5a-图5c示出了根据本公开的一个实施方式的为单个节点设置多个节点配置信息的示意图,其中图5a示出了单个节点具有多个输入的情形;图5b示出了单个节点具有多个输出的情形;以及图5c示出了单个节点具有多个输入和多个输出的情形。
图6a至图6c分别示例性地示出了链式拓扑结构,环形拓扑结构和树形拓扑结构的示意图。
图7示出了根据本公开的一个实施方式的基于多处理节点来构建通信拓扑结构的设备。
图8示出了一种组合处理装置的示意性框图。
图9示出了一种板卡的示意性框图。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
在本文中,处理设备可以是处理器、芯片、电路等任何能够对数据进行接收、计算和发送的装置、模块、设备和单元等。
图1示出了根据本公开的一个实施方式的处理节点的示意性结构图。
该处理节点可以是、可以包含或者可以包含于上述的处理设备。该处理节点可以包括通信装置100,该通信装置可以包括接收装置110、任务处理装置130、 发送装置120以及存储器140;任务处理装置130一端连接接收装置110,另一端连接发送装置120;接收装置110、所述发送装置120分别与所述存储器140相连。
接收装置110可以从其他处理节点或者从上层的驱动接收数据,并将接收到的数据发送给任务处理装置130以进行计算得到待发送数据;存储器140用于存储所述通信所接收到的以及计算过程中的各种数据;发送装置130用于将这些数据发送出去。
需要理解的是,上面对每个处理节点的解释,仅仅是为了便于理解。在本公开的技术方案中,用户无需了解底层的硬件结构,也无需了解底层信号的解析过程。
图2示出了根据本公开一个实施方式的一个处理节点与其他处理节点的连接关系示意图。
在图2中,Z可以视为一个处理节点,该处理节点可以有多个端口,例如端口a-f,该处理节点Z可以通过这些端口分别连接到其他的处理节点A-F。处理节点Z与其他节点A-F之间的连接可以使能也可以禁用,从而形成不同的拓扑结构。例如,图2示出了处理节点Z与处理节点A和C之间的连接是使能的(以实线表示),而与其他处理节点B、D、E和F之间虽然也存在物理上的连接关系,但并不发生实际的通信(以虚线表示),由此形成的拓扑结构为(A,Z,C)。可以理解的是,处理节点Z也可以形成任何其他类型的拓扑结构,例如(F,Z,B)、(E,Z、A)、(A,Z,(B,C))等。对于拓扑结构(A,Z,(B,C)),其表示处理节点Z与处理A之间的连接是使能的,并且节点Z与节点B以及节点C之间的连接是使能的。
由上文中可以看出,通过改变每个处理节点与其他处理节点之间的连接的使能关系,可以方便地形成所需的拓扑结构。
图3示出了根据本公开的方法所能应用的系统环境示意图。
如图3所示,该系统可以包括主机和处理设备,处理设备可以与处理节点等同或者互相包含,因此,上下文中二者可以互换。需要理解的是,处理设备可以与主机结合起来形成一个系统,也可以是分立式的系统。用户可以在主机中进行编辑以对处理设备进行管理。主机中可以是采用通用计算机或专用计算机实现的, 其中可以包含有集合通信原语,例如上文所述的AllReduce以及Allgather等;多种用户应用;以及节点间通信驱动。在处理设备中,可以包括节点间通信模块,节点间通信模块下为多种通信介质以及对应的端口,例如RoCE,Interlaken等。
在主机中,还包括了本公开的用户通信接口,可以通过该用户通信接口来对处理节点的通信进行管理,而无需每次都修改驱动程序。用户无需知道底层的硬件结构,也无需了解底层信号的解析过程。
图4a示出了根据本公开的一个实施方式的基于多处理节点来构建通信拓扑结构的方法的流程图;图4b示出了根据本公开的一个实施方式的基于多处理节点来构建通信拓扑结构的多处理节点系统的示意图。
如图4a所示,该方法包括:在操作S410,构建节点配置信息,所述节点配置信息包括上游节点信息、当前节点信息和下游节点信息;在操作S420,向至少两个处理节点发送所述节点配置信息,以构建所述通信拓扑结构。
首先可以在主机建立节点配置信息,该节点配置信息可以指示将要如何配置处理节点或者每个处理节点与其他处理节点之间的关系。在主机建立节点配置信息可以通过并行编程来实现。
根据本公开的一个实施方式,这里所述的“其他处理节点”可以是与一个处理节点具有连接关系的处理节点。假设将某个处理节点称为当前节点,那么向该当前节点发送数据或信息的处理节点则为该当前节点的上游节点,而从该当前节点接收数据或信息的处理节点则为该当前节点的下游节点,由此,具有上游节点信息、当前节点信息和下游节点信息的节点配置信息可以完整地描述某一个节点以及与该节点相邻的其他节点。
在两个处理节点的情况下,例如处理节点A和处理节点B并且处理节点A向B发送数据,数据在处理节点B处进行处理,则处理节点B为当前节点,处理节点A为处理节点B的上游节点,而处理节点B的下游节点为空。
可以理解的是,在上面两个处理节点A和B的情况下,如果以处理节点A为当前节点,那么处理节点B则为处理节点A的下游节点,而处理节点A的上游节点为空。
此外,还需要理解的是,“向至少两个处理节点发送”并不必然意味着直接向处理节点发送节点配置信息,而是例如可以发送给驱动,再由驱动直接或间接 地发送给处理节点。任何能够使得节点配置信息能够到达处理节点的直接或间接的方式都属于本公开的保护范围之内。
当构建好该节点配置信息之后,如图4a和图4b所示,可以将节点配置信息发送到至少两个处理节点,从而通过多个处理节点来形成不同的拓扑网络。在图4b中,所构建的节点配置信息可以被发送给处理节点1、处理节点2……处理节点n等等。处理节点在接收到这些节点配置信息之后,形成不同的拓扑网络,并基于这些拓扑网络进行通信和处理数据。
在上面的方案中,只要在主机设置好了每个设备中处理节点的运行规则,则后续主机无需再参与处理节点间通信和数据的处理,从而减少了主机与设备之间的交互,提升了运行效率。
需要理解的是,图4b仅仅是主机和处理设备的一种示例,二者并不必须是如图4b所示的那样。例如,多个处理节点可以处于一个处理设备中,也可以处于多个处理设备中,并由一个或多个主机来进行控制;每个主机可以控制一个或多个处理节点,主机对处理节点的控制可以是串行的方式,也可以是并行的方式,例如可以逐个地对每个处理节点进行配置,也可以同时对多个处理节点进行配置。主机与处理节点的任何组合方式都在本公开的保护范围之内。
根据本公开的一个实施方式,所述上游节点信息用于指示向当前节点传送数据的处理节点,所述当前节点信息用于指示对所接收到的数据进行计算的处理节点,以及所述下游节点信息用于指示从所述当前节点接收计算后数据的处理节点。
以三个相互连接的处理节点A、B和C为例,其中处理节点A是处理节点B的上游节点,并向处理节点B发送数据;处理节点B发挥计算功能,在接收到来自于处理节点A的数据之后进行计算和处理;处理节点C为处理节点B的下游节点,在处理节点B对数据处理完毕之后,将处理后的数据发送给处理节点C。由此,可以向处理节点B发送节点配置信息,处理节点B在接收到该节点配置信息之后进行解析,从而获知向其发送数据的上游节点为处理节点A,其对接收到的数据进行运算和处理之后,要将这些数据发送给其下游的处理节点C。通过向每个处理节点发送这样的节点配置信息,则接收到该节点配置信息的处理节点将知晓其扮演的角色以及上下游节点的具体信息。由此,通过修改节点配置信息的内容,则可以布设出不同的网络拓扑,提升了网络拓扑设置的效率,降低了难 度。
节点配置信息的形式可以有多种,根据本公开的一个实施方式,节点配置信息可以为队列元组的形式<上游节点,当前节点,下游节点>。如上所述,当将节点配置信息发送到每个处理节点时,该元组中包含的信息可以让接收到该节点配置信息的处理节点将知晓其扮演的角色以及上下游节点的具体信息。
根据本公开的另一个实施方式,节点配置信息可以为队列元组的形式<上游节点,下游节点>。在该实施方式中,省略了元素“当前节点”,这是因为可以将当前节点设置为默认,即节点配置信息发送到哪个处理节点,那么接收到该节点配置信息的处理节点就被默认为当前节点。
根据本公开的一个实施方式,针对单个处理节点的节点配置信息可以为多个,并且可以具有多个不同的上游节点信息和/或多个不同的下游节点信息。
图5a-图5c示出了根据本公开的一个实施方式的为单个节点设置多个节点配置信息的示意图,其中图5a示出了单个节点具有多个输入的情形;图5b示出了单个节点具有多个输出的情形;以及图5c示出了单个节点具有多个输入和多个输出的情形。
如图5a所示,节点Z为当前节点,其具有两个上游节点A和B,具有一个下游节点C,由此,为了实现这样的配置,发送给该处理节点Z的节点配置信息可以包括:
Figure PCTCN2021080889-appb-000001
以及
Figure PCTCN2021080889-appb-000002
其中符号
Figure PCTCN2021080889-appb-000003
表示空。在此实施方式中,处理节点Z可以从处理节点A和B接收数据,在该处理节点Z处进行计算和处理之后,将处理后的数据发送到处理节点C。此外,在图5a中,还示例性地以方框表示了对来自处理节点A和处理节点B的数据进行处理和计算的任务处理部分,这可以对应于图1中的任务处理装置,下文中将不再赘述。
如图5b所示,节点Z为当前节点,其具有一个上游节点A,具有两个下游节点C和D,由此,为了实现这样的配置,发送给该处理节点Z的节点配置信息可以包括:
Figure PCTCN2021080889-appb-000004
以及
Figure PCTCN2021080889-appb-000005
其中符号
Figure PCTCN2021080889-appb-000006
表示空。在此实施方式中,处理节点Z可以从处理节点A接收数据,在该处理节点Z处进行计算和处理之后,将处理后的数据发送到处理节点C和D。
如图5c所示,节点Z为当前节点,其具有两个上游节点A和B,具有两个下游节点C和D,由此,为了实现这样的配置,发送给该处理节点Z的节点配 置信息可以包括:
Figure PCTCN2021080889-appb-000007
以及
Figure PCTCN2021080889-appb-000008
其中符号
Figure PCTCN2021080889-appb-000009
表示空。在此实施方式中,处理节点Z可以从处理节点A和B接收数据,在该处理节点Z处进行计算和处理之后,将处理后的数据发送到处理节点C和D。
需要理解的是,上面举例说明的上游节点和下游节点为两个,但只要端口数量允许,本领域技术人员可以将上游节点和下游节点扩展为任意数量。此外,元组的形式也可以仅包括上游节点和下游节点而无需包括当前节点。
此外,在上文中,
Figure PCTCN2021080889-appb-000010
虽然表示为空,但其是作为同一个节点中桥接的作用存在,例如,
Figure PCTCN2021080889-appb-000011
可以表示处理节点Z为处理节点B和C之间的一个桥接节点。
根据本公开的一个实施方式,所述上游节点信息和下游节点信息中的一个可以为空。
除了上述情况中上游节点或下游节点为空之外,还存在其他情形,例如当某个处理节点为拓扑结构中的端点时,则存在上游节点或下游节点为空的情况,这将在下文中进行更详细的描述。
根据本公开的一个实施方式,其中,向至少两个处理节点发送所述节点配置信息,以构建通信拓扑结构包括:向所有处理节点中的至少一部分处理节点发送不同的节点配置信息,以将所述至少一部分处理节点构建为不同的通信拓扑结构。
从上面的描述中可以看到,通过向每个处理节点发送不同的节点配置信息,可以使得接收到节点配置信息的处理节点形成不同的连接关系,由此,通过向多个处理节点发送节点配置信息,可以形成更为复杂和多样的拓扑结构。
图6a至图6c分别示例性地示出了链式拓扑结构,环形拓扑结构和树形拓扑结构的示意图。
如图6a所示,处理节点A、B和C构成了一种链式拓扑结构。这三个处理节点A、B和C依次串行连接,其中对于处理节点A是一个端点,其节点配置信息为
Figure PCTCN2021080889-appb-000012
这表示处理节点A为当前节点,其上游节点为空,下游节点为处理节点B;类似地,对于处理节点B,其节点配置信息为<A,B,C>,这表示处理点B为当前节点,其上游节点为处理节点A,下游节点为处理节点C;类似地,对于处理节点C,其节点配置信息为
Figure PCTCN2021080889-appb-000013
这表示处理点C为 当前节点,其上游节点为处理节点B,下游节点为空。
如图6b所示,处理节点A、B、C构成了一种环形拓扑结构。这三个处理节点A、B和C依次串行连接,并且处理节点A和C连接,从而形成环形结构。其中对于处理节点A的节点配置信息为<C,A,B>,这表示处理节点A为当前节点,其上游节点为处理节点C,下游节点为处理节点B;类似地,对于处理节点B,其节点配置信息为<A,B,C>,这表示处理点B为当前节点,其上游节点为处理节点A,下游节点为处理节点C;类似地,对于处理节点C,其节点配置信息为<B,C,A>,这表示处理点C为当前节点,其上游节点为处理节点B,下游节点为处理节点A。
如图6c所示,处理节点A、B、C、D构成了一种树形拓扑结构。其中处理节点A、B分别与处理节点C相连接,并且处理节点D与处理节点C相连接。其中对于处理节点A的节点配置信息为
Figure PCTCN2021080889-appb-000014
这表示处理节点A为当前节点,其上游节点为空,下游节点为处理节点C。类似地,对于处理节点B,其节点配置信息为
Figure PCTCN2021080889-appb-000015
这表示处理点B为当前节点,其上游节点为空,下游节点为处理节点C。
对于处理节点C,由于其具有两个输入和一个输出,因此需要有三组节点配置信息,分别为
Figure PCTCN2021080889-appb-000016
以及
Figure PCTCN2021080889-appb-000017
其中节点配置信息
Figure PCTCN2021080889-appb-000018
表示当前节点为C,其具有上游节点A;节点配置信息
Figure PCTCN2021080889-appb-000019
表示当前节点为C,其具有上游节点B;节点配置信息
Figure PCTCN2021080889-appb-000020
表示当前节点为C,其具有下游节点D。
对于处理节点D,其节点配置信息为
Figure PCTCN2021080889-appb-000021
这表示当前节点为D,其上游接地为处理节点C,其下游节点为空。
需要理解的是,上面的图6a-图6c仅仅是多种拓扑结构中的几种示例,本领域技术人员可以通过修改节点配置信息并将这些节点配置信息下发给不同的节点而构建出各种所需的拓扑结构。此外,为简洁起见,图6a至图6c中省略了图5a-图5c中的任务处理部分。
这样的配置方式,便于用户通过简单的方式来构建出不同的拓扑结构,从而简化了操作,提升了效率。
根据本公开的一个实施方式,构建通信拓扑结构可以包括使得所述通信拓扑 结构中的处理节点预留资源。
在根据上面的方式构建了通信拓扑结构后,可以为构建好的拓扑结构中的所有处理节点预留资源,例如通信资源和/或寄存器资源。这些资源可用于处理节点后续所进行的通信、存储、计算等等,使得处理节点无需在处理过程中临时申请资源,从而使得后续的处理效率更高。
上文所述的通信资源可以包括:节点间通信所需的端口和/或信道。通信端口就是两个处理节点间物理连线的网络介质端口模块。通信信道是两个处理节点所匹配的发送和接收装置中间的虚拟通信链路,通俗讲是从一大组DMA中选择了哪个发送DMA模块,哪个接收DMA模块。
寄存器资源可以包括用于存储任务描述信息的存储空间,所述任务描述信息用于指示所构建的通信拓扑结构中每个处理节点所要执行的操作。任务描述信息例如可以是规定每个处理节点应当执行什么操作(发送、计算还是接收等),如何执行操作,何时执行操作等。
图7示出了根据本公开的一个实施方式的基于多处理节点来构建通信拓扑结构的设备,包括:第一装置M710,配置为构建节点配置信息,所述节点配置信息包括上游节点信息、当前节点信息和下游节点信息;以及第二装置M720,配置为向至少两个处理节点发送所述节点配置信息,以构建通信拓扑结构。
上述设备,可以通过软件、硬件或者固件等来实现,以实现如图4所示的功能。该设备可以设置或集成在任何其他设备中,例如主机或者服务器中。
由此,本公开还提供了一种基于多处理节点来构建通信拓扑结构的系统,包括:多个处理节点;以及主机,所述主机包括构建模块,所述构建模块包括:第一装置M710,用于构建节点配置信息,所述节点配置信息包括上游节点信息、当前节点信息和下游节点信息;第二装置M720,向至少两个处理节点发送所述节点配置信息,以构建通信拓扑结构。
根据本公开的另一个方面,还提供一种电子设备,包括:一个或多个处理器;以及存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如上所述的方法。
根据本公开的又一个方面,还提供一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如上所述 的方法。
本公开中预申请资源的方法解决了分布式场景下多节点资源的一致性占用,减缓了处理设备部分节点资源申请不足带来的资源死锁现象;此外,还解决了处理设备数据接收、计算、发送的自动路由,而无需主机Host主动干预处理设备执行过程;对用户较为友好,无需用户了解硬件底层的硬件结构、描述符或者模板复杂配置过程,降低分布式任务(例如AllReduce)的开发复杂度。
本公开的技术方案可应用于人工智能领域,实现在主机中、服务器中,或者实现为或者实现在人工智能芯片中。该芯片可以单独存在,也可以包含在通信配置装置中。
图8示出了一种组合处理装置800,其包括上述的通信配置装置802,互联接口804,和其他处理装置806。根据本公开的通信配置装置与其他处理装置进行交互,共同完成用户指定的操作。图8为组合处理装置的示意图。
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为机器学习运算装置与外部数据和控制的接口,包括数据搬运,完成对本机器学习运算装置的开启、停止等基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。
互联接口,用于在通信配置装置(包括例如机器学习运算装置)与其他处理装置间传输数据和控制指令。该通信配置装置从其他处理装置中获取所需的输入数据,写入该通信配置装置片上的存储装置;可以从其他处理装置中获取控制指令,写入通信配置装置片上的控制缓存;也可以读取通信配置装置的存储模块中的数据并传输给其他处理装置。
可选的,该结构还可以包括存储装置808,存储装置分别与所述通信配置装置和所述其他处理装置连接。存储装置用于保存在所述通信配置装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本通信配置装置或其他处理装置的内部存储中无法全部保存的数据。
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的互联接口与设备的某些部件相连接。某些部件譬如 摄像头,显示器,鼠标,键盘,网卡,wifi接口。
在一些实施例里,本披露还公开了一种板卡,其包括了芯片封装结构。参阅图9,其提供了一种示例性的板卡,上述板卡除了包括芯片902以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件904、接口装置906和控制器件908。
所述存储器件与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元910。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备912(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。在另一个实施例中,所述接口装置还可以是其他的接口,本披露并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可 以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和/或多个处理电路的工作状态的调控。
在一些实施例里,本披露还公开了一种电子设备或装置,其包括了上述板卡。
电子设备或装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本披露所提供的几个实施例中,应该理解到,所披露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、光学、声学、磁性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本披露各个实施例中的各功能单元可以集成在一个处理单元中, 也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,当本披露的技术方案可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本披露各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。
依据以下条款可更好地理解前述内容:
条款1.一种基于多处理节点来构建通信拓扑结构的方法,包括:
构建节点配置信息,所述节点配置信息包括上游节点信息、当前节点信息和下游节点信息;
向至少两个处理节点发送所述节点配置信息,以构建所述通信拓扑结构。
条款2.根据条款1所述的方法,其中,
所述上游节点信息用于指示向当前节点传送数据的处理节点,所述当前节点信息用于指示对所接收到的数据进行计算的处理节点,以及所述下游节点信息用于指示从所述当前节点接收计算后数据的处理节点。
条款3.根据条款1所述的方法,其中,所述节点配置信息为队列元组的形式<上游节点,下有节点>或<上游节点,当前节点,下游节点>。
条款4.根据条款1所述的方法,其中,针对单个处理节点的节点配置信息为多个,并且具有多个不同的上游节点信息和/或多个不同的下游节点信息。
条款5.根据条款1-4中任意一项所述的方法,其中,所述上游节点信息和下游节点信息中的一个为空。
条款6.根据条款1-5中任意一项所述的方法,其中,向至少两个处理节点发送所述节点配置信息,以构建通信拓扑结构包括:
向所有处理节点中的至少一部分处理节点发送不同的节点配置信息,以将所述至少一部分处理节点构建为不同的通信拓扑结构。
条款7.根据条款6所述的方法,其中,所述通信拓扑结构包括链式拓扑结构、环形拓扑结构和树形拓扑结构中的至少一个。
条款8.根据条款1-7中任意一项所述的方法,其中,构建通信拓扑结构包括使得所述通信拓扑结构中的处理节点预留资源。
条款9.根据条款8所述的方法,其中,所述资源包括通信资源和/或寄存器资源。
条款10.根据条款9所述的方法,其中,
所述通信资源包括:节点间通信所需的端口和/或信道;
所述寄存器资源包括:用于存储任务描述信息的存储空间,所述任务描述信息用于指示所构建的通信拓扑结构中每个处理节点所要执行的操作。
条款11.一种基于多处理节点来构建通信拓扑结构的设备,包括:
第一装置,构建节点配置信息,所述节点配置信息包括上游节点信息、当前节点信息和下游节点信息;
第二装置,向至少两个处理节点发送所述节点配置信息,以构建通信拓扑结构。
条款12.一种基于多处理节点来构建通信拓扑结构的系统,包括:
多个处理节点;以及
主机,所述主机包括构建模块,所述构建模块包括:
第一装置,用于构建节点配置信息,所述节点配置信息包括上游节点信息、当前节点信息和下游节点信息;
第二装置,向至少两个处理节点发送所述节点配置信息,以构建通信拓扑结构。
条款13.一种电子设备,包括:
一个或多个处理器;以及
存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如条款1-10中任意一项所述的方法。
条款14.一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如条款1-10中任意一项所述的方法。

Claims (14)

  1. 一种基于多处理节点来构建通信拓扑结构的方法,包括:
    构建节点配置信息,所述节点配置信息包括上游节点信息、当前节点信息和下游节点信息;
    向至少两个处理节点发送所述节点配置信息,以构建所述通信拓扑结构。
  2. 根据权利要求1所述的方法,其中,
    所述上游节点信息用于指示向当前节点传送数据的处理节点,所述当前节点信息用于指示对所接收到的数据进行计算的处理节点,以及所述下游节点信息用于指示从所述当前节点接收计算后数据的处理节点。
  3. 根据权利要求1所述的方法,其中,所述节点配置信息为队列元组的形式<上游节点,下游节点>或<上游节点,当前节点,下游节点>。
  4. 根据权利要求1所述的方法,其中,针对单个处理节点的节点配置信息为多个,并且具有多个不同的上游节点信息和/或多个不同的下游节点信息。
  5. 根据权利要求1-4中任意一项所述的方法,其中,所述上游节点信息和下游节点信息中的一个为空。
  6. 根据权利要求1-5中任意一项所述的方法,其中,向至少两个处理节点发送所述节点配置信息,以构建通信拓扑结构包括:
    向所有处理节点中的至少一部分处理节点发送不同的节点配置信息,以将所述至少一部分处理节点构建为不同的通信拓扑结构。
  7. 根据权利要求6所述的方法,其中,所述通信拓扑结构包括链式拓扑结构、环形拓扑结构和树形拓扑结构中的至少一个。
  8. 根据权利要求1-7中任意一项所述的方法,其中,构建通信拓扑结构包括使得所述通信拓扑结构中的处理节点预留资源。
  9. 根据权利要求8所述的方法,其中,所述资源包括通信资源和/或寄存器资源。
  10. 根据权利要求9所述的方法,其中,
    所述通信资源包括:节点间通信所需的端口和/或信道;
    所述寄存器资源包括:用于存储任务描述信息的存储空间,所述任务描述信息用于指示所构建的通信拓扑结构中每个处理节点所要执行的操作。
  11. 一种基于多处理节点来构建通信拓扑结构的设备,包括:
    第一装置,构建节点配置信息,所述节点配置信息包括上游节点信息、当前节点信息和下游节点信息;
    第二装置,向至少两个处理节点发送所述节点配置信息,以构建通信拓扑结构。
  12. 一种基于多处理节点来构建通信拓扑结构的系统,包括:
    多个处理节点;以及
    主机,所述主机包括构建模块,所述构建模块包括:
    第一装置,用于构建节点配置信息,所述节点配置信息包括上游节点信息、当前节点信息和下游节点信息;
    第二装置,向至少两个处理节点发送所述节点配置信息,以构建通信拓扑结构。
  13. 一种电子设备,包括:
    一个或多个处理器;以及
    存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如权利要求1-10中任意一项所述的方法。
  14. 一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如权利要求1-10中任意一项所述的方法。
PCT/CN2021/080889 2020-04-24 2021-03-15 基于多处理节点来构建通信拓扑结构的方法和设备 WO2021213076A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/920,961 US12050545B2 (en) 2020-04-24 2021-03-15 Method and device for constructing communication topology structure on basis of multiple processing nodes
EP21793522.0A EP4141685A4 (en) 2020-04-24 2021-03-15 METHOD AND DEVICE FOR CONSTRUCTING A COMMUNICATIONS TOPOLOGY STRUCTURE BASED ON SEVERAL PROCESSING NODES

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010334771.XA CN113553286A (zh) 2020-04-24 2020-04-24 基于多处理节点来构建通信拓扑结构的方法和设备
CN202010334771.X 2020-04-24

Publications (1)

Publication Number Publication Date
WO2021213076A1 true WO2021213076A1 (zh) 2021-10-28

Family

ID=78101330

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/080889 WO2021213076A1 (zh) 2020-04-24 2021-03-15 基于多处理节点来构建通信拓扑结构的方法和设备

Country Status (4)

Country Link
US (1) US12050545B2 (zh)
EP (1) EP4141685A4 (zh)
CN (1) CN113553286A (zh)
WO (1) WO2021213076A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115208768A (zh) * 2022-06-15 2022-10-18 中山大学 用于Dragonfly拓扑的Allreduce方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572182A (zh) * 2014-12-23 2015-04-29 杭州华为数字技术有限公司 一种流应用的配置方法、节点及流计算系统
CN106383738A (zh) * 2016-09-30 2017-02-08 北京百度网讯科技有限公司 任务处理方法和分布式计算框架
CN109254842A (zh) * 2017-07-12 2019-01-22 腾讯科技(深圳)有限公司 分布式流式系统的资源管理方法、装置及可读存储介质
CN110262995A (zh) * 2019-07-15 2019-09-20 北京一流科技有限公司 执行体创建系统和执行体创建方法
WO2020068209A1 (en) * 2018-09-28 2020-04-02 Microsoft Technology Licensing, Llc Static streaming job startup sequence

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005198201A (ja) * 2004-01-09 2005-07-21 Ntt Docomo Inc ネットワークトポロジー構成方法及びノード
US9250973B2 (en) * 2009-03-12 2016-02-02 Polycore Software, Inc. Apparatus and associated methodology of generating a multi-core communications topology
US9948520B2 (en) * 2016-04-13 2018-04-17 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Efficiently determining network topology
CN108390771B (zh) * 2018-01-25 2021-04-16 中国银联股份有限公司 一种网络拓扑重建方法和装置
US11341009B1 (en) * 2019-01-18 2022-05-24 EMC IP Holding Company LLC Directing placement of data in cloud storage nodes

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572182A (zh) * 2014-12-23 2015-04-29 杭州华为数字技术有限公司 一种流应用的配置方法、节点及流计算系统
CN106383738A (zh) * 2016-09-30 2017-02-08 北京百度网讯科技有限公司 任务处理方法和分布式计算框架
CN109254842A (zh) * 2017-07-12 2019-01-22 腾讯科技(深圳)有限公司 分布式流式系统的资源管理方法、装置及可读存储介质
WO2020068209A1 (en) * 2018-09-28 2020-04-02 Microsoft Technology Licensing, Llc Static streaming job startup sequence
CN110262995A (zh) * 2019-07-15 2019-09-20 北京一流科技有限公司 执行体创建系统和执行体创建方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4141685A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115208768A (zh) * 2022-06-15 2022-10-18 中山大学 用于Dragonfly拓扑的Allreduce方法
CN115208768B (zh) * 2022-06-15 2023-08-01 中山大学 用于Dragonfly拓扑的Allreduce方法

Also Published As

Publication number Publication date
US12050545B2 (en) 2024-07-30
EP4141685A4 (en) 2024-03-20
US20230169031A1 (en) 2023-06-01
EP4141685A1 (en) 2023-03-01
CN113553286A (zh) 2021-10-26

Similar Documents

Publication Publication Date Title
WO2020078470A1 (zh) 片上网络数据处理方法及装置
TW201805858A (zh) 一種用於執行神經網絡運算的裝置及方法
KR101950786B1 (ko) 분산처리용 인공신경망 연산 가속화 방법
CN104699654A (zh) 一种基于chi片内互联总线与qpi片间互联总线互联适配系统和方法
CN106844263B (zh) 一种基于可配置的多处理器计算机系统及实现方法
CN112805727A (zh) 分布式处理用人工神经网络运算加速装置、利用其的人工神经网络加速系统、及该人工神经网络的加速方法
CN117493237B (zh) 计算设备、服务器、数据处理方法和存储介质
CN108256643A (zh) 一种基于hmc的神经网络运算装置和方法
WO2022247880A1 (zh) 一种对神经网络的算子进行融合的方法和相关产品
CN117978759B (zh) 一种互联装置、高性能交换装置及大模型一体机
WO2021213076A1 (zh) 基于多处理节点来构建通信拓扑结构的方法和设备
CN117687956B (zh) 多加速卡异构服务器及资源链路重构方法
WO2021185262A1 (zh) 计算装置、方法、板卡和计算机可读存储介质
WO2021213075A1 (zh) 一种基于多处理节点来进行节点间通信的方法和设备
WO2022143194A1 (zh) 一种执行异步任务的方法、设备和计算机程序产品
CN111340202B (zh) 运算方法、装置及相关产品
WO2022088171A1 (en) Neural processing unit synchronization systems and methods
CN111078625B (zh) 片上网络处理系统和片上网络数据处理方法
CN111078624B (zh) 片上网络处理系统和片上网络数据处理方法
CN111078623B (zh) 片上网络处理系统和片上网络数据处理方法
CN111047030A (zh) 运算方法、装置、计算机设备和存储介质
CN111966399A (zh) 指令处理方法、装置及相关产品
CN112396186B (zh) 执行方法、装置及相关产品
WO2020156212A1 (zh) 一种数据处理的方法、装置及电子设备
CN118260238A (zh) 片间通信方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21793522

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021793522

Country of ref document: EP

Effective date: 20221124

NENP Non-entry into the national phase

Ref country code: DE