WO2023151216A1 - 图数据处理的方法和芯片 - Google Patents

图数据处理的方法和芯片 Download PDF

Info

Publication number
WO2023151216A1
WO2023151216A1 PCT/CN2022/100707 CN2022100707W WO2023151216A1 WO 2023151216 A1 WO2023151216 A1 WO 2023151216A1 CN 2022100707 W CN2022100707 W CN 2022100707W WO 2023151216 A1 WO2023151216 A1 WO 2023151216A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing
node
processing unit
data
image data
Prior art date
Application number
PCT/CN2022/100707
Other languages
English (en)
French (fr)
Inventor
姚鹏程
蒋颖昕
郑龙
鲁芳敏
张学仓
金海�
廖小飞
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023151216A1 publication Critical patent/WO2023151216A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17381Two dimensional, e.g. mesh, torus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of computers, and in particular, relates to a method and chip for image data processing.
  • the performance of graph computing is affected by many factors such as the processing rate of graph data and the storage rate of graph data. At present, the processing rate of graph data cannot make full use of the high bandwidth of storage devices. How to improve the processing rate of graph data is an urgent problem to be solved.
  • the application provides a method and a chip for image data processing.
  • the chip can simultaneously distribute image data to multiple processing engines in the same row, which can improve the efficiency of chip allocation image data, thereby helping to improve image data processing. processing rate.
  • a method for processing graph data is provided, the method is applied to a chip, and the chip includes N rows of processing engines PE and N row buses; wherein, the N row buses correspond to N rows of PEs; N is An integer greater than 1, each line of PE includes at least 2 PEs; the method includes: acquiring the first image data and the second image data; determining the target row PE that needs to be stored for the first image data and the second image data; the target row PE Be one row of PE in N rows of PEs, the target row PE includes the first PE and the second PE; determine the target row bus corresponding to the target row PE; the target row bus is connected to the first PE through the first communication link; the target row The bus is connected to the second PE through the second communication link; the first communication link and the second communication link do not pass through any PE; the first image data is transmitted to the first PE through the first communication link through the target line bus; The second map data is transmitted to the second PE through the second communication link through the target row bus.
  • the chip acquires graph data from an external storage device, the graph data includes node loads and edge loads, the node load includes node information (such as node attribute information), and the edge load includes node identifiers of source nodes and/or destination The node ID of the node.
  • the edge load may also include edge information (such as edge attributes, weights, etc.).
  • a row bus is set inside the chip, and the row bus corresponds to N rows of processing engines.
  • the row bus corresponds to N rows of processing engines.
  • Route-to-forwarding is beneficial to improve the rate at which the chip distributes the graph data to be processed, thereby improving the overall efficiency of the chip for graph data processing.
  • the method further includes: transmitting the first image data to the second PE through the second communication link through the target row bus.
  • data processed by other processing engines in the same row as the processing engine can also be sent to the same processing engine through the row bus, and then when there is a processing engine in an idle state among the multiple processing engines in the same row, Obtaining other graph data through the row bus is conducive to improving the utilization rate of multiple processing engines, reducing the probability of idle processing of the processing engines, and improving the processing efficiency of the chip for graph data.
  • the N rows of PEs further include a third PE and a fourth PE
  • the method further includes: the first PE calculates based on the first map data to obtain a first calculation result; the second PE calculates and obtains the second calculation result based on the second image data; the third PE performs a reduction process on the first calculation result and the second calculation result, and transmits the result after the reduction processing to the fourth PE, the fourth PE is the target PE of the first calculation result and the second calculation result.
  • one or more processing engines of multiple processing engines in the chip are used to first perform protocol processing on the intermediate process data of graph data processing, which is conducive to sharing the data processing burden of the processing engines and improving the efficiency of multiple processing engines in the chip.
  • the utilization rate of the processing engine is conducive to improving the processing efficiency of the chip for graph data.
  • each PE in the N rows of PEs includes a graph processing unit, and the graph processing unit of the first PE calculates the first The calculation result: the second PE calculates and obtains the second calculation result based on the second image data, including: the image processing unit of the second PE calculates and obtains the second calculation result based on the second image data.
  • each PE in the N rows of PEs includes a routing unit, and the routing unit of the third PE performs the first calculation result and the second calculation result
  • the protocol is processed, and the result of the protocol processing is transmitted to the fourth PE.
  • a special routing unit is set in each processing engine, and the routing unit is used to perform protocol processing on the calculation result, and the result after the protocol processing is routed to the destination processing engine.
  • the implementation of the technical solution is conducive to improving the utilization rate of the routing unit in the processing engine, improving the adaptability of the chip to different application scenarios, and improving the utilization rate of data processing resources in the chip.
  • each of the N rows of PEs includes a cache
  • the method further includes: the first PE saves the first image data to the first PE's In the cache, the second PE saves the second image data in the cache of the second PE.
  • the N rows of PEs further include a fifth PE
  • the method further includes: the fifth PE performs protocol processing on the third processing result and the fourth processing result, and the The third processing result and the fourth processing result are used to update the same map data.
  • any processing engine included in the chip can perform protocol processing on the intermediate data of the graph data processing, which is beneficial to improve the efficiency of the chip in processing the graph data.
  • the N rows of PEs form a PE array of N rows and M columns, where M is an integer greater than 1.
  • PE communication links are set between adjacent PEs, and the PE communication links are used to implement communication between PEs. data sharing.
  • the first graph data is node information of the source node
  • the method further includes: acquiring a third graph data, where the third graph data is the associated The side load of the edge; the third graph data is sent to the second PE through the second communication link; the second PE calculates the update load of the destination node according to the first graph data and the third graph data, and the update The payload is used to update the node information of the destination node.
  • the edge load is sent to the calculation processing unit in the same row as the processing engine of the update source node, and the processing engine that obtains the edge load only needs to route the edge load to the calculation processing of the update destination node in the column where it is located unit.
  • the implementation of the technical solution is beneficial to reduce the communication overhead of the calculation processing unit between the columns.
  • the chip when the chip finishes updating the node information of the destination node, the chip obtains the edge load of the associated edge of the destination node, and the associated edge of the destination node is connected to the The associated edges of the source nodes are different.
  • the destination node is already an active node of the current iteration, and the destination node is the source node of the edge load of the current iteration.
  • the execution of the next round of iteration is immediately triggered for the processing engine, instead of waiting for all the processing engines to complete the update before triggering the execution.
  • the implementation of the technical solution is beneficial to reduce the idling time of the processing engine, is beneficial to the load balancing among multiple processing engines, and is beneficial to improving the data processing efficiency of the chip.
  • a chip in a second aspect, includes N rows of processing engines PE, and N rows of buses; wherein, the N rows of buses correspond to N rows of PEs; N is an integer greater than 1, and each row of PEs includes at least 2 PEs;
  • the chip is used to: obtain the first image data and the second image data; determine the target row PE that needs to be stored in the first image data and the second image data; the target row PE is one row of PE in N rows of PEs, and the target row PE includes The first PE and the second PE; determine the target row bus corresponding to the target row PE; the target row bus is connected to the first PE through the first communication link; the target row bus is connected to the second PE through the second communication link; The first communication link and the second communication link do not pass through any PE; the first map data is transmitted to the first PE through the target row bus through the first communication link; the second map data is passed through the second communication link through the target row bus The link is transmitted to the second PE.
  • the chip is further configured to: transmit the first image data to the second PE through the second communication link through the target row bus.
  • the N rows of PEs of the chip further include a third PE and a fourth PE; the first PE is used to calculate the first calculation based on the first image data Result; the second PE is used to calculate the second calculation result based on the second image data; the third PE is used to perform the reduction processing on the first calculation result and the second calculation result, and convert the reduction processing The result of is transmitted to the fourth PE, and the fourth PE is the destination PE of the first calculation result and the second calculation result.
  • each PE of the N rows of PEs includes a graph processing unit, and the graph processing unit of the first PE is used to calculate and obtain The first calculation result; the image processing unit of the second PE, configured to calculate and obtain the second calculation result based on the second image data.
  • each PE of the N rows of PEs includes a routing unit, and the routing unit of the third PE is used for the first calculation result and the second The calculation result is processed by a protocol, and the result after the protocol processing is transmitted to the fourth PE.
  • each of the N rows of PEs includes a cache
  • the first PE is also used to: save the first image data to the cache of the first PE In the middle
  • the second PE is also used for: saving the second image data in the cache of the second PE.
  • the N rows of PEs further include a fifth PE, and the fifth PE is configured to perform protocol processing on the third processing result and the fourth processing result, and the third PE The processing result and the fourth processing result are used to update the same graph data.
  • the N rows of PEs form a PE array of N rows and M columns, where M is an integer greater than 1.
  • PE communication links are set between adjacent PEs, and the PE communication links are used to implement communication between PEs. data sharing.
  • the first graph data is node information of the source node
  • the chip is further used to: acquire a third graph data
  • the third graph data is the source node's node information.
  • the edge load of the associated edge send the third graph data to the second PE through the second communication link; the second PE is also used to: calculate the destination node's value according to the first graph data and the third graph data
  • An update payload where the update payload is used to update the node information of the destination node.
  • the chip when the chip finishes updating the node information of the destination node, the chip is further configured to: obtain the edge load of the associated edge of the destination node, and the edge load of the destination node The associated edge is different from the associated edge of the source node.
  • a graph data processing device comprising: an acquisition unit, configured to acquire first graph data and second graph data; an N-row processing unit, configured to process the first graph data and the For the second picture data, N is an integer greater than 1, and each row processing unit includes at least 2 processing units; N row buses, the N row buses correspond to the N row processing units; a dispatch unit is used to determine the first A target row processing unit that needs to store the first image data and the second image data; the target row processing unit is a row processing unit in the N row processing units, and the target row processing unit includes a first processing unit and a second processing unit; The dispatching unit is also used to determine the target row bus corresponding to the target row processing unit; the target row bus is connected to the first processing unit through a first communication link; the target row bus is connected to the second processing unit through a second The communication link is connected; the first communication link and the second communication link do not pass through any processing unit; the dispatching unit is also used to transmit the first image data through the first
  • the dispatch unit is further configured to transmit the first image data to the second processing unit through the second communication link through the target row bus.
  • the N-row processing unit further includes a third processing unit and a fourth processing unit, and the first processing unit is configured to calculate and obtain the first row based on the first image data. a calculation result; the second processing unit is configured to calculate a second calculation result based on the second image data; the third processing unit is configured to perform a reduction process on the first calculation result and the second calculation result, and The result after the protocol processing is transmitted to the fourth processing unit, and the fourth processing unit is the destination processing unit of the first calculation result and the second calculation result.
  • each processing unit of the N rows of processing units includes an image processing subunit, and the image processing subunit of the first processing unit is configured to A graph data is calculated to obtain the first calculation result; the graph processing subunit of the second processing unit is configured to calculate and obtain the second calculation result based on the second graph data.
  • each processing unit of the N rows of processing units includes a routing subunit, and the routing subunit of the third processing unit is used for the first calculation
  • the result and the second calculation result are subjected to reduction processing, and the reduction processing result is transmitted to the fourth processing unit.
  • each processing unit of the N rows of processing units includes a storage subunit, and the first processing unit is also used to save the first image data to the In the storage subunit of the first processing unit; the second processing unit is also used to store the second image data in the storage subunit of the second processing unit.
  • the N-row processing unit further includes a fifth processing unit, and the fifth processing unit is configured to perform protocol processing on the third processing result and the fourth processing result, The third processing result and the fourth processing result are used to update the same map data.
  • the N rows of processing units form a processing unit array of N rows and M columns, where M is an integer greater than 1.
  • processing unit communication links are provided between adjacent processing units, and the processing unit communication links are used for Realize data sharing between processing units.
  • the first graph data is node information of the source node
  • the acquiring unit is further configured to acquire a third graph data
  • the third graph data is the source node The edge load of the associated edge
  • the dispatching unit is further configured to send the third graph data to the second processing unit through the second communication link
  • the second processing unit is further configured to transmit the data according to the first graph data Calculate the update load of the destination node with the third graph data, and the update load is used to update the node information of the destination node.
  • the obtaining unit when the graph data processing device finishes updating the node information of the destination node, the obtaining unit is further configured to obtain the edge load of the associated edge of the destination node, The associated edge of the destination node is different from the associated edge of the source node.
  • a fourth aspect provides a chip set, the chip includes a processor and the chip described in the second aspect, the processor is coupled to the chip, and the processor is used to control the chip to implement the first aspect and any possible implementation manner thereof.
  • an electronic device including the chip in the second aspect.
  • an electronic device including the chipset in the third aspect.
  • a computer program product includes computer program code, and when the computer program code is run on a computer, the first aspect or any possible implementation manner thereof is executed.
  • a computer-readable storage medium is provided.
  • Computer instructions are stored in the computer-readable storage medium.
  • the method in the first aspect or any possible implementation thereof is executed. implement.
  • Fig. 1 is a graph data structure provided by the embodiment of this application.
  • FIG. 2 is an application scenario of a chip provided by an embodiment of the present application.
  • FIG. 3 is a structural diagram of a chip provided by an embodiment of the present application.
  • FIG. 4 is a structural diagram of another chip provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of a graph data processing method provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another graph data processing method provided by an embodiment of the present application.
  • Fig. 7 is a schematic diagram of another graph data processing method provided by the embodiment of the present application.
  • Fig. 8 is a schematic diagram of another graph data processing method provided by the embodiment of the present application.
  • FIG. 9 is a schematic diagram of another graph data processing method provided by an embodiment of the present application.
  • Fig. 10 is a schematic diagram of another graph data processing method provided by the embodiment of the present application.
  • Fig. 11 is a schematic diagram of a graph data processing device provided by an embodiment of the present application.
  • references to "one embodiment” or “some embodiments” or the like in this specification means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically stated otherwise.
  • the terms “including”, “comprising”, “having” and variations thereof mean “including but not limited to”, unless specifically stated otherwise.
  • FIG. 1 is a schematic diagram of a graph data structure provided by this application.
  • a graph is an abstract data type.
  • the data structure of the graph contains a finite set as a set of nodes (node 111 as shown in Figure 1), and a set of unordered pairs or ordered pairs as edges (edge 121 as shown in Figure 1) collection.
  • Nodes can be part of the graph structure or external entities denoted by integer subscripts or references.
  • the graph data structure may also contain an edge value, such as a weight, associated with each edge.
  • the graph data structure 1 shown in FIG. 1 includes multiple nodes such as node 111 , node 112 and node 113 and multiple edges such as edge 121 , edge 122 and edge 123 .
  • the nodes 111, 112, and 113 are adjacent nodes to each other.
  • the node 111 is connected to the node 112 through the edge 121
  • the node 111 is connected to the node 113 through the edge 122
  • the node 112 and the node 113 are connected through the edge 123 .
  • the node 111 may also be referred to as the source node and the node 113 as the destination node.
  • Graph computing refers to the process of modeling data in the form of a graph, and analyzing the graph data by calculating the attributes of nodes or edges in the graph (that is, graph attribute analysis) to obtain processing results.
  • Graph computing is a high-performance computing technology for processing grid graphs. Through graph computing, the relationship between different nodes can be obtained or the status of nodes and edges in the graph can be updated.
  • the node information of the source node or the node information of the destination node can be regarded as one or more attributes of the source node or the destination node, and the corresponding connection between the source node and the destination node
  • the edges of also have some attributes, which are called edge loads (or edge workloads, edge information) here.
  • edge loads or edge workloads, edge information
  • node information and edge load have different practical meanings. Both node information and edge loads can be referred to as graph data in the graph computing process.
  • a social network can be regarded as a graph composed of personal and official accounts as nodes, and personal attention and likes to the official account as edges; in social networks, information such as personal browsing records and browsing time on web pages;
  • the process of determining the popularity of an official account by the number or frequency of personal attention and likes on the official account can be regarded as the process of determining the node information of the destination node according to the node information and edge load of the source node in graph computing.
  • the transaction network can be regarded as a graph composed of individuals and commodities as nodes, and individuals' purchases and collections of commodities as edges.
  • the process of determining the annual sales target of commodities based on the purchase of commodities, the monthly growth of favorites, and the monthly growth and change in the transaction network can be regarded as the process of determining the node information of the destination node according to the edge load in graph computing.
  • the process of determining the information of other nodes or updating the information of other nodes according to the information of some nodes and the information of some edges between nodes belongs to a kind of graph calculation.
  • any of the two endpoints it contains can be used as a source node or as a destination node.
  • the active node is used as the source node, and the other end point of the edge opposite to the active node is used as the destination node.
  • the data structure of the graph is used as the processing object of the chip provided by the application. It should be understood that the chip provided by the application is also applicable to stacks, queues, arrays, and linked lists. , tree (tree), heap (heap) and (hash table) and other data organized in different ways, this application does not limit this.
  • the general-purpose processing architecture based on control flow usually exhibits low instruction per cycle (IPC) throughput in the graph calculation process, that is, the processing and calculation efficiency of the computing core is low.
  • IPC instruction per cycle
  • FIG. 2 is a schematic diagram of a usage scenario of the chip provided by the present application.
  • a central processing unit (central process unit, CPU) 21 includes one or more processor cores, and in this embodiment of the present application, the CPU is used to process graph data.
  • the chip 22 can also be called an accelerator (accelerator), which can be provided with one or more accelerator memories (off-chip caches) 24, and the accelerator memories are used to store graph data that needs to be processed.
  • the accelerator includes a memory controller and multiple computing processing units (processelements, PEs), which may also be referred to as processing engines (process engines).
  • the controller is used to read the graph data to be processed from the accelerator memory and distribute the data to multiple computing processing units, and the multiple computing processing units process the data in the graph data structure to obtain processing results.
  • the accelerator outputs the processing result to the CPU, and the CPU can further process the processing result to obtain the target result, so that the accelerator can accelerate the CPU's processing of graph data.
  • the communication channel 23 is located between the CPU and the accelerator, and provides a channel for data transmission between the CPU and the accelerator.
  • the communication channel may be a high-speed serial computer expansion bus (peripheral component interconnect express, PCIe) or the like.
  • the CPU and accelerator can perform the following steps:
  • the CPU host program writes the data required by the accelerator core into the global memory of the accelerator connected to the CPU through the communication channel.
  • the accelerator executes calculation and reads data from the global memory at the same time.
  • the accelerator writes the data back to the global memory, and notifies the host that data processing is completed.
  • the CPU host program reads the data from the global memory back to the host memory, and continues processing.
  • FIG. 3 is a schematic structural diagram of a chip provided by the present application.
  • the chip 22 includes a prefetching module 221 , a scheduling module 222 and a processing module 223 , and the chip 22 may be configured with one or more off-chip caches 24 .
  • the dispatcher module (dispatcher) further assigns it to the processing module (processor) for processing, and the processed results are returned to the off-chip cache via the dispatcher module and the prefetch module.
  • the chip is also provided with input and output interfaces for exchanging data with the outside of the chip.
  • the prefetch module can obtain graph data to be processed from the off-chip cache through this interface, and the prefetch module can also send the data processing result of the processing module to the off-chip cache through this interface.
  • the processing module includes at least two PEs, and the PEs are connected to each other through a network on chip (NoC).
  • NoC network on chip
  • each PE includes a routing unit (routing unit, RU), and routing units between PEs are connected to each other, and can be used for mutual communication and data transmission between PEs.
  • routing unit routing unit, RU
  • Data sharing among multiple PEs on the chip can be realized by setting interconnected communication links among multiple PEs.
  • PEs include a graph unit (graph unit, GU) or a computing unit or a graph processing unit, a routing unit, and a temporary storage unit (scratchpad, SPD), and the computing unit is used to process the workload assigned by the scheduling module (workload) and generate an update request.
  • graph unit graph unit, GU
  • computing unit or a graph processing unit, a routing unit, and a temporary storage unit (scratchpad, SPD)
  • the computing unit is used to process the workload assigned by the scheduling module (workload) and generate an update request.
  • the routing unit is used to send the calculation results of the computing unit to the temporary storage unit of the PE that stores the corresponding node through the NoC; the temporary storage unit is used to store the attributes of the point, and all the temporary storage units included in the PE form a processing module cache or called It is the on-chip cache of the chip, and the temporary storage unit contained in each PE belongs to a part of the on-chip cache, that is, the chip in the embodiment of the present application adopts a distributed cache.
  • the processing module may include N rows of PEs, where N is an integer greater than 1, and each row of PEs includes at least 2 PEs.
  • the processing module includes N*M PEs (N, M are both positive integers greater than or equal to 1), and the N*M PEs form an array of N rows and M columns, and the M column in the first row PE can be expressed as PE(1,M), PE located in row N and column 1 can be expressed as PE(N,1) and so on.
  • PE(n, m) means the PE in row n and column m, where n and m are both positive integers greater than or equal to 1.
  • the prefetch module is used to perform prefetching to obtain graph data stored on the off-chip cache.
  • the prefetch module includes a plurality of prefetch units, and each prefetch unit is connected to a dummy channel of the off-chip memory.
  • the prefetching module includes N prefetching units (N is an integer greater than 1), and each prefetching unit in the N prefetching units is connected to each row of PEs in the N rows of PEs in the processing module. correspond.
  • the prefetching unit includes a vertex prefetcher (Vpref) and an edge prefetcher (Epref).
  • Vpref vertex prefetcher
  • Epref edge prefetcher
  • the point prefetcher is used to obtain the data of the active point
  • the edge prefetcher is used to prefetch the data of the active edge (or the associated edge of the active point).
  • the chip can obtain data from the external storage space.
  • the chip can acquire one or more graph data from the external storage space at a time.
  • the scheduling module is used to receive the graph data from the prefetching module, and dispatch the workload to be processed to the processing module.
  • the scheduling module includes multiple dispatcher units, each dispatcher unit is associated with each prefetch unit, and the dispatcher unit is used to schedule the graph data in the associated prefetch unit.
  • the dispatch unit includes a vertex dispatcher unit (VDU) and an edge dispatcher unit (EDU).
  • VDU vertex dispatcher unit
  • ENU edge dispatcher unit
  • the point dispatch unit is used to dispatch the data of the active point
  • the edge dispatch unit is used to dispatch the data of the associated edge of the active point.
  • the scheduling module includes N dispatching units, each dispatching unit includes a point dispatching unit and an edge dispatching unit, and the point dispatching unit is associated with a point prefetching unit of a certain prefetching unit in the prefetching module, for Receive the active point data in the associated point prefetch unit, and dispatch the active point data to the processing module; The data of the active point associated edge in the edge prefetch module and dispatch the data of the active point associated edge to the processing module.
  • one or more first communication interfaces are set between the prefetching module and the scheduling module, and the multiple prefetching units included in the prefetching module communicate with the multiple dispatching units included in the scheduling module through the first communication interface for mutual data transfer.
  • the prefetching module includes multiple prefetching units
  • the scheduling module includes multiple dispatching units
  • a separate communication interface is set between the interrelated prefetching units and the dispatching units. That is, multiple second communication interfaces are provided between the prefetching module and the scheduling module, and the second communication interfaces are used for mutual data transmission between interrelated prefetching units and dispatching units.
  • one or more third communication interfaces are set between the scheduling module and the processing module, and the multiple dispatch units included in the scheduling module and the multiple PEs included in the processing module perform data transmission through the third communication interface .
  • the chip includes a prefetching module, a scheduling module, and a processing module.
  • the processing module includes PEs in 16 rows and 16 columns. All PEs form a PE array, and communication links are provided between adjacent PEs.
  • the prefetching module includes 16 prefetching units
  • the scheduling module includes 16 dispatching units
  • each of the 16 prefetching units is associated with each of the 16 dispatching units.
  • the interrelated prefetching unit and dispatching unit are associated with each row of PEs in the 16 rows of PEs, and are used to prefetch and distribute data for the PEs of the associated rows.
  • FIG. 4 it is a schematic diagram of another chip architecture provided by the embodiment of the present application.
  • the external cache 24 establishes a communication link, that is, at least N communication links are set between the prefetching module 221 and the off-chip cache. Get the data needed by the unit.
  • Each prefetching unit in the prefetching module 221 also establishes a communication link with each dispatching unit of the scheduling module 222 respectively. Specifically, a communication link is set between the prefetching unit in the first row and the dispatching unit in the first row. Links, a communication link is provided between the prefetching unit in the second row and the dispatching unit in the second row, and a communication link is provided between the prefetching unit in the nth row and the dispatching unit in the nth row.
  • the dispatch unit of each row can obtain the data of the corresponding active point from the prefetch unit connected with it through the communication link.
  • the chip also includes N row buses 224 , and the N row buses are in one-to-one correspondence with N rows of computing and processing units. Specifically, a communication link is set between the row bus 224 of the first row and the M computing processing units of the first row, and a communication link is set between the row bus 224 of the second row and the M computing processing units of the second row. There are communication links, and there are communication links between the row bus 224 in the nth row and the M computing processing units in the nth row. No other computing processing unit is passed between the row bus and the computing processing unit.
  • One end of the row bus 224 away from the processing unit is connected to the dispatch unit. Specifically, a communication link is provided between the row bus 224 of the first row and the dispatch unit of the first row, and the row bus 224 of the second row is connected to the dispatch unit of the second row. A communication link is provided between the dispatching units of row n, and a communication link is provided between the row bus 224 of the nth row and the dispatching unit of the nth row.
  • the dispatching unit of the nth row can dispatch the same point load or side load to multiple computing processing units of the nth row at one time. In one embodiment, through the above N row buses, the dispatching unit of the nth row can distribute the same point load or edge load to all the M computing processing units of the nth row at one time.
  • multiple calculation processing units in the same row can obtain multiple pieces of data to be processed at the same time.
  • the architecture provided in the embodiment of the present application can be implemented on an integrated circuit (for example: Xilinx Alveo U280FPGA) of a field programmable logic gate array (field programmable gate array, FPGA), or can also be implemented on a complex programmable logic device (complex programmable logic device, CPLD) and other integrated circuits, this application does not limit this.
  • the storage device used in the embodiment of the present application may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM) and other types of storage devices.
  • double data rate synchronous dynamic random access memory double data rate synchronous dynamic random access memory, DDR SDRAM
  • the off-chip cache in this embodiment of the present application may use a high bandwidth (high bandwidth memory, HBM) stack.
  • HBM high bandwidth memory
  • communication links are directly established between different PEs, and data transmission between PEs can be directly completed through the communication links between PEs, without the need for dispatching through a centralized dispatching mechanism , improve the scalability of the chip when processing graph data, improve the processing efficiency of the chip for graph data processing, improve the utilization rate of the high bandwidth of the chip for storage devices, and improve the performance of the chip.
  • each PE is only connected to a limited number of PEs, which reduces the hardware complexity of the chip.
  • the structure of the chip provided by the embodiment of the present application is mainly described above with reference to FIG. 2 to FIG. 4 , and the data processing method applicable to the chip provided by the present application will be further described below with reference to FIG. 5 to FIG. 10 .
  • FIG. 5 is a basic flow chart of chip processing map data processing provided by the embodiment of the present application.
  • the image data processing performed by the chip can be divided into two stages: a decentralized stage and an application stage.
  • the decentralized stage is mainly responsible for reading side loads, processing side loads, and generating update loads for distribution to PEs.
  • the apply phase is mainly responsible for receiving the update load and updating the active nodes to start the next iteration.
  • S201 to S203 are the scatter phase
  • S204 to S206 are the apply phase
  • the scheduling module sequentially reads the data of the active nodes and the associated edges of the active nodes through the prefetching module.
  • the prefetching module can read the data of one or more active nodes and/or the data of the associated edges of the active nodes at one time.
  • the scheduling module distributes the data of the active nodes and the associated edges of the active nodes according to a certain algorithm.
  • the node data of the active node and the data of the associated edge of the active node may be assigned according to the node identifier of the active node.
  • the scheduling module can dispatch the data of the active node and the edge associated with the active node to the calculation processing unit through the row bus associated therewith.
  • the chip can distribute the same graph data to multiple computing processing units in the same row at one time, and can also distribute multiple different graph data to multiple computing processing units in the same row at one time.
  • the current PE is the PE that updates the node information of the destination node, and the PE stores the update load in the local SPD.
  • the PE sends the update load to the RU of the PE responsible for updating the node information of the destination node through the RU.
  • the RU is responsible for the one or more update loads Execute the specification operation.
  • the SPD of the PE performs an apply function on each point stored locally, and sends the result to the GU.
  • the application function here can be a user-defined function, or can be determined by other means, and the application function is used to calculate the update result of the node information after the current iteration.
  • GU compares the processing result sent by SPD with the result of the last iteration of node information, and sends the updated node information to the scheduling module.
  • the scheduling module takes the updated node in the current iteration as the active node in the next iteration, and writes the information of one or more active nodes back to the off-chip cache, thereby starting the next iteration.
  • FIG. 6 is a schematic diagram of a chip processing method for image data processing provided by the present application.
  • the chip distributes the edge load according to the source node included in the edge load, and the PE updates the node information of the destination node locally.
  • node 1 is an active node
  • node 3 , node 4 and node 8 are adjacent nodes of node 1 .
  • This round of iteration is used to update the node information of the adjacent nodes of node 1.
  • Node 1 can also be called the source node
  • node 3, node 4, and node 8 can also be called destination nodes (that is, nodes that need to update node information) .
  • Node 1 and node 4 are connected by edge a
  • node 1 and node 3 are connected by edge b
  • nodes 1 and 8 are connected by edge c.
  • edge a, edge b, and edge c can be called active edges Or active point associated edge.
  • the chip Before executing the graph data processing, the chip can perform an initialization operation, and the initialization operation can determine one or more active points of the first iteration of the chip's graph data processing. Optionally, the initialization operation can also determine the first round of iteration Node information for one or more active nodes.
  • initialization operations are performed by the CPU.
  • edge loads edge workloads (hereinafter referred to as edge loads) E1 and E2 of edge a, edge b, and edge c respectively from the off-chip cache and E3, and according to the same source node of the three edges, the three edge loads are sent to PE(1,1) which has saved the node information of node 1.
  • PE(1,1) will process the edge loads after receiving the edge loads of the three edges.
  • PE(1,1) determines the destination node of each edge load according to the edge load, and routes the edge load to the PE storing the node information of the destination node through the RU.
  • PE(1,1) also routes the node information of node 1 to the PE storing the node information of the destination node.
  • PE(1,1) determines that the destination node of the edge load is node 4 according to the edge load E1 of edge a, and PE(1,1) routes the edge load E1 and/or the node information of node 1 to P (2,1), that is, the PE that stores the node information of node 4.
  • the processing process of the load corresponding to side b and side c is similar to that of side a.
  • the side load E2 of side b will be routed to PE(1,2), and the side load of side c will be routed to PE(3,2).
  • the PE storing the node information of the destination node updates the node information of the destination node after receiving the edge load including the destination node.
  • the PE storing the node information of the destination node updates the node information of the destination node according to one or more of the following information: edge load, node information of the source node, or current node information of the destination node.
  • the determination of the node information of the nodes in the graph is often completed through multiple rounds of iterations, so the node information of a certain node may be updated multiple times during the iteration process.
  • the current node information of the destination node refers to the node information of the destination node before the completion of the current round of iteration or at the end of the previous round of iteration.
  • the update method of the node information may be determined by the chip according to the application scenario, or may be preset by the user of the chip.
  • the chip may be pre-configured with one or more of the following algorithms, and execute the application process according to the pre-configured algorithms: page rank (page rank) algorithm, breadth first search (BFS) algorithm, single-source shortest Path (single source shortest path, SSSP) algorithm or collaborative filtering (collaborative filtering, CF) algorithm.
  • page rank page rank
  • BFS breadth first search
  • SSSP single-source shortest Path
  • collaborative filtering collaborative filtering
  • the chip determines the information needed to update the node information according to its preconfigured node information update method, and then updates the node information of the destination node according to the preconfigured node information update method.
  • the chip determines the scene of the currently processed graph data according to the node information of the source node, and then determines a method for updating the node information.
  • the PE storing the node information of the destination node receives multiple edge loads with the same destination node, and the PE updates the node information of the destination node according to the multiple edge loads.
  • one or more of the node information saved by multiple PEs in the chip will be updated.
  • PE can compare the processing result of one round of iteration with the node information before updating.
  • the updated node information generated in the iterative process is sent to the scheduling module.
  • the scheduling module can determine the active nodes of the next iteration according to the node information obtained in the current iteration, and write the new one or more active nodes back to the off-chip cache and trigger the next iteration process.
  • node 3, node 4, and node 8 are the nodes updating node information in this round, and the scheduling module will return the identifiers of these nodes to the off-chip cache as active nodes in the next iteration.
  • each PE in the application phase updates the node information of the node stored locally, without routing the node information to other PEs, thus reducing the communication overhead between different PEs in the application phase.
  • FIG. 7 is a schematic diagram of another chip processing method for image data processing provided by the present application.
  • the chip distributes the edge load according to the destination nodes included in the edge load. All the PEs included in the chip save the node information of the nodes that may be used. At the end of a round of iteration, all PEs save The node information of the nodes that may be used is updated.
  • the graph data structure processed in the data processing method shown in FIG. 7 is consistent with the graph data structure shown in FIG. 6 .
  • the chip Before executing the graph data processing, the chip can perform an initialization operation, and the initialization operation can determine one or more active points of the first iteration of the chip's graph data processing. Optionally, the initialization operation can also determine the first round of iteration Node information for one or more active nodes.
  • initialization operations are performed by the CPU.
  • the chip can read the edge loads E 1 , E 2 , and E 3 of edge a, edge b, and edge c from the off-chip cache respectively, and set node 4 according to the destination node of edge a , the destination node of edge b is node 3, and the destination node of edge c is node 8, and the edge loads of edge a, edge b, and edge c are respectively distributed to PE(2,1) that saves the node information of node 4, saves node PE(1,3) for the node information of 3 and PE(3,2) for saving the node information of node 8.
  • PE(2,1) locally saves a copy V 1R of the node information of source node 1 of edge a, when PE(2,1) receives edge load E 1 , PE(2,1) can The node information of the node 4 is updated according to one or more of the acquired V 1R , the edge load E 1 , or the current node information V 4 of the destination node.
  • the chip determines information needed to update node information according to its preconfigured node information update method, and then updates the node information of the destination node according to the preconfigured node information update method.
  • the update method of the node information may be determined by the chip according to the application scenario, or may be one of one or more update methods preset by the user of the chip.
  • the chip determines information needed to update node information according to its preconfigured node information update method, and then updates the node information of the destination node according to the preconfigured node information update method.
  • the chip determines the scene of the currently processed graph data according to the node information of the source node, and then determines a method for updating the node information.
  • the PE storing the node information of the destination node receives multiple edge loads with the same destination node, and the PE updates the node information of the destination node according to the multiple edge loads.
  • one or more of the node information saved by multiple PEs in the chip will be updated.
  • PE can compare the processing result of one round of iteration with the node information before updating.
  • the updated node information generated in the iterative process is sent to the scheduling module.
  • the scheduling module can determine the active nodes of the next iteration according to the node information obtained in the current iteration, and write the new one or more active nodes back to the off-chip cache and trigger the next iteration process.
  • node 3, node 4 and node 8 are the nodes that update the node information in this round, and the scheduling module will return the node identifiers of these nodes to the off-chip cache as active nodes in the next iteration.
  • the copies of the node information (such as V 1R ) of each node stored in all PEs also need to be updated.
  • the chip routes the updated node information to each PE that may use the node information.
  • the node information of node 4 is updated, and PE(2,1) storing the node information of node 4 will route the updated node information V 4 of node 4 to PE(1,1), PE(1 ,3) and PE(3,2).
  • the node information of node 3 is updated, and PE(1,3) which saves the node information of node 3 will route the updated node information V 3 of node 3 to PE(1,1), PE(2,1) and PE(3,2).
  • the arrows connecting different PEs in the application phase in FIG. 7 schematically indicate the process in which the PE with updated node information routes the updated node information to other PEs.
  • nodes since all nodes retain a copy of the node information of the source node, when updating the node information of the destination node, there is no need for the PE that saves the node information of the source node to route the node information of the source node to the destination node. node, which reduces the communication overhead between PEs in the decentralized stage.
  • FIG. 8 is a schematic diagram of another chip processing method for image data processing provided by the present application.
  • the chip when distributing the load, distributes the node information of the source node of the edge load to all PEs in the row where the source node is located, and distributes the edge load to one or more PEs in the row where the source node is located.
  • the graph data structure processed in the data processing method shown in FIG. 8 is consistent with the graph data structure shown in FIG. 8 .
  • the chip Before executing the graph data processing, the chip can perform an initialization operation, and the initialization operation can determine one or more active points of the first iteration of the chip's graph data processing. Optionally, the initialization operation can also determine the first round of iteration Node information for one or more active nodes.
  • initialization operations are performed by the CPU.
  • the scheduling module will distribute the node information V 1 of the source node node 1 shared by side a, side b and side c to PE( 1,1) Among all PEs in the same row, PE(1,2) and PE(1,3) can receive the node information V 1 of the source node in the current iteration.
  • the scheduling module can also distribute the node information V 1 of the source node node 1 to all PEs in the same column of PE(1,1) while allocating side loads, PE(2,1) and PE (3,1) The node information V 1 of the source node in the current iteration can be received.
  • the scheduling module sequentially assigns edge loads to other PEs in the same row of PE (1,1) according to the sequence of the column where the destination node is located, that is, the edge load E 1 of edge a is assigned to PE (1,1), Distribute the edge load E 3 of edge c to PE(1,2), and distribute the edge load E 2 of edge b to PE(1,3).
  • the destination node of edge c is node 8, and node 8 is located in the second column through calculation, and the scheduling module distributes the edge load E 3 of edge c to the PE in the second column of the first row, that is, PE(1,2 ).
  • the scheduling module sequentially assigns edge loads to other PEs in the same column of PEs according to the sequence of the row where the destination node is located, that is, the edge load E 1 of edge a is assigned to PE(1,1), and the edge load E 1 of edge b is assigned to PE(1,1).
  • the edge load E 2 is distributed to PE(1,1)
  • the load E 3 of edge c is distributed to PE(3,1).
  • edge a, edge b, and edge c are stored in the off-chip cache according to the destination nodes.
  • the scheduling module prefetches edge load data, it reads the source node of the edge load. If it is not the current source node, then Retake the next edge load for that column.
  • the PE when receiving the edge load and the node information of the source node, the PE will obtain the destination node of the edge load, and search for a PE storing the node information of the destination node in the same column.
  • the destination node for obtaining the edge load E3 is node 8
  • the destination node for obtaining the edge load E3 is node 8
  • the PE that saves the destination node information is the current PE (for example, V 3 ), then the current PE updates the saved PE according to one or more of the source node's node information, edge load, or destination node's current node information. Node information of the destination node.
  • the PE that stores the destination node information is not the current PE (such as V 1 and V 2 ), and the current PE will route the node information and/or edge load of the source node to the PE that stores the node information of the destination node, After receiving the node information and/or edge load of the source node, the PE that saves the node information of the destination node will update the saved destination node according to one or more of the node information of the source node, the edge information or the current node information of the destination node node information.
  • the chip determines information needed to update node information according to its preconfigured node information update method, and then updates the node information of the destination node according to the preconfigured node information update method.
  • the update method of the node information may be determined by the chip according to the application scenario, or may be preset by the user of the chip.
  • the chip determines information needed to update node information according to its preconfigured node information update method, and then updates the node information of the destination node according to the preconfigured node information update method.
  • the chip determines the scene of the currently processed graph data according to the node information of the source node, and then determines a method for updating the node information.
  • one or more of the node information saved by multiple PEs in the chip will be updated.
  • PE can compare the processing result of one round of iteration with the node information before updating.
  • the updated node information generated in the iterative process is sent to the scheduling module.
  • the scheduling module can determine the active nodes of the next iteration according to the node information obtained in the current iteration, write one or more new active nodes back into the off-chip cache and trigger the next iteration process.
  • node 3, node 4 and node 8 are nodes that update node information in this round, and the scheduling module can use one or more of node 3, node 4 and node 8 as the active point of the next round of iteration, and then from Get the associated edge of the active point in the off-chip cache as the edge load of the next round of iteration. For example, the scheduling module takes node 3 as the active point of the next iteration, and then obtains the associated edge of node 3 from the off-chip cache as the edge load of the next iteration.
  • the chip distributes the edge load to the PE in the same row as the source node, so that the edge load can only be routed in the same column, which is beneficial to reduce the routing of the edge load between columns in the dispersal stage, and reduces the dispersal stage.
  • Communication overhead between PEs In the application stage, the node information of the source node is distributed to all PEs in the same row of the source node through the scheduling module, and the PE that saves the node information of the destination node in the application stage only needs to route the node information of the source node in the current column when updating the node information of the destination node. Node information is beneficial to reduce the node information of the source node in the application stage and route between columns, reducing the communication overhead between PEs in the application stage.
  • FIG. 9 is a schematic diagram of another chip processing method for image data processing provided by the present application.
  • the reduce function is mainly used to merge the intermediate results of data processing, so as to reduce the communication overhead generated during data processing.
  • the reduction function in the graph processing model can satisfy commutative and associative laws. Taking the graph data structure shown in Figure 6 as an example, first briefly introduce the commutative law and associative law in the process of graph data processing.
  • both node 3 and node 4 are active nodes, and both node 3 and node 4 need to update the node information of node 5.
  • the commutative law is reflected in: to update the node information of node 5, it can be updated according to node 3 or node 4 first, that is, the node information of node 5 at the end of the current round of iteration is the same as that of node 3.
  • the information update performed by node 5 has nothing to do with the order in which node 4 performs information update on node 5 .
  • node 1, node 4 and node 8 are all active nodes, and node 1, node 4 and node 8 all need to update the node information of node 3.
  • the associative law is reflected as follows: to update the node information of node 3, first update the node information of node 3 according to node 1 and node 4, and then update the node information of node 3 according to node 8; or It is also possible to first update the node information of node 3 according to node 8 and node 1, and then update the node information of node 3 according to node 4.
  • any two or more active nodes can be combined with the node information update results of the destination node, and further combined with other The active node updates and calculates the node information of the destination node, and this process does not affect the node information of the destination node at the end of the current iteration.
  • FIG. 9 is described based on the data processing flow shown in FIG.
  • the data processing flow is also applicable to those shown in FIGS. 6 and 7 and other data processing flows, which are not listed here.
  • FIG. 9 exemplarily shows an architecture diagram of a RU of a PE provided by the embodiment of the present application.
  • the RU includes at least one set of input and output interfaces, which are used for the RU to communicate with other PEs (such as other PEs or scheduling modules). ) to receive data and RU to send data to the outside.
  • RU can set 4 stages (stage), each stage contains 4 registers (register, Reg) and a reduce unit (reduce unit), where the register is used to store the update load, and the reduce unit is used to execute the corresponding operation of the reduce function .
  • register register
  • Reg reduce unit
  • communication can be achieved between two registers in adjacent stages.
  • a register in stage 1 receives an update payload through the input interface. If this register is empty, this register holds the update payload. If the register is not empty and the load in the register is the same as the node updated by the received load, the new value is saved after executing the reduction function; if the register is not empty and the load in the register is different from the node updated by the received load, the register This payload is sent to the register of the next stage until the payload is the same as the updated node for the reduce operation or the payload is stored in an empty register.
  • a certain register in stage 1 will send the stored load value or the value of the load after execution of the protocol to other PEs.
  • the register for receiving the payload and the register for sending the payload in stage 1 may not be the same register.
  • the RU contained in the PE of the chip provided by this application may also contain more or fewer registers, and may also contain more or fewer protocol units, and different registers may also be set more There are multiple communication links, and the architecture diagram of the RU shown in (a) in FIG. 9 is not limited to this.
  • (b) in Fig. 9 shows the process of RU read and write load, wherein V 1 , V 2 , V 3 and V' 3 are used to indicate node 1, node 2 and node 3 in the storage register
  • the register in the first row and the first column can be expressed as Reg(1,1)
  • the register in the second row and the second column can be expressed as Reg(2,2), and so on.
  • Reg(1,1) and Reg(2,1) store the loads for updating V 1 and V 3 respectively
  • Reg(1,2) store the loads for updating V 2 .
  • the RU sends the payload to the first column (by taking the remainder of the sequence number of the update payload and the number of pipelines, the obtained The remainder is the ordinal number of the column to which the load should be sent).
  • the RU determines to send the payload to the next stage by comparing the serial number of the updated payload stored in Reg(1,1) with the serial number of the payload, that is, the register Reg(2,1) in the second row and first column.
  • Reg(2,1) receives the load, RU compares the sequence number of the load updating node saved in the register of the second row and first column with the sequence number of the load, and determines it as the load of updating V 3 that has been saved in the register Perform a protocol operation with the newly received updated V3 load, the protocol operation is performed by the protocol unit, after the protocol operation is completed, the protocol unit writes the processed load V′3 of the updated node 3 into the register.
  • the RU When reading the load from the register, taking the update load V 1 of node 1 as an example, the RU sends the update load V 1 of node 1 to the output port of the RU, and then routes it to other PEs. The RU sends the node load V ′ 3 stored in the register Reg(2,1) in the same pipeline stage 2 as V 1 to the register Reg(1,1).
  • the RU shown in FIG. 9 may be the RU of the PE storing the node information of the destination node, or the RU of any PE included in the chip.
  • the load used to update the same destination node is reduced during the routing process through the RU, which is beneficial to reduce the total amount of the load of the update node transmitted between PEs, that is, it is beneficial to reduce the communication between PEs
  • the total amount is beneficial to reduce the communication overhead of the chip.
  • the probability of updating the load routing of the same node to the same RU is increased, and the probability of performing the protocol operation by the RU is therefore increased, which is more conducive to reducing the load between PEs.
  • the total amount of communication is more conducive to reducing the communication overhead of the chip.
  • FIG. 10 is a schematic diagram of another processing method for image data processing by the chip provided in the present application.
  • PE(1,1) in the chip saves the node information V 1 of node 1
  • PE(1,2) saves the node information V 2 of node 2
  • PE(2,1) saves the node information of node 3 Information V 3 .
  • PE(1,1) immediately sends the information of V 1 to the scheduling module, and the scheduling module compares the node information of V 1 before this round of update and obtains The node information of V1 determines that the node information of V1 has been updated in this round of iteration, and V1 is taken as the active point of the next round of iteration.
  • the scheduling module further acquires the edge load associated with V 1 through the prefetch module, and sends the acquired edge load to PE for triggering the next iteration of PE(1,1).
  • the scheduling module saves the node information of node 1 in the current iteration.
  • the scheduling module obtains the edge load of the associated edge of V 1 , and determines that the source node is node 1 according to the edge load, and then assigns the edge load to the same row as PE(1,1) that saves the node information of node 1 All PEs, namely PE(1,1) and PE(1,2).
  • the scheduling module may also distribute the edge load of the associated edge of node 1 to the PE(1,1) storing the node information of node 1.
  • the scheduling module may also assign the node information of node 1 to all PEs in the same row of PE(1,1) storing the node information of node 1.
  • PE(1,2) can immediately request to trigger the next round of iteration after updating the node information of node 2, and start the decentralized phase of the next round of iteration.
  • PE(2,1) can immediately request to trigger the next round of iteration after updating the node information of node 3, and start the decentralized phase of the next round of iteration.
  • the PE that saves a certain node information executes the application stage and completes the update of the node information, it directly requests the scheduling module to trigger the next round of iterations, without waiting for the current round of iterations of all PEs in the chip. After all completion, the next round of iteration will be triggered, which will help reduce the idling time of PE, improve the load balance in the chip, and improve the efficiency of the chip’s image data processing.
  • the embodiment of the present application also provides a chip, which can be used to implement any one of the graph data processing methods shown in FIG. 5 to FIG. 10 .
  • the embodiment of the present application also provides a graph data processing device 1100, which may include an acquisition unit 1110, which is used to obtain graph data from an off-chip cache, etc., as shown in FIG. 5 to the acquisition action performed by the prefetch module in Figure 10;
  • a graph data processing device 1100 may include an acquisition unit 1110, which is used to obtain graph data from an off-chip cache, etc., as shown in FIG. 5 to the acquisition action performed by the prefetch module in Figure 10;
  • the graph data processing apparatus 1100 may also include a dispatching unit 1120, which is used to perform dispatching and scheduling of node information, etc., such as dispatching actions performed by the dispatching module in FIGS. 5 to 10 ;
  • the graph data processing apparatus 1100 may also include a processing unit 1130, which is used to perform calculations of node loads and other processing actions performed by the processing modules in FIGS. 5 to 10; the processing unit 1130 may also include a graph processing subunit , a routing subunit and a storage subunit, wherein the processing subunit is used to perform actions such as data processing performed by PEs as shown in Figures 5 to 10, and the routing subunit is used to perform the protocol for updating loads as shown in Figures 5 to 10, Routing and other actions, the storage subunit is used to perform actions such as storing node information as shown in Figure 5 to Figure 10 .
  • the data processing device 1100 in this figure may also include a row bus 1140, which corresponds to the processing modules of each row, and a separate communication link is provided between the row bus and each processing module of the corresponding row, and the communication link Without going through any other processing units in the bus, the dispatch unit can dispatch the data to be processed to the processing units through the row bus.
  • a row bus 1140 which corresponds to the processing modules of each row, and a separate communication link is provided between the row bus and each processing module of the corresponding row, and the communication link
  • the dispatch unit can dispatch the data to be processed to the processing units through the row bus.
  • the embodiment of the present application also provides a chipset, the chipset includes a processor and a chip, and the chipset can be used to implement any graph data processing method as shown in FIG. 5 to FIG. 10 .
  • An embodiment of the present application also provides an electronic device, the electronic device includes a chip or a chipset, and the electronic device can be used to implement any one of the graph data processing methods shown in FIG. 5 to FIG. 10 .
  • the embodiment of the present application also provides a computer program product, the computer program product includes computer program code, when the computer program code is run on the computer, any graph data processing method shown in Figure 5 to Figure 10 is executed.
  • the embodiment of the present application also provides a computer-readable storage medium, the calculation means that computer instructions are stored in the storage medium, and when the computer instructions are run on the computer, the processing method of any graph data as shown in Fig. 5 to Fig. 10 is executed. implement.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Processing (AREA)

Abstract

本申请提供了一种图数据处理的方法和芯片,该方法应用于芯片,该芯片包括预取模块、调度模块和处理模块,处理模块中包括多个处理引擎,调度模块与处理模块之间设置有多条行总线。通过该多条行总线,芯片可以一次向同一行的多个处理引擎分派多个图数据。本申请提供的图数据处理方法有利于提高芯片对于图数据的分派的效率,有利于提高芯片的可扩展性,有利于减少芯片内部计算处理单元的通信开销,有利于提高芯片对图数据的处理效率。

Description

图数据处理的方法和芯片
本申请要求于2022年02月14日提交中国专利局、申请号为202210151161.5、发明名称为“图数据处理的方法和芯片”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,具体地,涉及一种图数据处理的方法和芯片。
背景技术
图计算(graph computing)的性能受图数据的处理速率、图数据的存储速率等多个因素影响。目前,图数据的处理速率并无法充分利用存储设备的高带宽,如何提高图数据的处理速率是亟需解决的问题。
发明内容
本申请提供一种图数据处理的方法和芯片,通过设置行总线,该芯片可以同时将图数据分派至同一行的多个处理引擎,可以提高芯片分派图数据的效率,进而有利于提高图数据处理的速率。
第一方面,提供了一种图数据处理的方法,该方法应用于芯片,该芯片包括N行处理引擎PE,N个行总线;其中,该N个行总线与N行PE相对应;N为大于1的整数,每一行PE包括至少2个PE;该方法包括:获取第一图数据和第二图数据;确定第一图数据和第二图数据需要存储的目标行PE;该目标行PE为N行PE中的一行PE,目标行PE包括第一PE和第二PE;确定目标行PE对应的目标行总线;该目标行总线与第一PE通过第一通信链路连接;该目标行总线与第二PE通过第二通信链路连接;第一通信链路和第二通信链路不经过任何PE;通过目标行总线将第一图数据通过第一通信链路传输给第一PE;通过目标行总线将第二图数据通过第二通信链路传输给第二PE。
可选地,芯片从外部存储设备获取图数据,该图数据包括节点负载和边负载,该节点负载包括节点信息(如节点的属性信息),该边负载包括源节点的节点标识和/或目的节点的节点标识。
可选地,边负载还可以包括边的信息(例如边的属性、权重等)。
本技术方案中,芯片内部设置行总线,该行总线与N行处理引擎对应,利用行总线可以一次向同一行的多个处理引擎发送多个图数据,而无需通过处理引擎之间的通信链路来转发,有利于提高芯片分派待处理的图数据的速率,进而提高芯片对于图数据处理的整体效率。
结合第一方面,在第一方面的某些实现方式中,该方法还包括:通过该目标行总线将第一图数据通过第二通信链路传输给该第二PE。
本技术方案中,通过行总线还可以向同一个处理引擎发送与该处理引擎处于同一行的 其他处理引擎处理的数据,进而当同一行的多个处理引擎中有处于空闲状态的处理引擎时,通过行总线获取其他图数据,有利于提升多个处理引擎的利用率,减少处理引擎处理空转的机率,有利于提高芯片对于图数据的处理效率。
结合第一方面,在第一方面的某些实现方式中,该N行PE还包括第三PE和第四PE,该方法还包括:该第一PE基于该第一图数据计算得到第一计算结果;该第二PE基于该第二图数据计算得到第二计算结果;该第三PE对该第一计算结果和该第二计算结果进行规约处理,并将规约处理后的结果传输至第四PE,该第四PE为该第一计算结果和该第二计算结果的目的PE。
本技术方案中,利用芯片中的多个处理引擎的一个或多个处理引擎对图数据处理的中间过程数据先进行规约处理,有利于分担处理引擎的数据处理负担,有利于提高芯片中多个处理引擎的利用率,有利于提高芯片对图数据的处理效率。
结合第一方面,在第一方面的某些实现方式中,该N行PE的每个PE中均包含图处理单元,该第一PE的图处理单元基于该第一图数据计算得到该第一计算结果;该第二PE基于该第二图数据计算得到第二计算结果,包括:该第二PE的图处理单元基于该第二图数据计算得到该第二计算结果。
本技术方案中,通过在每个处理引擎中设置专门的图处理单元用于对图数据进行计算,由于图处理单元的功能是确定的,因而本技术方案的实施例有利于对根据芯片的实际用途对该图处理单元的材料和结构等性质进行定制,有利于提高芯片对于不同应用场景的适应性,有利于提高芯片中数据处理资源的利用率。
结合第一方面,在第一方面的某些实现方式中,该N行PE的每个PE中均包含路由单元,该第三PE的路由单元对该第一计算结果和该第二计算结果进行规约处理,并将规约处理后的结果传输至第四PE。
本技术方案中,通过在每个处理引擎中设置专门的路由单元,并利用该路由单元对计算结果执行规约处理,并将规约处理后的结果路由至目的处理引擎。本技术方案的实施,有利于提高处理引擎中路由单元的利用率,有利于提高芯片对于不同应用场景的适应性,有利于提高芯片中数据处理资源的利用率。
结合第一方面,在第一方面的某些实现方式中,该N行PE的每个PE中均包含缓存,该方法还包括:该第一PE保存该第一图数据至该第一PE的缓存中,该第二PE保存该第二图数据至该第二PE的缓存中。
本技术方案中,通过在每个处理引擎中设置专门的缓存,并将图数据保存在缓存中,多个图数据中包含的多个缓存组成了芯片的缓存。分布式缓存的设计,有利于提高芯片对于图数据读取和写入的效率,进而有利于提高芯片对图数据的处理效率。
结合第一方面,在第一方面的某些实现方式中,该N行PE还包括第五PE,该方法还包括:该第五PE对第三处理结果和第四处理结果执行规约处理,该第三处理结果和该第四处理结果用于更新同一个图数据。
本技术方案中,芯片中包含的任意一个处理引擎都可以对图数据处理的中间数据进行规约处理,有利于提高芯片对图数据处理的效率。
结合第一方面,在第一方面的某些实现方式中,该N行PE组成N行M列的PE阵列,M为大于1的整数。
结合第一方面,在第一方面的某些实现方式中,该N行PE包含的所有PE中,相邻PE之间设置有PE通信链路,该PE通信链路用于实现PE之间的数据共享。
通过在计算处理单元之间设置通信链路,不同计算处理单元之间可以直接通过该通信链路进行通信或数据传输。多个计算处理单元之间的通信无需通过集中式的分发机制实现,有利于简化芯片的架构。通过设置通信链路,可以为芯片扩展更多的计算处理单元,从而可以提高芯片的数据的处理效率。
结合第一方面,在第一方面的某些实现方式中,该第一图数据为源节点的节点信息,该方法还包括:获取第三图数据,该第三图数据为该源节点的关联边的边负载;通过该第二通信链路将该第三图数据发送至该第二PE;该第二PE根据该第一图数据和该第三图数据计算目的节点的更新负载,该更新负载用于更新该目的节点的节点信息。
本技术方案中,将边负载发送至与更新源节点的处理引擎同一行的计算处理单元中,获取边负载的处理引擎只需在其所处的列中路由边负载至更新目的节点的计算处理单元。本技术方案的实施,有利于减少计算处理单元在列之间的通信开销。
结合第一方面,在第一方面的某些实现方式中,当该芯片更新完该目的节点的节点信息时,该芯片获取该目的节点的关联边的边负载,该目的节点的关联边与该源节点的关联边不同。
应理解,当芯片获取目的节点的关联边的边负载,并将其分派至处理引擎时,目的节点已经是本轮迭代的活跃节点,目的节点即为本轮迭代的边负载的源节点。
本技术方案中,在某一个处理引擎完成节点信息的更新后,立即为该处理引擎触发执行下一轮迭代,而不是等到所有的处理引擎都完成更新再触发执行。本技术方案的实施,有利于缩减处理引擎的空转时间,有利于多个处理引擎之间的负载均衡,有利于提高芯片的数据处理效率。
第二方面,提供了一种芯片,该芯片包括N行处理引擎PE,N个行总线;其中,该N个行总线与N行PE相对应;N为大于1的整数,每一行PE包括至少2个PE;
该芯片用于:获取第一图数据和第二图数据;确定第一图数据和第二图数据需要存储的目标行PE;该目标行PE为N行PE中的一行PE,目标行PE包括第一PE和第二PE;确定目标行PE对应的目标行总线;该目标行总线与第一PE通过第一通信链路连接;该目标行总线与第二PE通过第二通信链路连接;第一通信链路和第二通信链路不经过任何PE;通过目标行总线将第一图数据通过第一通信链路传输给第一PE;通过目标行总线将第二图数据通过第二通信链路传输给第二PE。
结合第二方面,在第二方面的某些实现方式中,该芯片还用于:通过该目标行总线将第一图数据通过第二通信链路传输给该第二PE。
结合第二方面,在第二方面的某些实现方式中,该芯片的N行PE还包括第三PE和第四PE;该第一PE,用于基于该第一图数据计算得到第一计算结果;该第二PE,用于基于该第二图数据计算得到第二计算结果;该第三PE,用于对该第一计算结果和该第二计算结果进行规约处理,并将规约处理后的结果传输至第四PE,该第四PE为该第一计算结果和该第二计算结果的目的PE。
结合第二方面,在第二方面的某些实现方式中,该N行PE的每个PE中均包含图处理单元,该第一PE的图处理单元,用于基于该第一图数据计算得到该第一计算结果;该 第二PE的图处理单元,用于基于该第二图数据计算得到该第二计算结果。
结合第二方面,在第二方面的某些实现方式中,该N行PE的每个PE中均包含路由单元,该第三PE的路由单元,用于对该第一计算结果和该第二计算结果进行规约处理,并将规约处理后的结果传输至第四PE。
结合第二方面,在第二方面的某些实现方式中,该N行PE的每个PE中均包含缓存,该第一PE还用于:保存该第一图数据至该第一PE的缓存中;该第二PE还用于:保存该第二图数据至该第二PE的缓存中。
结合第二方面,在第二方面的某些实现方式中,该N行PE还包括第五PE,该第五PE,用于对第三处理结果和第四处理结果执行规约处理,该第三处理结果和该第四处理结果用于更新同一个图数据。
结合第二方面,在第二方面的某些实现方式中,该N行PE组成N行M列的PE阵列,M为大于1的整数。
结合第二方面,在第二方面的某些实现方式中,该N行PE包含的所有PE中,相邻PE之间设置有PE通信链路,该PE通信链路用于实现PE之间的数据共享。
结合第二方面,在第二方面的某些实现方式中,该第一图数据为源节点的节点信息,该芯片还用于:获取第三图数据,该第三图数据为该源节点的关联边的边负载;通过该第二通信链路将该第三图数据发送至该第二PE;该第二PE还用于:根据该第一图数据和该第三图数据计算目的节点的更新负载,该更新负载用于更新该目的节点的节点信息。
结合第二方面,在第二方面的某些实现方式中,当该芯片更新完该目的节点的节点信息时,该芯片还用于:获取该目的节点的关联边的边负载,该目的节点的关联边与该源节点的关联边不同。
第三方面,提供一种图数据处理装置,该图数据处理装置包括:获取单元,用于获取第一图数据和第二图数据;N行处理单元,用于处理该第一图数据和该第二图数据,N为大于1的整数,每一行处理单元包括至少2个处理单元;N个行总线,该N个行总线与该N行处理单元相对应;分派单元,用于确定该第一图数据和该第二图数据需要存储的目标行处理单元;该目标行处理单元为该N行处理单元中的一行处理单元,该目标行处理单元包括第一处理单元和第二处理单元;该分派单元,还用于确定该目标行处理单元对应的目标行总线;该目标行总线与该第一处理单元通过第一通信链路连接;该目标行总线与该第二处理单元通过第二通信链路连接;该第一通信链路和该第二通信链路不经过任何处理单元;该分派单元,还用于通过该目标行总线将该第一图数据通过该第一通信链路传输给该第一处理单元;通过该目标行总线将该第二图数据通过该第二通信链路传输给该第二处理单元。
结合第三方面,在第三方面的某些实现方式中,该分派单元还用于,通过该目标行总线将该第一图数据通过该第二通信链路传输给该第二处理单元。
结合第三方面,在第三方面的某些实现方式中,该N行处理单元还包括第三处理单元和第四处理单元,该第一处理单元,用于基于该第一图数据计算得到第一计算结果;该第二处理单元,用于基于该第二图数据计算得到第二计算结果;该第三处理单元,用于对该第一计算结果和该第二计算结果进行规约处理,并将规约处理后的结果传输至第四处理单元,该第四处理单元为该第一计算结果和该第二计算结果的目的处理单元。
结合第三方面,在第三方面的某些实现方式中,该N行处理单元的每个处理单元中均包含图处理子单元,该第一处理单元的图处理子单元,用于基于该第一图数据计算得到该第一计算结果;该第二处理单元的图处理子单元,用于基于该第二图数据计算得到该第二计算结果。
结合第三方面,在第三方面的某些实现方式中,该N行处理单元的每个处理单元中均包含路由子单元,该第三处理单元的路由子单元,用于对该第一计算结果和该第二计算结果进行规约处理,并将规约处理后的结果传输至该第四处理单元。
结合第三方面,在第三方面的某些实现方式中,该N行处理单元的每个处理单元中均包含存储子单元,该第一处理单元,还用于保存该第一图数据至该第一处理单元的存储子单元中;该第二处理单元,还用于保存该第二图数据至该第二处理单元的存储子单元中。
结合第三方面,在第三方面的某些实现方式中,该N行处理单元还包括第五处理单元,该第五处理单元,用于对第三处理结果和第四处理结果执行规约处理,该第三处理结果和该第四处理结果用于更新同一个图数据。
结合第三方面,在第三方面的某些实现方式中,该N行处理单元组成N行M列的处理单元阵列,M为大于1的整数。
结合第三方面,在第三方面的某些实现方式中,该N行处理单元包含的所有处理单元中,相邻处理单元之间设置有处理单元通信链路,该处理单元通信链路用于实现处理单元之间的数据共享。
结合第三方面,在第三方面的某些实现方式中,该第一图数据为源节点的节点信息,该获取单元,还用于获取第三图数据,该第三图数据为该源节点的关联边的边负载;该分派单元,还用于通过该第二通信链路将该第三图数据发送至该第二处理单元;该第二处理单元,还用于根据该第一图数据和该第三图数据计算目的节点的更新负载,该更新负载用于更新该目的节点的节点信息。
结合第三方面,在第三方面的某些实现方式中,当该图数据处理装置更新完该目的节点的节点信息时,该获取单元还用于,获取该目的节点的关联边的边负载,该目的节点的关联边与该源节点的关联边不同。
第四方面,提供一种芯片组,该芯片包括处理器以及第二方面所述的芯片,该处理器与芯片耦合,该处理器用于控制芯片以实现第一方面及其任意可能实现的方式。
第五方面,提供一种电子设备,包括第二方面中的芯片。
第六方面,提供一种电子设备,包括第三方面中的芯片组。
第七方面,提供一种计算机程序产品,该计算机程序产品包括计算机程序代码,当该计算机程序代码在计算机上运行时,第一方面或其任意可能的实现方式被执行。
第八方面,提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,当计算机指令在计算机上运行时,使得第一方面或其任意可能的实现方式中的方法被执行。
附图说明
图1是本申请实施例提供的一种图数据结构。
图2是本申请实施例提供的一种芯片的应用场景。
图3是本申请实施例提供的一种芯片的架构图。
图4是本申请实施例提供的另一种芯片的架构图。
图5是本申请实施例提供的一种图数据处理方法的示意图。
图6是本申请实施例提供的另一种图数据处理方法的示意图。
图7是本申请实施例提供的又一种图数据处理方法的示意图。
图8是本申请实施例提供的又一种图数据处理方法的示意图。
图9是本申请实施例提供的又一种图数据处理方法的示意图。
图10是本申请实施例提供的又一种图数据处理方法的示意图。
图11是本申请实施例提供的一种图数据处理装置的示意图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式,除非其上下文中明确地有相反指示。还应当理解,在本申请以下各实施例中,“至少一个”、“一个或多个”是指一个、两个或两个以上。术语“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系;例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。
在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
图1是本申请提供的一种图数据结构的示意图。
在计算机科学中,图是一种抽象的数据类型。图的数据结构(data structure)包含一个有限的集合作为节点(如图1所示的节点111)集合,以及一个无序对或有序对的集合作为边(如图1所示的边121)的集合。节点可以是图结构的一部分,也可以是用整数下标或引用表示的外部实体。图的数据结构还可能包含和每条边相关联的数值(edge value),例如权重(weight)。
如图1所示的图数据结构1,包括节点111、节点112和节点113等多个节点以及边121、边122和边123等多条边。节点111、节点112和节点113两两之间互为相邻节点。节点111与节点112之间通过边121联系,节点111与节点113之间通过边122联系,节点112和节点113之间通过边123联系。
在以节点111为活跃节点确定节点113的节点信息的过程中,又可以称节点111为源节点,节点113为目的节点。
图计算(graph computing或graph processing)是指将数据按照图的方式建模,并通过 计算图中节点或边的属性分析图数据(即图属性分析)以获得处理结果的过程。图计算是一种高性能地处理网格图的计算技术,其通过图计算处理可以获得不同节点间的关系或者更新图中节点与边的状态。
图计算中,源节点的节点信息或目的节点的节点信息(或称节点工作负载、节点负载)可以视为是源节点或目的节点的一种或多种属性,相应的连接源节点和目的节点的边也存在一些属性,这里将其称为边负载(或称边的工作负载、边信息)。根据不同的应用场景,节点信息、边负载具有不同的实际意义。节点信息和边负载都可以称为图计算过程中的图数据。
举例而言:社交网络可以看成是以个人和公众号为节点,以个人对公众号的关注、点赞为边构成的图;社交网络中通过个人在网页中的浏览记录、浏览时间等信息以及个人对公众号关注、点赞的数量或频次确定公众号的受喜爱度的过程可以看作图计算中根据源节点的节点信息和边负载确定目的节点的节点信息的过程。
又如,交易网络可以看成是以个人和商品为节点,以个人对商品的购买、收藏为边构成的图。交易网络中根据商品的购买、收藏的月度增长量和月度增长变化量确定商品的年度销售目标的过程可以看作图计算中根据边负载确定目的节点的节点信息的过程。
根据一些节点的信息、节点之间的一些边的信息确定另一些节点的信息或对另一些节点的信息进行更新的过程属于一种图计算。
需要说明的是,对于图结构中的边,其包含的两个端点中的任意一个都可以作为源节点也可以作为目的节点。除非特别说明,以下实施例中均将活跃节点作为源节点,与活跃节点相对的边的另一个端点作为目的节点。
以下实施例中,以图的数据结构作为本申请提供的芯片的处理对象,应理解本申请提供的芯片还适用于堆栈(stack)、队列(queue)、数组(array)、链表(linked list)、树(tree)、堆积(heap)和(hash table)等不同组织方式的数据,本申请对此不作限制。
真实世界的自然图大多满足小世界网络的特点,因而由真实世界的自然图转化得到的图数据结构缺乏固定的结构化关系,且不同节点的出度和入度差别明显,因而针对这类图数据的图计算也缺少扩展性和局部性。
应理解,本申请提供的图数据处理方法对应的图数据结构既适用于源于真实世界的自然图,也适用于合成图,本申请对此不作限制。
以控制流为主的通用处理架构在图计算过程中通常表现出较低的每周期指令吞吐量(instruction per cycle,IPC),即计算核心的处理计算效率较低。通过为通用处理器设置面向图应用的专用加速器(accelerator)在一定程度上可以提高通用处理器处理图结构数据的效率,而如何高效利用加速器的片上存储资源和提高片外存储带宽的利用率,以增强加速器的效用是亟需解决的问题。
图2是本申请提供的芯片的一种使用场景示意图。
中央处理单元(central process unit,CPU)21包括一个或多个处理器核,在本申请实施例CPU用于处理图数据。
芯片22又可以称为加速器(accelerator),其可以设置一个或多个加速器内存(片外缓存)24,加速器内存用于保存需要处理的图数据。加速器包括内存控制器以及多个计算处理单元(processelement,PE),该计算处理单元也可以称为处理引擎(process engine)。 控制器用于从加速器内存中读取需要处理的图数据并将数据分发到多个计算处理单元,由多个计算处理单元对图数据结构中的数据进行处理得到处理结果。加速器将处理结果再输出到CPU,CPU可以在处理结果基础上进一步处理,得到目标结果,加速器从而可以实现对CPU处理图数据的加速。
通信通道23位于CPU和加速器之间,为CPU与加速器之间数据的传输提供通道。通信通道可以是高速串行计算机扩展总线(peripheral component interconnect express,PCIe)等。
在图数据处理过程中,CPU和加速器可以按照如下步骤执行:
S101,CPU主机程序通过通信通道将加速器内核所需的数据写入与CPU连接的加速器的全局内存中。
S102,CPU主机程序使用其输入参数设置加速器内核。
S103,CPU主机程序触发加速器内核功能的执行。
S104,加速器执行计算,同时从全局内存中读取数据。
S105,加速器将数据写回到全局内存,并通知主机数据处理已经完成。
S106,CPU主机程序将数据从全局内存读回主机内存,并继续处理。
以下实施例中,重点对S104的内容进行介绍,其余步骤不做详细描述。
图3是本申请提供的一种芯片的架构示意图。
芯片22包括预取模块221、调度模块222和处理模块223,芯片22可以配置一个或多个片外缓存24。预取模块(prefetcher)从片外缓存获取待处理数据后,由调度模块(dispatcher)进一步分配至处理模块(processor)处理,处理得到的结果再经由调度模块和预取模块返回至片外缓存。
应理解,芯片还设置输入输出接口,用于与芯片外部交换数据。例如,预取模块可以通过该接口从片外缓存中获取待处理的图数据,预取模块也可以通过该接口将处理模块的数据处理结果发送到片外缓存。
在一些实施例中,处理模块包括至少两个PE,PE与PE之间通过片上网络(network on chip,NoC)相互连接。
具体地,每个PE包括路由单元(routing unit,RU),PE与PE之间的路由单元相互连接,并可以用于PE之间的相互通信及数据传输。
通过在多个PE之间设置相互连通的通信链路可以实现芯片上多个PE的数据共享。
在一些实施例中,PE都包括图形单元(graph unit,GU)或称计算单元或称图处理单元、路由单元和暂存单元(scratchpad,SPD),计算单元用于处理调度模块分配的工作负载(workload)并生成更新请求。路由单元用于将计算单元的计算结果通过NoC发送至存储相对应的节点的PE的暂存单元中;暂存单元用于存储点的属性,所有PE包含的暂存单元组成处理模块缓存或者称为芯片的片上缓存,每个PE包含的暂存单元都属于片上缓存的一部分,即本申请实施例中芯片采用分布式缓存。
在一些实施例中,该处理模块可以包括N行PE,N为大于1的整数,每一行PE包括至少2个PE。
在一些实施例中,处理模块包括N*M个PE(N,M均为大于或等于1的正整数),N*M个PE形成N行M列的阵列,位于第一行第M列的PE可以表示为PE(1,M),位于第N行第1列的PE可以表示为PE(N,1)依次类推。以下实施例中,除非特别说明,PE(n,m)即表示第n行,第m列的PE,n,m均为大于或等于1的正整数。
预取模块用于执行预取以获取保存在片外缓存上的图数据。
在一些实施例中,预取模块包多个预取单元,每个预取单元都连接到片外存储器的一个伪通道。
在一些实施例中,预取模块包括N个预取单元(N为大于1的整数),N个预取单元中的每一个预取单元分别与处理模块中的N行PE中的每一行PE对应。
在一些实施例中,预取单元包括点预取器(vertex prefetcher,Vpref)和边预取器(edge prefetcher,Epref)。点预取器用于获取活跃点的数据,边预取器用于预取活跃边(或称活跃点关联边)的数据。
通过预取模块,芯片可以从外部存储空间中获取数据。芯片可以从外部存储空间一次获取一个或多个图数据。
调度模块用于接收来自预取模块的图数据,并将即将被处理的工作负载分派到处理模块中。
在一些实施例中,调度模块包多个分派单元(dispatcher unit),每一个分派单元都分别与每一个预取单元相关联,分派单元用于调度相关联的预取单元中的图数据。
在一些实施例中,分派单元包括点分派单元(vertex dispatcher unit,VDU)和边分派单元(edge dispatcher unit,EDU)。点分派单元用于分派活跃点的数据,边分派单元用于分派活跃点关联边的数据。
在一些实施例中,调度模块包括N个分派单元,每个分派单元包括点分派单元和边分派单元,点分派单元与预取模块中某一个预取单元的点预取单元相关联,用于接收相关联的点预取单元中的活跃点数据,并将活跃点数据分派到处理模块;边分派单元与预取模块中某一个预取单元的边预取单元相关联,用于接收相关联的边预取模块中的活跃点关联边的数据,并将活跃点关联边的数据分派到处理模块。
在一些实施例中,预取模块与调度模块之间设置一个或多个第一通信接口,预取模块中包含的多个预取单元与调度模块中包含的多个分派单元通过该第一通信接口进行相互数据传输。
在另一些实施例中,预取模块包含多个预取单元,调度模块包含多个分派单元,相互关联的预取单元与分派单元之间单独设置通信接口。即预取模块与调度模块之间设置多个第二通信接口,第二通信接口用于相互关联的预取单元与分派单元之间进行相互数据传输。
在一些实施例中,调度模块与处理模块之间设置一个或多个第三通信接口,调度模块中包含的多个分派单元与处理模块包含的多个PE之间通过第三通信接口进行数据传输。
在一个实施例中,芯片包括预取模块、调度模块和处理模块,处理模块包括16行16列个PE,所有PE组成PE阵列,相邻的PE之间设置有通信链路。预取模块包括16个预取单元,调度模块包括16个分派单元,16个预取单元中的每一个预取单元分别与16个分派单元中的每一个分派单元相关联。相互关联的预取单元与分派单元又与16行PE中的每一行PE向关联,用于为关联行的PE预取、分派数据。
如图4所示,为本申请实施例提供的另一种芯片架构示意图,相比于图3所示的芯片架构,本申请实施例中预取模块221中的每一个预取单元分别与片外缓存24建立通信链路,即预取模块221与片外缓存之间至少设置N条通信链路,通过该N条通信链路,每一个预取单元可以从片外缓存24中获取该预取单元所需要获取的数据。
预取模块221中的每一个预取单元还分别与调度模块222的每一个分派单元分别建立通信链路,具体的,第1行的预取单元与第1行的分派单元之间设置有通信链路,第2行的预取单元与第2行的分派单元之间设置有通信链路,第n行的预取单元与第n行的分派单元之间设置有通信链路。每一行的分派单元可以通过该通信链路向与其建立连接的预取单元获取相应活跃点的数据。
该芯片中还包括N条行总线224,该N条行总线与N行计算处理单元一一对应。具体的,第1行的行总线224与第1行的M个计算处理单元之间均设置有通信链路,第2行的行总线224与第2行的M个计算处理单元之间均设置有通信链路,第n行的行总线224与第n行的M个计算处理单元之间均设置有通信链路。行总线与计算处理单元之间不经过任何其他计算处理单元。
行总线224的远离处理单元的一端与分派单元连接,具体的,第1行的行总线224与第1行的分派单元之间设置有通信链路,第2行的行总线224与第2行的分派单元之间设置有通信链路,第n行的行总线224与第n行的分派单元之间设置有通信链路。
在一些实施例中,通过上述N条行总线,第n行的分派单元可以一次将相同的点负载或边负载分派到第n行的多个计算处理单元中。在一个实施例中,通过上述N条行总线,第n行的分派单元可以一次将相同的点负载或边负载分派到第n行的所有M个计算处理单元中。
在另一些实施例中,通过上述N条行总线,处于同一行的多个计算处理单元可以同时获取到多个待处理的数据。
需要说明的是,本申请实施例提供的架构可以在现场可编程逻辑门阵列(field programmable gate array,FPGA)的集成电路(例如:Xilinx Alveo U280FPGA)上实现,或者也可以在复杂可编程逻辑器件(complex programmable logic device,CPLD)等其他集成电路上实现,本申请对此不进行限制。
本申请实施例中运用的存储设备可以是双倍数据率同步动态随机存取存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)等多种类型的存储设备。
示例性地,本申请实施例的片外缓存可以使用高带宽(high bandwidth memory,HBM)堆栈。
本申请实施例提供的芯片中,不同的PE之间直接构建了通信链路,PE与PE之间的数据传输可以直接通过PE之间的通信链路完成,无需通过集中式的分派机制进行分派,提高了芯片在进行图数据处理时的可扩展性,提高了芯片对图数据处理的处理效率,提高了芯片对存储设备高带宽的利用率,提升了芯片的性能。
此外,本申请实施例中,每一个PE只与有限的PE相连,降低了芯片的硬件复杂度。
以上结合图2至图4主要说明了本申请实施例提供的芯片的架构,以下结合图5至图10进一步说明适用于本申请提供的芯片的数据处理方法。
图5是本申请实施例提供的芯片进行图数据处理的基本流程图。
本申请实施例中芯片进行图数据处理可以分为两个阶段:分散阶段和应用阶段。其中分散阶段主要负责读取边负载、处理边负载以及生成更新负载分派至PE。应用阶段主要负责接收更新负载并更新活跃节点以便开始下一轮迭代。
S201至S203为分散阶段(scatter phase),S204至S206为应用阶段(apply phase)。
S201,读取活跃节点和活跃边(活跃节点关联边)。
具体地,调度模块通过预取模块顺序读取活跃节点以及活跃节点关联边的数据。预取模块可以一次读取一个或多个活跃节点的数据和/或活跃节点关联边的数据。
S202,分派工作负载。
具体地,调度模块根据一定的算法分派活跃节点和活跃节点关联边的数据。例如可以根据活跃节点的节点标识来分派活跃节点的节点数据和活跃节点关联边的数据。
调度模块可以通过与之关联的行总线向计算处理单元分派活跃节点和活跃节点关联边的数据。通过行总线,芯片可以一次向同一行的多个计算处理单元分派同一个图数据,也可以一次向同一行的多个计算处理单元分派多个不同的图数据。
S203,处理工作负载。
在一些实施例中,当前PE为更新目的节点的节点信息的PE,则该PE在本地的SPD中保存更新负载。
在另一些实施例中,当前PE不是更新目的节点的节点信息的PE,则该PE将更新负载通过RU发送至负责更新目的节点的节点信息的PE的RU中。
可选地,当用于更新同一目的节点的一个或多个更新负载在路由至负责更新目的节点的节点信息的PE过程中同时路由至任一RU,则该RU对该一个或多个更新负载执行规约操作。
S204,更新(应用)节点属性。
具体地,PE的SPD对与本地存储的每一个点执行应用(apply)函数,并将结果发送至GU。
需要说明的是,这里应用函数可以是用户自定义的函数,也可以是通过其他方式确定的,该应用函数用于计算本轮迭代后节点信息的更新结果。
S205,读取节点属性更新的结果。
GU将SPD发送的处理结果与节点信息上一轮迭代的结果进行比较,并将产出更新的节点信息发送至调度模块。
S206,生成下一轮迭代的活跃点,并将下一轮迭代的活跃点写回片外缓存。
具体地,调度模块将本轮迭代中进行更新的节点作为下一轮迭代的活跃节点,并将一个或多个活跃节点的信息写回片外缓存,从而开启下一轮迭代。
图6为本申请提供的一种芯片进行图数据处理的处理方法示意图。
在本申请实施例中,芯片根据边负载包含的源节点对边负载进行分派,PE对目的节点的节点信息在本地进行更新。
如图6所示,节点1为活跃节点,节点3、节点4和节点8为节点1的相邻节点。本轮迭代用于对节点1的相邻节点的节点信息进行更新,节点1又可以称为源节点,节点3、节点4和节点8又可以称为目的节点(即需要更新节点信息的节点)。节点1和节点4之间通过边a连接,节点1和节点3之间通过边b连接,节点1和节点8之间通过边c连接,这里边a、边b和边c可以称为活跃边或者活跃点关联边。
在执行图数据处理前,芯片可以执行初始化操作,该初始化操作可以确定芯片对图数据处理的第一轮迭代的一个或多个活跃点,可选地,初始化操作还可以确定第一轮迭代的 一个或多个活跃点的节点信息。
在一些实施例中,初始化操作有CPU执行。
在分散阶段,
对于边a、边b和边c,这三条边拥有相同的源节点,芯片可以从片外缓存中分别读取边a、边b和边c的边工作负载(以下简称边负载)E1、E2和E3,并根据三条边相同的源节点,将三条边负载发送至已经保存了节点1的节点信息的PE(1,1)。
PE(1,1)接收到三条边的边负载后会对边负载进行处理。
在一些实施例中,PE(1,1)根据边负载确定每一个边负载的目的节点,并将边负载通过RU路由到保存目的节点的节点信息的PE。
可选地,PE(1,1)还将节点1的节点信息路由至保存目的节点的节点信息的PE。
示例性地,PE(1,1)根据边a的边负载E1确定该边负载的目的节点为节点4,PE(1,1)将该边负载E1和/或节点1的节点信息路由至P(2,1),即保存节点4的节点信息的PE。
边b和边c对应负载的处理过程与边a的处理过程类似,详细可以参考边a的处理过程,边b的边负载E2会路由至PE(1,2),边c的边负载会路由至PE(3,2)。
如图6所示,分散阶段中,PE(1,1)与PE(1,2)之间、PE(1,2)与PE(1,3)之间、PE(1,1)与PE(2,1)之间、PE(1,2)与PE(2,2)之间以及PE(2,2)与PE(3,2)之间连接的箭头示意性地表示了边负载、节点信息在PE之间路由的过程。
在应用阶段,保存目的节点的节点信息的PE接收包含目的节点的边负载后,对目的节点的节点信息进行更新。
在一些实施例中,保存目的节点的节点信息的PE根据以下信息中的一项或多项更新目的节点的节点信息:边负载、源节点的节点信息或目的节点当前的节点信息。
应理解,对于复杂的图结构,确定图中节点的节点信息的往往是通过多轮的迭代完成的,因而在迭代过程中可能会多次对某一节点的节点信息进行更新。目的节点当前的节点信息是指在本轮迭代完成前或者上一轮迭代结束时,目的节点的节点信息。
这里,节点信息的更新方法可以是芯片根据应用场景确定的,也可以是芯片的用户预先设定好的。
示例性地,芯片可以预先配置如下算法中的一种或多种,并根据预先配置的算法执行应用过程:网页排名(page rank)算法、广度优先(breadth first search,BFS)算法、单源最短路径(single source shortest path,SSSP)算法或协同过滤(collaborative filtering,CF)算法。
在一些实施例中,芯片根据其预配置的节点信息更新方法,确定更新节点信息所需的信息,进而再根据该预配置的节点信息更新方法更新目的节点的节点信息。
示例性地,芯片根据源节点的节点信息确定当前处理的图数据的场景,进而确定节点信息的更新方法。在另一些实施例中,保存目的节点的节点信息的PE接收多个具有相同目的节点的边负载,该PE根据多个边负载更新目的节点的节点信息。
在一轮迭代过程中,芯片中的多个PE保存的节点信息中的一个或多个会进行更新,PE可以通过将一轮迭代过程的处理结果与更新前的节点信息进行比较,对于本轮迭代过程中产生了更新的节点信息发送至调度模块。调度模块可以根据本轮迭代获取的节点信息确定下一轮迭代的活跃节点,并将新的一个或多个活跃节点写回片外缓存中并触发下一轮 迭代过程。
示例性地,节点3、节点4和节点8为本轮更新节点信息的节点,调度模块会将这些节点的标识返回至片外缓存,作为下一轮迭代的活跃节点。
本申请实施例中,应用阶段中每个PE更新本地存储的节点的节点信息,无需将节点信息路由至其他PE,因而减少了应用阶段中不同PE之间的通信开销。
图7为本申请提供的另一种芯片进行图数据处理的处理方法示意图。
在本申请实施例中,芯片根据边负载包含的目的节点对边负载进行分派,芯片包含的所有PE中均保存了可能用到的节点的节点信息,在一轮迭代结束时对所有PE中保存的可能用到的节点的节点信息进行更新。
图7所示数据处理方法中处理的图数据结构与图6中所示的图数据结构一致,相关描述可以参考图6所示实施例的内容,此处不做赘述。
在执行图数据处理前,芯片可以执行初始化操作,该初始化操作可以确定芯片对图数据处理的第一轮迭代的一个或多个活跃点,可选地,初始化操作还可以确定第一轮迭代的一个或多个活跃点的节点信息。
在一些实施例中,初始化操作有CPU执行。
在分散阶段,
对于边a、边b和边c,芯片可以从片外缓存中分别读取边a、边b和边c的边负载E 1、E 2和E 3,并根据边a的目的节点为节点4,边b的目的节点为节点3,边c的目的节点为节点8,分别将边a、边b和边c的边负载分派到保存节点4的节点信息的PE(2,1)、保存节点3的节点信息的PE(1,3)和保存节点8的节点信息的PE(3,2)。
以边a为例,PE(2,1)本地保存了边a的源节点1的节点信息的副本V 1R,当PE(2,1)接收到边负载E 1,PE(2,1)可以根据已经获取的V 1R、边负载E 1或目的节点当前的节点信息V 4中的一项或多项更新节点4的节点信息。
边b和边c对应负载的处理过程与边a的处理过程类似,详细可以参考边a的处理过程,PE(1,3)和PE(3,2)同样也会更新节点3和节点8的节点信息。
可选地,芯片根据其预配置的节点信息更新方法,确定更新节点信息所需的信息,进而再根据该预配置的节点信息更新方法更新目的节点的节点信息。
这里,节点信息的更新方法可以是芯片根据应用场景确定的,也可以是芯片的用户预先设定好的一个或多个更新方法中的一种。
示例性地,芯片根据其预配置的节点信息更新方法,确定更新节点信息所需的信息,进而再根据该预配置的节点信息更新方法更新目的节点的节点信息。
示例性地,芯片根据源节点的节点信息确定当前处理的图数据的场景,进而确定节点信息的更新方法。
在另一些实施例中,保存目的节点的节点信息的PE接收多个具有相同目的节点的边负载,该PE根据多个边负载更新目的节点的节点信息。
在一轮迭代过程中,芯片中的多个PE保存的节点信息中的一个或多个会进行更新,PE可以通过将一轮迭代过程的处理结果与更新前的节点信息进行比较,对于本轮迭代过程中产生了更新的节点信息发送至调度模块。调度模块可以根据本轮迭代获取的节点信息确定下一轮迭代的活跃节点,并将新的一个或多个活跃节点写回片外缓存中并触发下一轮 迭代过程。
示例性地,节点3、节点4和节点8为本轮更新节点信息的节点,调度模块会将这些节点的节点标识返回至片外缓存,作为下一轮迭代的活跃节点。
在一些实施例中,一轮迭代结束时,由于部分节点的节点信息已经进行了更新,保存在所有PE的各个节点的节点信息的副本(如V 1R)也需要进行更新。芯片将已经更新的节点信息路由至各个可能用到该节点信息的PE。
示例性地,节点4的节点信息发生了更新,保存节点4的节点信息的PE(2,1)会将节点4更新后的节点信息V 4分别路由至PE(1,1)、PE(1,3)和PE(3,2)。节点3的节点信息发生了更新,保存节点3的节点信息的PE(1,3)会将节点3更新后的节点信息V 3分别路由至PE(1,1)、PE(2,1)和PE(3,2)。
图7中应用阶段不同PE之间的连接的箭头示意性地标识了更新了节点信息的PE将更新后的节点信息路由至其他PE的过程。
本申请实施例中,由于所有节点都保留了源节点的节点信息的副本,在对目的节点进行节点信息更新时,无需再由保存源节点的节点信息的PE将源节点的节点信息路由至目的节点,减少了分散阶段的PE间的通信开销。
图8为本申请提供的又一种芯片进行图数据处理的处理方法示意图。
在本申请实施例中,芯片在分派负载的将边负载的源节点的节点信息分派至源节点所在行的所有PE,并将边负载分派至源节点所在行的一个或多个PE。
图8所示数据处理方法中处理的图数据结构与图8中所示的图数据结构一致,相关描述可以参考图6所示实施例的内容,此处不做赘述。
在执行图数据处理前,芯片可以执行初始化操作,该初始化操作可以确定芯片对图数据处理的第一轮迭代的一个或多个活跃点,可选地,初始化操作还可以确定第一轮迭代的一个或多个活跃点的节点信息。
在一些实施例中,初始化操作有CPU执行。
在分散阶段,对于边a、边b和边c的边负载,调度模块在分派边负载的同时会将边a、边b和边c共有的源节点节点1的节点信息V 1分派至PE(1,1)同一行的所有PE中,PE(1,2)和PE(1,3)可以接收到本轮迭代中源节点的节点信息V 1
在另一些实施例中,调度模块在分派边负载的同时也可以将源节点节点1的节点信息V 1分派至PE(1,1)同一列的所有PE中,PE(2,1)和PE(3,1)可以接收到本轮迭代中源节点的节点信息V 1
在一些实施例中,调度模块根据目的节点所在列的先后排序依次分派边负载至PE(1,1)同一行的其他PE,即将边a的边负载E 1分派至PE(1,1),将边c的边负载E 3分派至PE(1,2),将边b的边负载E 2分派至PE(1,3)。
示例性地,边c的目的节点为节点8,通过计算得到节点8位于第2列,调度模块将边c的边负载E 3分派到第1行第2列的PE,即PE(1,2)。
在另一些实施例中,调度模块根据目的节点所在行的先后排序依次分派边负载至PE同一列的其他PE,即将边a的边负载E 1分派至PE(1,1),将边b的边负载E 2分派至PE(1,1),将边c的负载E 3分派至PE(3,1)。
在一些实施例中,边a、边b和边c在片外缓存在按照目的节点进行分类存放,调度 模块在预取边负载数据时,读取边负载的源节点,如果不是当前源节点则重新取该列的下一个边负载。
在一些实施例中,当接收到边负载和源节点的节点信息,PE会获取边负载的目的节点,并在同一列中寻找保存该目的节点的节点信息的PE。
示例性地,PE(1,2)在接收到边负载E 3时,获取边负载E 3的目的节点为节点8,在确定PE(1,2)保存的节点信息不是节点8后,在第2列中找到保存节点8的节点信息的PE(3,2),进而将源节点信息V 1和边负载E 3发送至PE(3,2)。
在一些实施例中,保存目的节点信息的PE为当前PE(例如V 3),则当前PE根据源节点的节点信息、边负载或目的节点当前的节点信息中的一项或多项更新保存的目的节点的节点信息。
在另一些实施例中,保存目的节点信息的PE不是当前PE(例如V 1和V 2),当前PE会将源节点的节点信息和/或边负载路由至保存目的节点的节点信息的PE,接收源节点的节点信息和/或边负载后,保存目的节点的节点信息的PE会根据源节点的节点信息、边信息或目的节点当前的节点信息中的一项或多项更新保存的目的节点的节点信息。
可选地,芯片根据其预配置的节点信息更新方法,确定更新节点信息所需的信息,进而再根据该预配置的节点信息更新方法更新目的节点的节点信息。
这里,节点信息的更新方法可以是芯片根据应用场景确定的,也可以是芯片的用户预先设定好的。
示例性地,芯片根据其预配置的节点信息更新方法,确定更新节点信息所需的信息,进而再根据该预配置的节点信息更新方法更新目的节点的节点信息。
示例性地,芯片根据源节点的节点信息确定当前处理的图数据的场景,进而确定节点信息的更新方法。
在一轮迭代过程中,芯片中的多个PE保存的节点信息中的一个或多个会进行更新,PE可以通过将一轮迭代过程的处理结果与更新前的节点信息进行比较,对于本轮迭代过程中产生了更新的节点信息发送至调度模块。调度模块可以根据本轮迭代获取的节点信息确定下一轮迭代的活跃节点,并将新的一个或多个活跃节点写回片外缓存中并触发下一轮迭代过程。
示例性地,节点3、节点4和节点8为本轮更新节点信息的节点,调度模块可以将节点3、节点4和节点8中的一个或多个作为下一轮迭代的活跃点,进而从片外缓存中获取活跃点的关联边作为下一轮的迭代的边负载。例如,调度模块将节点3作为下一轮迭代的活跃点,进而从片外缓存中获取节点3的关联边作为下一轮迭代的边负载。
本申请实施例中,芯片通过将边负载分派至与源节点同一行的PE,可以使边负载只在同一列内路由,有利于减少分散阶段边负载在列之间的路由,减少了分散阶段PE之间的通信开销。在应用阶段,通过调度模块将源节点的节点信息分派到源节点同一行的所有PE,应用阶段保存目的节点的节点信息的PE更新目的节点的节点信息时只需在当前列内路由源节点的节点信息,有利于减少应用阶段源节点的节点信息在列之间路由,减少了应用阶段PE之间的通信开销。
图9为本申请提供的又一种芯片进行图数据处理的处理方法示意图。
规约(reduce)函数主要用于对数据处理的中间结果进行一定的合并处理,从而减少 数据处理过程中产生的通信开销。在图处理模型中的规约函数能够满足交换律和结合律。以下以图6所示的图数据结构为例,首先简单介绍图数据处理过程中的交换律和结合律。
在某一轮迭代中,节点3和节点4均为活跃节点,节点3和节点4都需要对节点5的节点信息进行更新。这种情况下交换律体现为:对节点5进行节点信息的更新,既可以先根据节点3进行更新也可以先根据节点4进行更新,即本轮迭代结束时节点5的节点信息与节点3对节点5进行信息更新和节点4对节点5进行信息更新的先后顺序无关。
在某一轮迭代中,节点1、节点4和节点8均为活跃节点,节点1、节点4和节点8都需要对节点3的节点信息进行更新。这种情况下结合律体现为:对节点3进行节点信息的更新,既可以先根据节点1和节点4对节点3的节点信息进行更新,再根据节点8对节点3的节点信息进行更新;或者也可以先根据节点8和节点1对节点3的节点信息进行更新,再根据节点4对节点3的节点信息进行更新。也就是说,当存在两个以上的活跃节点对同一目的节点进行节点信息的更新时,可以将其中任意两个或两个以上的活跃点对目的节点节点信息更新结果进行结合,再进一步与其它活跃节点对目的节点的节点信息更新计算,该过程不影响目的节点本轮迭代结束时的节点信息。
以下仅以规约函数为例说明图9所示的数据处理方法,应理解,在图处理模型中满足交换律和结合律的其他函数也适用于本申请实施例提供的数据处理方法。
还应理解,出于清楚、简洁的目的,图9中的实施例是以图8所示的数据处理流程为基础说明的,本申请实施例提供的数据处理方法不仅适用于图8所示的数据处理流程,还适用于图6和图7所示的以及其他数据处理流程,此处并未一一列出。
图9中的(a)示例性地给出了本申请实施例提供的一种PE的RU的架构图,RU包括至少一组输入输出接口,用于RU从RU以外(如其他PE或调度模块)接收数据以及RU向外部发送数据。RU可以设置4个阶段(stage),每个阶段均包含4个寄存器(register,Reg)和一个规约单元(reduce unit),其中寄存器用于存储更新负载,规约单元用于执行规约函数相应的操作。位于同一管线的一组寄存器中,相邻阶段的两个寄存器之间都可以实现通信。
每次更新节点信息时,阶段1的某个寄存器会通过输入接口接收一个更新负载。如果该寄存器为空,则该寄存器保存该更新负载。如果寄存器不为空且寄存器中的负载与接收到的负载更新的节点相同则执行规约函数后保存新值;如果寄存器不为空且寄存器中的负载与接收到的负载更新的节点不同则该寄存器将该负载发送到下一阶段的寄存器中,直到该负载与更新的节点相同的负载进行规约操作或者该负载被存入空的寄存器。
在完成上述负载处理后,阶段1中的某个寄存器会将保存的负载值或者执行规约后的负载的值发送至其他PE中。
这里需要说明的是,阶段1中接收负载的寄存器和发送负载的寄存器可以不为同一个寄存器。
还需说明的是,本申请提供的芯片的PE包含的RU,还可以包含更多的或者更少的寄存器,也可以包含更多的或者更少的规约单元,不同寄存器之间也可以设置更多的通信链路,图9中的(a)所示的RU的架构图对此并不构成限定。
示例性地,图9中的(b)给出了RU读写负载的过程,其中V 1、V 2、V 3和V′ 3用于指示存储中寄存器中的节点1、节点2、节点3以及规约后的节点3的负载,第一行第一 列的寄存器可以表示为Reg(1,1),第二行第二列的寄存器可以表示为Reg(2,2),依此类推。
在向寄存器中写入负载阶段,Reg(1,1)和Reg(2,1)分别存储了更新V 1和V 3的负载,Reg(1,2)存储了更新V 2的负载。当RU的输入端口接收到一个新的更新V 3的负载,根据更新负载的序号和管线的数量,RU将该负载发送到第一列(通过将更新负载的序号对管线的数量取余,所得余数即为负载应发送到的列的序号)。RU通过比较Reg(1,1)中已经保存的更新负载的序号与该负载的序号,确定将该负载发送至下一阶段,即第二行第一列的寄存器Reg(2,1)。
Reg(2,1)接收到该负载,RU将第二行第一列的寄存器已经保存的负载更新节点的序号与该负载的序号进行比较,确定为该寄存器中已经保存的更新V 3的负载与新接收的更新的V 3负载执行规约操作,该规约操作由规约单元执行,完成规约操作后,规约单元将处理后得到的更新节点3的负载V′ 3写入到寄存器中。
在从寄存器中读取负载时,以读取节点1的更新负载V 1为例,RU将节点1的更新负载V 1发送至RU的输出端口,进而路由至其他PE。RU将与V 1位于同一管线阶段2的寄存器Reg(2,1)保存的节点负载V′ 3发送至寄存器Reg(1,1)中。
需要说明的是,图9所示中的RU可以是保存目的节点的节点信息的PE的RU,也可以是芯片包含的任一个PE的RU。
本申请实施例中,通过RU将用于更新同一目的节点的负载在路由过程中进行规约操作,有利于减少更新节点的负载在PE之间传输的总量,即有利于减少PE之间的通信总量,有利于减少芯片的通信开销。
此外,对于图8所示的实施例,负载在列内路由的情况,更新同一节点的负载路由至同一RU的机率提高,由RU执行规约操作的机率因此提高,更有利于减少PE之间的通信总量,更有利于减少芯片的通信开销。
图10为本申请提供的芯片进行图数据处理的又一种处理方法示意图。
在本申请实施例中,芯片中PE(1,1)保存节点1的节点信息V 1,PE(1,2)保存节点2的节点信息V 2,PE(2,1)保存节点3的节点信息V 3
在第1轮迭代中,应用阶段V 1的节点信息完成更新后,PE(1,1)立即将V 1的信息发送至调度模块,调度模块通过对比V 1本轮更新前的节点信息和获取的节点信息确定V 1的节点信息在本轮迭代中发生了更新,并将V 1作为下一轮迭代的活跃点。调度模块进一步通过预取模块获取与V 1的关联边的边负载,并将获取的边负载发送至PE,用于触发PE(1,1)的下一轮迭代。
在一些实施例中,在本轮迭代中调度模块保存节点1的节点信息。
具体地,调度模块获取V 1关联边的边负载,并根据该边负载确定其源节点为节点1,进而将该边负载分派至与保存节点1的节点信息的PE(1,1)同一行所有PE,即PE(1,1)和PE(1,2)。
在另一些实施例中,调度模块也可以将节点1的关联边的边负载分派至保存节点1的节点信息的PE(1,1)。
可选地,调度模块还可以将节点1的节点信息分派至保存节点1的节点信息的PE(1,1)同一行所有PE。
类似地,PE(1,2)在完成节点2的节点信息更新后可以立即请求触发下一轮迭代,开始 下一轮迭代的分散阶段。PE(2,1)在完成节点3的节点信息更新后可以立即请求触发下一轮迭代,开始下一轮迭代的分散阶段。
在本申请实施例中,保存某一节点信息的PE在执行完应用阶段,完成该节点信息的更新后直接向调度模块请求触发下一轮的迭代,无需等待芯片中的所有PE的本轮迭代全部完成再触发下一轮迭代,有利于减少PE的空转时间,有利于提高芯片中负载的均衡度,有利于提高芯片对图数据处理的效率。
基于相同的发明构思,本申请实施例还提供一种芯片,该芯片可以用于实现如图5至图10中任一种图数据的处理方法。
如图11所示,本申请实施例还提供一种图数据处理装置1100,该图数据处理装置1100可以包括获取单元1110,该获取单元1110用于执行从片外缓存中获取图数据等如图5至图10中预取模块执行的获取动作;
该图数据处理装置1100还可以包括分派单元1120,该分派单元1120用于执行节点信息的分派、调度等如图5至图10中调度模块执行的分派动作;
该图数据处理装置1100还可以包括处理单元1130,该处理单元1130用于执行节点负载的计算等如图5至图10中处理模块执行的处理动作;该处理单元1130还可以包括图处理子单元、路由子单元和存储子单元,其中,处理子单元用于执行如图5至图10中PE执行的数据处理等动作,路由子单元用于执行如图5至图10中更新负载的规约、路由等动作,存储子单元用于执行如图5至图10中存储节点信息等动作。
该图数据处理装置1100还可以包括行总线1140,该行总线1140与每一行的处理模块对应,该行总线与对应行的每一个处理模块之间设置有单独的通信链路,该通信链路中不经过任何其他处理单元,分派单元可以通过该行总线向处理单元分派待处理的数据。
本申请实施例还提供一种芯片组,该芯片组包括处理器和芯片,该芯片组可以用于实现如图5至图10中任一种图数据的处理方法。
本申请实施例还提供一种电子设备,该电子设备包括芯片或芯片组,该电子设备可以用于实现如图5至图10中任一种图数据的处理方法。
本申请实施例还提供一种计算机程序产品,该计算机程序产品包括计算机程序代码,当计算机程序代码在计算机上运行时,如图5至图10中任一种图数据的处理方法被执行。
本申请实施例还提供一种计算机可读存储介质,该计算即存储介质中存储计算机指令,当计算机指令在计算机上运行时,使得如图5至图10中任一种图数据的处理方法被执行。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组 件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (26)

  1. 一种图数据处理的方法,其特征在于,所述方法应用于芯片,所述芯片包括N行处理引擎PE,N个行总线;其中,所述N个行总线与所述N行PE相对应;N为大于1的整数,每一行PE包括至少2个PE;所述方法包括:
    获取第一图数据和第二图数据;
    确定所述第一图数据和所述第二图数据需要存储的目标行PE;所述目标行PE为所述N行PE中的一行PE,所述目标行PE包括第一PE和第二PE;
    确定所述目标行PE对应的目标行总线;所述目标行总线与所述第一PE通过第一通信链路连接;所述目标行总线与所述第二PE通过第二通信链路连接;所述第一通信链路和所述第二通信链路不经过任何PE;
    通过所述目标行总线将所述第一图数据通过所述第一通信链路传输给所述第一PE;通过所述目标行总线将所述第二图数据通过所述第二通信链路传输给所述第二PE。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:通过所述目标行总线将所述第一图数据通过所述第二通信链路传输给所述第二PE。
  3. 根据权利要求1或2所述的方法,其特征在于,所述N行PE还包括第三PE和第四PE,所述方法还包括:
    所述第一PE基于所述第一图数据计算得到第一计算结果;所述第二PE基于所述第二图数据计算得到第二计算结果;
    所述第三PE对所述第一计算结果和所述第二计算结果进行规约处理,并将规约处理后的结果传输至第四PE,所述第四PE为所述第一计算结果和所述第二计算结果的目的PE。
  4. 根据权利要求3所述的方法,其特征在于,所述N行PE的每个PE中均包含图处理单元,所述第一PE基于所述第一图数据计算得到第一计算结果,包括:所述第一PE的图处理单元基于所述第一图数据计算得到所述第一计算结果;
    所述第二PE基于所述第二图数据计算得到第二计算结果,包括:所述第二PE的图处理单元基于所述第二图数据计算得到所述第二计算结果。
  5. 根据权利要求3或4所述的方法,其特征在于,所述N行PE的每个PE中均包含路由单元,所述第三PE对所述第一计算结果和所述第二计算结果进行规约处理,并将规约处理后的结果传输至第四PE,包括:
    所述第三PE的路由单元对所述第一计算结果和所述第二计算结果进行规约处理,并将规约处理后的结果传输至所述第四PE。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述N行PE的每个PE中均包含缓存,所述方法还包括:
    所述第一PE保存所述第一图数据至所述第一PE的缓存中,所述第二PE保存所述第二图数据至所述第二PE的缓存中。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述N行PE还包括第五PE,所述方法还包括:
    所述第五PE对第三处理结果和第四处理结果执行规约处理,所述第三处理结果和所述第四处理结果用于更新同一个图数据。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述N行PE组成N行M列的PE阵列,M为大于1的整数。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述N行PE包含的所有PE中,相邻PE之间设置有PE通信链路,所述PE通信链路用于实现PE之间的数据共享。
  10. 根据权利要求2所述的方法,其特征在于,所述第一图数据为源节点的节点信息,所述方法还包括:
    获取第三图数据,所述第三图数据为所述源节点的关联边的边负载;
    通过所述第二通信链路将所述第三图数据发送至所述第二PE;
    所述第二PE根据所述第一图数据和所述第三图数据计算目的节点的更新负载,所述更新负载用于更新所述目的节点的节点信息。
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:
    当所述芯片更新完所述目的节点的节点信息时,所述芯片获取所述目的节点的关联边的边负载,所述目的节点的关联边与所述源节点的关联边不同。
  12. 一种图数据处理装置,其特征在于,包括:
    获取单元,用于获取第一图数据和第二图数据;
    N行处理单元,用于处理所述第一图数据和所述第二图数据,N为大于1的整数,每一行处理单元包括至少2个处理单元;
    N个行总线,所述N个行总线与所述N行处理单元相对应;
    分派单元,用于确定所述第一图数据和所述第二图数据需要存储的目标行处理单元;所述目标行处理单元为所述N行处理单元中的一行处理单元,所述目标行处理单元包括第一处理单元和第二处理单元;
    所述分派单元,还用于确定所述目标行处理单元对应的目标行总线;所述目标行总线与所述第一处理单元通过第一通信链路连接;所述目标行总线与所述第二处理单元通过第二通信链路连接;所述第一通信链路和所述第二通信链路不经过任何处理单元;
    所述分派单元,还用于通过所述目标行总线将所述第一图数据通过所述第一通信链路传输给所述第一处理单元;通过所述目标行总线将所述第二图数据通过所述第二通信链路传输给所述第二处理单元。
  13. 根据权利要求12所述的图数据处理装置,其特征在于,所述分派单元还用于,通过所述目标行总线将所述第一图数据通过所述第二通信链路传输给所述第二处理单元。
  14. 根据权利要求12或13所述的图数据处理装置,其特征在于,所述N行处理单元还包括第三处理单元和第四处理单元,
    所述第一处理单元,用于基于所述第一图数据计算得到第一计算结果;
    所述第二处理单元,用于基于所述第二图数据计算得到第二计算结果;
    所述第三处理单元,用于对所述第一计算结果和所述第二计算结果进行规约处理,并将规约处理后的结果传输至第四处理单元,所述第四处理单元为所述第一计算结果和所述第二计算结果的目的处理单元。
  15. 根据权利要求14所述的图数据处理装置,其特征在于,所述N行处理单元的每个处理单元中均包含图处理子单元,
    所述第一处理单元的图处理子单元,用于基于所述第一图数据计算得到所述第一计算结果;
    所述第二处理单元的图处理子单元,用于基于所述第二图数据计算得到所述第二计算结果。
  16. 根据权利要求12至15中任一项所述的图数据处理装置,其特征在于,所述N行处理单元的每个处理单元中均包含路由子单元,
    所述第三处理单元的路由子单元,用于对所述第一计算结果和所述第二计算结果进行规约处理,并将规约处理后的结果传输至所述第四处理单元。
  17. 根据权利要求12至16中任一项所述的图数据处理装置,其特征在于,所述N行处理单元的每个处理单元中均包含存储子单元,
    所述第一处理单元,还用于保存所述第一图数据至所述第一处理单元的存储子单元中;
    所述第二处理单元,还用于保存所述第二图数据至所述第二处理单元的存储子单元中。
  18. 根据权利要求12至17中任一项所述的图数据处理装置,其特征在于,所述N行处理单元还包括第五处理单元,
    所述第五处理单元,用于对第三处理结果和第四处理结果执行规约处理,所述第三处理结果和所述第四处理结果用于更新同一个图数据。
  19. 根据权利要求12至18中任一项所述的图数据处理装置,其特征在于,所述N行处理单元组成N行M列的处理单元阵列,M为大于1的整数。
  20. 根据权利要求12至19中任一项所述的图数据处理装置,其特征在于,所述N行处理单元包含的所有处理单元中,相邻处理单元之间设置有处理单元通信链路,所述处理单元通信链路用于实现处理单元之间的数据共享。
  21. 根据权利要求13所述的图数据处理装置,其特征在于,所述第一图数据为源节点的节点信息,
    所述获取单元,还用于获取第三图数据,所述第三图数据为所述源节点的关联边的边负载;
    所述分派单元,还用于通过所述第二通信链路将所述第三图数据发送至所述第二处理单元;
    所述第二处理单元,还用于根据所述第一图数据和所述第三图数据计算目的节点的更新负载,所述更新负载用于更新所述目的节点的节点信息。
  22. 根据权利要求21所述的图数据处理装置,其特征在于,当所述图数据处理装置更新完所述目的节点的节点信息时,所述获取单元还用于,获取所述目的节点的关联边的边负载,所述目的节点的关联边与所述源节点的关联边不同。
  23. 一种芯片,其特征在于,包括:处理器,用于读取存储器中存储的指令,当所述处理器执行所述指令时,使得所述芯片实现上述权利要求1至11中任一项所述的方法。
  24. 一种电子设备,其特征在于,包括权利要求12所述的芯片。
  25. 一种计算机程序产品,其特征在于,所述计算机程序产品中包括计算机程序代码,当所述计算机程序代码在计算机上运行时,权利要求1至11中任一项所述的方法被执行。
  26. 一种计算机可读存储介质,其特征在于,其上存储有计算机程序,所述计算机程序被计算机执行时,以使得实现权利要求1至11中任一项所述的方法。
PCT/CN2022/100707 2022-02-14 2022-06-23 图数据处理的方法和芯片 WO2023151216A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210151161.5A CN116627887A (zh) 2022-02-14 2022-02-14 图数据处理的方法和芯片
CN202210151161.5 2022-02-14

Publications (1)

Publication Number Publication Date
WO2023151216A1 true WO2023151216A1 (zh) 2023-08-17

Family

ID=87563506

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100707 WO2023151216A1 (zh) 2022-02-14 2022-06-23 图数据处理的方法和芯片

Country Status (2)

Country Link
CN (1) CN116627887A (zh)
WO (1) WO2023151216A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563808A (zh) * 2018-01-05 2018-09-21 中国科学技术大学 基于fpga的异构可重构图计算加速器系统的设计方法
CN112967172A (zh) * 2021-02-26 2021-06-15 成都商汤科技有限公司 一种数据处理装置、方法、计算机设备及存储介质
WO2021182223A1 (ja) * 2020-03-11 2021-09-16 株式会社エヌエスアイテクス プロセッサ及びデータ経路再構成方法
CN113407483A (zh) * 2021-06-24 2021-09-17 重庆大学 一种面向数据密集型应用的动态可重构处理器

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563808A (zh) * 2018-01-05 2018-09-21 中国科学技术大学 基于fpga的异构可重构图计算加速器系统的设计方法
WO2021182223A1 (ja) * 2020-03-11 2021-09-16 株式会社エヌエスアイテクス プロセッサ及びデータ経路再構成方法
CN112967172A (zh) * 2021-02-26 2021-06-15 成都商汤科技有限公司 一种数据处理装置、方法、计算机设备及存储介质
CN113407483A (zh) * 2021-06-24 2021-09-17 重庆大学 一种面向数据密集型应用的动态可重构处理器

Also Published As

Publication number Publication date
CN116627887A (zh) 2023-08-22

Similar Documents

Publication Publication Date Title
US10728091B2 (en) Topology-aware provisioning of hardware accelerator resources in a distributed environment
Ma et al. Garaph: Efficient {GPU-accelerated} graph processing on a single machine with balanced replication
TWI735545B (zh) 一種模型的訓練方法和裝置
Bryk et al. Storage-aware algorithms for scheduling of workflow ensembles in clouds
US20160132541A1 (en) Efficient implementations for mapreduce systems
CN110308984B (zh) 一种用于处理地理分布式数据的跨集群计算系统
US20170091668A1 (en) System and method for network bandwidth aware distributed learning
US9420036B2 (en) Data-intensive computer architecture
Rafique et al. Cellmr: A framework for supporting mapreduce on asymmetric cell-based clusters
CN115033188B (zh) 一种基于zns固态硬盘的存储硬件加速模块系统
Wang et al. Phase-reconfigurable shuffle optimization for Hadoop MapReduce
CN109918450B (zh) 基于分析类场景下的分布式并行数据库及存储方法
CN111630487A (zh) 用于神经网络处理的共享存储器的集中式-分布式混合组织
TW202207031A (zh) 用於記憶體通道控制器之負載平衡
Ren et al. irdma: Efficient use of rdma in distributed deep learning systems
Jing et al. MaMR: High-performance MapReduce programming model for material cloud applications
Trivedi et al. RStore: A direct-access DRAM-based data store
Sun et al. Multi-node acceleration for large-scale GCNs
WO2023124304A1 (zh) 芯片的缓存系统、数据处理方法、设备、存储介质及芯片
WO2023151216A1 (zh) 图数据处理的方法和芯片
Contini et al. Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication
TW202008172A (zh) 儲存系統
US20220374424A1 (en) Join queries in data virtualization-based architecture
Ghasemi A scalable heterogeneous dataflow architecture for big data analytics using fpgas
Sun et al. FPGA-based acceleration architecture for Apache Spark operators

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22925575

Country of ref document: EP

Kind code of ref document: A1