WO2023173912A1 - Procédé de configuration d'un réseau d'éléments de traitement (pe) et dispositif associé - Google Patents

Procédé de configuration d'un réseau d'éléments de traitement (pe) et dispositif associé Download PDF

Info

Publication number
WO2023173912A1
WO2023173912A1 PCT/CN2023/070594 CN2023070594W WO2023173912A1 WO 2023173912 A1 WO2023173912 A1 WO 2023173912A1 CN 2023070594 W CN2023070594 W CN 2023070594W WO 2023173912 A1 WO2023173912 A1 WO 2023173912A1
Authority
WO
WIPO (PCT)
Prior art keywords
configuration
dynamic
array
operator
operators
Prior art date
Application number
PCT/CN2023/070594
Other languages
English (en)
Chinese (zh)
Inventor
张鑫
蔡兆晖
何雷骏
邵芳琳
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023173912A1 publication Critical patent/WO2023173912A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06GANALOGUE COMPUTERS
    • G06G7/00Devices in which the computing operation is performed by varying electric or magnetic quantities
    • G06G7/12Arrangements for performing computing operations, e.g. operational amplifiers
    • G06G7/14Arrangements for performing computing operations, e.g. operational amplifiers for addition or subtraction 
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06GANALOGUE COMPUTERS
    • G06G7/00Devices in which the computing operation is performed by varying electric or magnetic quantities
    • G06G7/12Arrangements for performing computing operations, e.g. operational amplifiers
    • G06G7/16Arrangements for performing computing operations, e.g. operational amplifiers for multiplication or division
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the field of chip technology, and in particular to a configuration method of a processing element (PE) array and related equipment.
  • PE processing element
  • Coarse-grained reconfigurable array (CGRA) chip is a kind of chip that combines the flexibility of field programmable gate array (FPGA) chip with application specific integrated circuit (ASIC) ), a new generation of programmable acceleration architecture with high energy efficiency characteristics of the chip, configures the PE array in the CGRA chip through configuration words, allowing the CGRA chip to execute the corresponding algorithm.
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • the CGRA chip can configure the PE array according to the program's operators to obtain the configured PE array. Then, when the CGRA chip executes the operator, it only needs to transmit the service data to the PE array without switching the configuration of the PE array.
  • the PE array can execute the operator based on the service data.
  • Embodiments of the present application provide a PE array configuration method and related equipment for configuring the PE array.
  • the first aspect of this application provides a method for configuring a processing unit PE array, which can be applied to a chip.
  • the chip includes a processing module and a PE array.
  • the processing module generates the isomorphism characteristics of M operators, M is a positive integer, and then determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, N is a positive integer, and based on the static configuration and M operators
  • the overall configuration of the sub-unit in the PE array determines M dynamic configurations, and the dynamic configurations include other configurations in the overall configuration except the static configuration.
  • the PE array can be configured based on the static configuration and one of the M dynamic configurations, that is, only the dynamic configuration needs to be switched, and the static configuration does not need to be switched, which reduces the switching overhead.
  • the step of the PE array configuring the PE array based on the static configuration and at least one dynamic configuration among the M dynamic configurations it may include: performing the M calculations When the first operator is the first operator in the operator, the PE array is configured based on the static configuration and the first dynamic configuration corresponding to the first operator, and the first dynamic configuration is one of the M dynamic configurations. ; When executing the second operator among the M operators, the PE array switches the first dynamic configuration to the second dynamic configuration corresponding to the second operator, and the second operator is An operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations. It can be seen that the PE array only needs to switch the dynamic configuration of the PE array and does not need to switch the static configuration, which reduces the switching overhead.
  • the processing module before the processing module performs the step of generating isomorphism features of the M operators, it may also include: the processing module obtains M data flow graphs corresponding to the M operators. , then the processing module can extract the isomorphism features according to the M data flow graphs, thereby obtaining the isomorphism features of the M operators.
  • the chip further includes a memory module.
  • the PE array Before the PE array performs the step of configuring based on the static configuration and at least one of the M dynamic configurations, it may also include: The storage module stores the mapping relationship between the static configuration and the index number; the processing module transmits a configuration word to the PE array, where the configuration word includes the index number and at least one dynamic configuration among the M dynamic configurations; The PE array obtains the static configuration that has a mapping relationship with the index number from the storage module.
  • the processing module transmits multiple copies of the configuration word, it does not need to directly transmit the static configuration, but replaces it with the index number, which reduces the transmission overhead and improves the transmission efficiency of the configuration word.
  • the configuration word also includes a configuration number, and the configuration number is used to indicate the number of times configuration is performed based on the configuration word.
  • Multiple configuration words with the same static configuration and the same dynamic configuration can be abbreviated as one configuration word to further reduce transmission overhead and improve transmission efficiency.
  • the isomorphism feature includes the routing configuration of each node among the N nodes, any two of the N nodes are directly connected or indirectly connected, and the static configuration includes the N nodes. PE routing configuration, then the routing configuration in N PEs does not need to be modified, reducing switching overhead.
  • the isomorphism feature also includes the functional configuration of at least 1 node among the N nodes, and the static configuration also includes the functional configuration of at least 1 PE among the N PEs, Then the routing configuration of at least one PE among the N PEs does not need to be modified, which reduces the switching overhead.
  • the chip also includes a MEM interface.
  • the MEM interface can obtain the source code of the program and transmit the source code of the program to the processing module, so that the processing module can generate M operators based on the source code of the program.
  • Data flow diagram get M data flow diagrams.
  • the processing module can extract isomorphism features from a data flow graph.
  • the isomorphism features are at least two identical local structures in the data flow graph, and reuse the isomorphism features.
  • N PEs in the PE array corresponding to the characteristics can reduce the number of required PEs and enhance usability.
  • the dynamic configuration also includes the configuration of at least 1 PE other than the N PEs. Then, the configuration word can also be applied to operators that cannot be composed of an integer number of isomorphic features. Enhanced applicability of configuration words.
  • isomorphism features can also be local structures of multiple different granularities, so that the processing module can determine isomorphism features of different granularities as needed under different circumstances, reducing the number of required PEs. , enhanced usability.
  • the storage module includes configuration random access memory (config RAM) and static configuration template library (template lib), where config ram is used to store configuration words, and template lib is used to store index numbers and static configuration. Mapping relationship, if there are more than one configuration word with the same static configuration, only one static configuration needs to be stored in template lib, and only the index number needs to be stored in config ram. Compared with storing static configuration for each configuration word, Reduced storage overhead.
  • config RAM configuration random access memory
  • template lib template library
  • the M operators can be all operators in a program or part of the operators in the program, so that the chip can determine one or more different isomorphisms for a program as needed.
  • the chip can determine one or more different isomorphisms for a program as needed.
  • a second aspect of the present application provides a chip, which is used to perform the method described in any one of the foregoing first aspects.
  • a third aspect of the present application provides a computer-readable storage medium.
  • the computer-readable storage medium stores instructions that, when run on a computer, cause the computer to execute the method described in any one of the above-mentioned first aspects. .
  • a fourth aspect of the present application provides a computer program product.
  • the computer program product includes computer-executable instructions.
  • the computer-executable instructions are stored in a computer-readable storage medium.
  • the processor of the device can read the computer-executed instructions from the computer-readable storage medium.
  • the processor executes the computer execution instruction to cause the device to implement the method provided by the above-mentioned first aspect or any possible implementation of the first aspect.
  • a fifth aspect of the present application provides a communication device, which may include a processor, a memory, and a communication interface.
  • the processor is coupled to memory and communication interfaces.
  • the memory is used to store instructions
  • the processor is used to execute the instructions
  • the communication interface is used to communicate with other communication devices under the control of the processor.
  • the instruction causes the processor to execute the method of the first aspect or any possible implementation of the first aspect.
  • Figure 1-1 is a schematic diagram of the structure of the PE array
  • FIG 1-2 is a schematic diagram of the data link in the embodiment of this application.
  • Figure 1-3 is a schematic diagram of the data link in the embodiment of this application.
  • FIGS 1-4 are schematic diagrams of a chip provided by embodiments of the present application.
  • FIG 2-1 is a schematic flowchart of Embodiment 1 of a PE array configuration method provided by the embodiment of this application;
  • Figure 2-2 is a schematic diagram of data flow diagram 1 in the embodiment of the present application.
  • FIG. 2-3 is a schematic diagram of data flow diagram 2 in the embodiment of the present application.
  • FIGS 2-4 are schematic diagrams of data flow diagram 3 in the embodiment of the present application.
  • FIGS. 2-5 are schematic diagrams of isomorphic features in embodiments of the present application.
  • FIGS 2-6 are schematic diagrams of the data flow diagram 3 divided into multiple partial structures in the embodiment of the present application.
  • Figures 2-7 are another schematic diagram of the isomorphism feature in the embodiment of the present application.
  • Figure 2-8 is another schematic diagram of the isomorphism feature in the embodiment of the present application.
  • Figure 2-9 is another schematic diagram of the isomorphism feature in the embodiment of the present application.
  • Figure 2-10 is a schematic diagram of the static configuration in the embodiment of the present application.
  • FIG. 1-11 is a schematic diagram of data flow diagram 4 in the embodiment of the present application.
  • Figure 2-12 is a schematic diagram of dividing the configuration buffer (cfg buffer) into two separate storage spaces in the embodiment of the present application;
  • Figure 2-13 is a schematic diagram of the cfg buffer sequentially receiving three configuration words transmitted by the configuration random access memory (config RAM) in the embodiment of the present application;
  • Figure 3 is a schematic structural diagram of a PE array configuration device provided by an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of a communication device provided by an embodiment of the present application.
  • Embodiments of the present application provide a PE array configuration method and related equipment for configuring the PE array of processing units in a CGRA chip.
  • the CGRA chip is a new generation of programmable acceleration architecture that combines the flexibility of FPGA chips with the high energy efficiency of ASIC chips.
  • the CGRA chip has a built-in PE array, which includes multiple PEs.
  • the PE array is used to execute algorithms. It should be noted that PE is composed of multiple logic gates, which are used to perform corresponding operations, such as addition, subtraction, multiplication, division, etc. Users can configure at least one PE of the PE array in the CGRA chip through the configuration word, so that the CGRA chip can execute the corresponding algorithm.
  • connected PE00 and PE01 can directly transfer data to each other.
  • a corresponding data link is formed, which can be used to execute the corresponding algorithm, for example, the data link shown in Figure 1-2.
  • the CGRA chip can configure the PE array according to the operator of the program and obtain the data link corresponding to the operator.
  • the data link includes the configuration of multiple PEs in the PE array. Then, when the CGRA chip executes the operator, it only needs to transmit the service data to the data link corresponding to the operator without switching the configuration of the PE in the data link.
  • the data link corresponding to operator 2 is shown in Figure 1-3, where SUB is subtraction.
  • SUB is subtraction.
  • PE11 changes from addition (ADD) to subtraction (SUB).
  • the CGRA chip still needs to N PEs perform overall configuration switching.
  • this application proposes a PE array configuration method and related equipment for configuring the PE array.
  • This application can be applied to chips, which include processing modules and PE arrays.
  • the processing module generates the isomorphism characteristics of M operators, M is a positive integer, and then determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, N is a positive integer, and based on the static configuration and M operators
  • M is a positive integer
  • N is a positive integer
  • M is a positive integer
  • M is a positive integer
  • M operators The overall configuration of the sub-unit in the PE array determines M dynamic configurations, and the dynamic configurations are other configurations in the overall configuration except the static configuration.
  • the PE array can be configured based on the static configuration and one of the M dynamic configurations, that is, only the dynamic configuration needs to be switched, and the static configuration does not need to be switched, which reduces the switching overhead.
  • the present application can be applied to the chip 100 shown in Figures 1-4, where the chip 100 includes a memory (memory, MEM) interface 110, a processing module 120, a storage module 130 and a PE array 140. It should be noted that the chip 100 can be an FPGA chip, a CGRA chip, or other reconfigurable chips, which is not limited here.
  • MEM memory
  • the chip 100 can be an FPGA chip, a CGRA chip, or other reconfigurable chips, which is not limited here.
  • the MEM interface 110 is an interface through which internal devices of the chip 100 interact with external devices.
  • the MEM interface 110 can receive the source code and business data of the program from the external settings of the chip 100, and transmit the source code of the program to the processing module 120 and the business data to the storage module 130.
  • the processing module 120 may have a built-in compiler (compiler) 121, where the compiler 121 is a logic module.
  • compiler 121 can be used to: generate the isomorphism characteristics of M operators based on the source code of the program, and determine the static configuration of N PEs in the PE array based on the isomorphism characteristics, and based on the static configuration and M operators in the PE array
  • the overall configuration in determines M dynamic configurations, and the dynamic configuration includes other configurations in the overall configuration except the static configuration.
  • the compiler 121 may store the static configuration and at least one dynamic configuration among the M dynamic configurations in the storage module 130.
  • the compiler 121 can forward the static configuration and at least one dynamic configuration among the M dynamic configurations to the storage module 130 through the MEM interface 110.
  • the compiler 121 can also be directly connected to the storage module, thereby directly forwarding the static configuration and at least one dynamic configuration among the M dynamic configurations to the storage module 130.
  • the storage module 130 may be a random access memory (RAM) built into the chip 100 .
  • the storage module 130 may transmit the static configuration and at least one dynamic configuration among the M dynamic configurations to the PE array 140, so that the PE array 140 configures based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
  • RAM random access memory
  • the PE array 140 has a built-in configuration buffer (Cfg buffer) 141.
  • the Cfg buffer 141 can be used to receive the static configuration transmitted by the storage module 130 and at least one dynamic configuration among the M dynamic configurations, so that the PE array 140 is based on the static configuration and the M dynamic configurations.
  • a dynamic configuration in the configuration configures the PE array 140.
  • the processing module 120 also includes a configuration switcher (Cfg switcher) 122, which can be used to switch the dynamic configuration in the Cfg buffer 141.
  • the compiler 121 can store the static configuration and at least one dynamic configuration among the M dynamic configurations in the storage module 130. In some possible implementations, the compiler 121 also stores the mapping relationship between the static configuration and the index number in the storage module 130, and then transmits the configuration word to the PE array 140. The configuration word includes the index number and at least one dynamic configuration among the M dynamic configurations. . It should be noted that the compiler 121 can send the configuration word to the storage module 130, and then the storage module 130 forwards the configuration word to the PE array 140. Alternatively, the compiler 121 can directly forward the configuration word to the PE array 140, which is not limited here.
  • the storage module 130 can be divided into multiple areas, namely configuration random access memory (config RAM) 131, static configuration template library (template lib) 132 and data random access memory (data RAM). )133.
  • config RAM configuration random access memory
  • template lib static configuration template library
  • data RAM data random access memory
  • the config RAM 131 is used to store the configuration word and transmit the configuration word to the PE array 140
  • the template lib 132 is used to store the mapping relationship between the static configuration and the index number, and return the corresponding data to the PE array 140 based on the index number in the configuration word.
  • Static configuration data RAM 133 is used to store business data and transmit it to the PE array 140.
  • the Cfg buffer 141 can receive the configuration word transmitted by the config RAM 131 of the storage module 130, and obtain the static configuration from the template lib 132 of the storage module 130 based on the index number in the configuration word, and based on the static configuration and M dynamic configurations A dynamic configuration of the PE array 140 is configured, and business data is calculated based on the configured PE array 140 to execute corresponding operators.
  • the foregoing has introduced the chip 100.
  • the configuration method of the PE array based on execution in the chip 100 will be introduced. Please refer to Figure 2-1.
  • the method embodiment mainly includes the following steps:
  • the processing module generates M data flow graphs (DFG) corresponding to M operators based on the source code of the program, where M is a positive integer.
  • DFG data flow graphs
  • the chip can receive the source code and business data of the program through the MEM interface, and then the MEM interface transmits the source code of the program to the processing module and the business data to the storage module.
  • the processing module receives the source code of the program, it can generate a data flow diagram corresponding to each of the M operators based on the source code of the program, and obtain M data flow diagrams.
  • a data flow graph includes the functional configuration and routing configuration of each node among multiple nodes.
  • the M operators may be all operators in a program, or may be part of the operators in the program, which is not limited here.
  • M 3, that is, three operators, namely operator 1, operator 2 and operator 3.
  • operator 1 is used to calculate the multiplication and addition operations between 2 ⁇ 2 order matrices: A*B+C*D
  • operator 2 is used to calculate the multiplication and subtraction operations between 2 ⁇ 2 order matrices: A*B-C* D
  • Operator 3 is used to calculate the multiplication and addition operations between 4 ⁇ 4 order matrices:
  • A, B, C and D are all two-dimensional matrices:
  • K 0 , K 1 , K 2 , K 3 , K 4 , K 5 , K 6 and K 7 are all 4-dimensional matrices:
  • E 1 A*B+C*D
  • E 1 has 4 elements:
  • Data flow diagram 1 includes 7 nodes, namely MUL0, MUL1, MUL2, MUL3, ADD0, ADD1, and ADD2.
  • the functional configurations of MUL0, MUL1, MUL2, and MUL3 are all multiplication, and the functional configurations of ADD0, ADD1, and ADD2 are all addition;
  • the routes of MUL0 and MUL1 are configured to point to ADD0, and the routes of MUL2 and MUL3 are configured to point to ADD1, ADD0,
  • the route of ADD1 is configured to point to ADD2.
  • MUL0 is used to perform the operation of Ai0*B0i
  • MUL1 is used to perform the operation of Ai1*B1i
  • MUL2 is used to perform the operation of Ci0*D0i
  • MUL3 is used to perform the operation of Ci1*D1i
  • ADD0 is used to perform the operation of MUL0+MUL1 Operation
  • ADD1 is used to perform the operation of MUL2+MUL3
  • ADD2 is used to perform the operation of ADD0+ADD1
  • E 1 ij is obtained.
  • E 2 A*BC*D
  • E 2 has 4 elements:
  • Data flow diagram 2 includes 7 nodes, namely MUL0, MUL1, MUL2, MUL3, ADD0, ADD1, and SUB0.
  • the functional configurations of MUL0, MUL1, MUL2, and MUL3 are all multiplication, the functional configurations of ADD0 and ADD1 are all addition, and the functional configuration of SUB0 is subtraction; the routing configurations of MUL0 and MUL1 are directed to ADD0, and the routing configurations of MUL2 and MUL3 are The routes pointing to ADD1, ADD0, and ADD1 are configured to point to SUB0.
  • MUL0 is used to perform the operation of Ai0*B0i
  • MUL1 is used to perform the operation of Ai1*B1i
  • MUL2 is used to perform the operation of Ci0*D0i
  • MUL3 is used to perform the operation of Ci1*D1i
  • ADD0 is used to perform the operation of MUL0+MUL1 Operation
  • ADD1 is used to perform the operation of MUL2+MUL3
  • SUB0 is used to perform the operation of ADD0-ADD1
  • E 2 ij is obtained.
  • E 3 K 0 *K 1 +K 2 *K 3 +K 4 *K 5 +K 6 *K 7 , then E 3 has 16 elements :
  • E 3 ij (K 0 i0*K 1 i0+K 0 i1*K 1 i1+K 0 i2*K 1 i2+K 0 i3*K 1 i3)+(K 2 i0*K 3 i0+K 2 i1 *K 3 i1+K 2 i2*K 3 i2+K 2 i3*K 3 i3)+(K 4 i0*K 5 i0+K 4 i1*K 5 i1+K 4 i2*K 5 i2+K 4 i3 *K 5 i3)+(K 6 i0*K 7 i0+K 6 i1*K 7 i1+K 6 i2*K 7 i2+K 6 i3*K 7 i3)
  • the processing module can generate the data flow diagram 3 shown in Figure 2-4 based on the source code of operator 3.
  • the data flow diagram 2 includes 31 nodes, namely MUL0 to MUL15, ADD0 to ADD114.
  • the functional configurations of MUL0 to MUL15 are all multiplication, and the functional configurations of ADD0 to ADD114 are all addition; MUL0 to MUL15, ADD0 to The routing configuration of ADD114 is shown in Figure 2-4 and will not be described in detail here.
  • ADD14 is used to perform the operation of ADD13+ADD12, and finally obtain the value of E 3 ij.
  • the processing module extracts isomorphic features based on the M data flow graphs, and the isomorphic features correspond to the same local structures among the M operators.
  • the isomorphism feature may include the routing configuration of each node among the N nodes, and any two nodes among the N nodes are directly connected or indirectly connected. In some possible implementations, the isomorphism feature also includes the functional configuration of at least one node among the N nodes. For example, the isomorphism characteristics determined between the data flow diagram 1 shown in Figure 2-2 and the data flow diagram 2 shown in Figure 2-3 can be shown in Figure 2-5.
  • the isomorphism The feature includes 7 nodes. Any 2 nodes among the 7 nodes are directly connected or indirectly connected, namely a, b, c, d, e, f, and g. Among them, the routes of a and b are configured to point to e, c.
  • the routes of , d are configured to point to f, and the routes of e and f are configured to point to g.
  • the isomorphism feature also includes the functional configuration of at least one node among the seven nodes.
  • the functional configurations of a, b, c, and d are all multiplication, e
  • the functional configuration of f is all additive, and the functional configuration of g is not limited.
  • the processing module can extract isomorphism features from a data flow graph.
  • the isomorphism features are at least two identical local structures in the data flow graph.
  • the N PEs in the PE array corresponding to the feature reduce the number of required PEs and enhance usability.
  • the processing module can divide Figure 2-4 into Figure 2-6, thereby dividing the data flow Figure 3 into 5 local structures with similar structures. Based on these 5 local structures, Figure 2-7 can be extracted.
  • This isomorphism feature includes 7 nodes, namely a, b, c, d, e, f, and g.
  • the routes of a and b are configured to point to e
  • the routes of c and d are configured to point to f and e.
  • f's route is configured to point to g.
  • the functional configurations of a, b, c, and d are not limited, and the functional configurations of e, f, and g are all additive.
  • the processing module can extract the isomorphism features shown in Figure 2-8 based on the data flow graph 1, data flow graph 2 and data flow graph 3.
  • the isomorphism features include 7 nodes. They are a, b, c, d, e, f, g respectively.
  • the routes of a and b are configured to point to e
  • the routes of c and d are configured to point to f
  • the routes of e and f are configured to point to g
  • the function configurations of a, b, c, d, and g are not limited, e
  • the functional configurations of f are all additive.
  • isomorphism features can also be local structures of multiple different granularities, so that the chip can determine isomorphism features of different granularities as needed under different circumstances, reducing the number of required PEs.
  • Enhanced usability For example, based on data flow graph 1, data flow graph 2 and data flow graph 3, the isomorphism feature shown in Figure 2-9 can be extracted.
  • This isomorphism feature includes three nodes, namely a and b. , c, where the routes of a and b are configured to point to c, and the functional configurations of a, b, and c are not limited.
  • the isomorphic features shown in Figure 2-9 have smaller granularity than the isomorphic features shown in Figure 2-8.
  • the M operators can be all operators in a program or part of the operators in the program, so that the chip can determine one or more different isomorphisms for a program as needed.
  • the processing module can be based on data flow diagram 1 /2/3 extracts isomorphism feature 1, and extracts isomorphism feature 2 based on data flow diagram 4/5/6.
  • the above steps 201-202 are optional, as long as the processing module can generate isomorphic features of M operators, there is no limitation here.
  • the chip can determine the isomorphism characteristics based on the calculation formula of M operators, which is not limited here.
  • the processing module determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, where N is a positive integer.
  • the isomorphism feature includes N nodes. Based on the connection relationship between the N nodes in the isomorphism feature, available N PEs are selected from the PE array, where the connection relationships of the N PEs are The connection relationship between N nodes in the isomorphic feature is the same. One node in the isomorphic feature corresponds to one PE among the N PEs. Then, based on the configuration of each node in the isomorphism feature, corresponding configurations are performed on the corresponding PEs among the N PEs to obtain the static configuration of the N PEs.
  • the static configuration also includes the routing configuration of each of the N PEs; if the isomorphism feature includes the function of at least one of the N nodes, During configuration, the static configuration also includes the functional configuration of at least one PE among the N PEs.
  • the PE array is a 3 ⁇ 3 architecture.
  • N PEs PE00, PE01, PE02, PE11, PE20, PE21, PE22
  • the static configuration of 7 PEs is obtained as shown in Figure 2-10.
  • the routing configurations of N PEs PE00, PE01, PE02, PE11, PE20, PE21, PE22
  • the function configurations of PE01 and PE21 are additive, and the function configurations of PE00, PE02, PE11, PE20, and PE22 are not limited.
  • the processing module determines M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array.
  • the dynamic configuration includes other configurations in the overall configuration except the static configuration.
  • the dynamic configuration is the functional configuration of PE00, PE02, PE11, PE20, and PE22.
  • the dynamic configuration corresponding to operator 1 is: the function configurations of PE00, PE02, PE20, and PE22 are all multiplication, and the function configuration of PE11 is addition
  • the dynamic configuration corresponding to operator 2 is: the functions of PE00, PE02, PE20, and PE22
  • the configurations are all multiplication, and the functional configuration of PE11 is subtraction
  • operator 3 corresponds to 5 dynamic configurations, of which 4 dynamic configurations are: the functional configurations of PE00, PE02, PE20, and PE22 are all multiplication, and the functional configuration of PE11 is subtraction
  • Operator 3 corresponds to one of the five dynamic configurations: the functional configurations of PE00, PE02, PE11, PE20, and PE22 are all subtractions.
  • the dynamic configuration also includes the configuration of at least one PE other than the N PEs in the PE array.
  • the data flow diagram 4 corresponding to operator 4 is based on the isomorphism characteristics shown in Figure 2-10.
  • the dynamic configuration corresponding to operator 4 can also include the routing configuration of PE10. , the route of PE10 is configured to point to PE11, and the function of PE10 is configured as addition.
  • the processing module stores the mapping relationship between the static configuration and the index number in the storage module.
  • the processing module can generate a statically configured index number, and store the mapping relationship between the index number and the static configuration in the storage module.
  • the storage module can store the mapping relationship through the template lib.
  • there are two static configurations namely static configuration 1 and static configuration 2.
  • the processing module can generate two index numbers, namely index number 1 and index number 2.
  • index number 1 has a mapping relationship with static configuration 1.
  • Index number 2 has a mapping relationship with static configuration 2, and the mapping relationship between index number and static configuration is stored in the template lib of the storage module.
  • the template lib of the storage module is shown in Table 1:
  • the items under the idx column are represented as index numbers, and the items under the cfg column are static configurations of the data link.
  • the processing module transmits the configuration word to the PE array.
  • the configuration word includes static configuration and at least one dynamic configuration among M dynamic configurations.
  • Operator 1 corresponds to 1 configuration word
  • operator 2 corresponds to 1 configuration word
  • operator 3 corresponds to 5 configuration words.
  • the dynamic configurations of the first four configuration words are the same, and only the dynamic configuration of the fifth one is different.
  • the configuration word may also include the number of configuration copies.
  • the number of configuration copies is used to indicate the number of configurations based on the configuration word.
  • multiple copies of the configuration word with multiple identical static configurations and the same dynamic configuration can be abbreviated It is 1 copy of the configuration word to further reduce the transmission overhead.
  • Table 3 it is an example of the configuration words of three operators in the embodiment of the present application.
  • each configuration word shown in Table 3 is different, that is, one configuration word corresponds to one reconfigurable cycle.
  • the dynamic configuration includes configurations of the N PEs other than the static configuration, and the dynamic configuration also includes configurations of at least 1 PE other than the N PEs. , so that the dynamic configuration can also include the configuration of other PEs other than the N nodes of the PE array, enhancing the applicability.
  • FIG. 2-11 it is the data flow diagram 4 corresponding to operator 4.
  • the isomorphism feature shown in Figure 2-5 can only be used as the local structure of the data flow diagram 4 of operator 4.
  • the configuration of the remaining node can correspond to PE10.
  • the configuration of PE10 is at least 1 other than N PEs.
  • Configuration of a PE Illustratively, as shown in Table 4-1 or Table 4-2, they are examples of configuration words of three operators in the embodiment of this application.
  • the dynamic configuration is divided into the Cfg_operation_list part and the other cfg part, where the Cfg_operation_list part is the configuration of the N PEs other than the static configuration, and the other cfg part is the configuration of at least 1 PE other than the N PEs.
  • the configuration word includes an index number and at least one dynamic configuration among the M dynamic configurations.
  • the index number represents the static configuration, which effectively reduces transmission overhead and improves transmission efficiency.
  • Table 5-1, Table 5-2, Table 5-3 or Table 5-4 they are examples of configuration words corresponding to three operators in the embodiment of the present application.
  • the processing module can transmit the configuration word to the storage module, and then the storage module stores the configuration word through the built-in config ram and transmits the configuration word to the PE array.
  • the transmission module can transmit all configuration words to the storage module at one time, and the storage module sequentially transmits the configuration words to the PE array according to certain rules, one configuration word at a time.
  • the configuration words received by the storage module but not yet transmitted to the PE array can be stored in the config ram. Since the configuration word includes the index number rather than the static configuration itself, storage requirements are greatly reduced.
  • the PE array obtains the static configuration that has a mapping relationship with the index number from the storage module.
  • the cfg buffer in the PE array can obtain static configuration from the template lib of the storage module based on the index number.
  • the cfg buffer in the PE array requests a static configuration from the storage module based on the index number.
  • the storage module determines the static configuration from the template lib based on the index number and mapping relationship, and returns the static configuration to the PE array.
  • the PE array when the PE array receives a new configuration word, check the index number. If the index number is the same as the index number in the last received configuration word, the PE array does not need to obtain the static configuration from the storage module. Instead, it uses the static configuration of the previous configuration word and only needs to switch the dynamic configuration. Transmission overhead is reduced.
  • the cfg buffer can be divided into two separate storage spaces, namely storage space 1 and storage space 2.
  • Storage space 1 is used to store static configuration
  • storage space 2 is used to store static configuration.
  • the cfg buffer in the PE array receives configuration word 1, configuration word 2, and configuration word 3 sequentially transmitted by the storage module.
  • Configuration word 1 includes the index number and dynamic configuration dynamic0
  • configuration Word 2 includes the index number and dynamic configuration dynamic1
  • configuration word 3 includes the index number and dynamic configuration dynamic2.
  • the cfg buffer in the PE array receives the configuration word 1, it obtains the static configuration from the template lib in the storage module based on the index number, and stores the static configuration and dynamic configuration dynamic0.
  • the cfg buffer in the PE array receives configuration word 2, it can be determined that the index number in configuration word 2 is the same as the index number in configuration word 1, and the static configuration (static) needs to be obtained from the config RAM in the storage module, and It is to switch the dynamic configuration dynamic0 to the dynamic configuration dynamic1 in configuration word 2.
  • the cfg buffer in the PE array receives configuration word 3, it can be determined that the index number in configuration word 3 is the same as the index number in configuration word 2, and the static configuration (static) needs to be obtained from the config RAM in the storage module, and It is to switch the dynamic configuration dynamic1 to the dynamic configuration dynamic2 in configuration word 2. Since only dynamic configurations need to be switched, there is no need to switch static configurations, which reduces switching overhead.
  • the PE array is configured based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
  • the PE array when the chip executes the first operator among M operators, the PE array is configured based on the static configuration and the first dynamic configuration corresponding to the first operator, and the first dynamic configuration is one of the M dynamic configurations. ;
  • the PE array switches the first dynamic configuration to the second dynamic configuration corresponding to the second operator.
  • the second operator is executed after the first operator is executed.
  • the second dynamic configuration is one of M dynamic configurations. Since only dynamic configuration is switched, there is no need to switch static configuration, which reduces switching overhead.
  • operator 1, operator 2, and operator 3 correspond to the same static configuration.
  • the cfg switcher in the processing module only needs to Switching dynamic configuration in the cfg buffer of the PE array eliminates the need to switch static configuration, thus saving switching overhead.
  • a chip 300 provided by an embodiment of the present application includes:
  • processing module 310 and PE array 320 wherein,
  • the processing module 310 is used to generate isomorphism features of M operators.
  • the isomorphism features correspond to the same local structure among the M operators.
  • M is a positive integer; according to the isomorphism
  • the characteristics determine the static configuration of N PEs in the PE array, and N is a positive integer; determine M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array, and the dynamic configuration Be other configurations in the overall configuration except the static configuration;
  • the PE array 320 is configured to configure based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
  • the PE array 320 is specifically configured to: when executing the first operator among the M operators, based on the static configuration and the first dynamic value corresponding to the first operator Configuration is configured, and the first dynamic configuration is one of the M dynamic configurations; when executing the second operator among the M operators, the first dynamic configuration is switched to the second The second dynamic configuration corresponding to the operator, the second operator is an operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations.
  • the processing module 310 is configured to obtain M data flow graphs corresponding to the M operators, and extract the isomorphism features according to the M data flow graphs.
  • the chip 300 further includes: a storage module 330; the processing module 310 is also configured to transmit the mapping relationship between the static configuration and the index number to the storage module 330, and to the PE
  • the array transmits a configuration word, which includes the index number and at least one dynamic configuration among the M dynamic configurations; the PE array 320 is also used to obtain the data from the storage module 330 based on the index number. Describe static configuration.
  • An embodiment of the present application also provides a computer storage medium, wherein the computer storage medium stores a program, and the program executes some or all of the steps described in the above method embodiments.
  • An embodiment of the present application also provides a computer program product, wherein the computer program product stores a program, and the program executes some or all of the steps recorded in the above method embodiments.
  • the communication device 400 includes:
  • Receiver 401, transmitter 402, processor 403 and memory 404 may be connected through a bus or other means. In FIG. 4, the connection through the bus is taken as an example.
  • Memory 404 may include read-only memory and random access memory and provides instructions and data to processor 403 .
  • a portion of memory 404 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 404 stores an operating system and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for implementing various operations.
  • the operating system may include various system programs that are used to implement various basic services and handle hardware-based tasks.
  • the processor 403 controls the operation of the communication device 400.
  • the processor 403 may also be called a central processing unit (CPU).
  • CPU central processing unit
  • various components of the communication device 400 are coupled together through a bus system, where in addition to a data bus, the bus system may also include a power bus, a control bus, a status signal bus, etc.
  • bus systems in the figure.
  • the methods disclosed in the above embodiments of the present application can be applied to the processor 403 or implemented by the processor 403.
  • the processor 403 may be included as a chip as described in FIG. 3 .
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory 404.
  • the processor 403 reads the information in the memory 404 and completes the steps of the above method in combination with its hardware.
  • the receiver 401 can be used to receive input numeric or character information, and generate signal input related to the relevant settings and function control of the communication device 400.
  • the transmitter 402 can include a display device such as a display screen, and the transmitter 402 can be used to output through an external interface. Numeric or character information.
  • the processor 403 is configured to execute the configuration method of the processing unit PE array executed by the communication device 400 .
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
  • the technical solution of the present application is essentially, or the part that contributes to the existing technology, can be embodied in the form of a software product.
  • the computer software product is stored in a readable storage medium, such as a computer floppy disk, a U disk, a mobile phone, etc.
  • a hard disk, ROM, RAM, magnetic disk or optical disk, etc. includes a number of instructions to cause a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a server, data center, etc. that contains one or more available media integrated.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Power Engineering (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Stored Programmes (AREA)

Abstract

Des modes de réalisation de la présente demande divulguent un procédé de configuration d'un réseau d'éléments de traitement (PE) et un dispositif associé, qui servent à configurer un réseau PE. La présente demande peut être appliquée à une puce, et la puce comprend un module de traitement et un réseau PE. Le module de traitement génère des caractéristiques homogènes de M opérateurs, puis détermine une configuration statique de N PE dans le réseau PE en fonction des caractéristiques homogènes, et détermine M configurations dynamiques sur la base de la configuration statique et d'une configuration générale des M opérateurs dans le réseau PE, les configurations dynamiques étant d'autres configurations dans la configuration générale à l'exception de la configuration statique. Par conséquent, le réseau PE peut être configuré sur la base de la configuration statique et de l'une des M configurations dynamiques, sans commuter la configuration statique du réseau PE, ce qui permet de réduire le surdébit de commutation.
PCT/CN2023/070594 2022-03-17 2023-01-05 Procédé de configuration d'un réseau d'éléments de traitement (pe) et dispositif associé WO2023173912A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210264327.4A CN116822595A (zh) 2022-03-17 2022-03-17 一种处理单元pe阵列的配置方法和相关设备
CN202210264327.4 2022-03-17

Publications (1)

Publication Number Publication Date
WO2023173912A1 true WO2023173912A1 (fr) 2023-09-21

Family

ID=88022168

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/070594 WO2023173912A1 (fr) 2022-03-17 2023-01-05 Procédé de configuration d'un réseau d'éléments de traitement (pe) et dispositif associé

Country Status (2)

Country Link
CN (1) CN116822595A (fr)
WO (1) WO2023173912A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402745A (zh) * 2017-07-04 2017-11-28 清华大学 数据流图的映射方法及装置
US20210092174A1 (en) * 2019-09-23 2021-03-25 Netapp, Inc. Methods for dictionary-based compression and devices thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402745A (zh) * 2017-07-04 2017-11-28 清华大学 数据流图的映射方法及装置
US20210092174A1 (en) * 2019-09-23 2021-03-25 Netapp, Inc. Methods for dictionary-based compression and devices thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YIN CHONG-YONG, YIN SHOU-YI, WEI SHAO-JUN: "Optimization of configuration contexts generated for reconfigurable media processor ", JOURNAL OF JILIN UNIVERSITY (ENGINEERING AND TECHNOLOGY EDITION), vol. 42, no. 04, 1 January 2012 (2012-01-01), pages 1059 - 1065, XP093090367 *
YIN SHOUYI, YIN CHONGYONG, LIU LEIBO, ZHU MIN, WEI SHAOJUN: "Configuration Context Reduction for Coarse-Grained Reconfigurable Architecture", IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, INFORMATION & SYSTEMS SOCIETY, TOKYO., JP, vol. E95-D, no. 2, 1 January 2012 (2012-01-01), JP , pages 335 - 344, XP093090377, ISSN: 0916-8532, DOI: 10.1587/transinf.E95.D.335 *

Also Published As

Publication number Publication date
CN116822595A (zh) 2023-09-29

Similar Documents

Publication Publication Date Title
Patarasuk et al. Bandwidth optimal all-reduce algorithms for clusters of workstations
US10938889B2 (en) Performing optimized collective operations in a irregular subcommunicator of compute nodes in a parallel computer
US20080022079A1 (en) Executing an allgather operation with an alltoallv operation in a parallel computer
US11398981B2 (en) Path creation method and device for network on chip and electronic apparatus
US11789733B2 (en) Instruction processing apparatus, acceleration unit, and server
US9246792B2 (en) Providing point to point communications among compute nodes in a global combining network of a parallel computer
Touzene On all-to-all broadcast in dense Gaussian network on-chip
US10476492B2 (en) Structures and operations of integrated circuits having network of configurable switches
US9390054B2 (en) Identifying a largest logical plane from a plurality of logical planes formed of compute nodes of a subcommunicator in a parallel computer
US9330059B2 (en) Identifying logical planes formed of compute nodes of a subcommunicator in a parallel computer
US8296457B2 (en) Providing nearest neighbor point-to-point communications among compute nodes of an operational group in a global combining network of a parallel computer
US9769112B2 (en) Optimising data transmission in a hypercube network
KR102238600B1 (ko) 스케쥴러 컴퓨팅 장치, 그것을 포함하는 분산 컴퓨팅 시스템의 데이터 노드 및 그것의 방법
WO2023173912A1 (fr) Procédé de configuration d'un réseau d'éléments de traitement (pe) et dispositif associé
US20220343144A1 (en) Server and accelerator for neural network computations
CN112448853B (zh) 一种网络拓扑图优化方法、终端设备及存储介质
Zhao et al. A novel energy-aware multi-task dynamic mapping heuristic of NoC-based MPSoCs
US11223703B2 (en) Instruction initialization in a dataflow architecture
Ueno et al. VCSN: Virtual circuit-switching network for flexible and simple-to-operate communication in HPC FPGA cluster
Larson et al. The möbius cubes
Rettkowski et al. Application-specific processing using high-level synthesis for networks-on-chip
US11954053B2 (en) Integrating buffer views into buffer access operations in a coarse-grained reconfigurable computing environment
Touzene All-to-all broadcast in hexagonal torus networks on-chip
WO2022029926A1 (fr) Système informatique et procédé de traitement de calcul
US20240241844A1 (en) Method and System for Integrating Buffer Views into Buffer Access Operations in Reconfigurable Computing Environments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23769423

Country of ref document: EP

Kind code of ref document: A1