WO2023173912A1 - Configuration method for processing element (pe) array and related device - Google Patents

Configuration method for processing element (pe) array and related device Download PDF

Info

Publication number
WO2023173912A1
WO2023173912A1 PCT/CN2023/070594 CN2023070594W WO2023173912A1 WO 2023173912 A1 WO2023173912 A1 WO 2023173912A1 CN 2023070594 W CN2023070594 W CN 2023070594W WO 2023173912 A1 WO2023173912 A1 WO 2023173912A1
Authority
WO
WIPO (PCT)
Prior art keywords
configuration
dynamic
array
operator
operators
Prior art date
Application number
PCT/CN2023/070594
Other languages
French (fr)
Chinese (zh)
Inventor
张鑫
蔡兆晖
何雷骏
邵芳琳
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023173912A1 publication Critical patent/WO2023173912A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06GANALOGUE COMPUTERS
    • G06G7/00Devices in which the computing operation is performed by varying electric or magnetic quantities
    • G06G7/12Arrangements for performing computing operations, e.g. operational amplifiers
    • G06G7/14Arrangements for performing computing operations, e.g. operational amplifiers for addition or subtraction 
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06GANALOGUE COMPUTERS
    • G06G7/00Devices in which the computing operation is performed by varying electric or magnetic quantities
    • G06G7/12Arrangements for performing computing operations, e.g. operational amplifiers
    • G06G7/16Arrangements for performing computing operations, e.g. operational amplifiers for multiplication or division
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the field of chip technology, and in particular to a configuration method of a processing element (PE) array and related equipment.
  • PE processing element
  • Coarse-grained reconfigurable array (CGRA) chip is a kind of chip that combines the flexibility of field programmable gate array (FPGA) chip with application specific integrated circuit (ASIC) ), a new generation of programmable acceleration architecture with high energy efficiency characteristics of the chip, configures the PE array in the CGRA chip through configuration words, allowing the CGRA chip to execute the corresponding algorithm.
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • the CGRA chip can configure the PE array according to the program's operators to obtain the configured PE array. Then, when the CGRA chip executes the operator, it only needs to transmit the service data to the PE array without switching the configuration of the PE array.
  • the PE array can execute the operator based on the service data.
  • Embodiments of the present application provide a PE array configuration method and related equipment for configuring the PE array.
  • the first aspect of this application provides a method for configuring a processing unit PE array, which can be applied to a chip.
  • the chip includes a processing module and a PE array.
  • the processing module generates the isomorphism characteristics of M operators, M is a positive integer, and then determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, N is a positive integer, and based on the static configuration and M operators
  • the overall configuration of the sub-unit in the PE array determines M dynamic configurations, and the dynamic configurations include other configurations in the overall configuration except the static configuration.
  • the PE array can be configured based on the static configuration and one of the M dynamic configurations, that is, only the dynamic configuration needs to be switched, and the static configuration does not need to be switched, which reduces the switching overhead.
  • the step of the PE array configuring the PE array based on the static configuration and at least one dynamic configuration among the M dynamic configurations it may include: performing the M calculations When the first operator is the first operator in the operator, the PE array is configured based on the static configuration and the first dynamic configuration corresponding to the first operator, and the first dynamic configuration is one of the M dynamic configurations. ; When executing the second operator among the M operators, the PE array switches the first dynamic configuration to the second dynamic configuration corresponding to the second operator, and the second operator is An operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations. It can be seen that the PE array only needs to switch the dynamic configuration of the PE array and does not need to switch the static configuration, which reduces the switching overhead.
  • the processing module before the processing module performs the step of generating isomorphism features of the M operators, it may also include: the processing module obtains M data flow graphs corresponding to the M operators. , then the processing module can extract the isomorphism features according to the M data flow graphs, thereby obtaining the isomorphism features of the M operators.
  • the chip further includes a memory module.
  • the PE array Before the PE array performs the step of configuring based on the static configuration and at least one of the M dynamic configurations, it may also include: The storage module stores the mapping relationship between the static configuration and the index number; the processing module transmits a configuration word to the PE array, where the configuration word includes the index number and at least one dynamic configuration among the M dynamic configurations; The PE array obtains the static configuration that has a mapping relationship with the index number from the storage module.
  • the processing module transmits multiple copies of the configuration word, it does not need to directly transmit the static configuration, but replaces it with the index number, which reduces the transmission overhead and improves the transmission efficiency of the configuration word.
  • the configuration word also includes a configuration number, and the configuration number is used to indicate the number of times configuration is performed based on the configuration word.
  • Multiple configuration words with the same static configuration and the same dynamic configuration can be abbreviated as one configuration word to further reduce transmission overhead and improve transmission efficiency.
  • the isomorphism feature includes the routing configuration of each node among the N nodes, any two of the N nodes are directly connected or indirectly connected, and the static configuration includes the N nodes. PE routing configuration, then the routing configuration in N PEs does not need to be modified, reducing switching overhead.
  • the isomorphism feature also includes the functional configuration of at least 1 node among the N nodes, and the static configuration also includes the functional configuration of at least 1 PE among the N PEs, Then the routing configuration of at least one PE among the N PEs does not need to be modified, which reduces the switching overhead.
  • the chip also includes a MEM interface.
  • the MEM interface can obtain the source code of the program and transmit the source code of the program to the processing module, so that the processing module can generate M operators based on the source code of the program.
  • Data flow diagram get M data flow diagrams.
  • the processing module can extract isomorphism features from a data flow graph.
  • the isomorphism features are at least two identical local structures in the data flow graph, and reuse the isomorphism features.
  • N PEs in the PE array corresponding to the characteristics can reduce the number of required PEs and enhance usability.
  • the dynamic configuration also includes the configuration of at least 1 PE other than the N PEs. Then, the configuration word can also be applied to operators that cannot be composed of an integer number of isomorphic features. Enhanced applicability of configuration words.
  • isomorphism features can also be local structures of multiple different granularities, so that the processing module can determine isomorphism features of different granularities as needed under different circumstances, reducing the number of required PEs. , enhanced usability.
  • the storage module includes configuration random access memory (config RAM) and static configuration template library (template lib), where config ram is used to store configuration words, and template lib is used to store index numbers and static configuration. Mapping relationship, if there are more than one configuration word with the same static configuration, only one static configuration needs to be stored in template lib, and only the index number needs to be stored in config ram. Compared with storing static configuration for each configuration word, Reduced storage overhead.
  • config RAM configuration random access memory
  • template lib template library
  • the M operators can be all operators in a program or part of the operators in the program, so that the chip can determine one or more different isomorphisms for a program as needed.
  • the chip can determine one or more different isomorphisms for a program as needed.
  • a second aspect of the present application provides a chip, which is used to perform the method described in any one of the foregoing first aspects.
  • a third aspect of the present application provides a computer-readable storage medium.
  • the computer-readable storage medium stores instructions that, when run on a computer, cause the computer to execute the method described in any one of the above-mentioned first aspects. .
  • a fourth aspect of the present application provides a computer program product.
  • the computer program product includes computer-executable instructions.
  • the computer-executable instructions are stored in a computer-readable storage medium.
  • the processor of the device can read the computer-executed instructions from the computer-readable storage medium.
  • the processor executes the computer execution instruction to cause the device to implement the method provided by the above-mentioned first aspect or any possible implementation of the first aspect.
  • a fifth aspect of the present application provides a communication device, which may include a processor, a memory, and a communication interface.
  • the processor is coupled to memory and communication interfaces.
  • the memory is used to store instructions
  • the processor is used to execute the instructions
  • the communication interface is used to communicate with other communication devices under the control of the processor.
  • the instruction causes the processor to execute the method of the first aspect or any possible implementation of the first aspect.
  • Figure 1-1 is a schematic diagram of the structure of the PE array
  • FIG 1-2 is a schematic diagram of the data link in the embodiment of this application.
  • Figure 1-3 is a schematic diagram of the data link in the embodiment of this application.
  • FIGS 1-4 are schematic diagrams of a chip provided by embodiments of the present application.
  • FIG 2-1 is a schematic flowchart of Embodiment 1 of a PE array configuration method provided by the embodiment of this application;
  • Figure 2-2 is a schematic diagram of data flow diagram 1 in the embodiment of the present application.
  • FIG. 2-3 is a schematic diagram of data flow diagram 2 in the embodiment of the present application.
  • FIGS 2-4 are schematic diagrams of data flow diagram 3 in the embodiment of the present application.
  • FIGS. 2-5 are schematic diagrams of isomorphic features in embodiments of the present application.
  • FIGS 2-6 are schematic diagrams of the data flow diagram 3 divided into multiple partial structures in the embodiment of the present application.
  • Figures 2-7 are another schematic diagram of the isomorphism feature in the embodiment of the present application.
  • Figure 2-8 is another schematic diagram of the isomorphism feature in the embodiment of the present application.
  • Figure 2-9 is another schematic diagram of the isomorphism feature in the embodiment of the present application.
  • Figure 2-10 is a schematic diagram of the static configuration in the embodiment of the present application.
  • FIG. 1-11 is a schematic diagram of data flow diagram 4 in the embodiment of the present application.
  • Figure 2-12 is a schematic diagram of dividing the configuration buffer (cfg buffer) into two separate storage spaces in the embodiment of the present application;
  • Figure 2-13 is a schematic diagram of the cfg buffer sequentially receiving three configuration words transmitted by the configuration random access memory (config RAM) in the embodiment of the present application;
  • Figure 3 is a schematic structural diagram of a PE array configuration device provided by an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of a communication device provided by an embodiment of the present application.
  • Embodiments of the present application provide a PE array configuration method and related equipment for configuring the PE array of processing units in a CGRA chip.
  • the CGRA chip is a new generation of programmable acceleration architecture that combines the flexibility of FPGA chips with the high energy efficiency of ASIC chips.
  • the CGRA chip has a built-in PE array, which includes multiple PEs.
  • the PE array is used to execute algorithms. It should be noted that PE is composed of multiple logic gates, which are used to perform corresponding operations, such as addition, subtraction, multiplication, division, etc. Users can configure at least one PE of the PE array in the CGRA chip through the configuration word, so that the CGRA chip can execute the corresponding algorithm.
  • connected PE00 and PE01 can directly transfer data to each other.
  • a corresponding data link is formed, which can be used to execute the corresponding algorithm, for example, the data link shown in Figure 1-2.
  • the CGRA chip can configure the PE array according to the operator of the program and obtain the data link corresponding to the operator.
  • the data link includes the configuration of multiple PEs in the PE array. Then, when the CGRA chip executes the operator, it only needs to transmit the service data to the data link corresponding to the operator without switching the configuration of the PE in the data link.
  • the data link corresponding to operator 2 is shown in Figure 1-3, where SUB is subtraction.
  • SUB is subtraction.
  • PE11 changes from addition (ADD) to subtraction (SUB).
  • the CGRA chip still needs to N PEs perform overall configuration switching.
  • this application proposes a PE array configuration method and related equipment for configuring the PE array.
  • This application can be applied to chips, which include processing modules and PE arrays.
  • the processing module generates the isomorphism characteristics of M operators, M is a positive integer, and then determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, N is a positive integer, and based on the static configuration and M operators
  • M is a positive integer
  • N is a positive integer
  • M is a positive integer
  • M is a positive integer
  • M operators The overall configuration of the sub-unit in the PE array determines M dynamic configurations, and the dynamic configurations are other configurations in the overall configuration except the static configuration.
  • the PE array can be configured based on the static configuration and one of the M dynamic configurations, that is, only the dynamic configuration needs to be switched, and the static configuration does not need to be switched, which reduces the switching overhead.
  • the present application can be applied to the chip 100 shown in Figures 1-4, where the chip 100 includes a memory (memory, MEM) interface 110, a processing module 120, a storage module 130 and a PE array 140. It should be noted that the chip 100 can be an FPGA chip, a CGRA chip, or other reconfigurable chips, which is not limited here.
  • MEM memory
  • the chip 100 can be an FPGA chip, a CGRA chip, or other reconfigurable chips, which is not limited here.
  • the MEM interface 110 is an interface through which internal devices of the chip 100 interact with external devices.
  • the MEM interface 110 can receive the source code and business data of the program from the external settings of the chip 100, and transmit the source code of the program to the processing module 120 and the business data to the storage module 130.
  • the processing module 120 may have a built-in compiler (compiler) 121, where the compiler 121 is a logic module.
  • compiler 121 can be used to: generate the isomorphism characteristics of M operators based on the source code of the program, and determine the static configuration of N PEs in the PE array based on the isomorphism characteristics, and based on the static configuration and M operators in the PE array
  • the overall configuration in determines M dynamic configurations, and the dynamic configuration includes other configurations in the overall configuration except the static configuration.
  • the compiler 121 may store the static configuration and at least one dynamic configuration among the M dynamic configurations in the storage module 130.
  • the compiler 121 can forward the static configuration and at least one dynamic configuration among the M dynamic configurations to the storage module 130 through the MEM interface 110.
  • the compiler 121 can also be directly connected to the storage module, thereby directly forwarding the static configuration and at least one dynamic configuration among the M dynamic configurations to the storage module 130.
  • the storage module 130 may be a random access memory (RAM) built into the chip 100 .
  • the storage module 130 may transmit the static configuration and at least one dynamic configuration among the M dynamic configurations to the PE array 140, so that the PE array 140 configures based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
  • RAM random access memory
  • the PE array 140 has a built-in configuration buffer (Cfg buffer) 141.
  • the Cfg buffer 141 can be used to receive the static configuration transmitted by the storage module 130 and at least one dynamic configuration among the M dynamic configurations, so that the PE array 140 is based on the static configuration and the M dynamic configurations.
  • a dynamic configuration in the configuration configures the PE array 140.
  • the processing module 120 also includes a configuration switcher (Cfg switcher) 122, which can be used to switch the dynamic configuration in the Cfg buffer 141.
  • the compiler 121 can store the static configuration and at least one dynamic configuration among the M dynamic configurations in the storage module 130. In some possible implementations, the compiler 121 also stores the mapping relationship between the static configuration and the index number in the storage module 130, and then transmits the configuration word to the PE array 140. The configuration word includes the index number and at least one dynamic configuration among the M dynamic configurations. . It should be noted that the compiler 121 can send the configuration word to the storage module 130, and then the storage module 130 forwards the configuration word to the PE array 140. Alternatively, the compiler 121 can directly forward the configuration word to the PE array 140, which is not limited here.
  • the storage module 130 can be divided into multiple areas, namely configuration random access memory (config RAM) 131, static configuration template library (template lib) 132 and data random access memory (data RAM). )133.
  • config RAM configuration random access memory
  • template lib static configuration template library
  • data RAM data random access memory
  • the config RAM 131 is used to store the configuration word and transmit the configuration word to the PE array 140
  • the template lib 132 is used to store the mapping relationship between the static configuration and the index number, and return the corresponding data to the PE array 140 based on the index number in the configuration word.
  • Static configuration data RAM 133 is used to store business data and transmit it to the PE array 140.
  • the Cfg buffer 141 can receive the configuration word transmitted by the config RAM 131 of the storage module 130, and obtain the static configuration from the template lib 132 of the storage module 130 based on the index number in the configuration word, and based on the static configuration and M dynamic configurations A dynamic configuration of the PE array 140 is configured, and business data is calculated based on the configured PE array 140 to execute corresponding operators.
  • the foregoing has introduced the chip 100.
  • the configuration method of the PE array based on execution in the chip 100 will be introduced. Please refer to Figure 2-1.
  • the method embodiment mainly includes the following steps:
  • the processing module generates M data flow graphs (DFG) corresponding to M operators based on the source code of the program, where M is a positive integer.
  • DFG data flow graphs
  • the chip can receive the source code and business data of the program through the MEM interface, and then the MEM interface transmits the source code of the program to the processing module and the business data to the storage module.
  • the processing module receives the source code of the program, it can generate a data flow diagram corresponding to each of the M operators based on the source code of the program, and obtain M data flow diagrams.
  • a data flow graph includes the functional configuration and routing configuration of each node among multiple nodes.
  • the M operators may be all operators in a program, or may be part of the operators in the program, which is not limited here.
  • M 3, that is, three operators, namely operator 1, operator 2 and operator 3.
  • operator 1 is used to calculate the multiplication and addition operations between 2 ⁇ 2 order matrices: A*B+C*D
  • operator 2 is used to calculate the multiplication and subtraction operations between 2 ⁇ 2 order matrices: A*B-C* D
  • Operator 3 is used to calculate the multiplication and addition operations between 4 ⁇ 4 order matrices:
  • A, B, C and D are all two-dimensional matrices:
  • K 0 , K 1 , K 2 , K 3 , K 4 , K 5 , K 6 and K 7 are all 4-dimensional matrices:
  • E 1 A*B+C*D
  • E 1 has 4 elements:
  • Data flow diagram 1 includes 7 nodes, namely MUL0, MUL1, MUL2, MUL3, ADD0, ADD1, and ADD2.
  • the functional configurations of MUL0, MUL1, MUL2, and MUL3 are all multiplication, and the functional configurations of ADD0, ADD1, and ADD2 are all addition;
  • the routes of MUL0 and MUL1 are configured to point to ADD0, and the routes of MUL2 and MUL3 are configured to point to ADD1, ADD0,
  • the route of ADD1 is configured to point to ADD2.
  • MUL0 is used to perform the operation of Ai0*B0i
  • MUL1 is used to perform the operation of Ai1*B1i
  • MUL2 is used to perform the operation of Ci0*D0i
  • MUL3 is used to perform the operation of Ci1*D1i
  • ADD0 is used to perform the operation of MUL0+MUL1 Operation
  • ADD1 is used to perform the operation of MUL2+MUL3
  • ADD2 is used to perform the operation of ADD0+ADD1
  • E 1 ij is obtained.
  • E 2 A*BC*D
  • E 2 has 4 elements:
  • Data flow diagram 2 includes 7 nodes, namely MUL0, MUL1, MUL2, MUL3, ADD0, ADD1, and SUB0.
  • the functional configurations of MUL0, MUL1, MUL2, and MUL3 are all multiplication, the functional configurations of ADD0 and ADD1 are all addition, and the functional configuration of SUB0 is subtraction; the routing configurations of MUL0 and MUL1 are directed to ADD0, and the routing configurations of MUL2 and MUL3 are The routes pointing to ADD1, ADD0, and ADD1 are configured to point to SUB0.
  • MUL0 is used to perform the operation of Ai0*B0i
  • MUL1 is used to perform the operation of Ai1*B1i
  • MUL2 is used to perform the operation of Ci0*D0i
  • MUL3 is used to perform the operation of Ci1*D1i
  • ADD0 is used to perform the operation of MUL0+MUL1 Operation
  • ADD1 is used to perform the operation of MUL2+MUL3
  • SUB0 is used to perform the operation of ADD0-ADD1
  • E 2 ij is obtained.
  • E 3 K 0 *K 1 +K 2 *K 3 +K 4 *K 5 +K 6 *K 7 , then E 3 has 16 elements :
  • E 3 ij (K 0 i0*K 1 i0+K 0 i1*K 1 i1+K 0 i2*K 1 i2+K 0 i3*K 1 i3)+(K 2 i0*K 3 i0+K 2 i1 *K 3 i1+K 2 i2*K 3 i2+K 2 i3*K 3 i3)+(K 4 i0*K 5 i0+K 4 i1*K 5 i1+K 4 i2*K 5 i2+K 4 i3 *K 5 i3)+(K 6 i0*K 7 i0+K 6 i1*K 7 i1+K 6 i2*K 7 i2+K 6 i3*K 7 i3)
  • the processing module can generate the data flow diagram 3 shown in Figure 2-4 based on the source code of operator 3.
  • the data flow diagram 2 includes 31 nodes, namely MUL0 to MUL15, ADD0 to ADD114.
  • the functional configurations of MUL0 to MUL15 are all multiplication, and the functional configurations of ADD0 to ADD114 are all addition; MUL0 to MUL15, ADD0 to The routing configuration of ADD114 is shown in Figure 2-4 and will not be described in detail here.
  • ADD14 is used to perform the operation of ADD13+ADD12, and finally obtain the value of E 3 ij.
  • the processing module extracts isomorphic features based on the M data flow graphs, and the isomorphic features correspond to the same local structures among the M operators.
  • the isomorphism feature may include the routing configuration of each node among the N nodes, and any two nodes among the N nodes are directly connected or indirectly connected. In some possible implementations, the isomorphism feature also includes the functional configuration of at least one node among the N nodes. For example, the isomorphism characteristics determined between the data flow diagram 1 shown in Figure 2-2 and the data flow diagram 2 shown in Figure 2-3 can be shown in Figure 2-5.
  • the isomorphism The feature includes 7 nodes. Any 2 nodes among the 7 nodes are directly connected or indirectly connected, namely a, b, c, d, e, f, and g. Among them, the routes of a and b are configured to point to e, c.
  • the routes of , d are configured to point to f, and the routes of e and f are configured to point to g.
  • the isomorphism feature also includes the functional configuration of at least one node among the seven nodes.
  • the functional configurations of a, b, c, and d are all multiplication, e
  • the functional configuration of f is all additive, and the functional configuration of g is not limited.
  • the processing module can extract isomorphism features from a data flow graph.
  • the isomorphism features are at least two identical local structures in the data flow graph.
  • the N PEs in the PE array corresponding to the feature reduce the number of required PEs and enhance usability.
  • the processing module can divide Figure 2-4 into Figure 2-6, thereby dividing the data flow Figure 3 into 5 local structures with similar structures. Based on these 5 local structures, Figure 2-7 can be extracted.
  • This isomorphism feature includes 7 nodes, namely a, b, c, d, e, f, and g.
  • the routes of a and b are configured to point to e
  • the routes of c and d are configured to point to f and e.
  • f's route is configured to point to g.
  • the functional configurations of a, b, c, and d are not limited, and the functional configurations of e, f, and g are all additive.
  • the processing module can extract the isomorphism features shown in Figure 2-8 based on the data flow graph 1, data flow graph 2 and data flow graph 3.
  • the isomorphism features include 7 nodes. They are a, b, c, d, e, f, g respectively.
  • the routes of a and b are configured to point to e
  • the routes of c and d are configured to point to f
  • the routes of e and f are configured to point to g
  • the function configurations of a, b, c, d, and g are not limited, e
  • the functional configurations of f are all additive.
  • isomorphism features can also be local structures of multiple different granularities, so that the chip can determine isomorphism features of different granularities as needed under different circumstances, reducing the number of required PEs.
  • Enhanced usability For example, based on data flow graph 1, data flow graph 2 and data flow graph 3, the isomorphism feature shown in Figure 2-9 can be extracted.
  • This isomorphism feature includes three nodes, namely a and b. , c, where the routes of a and b are configured to point to c, and the functional configurations of a, b, and c are not limited.
  • the isomorphic features shown in Figure 2-9 have smaller granularity than the isomorphic features shown in Figure 2-8.
  • the M operators can be all operators in a program or part of the operators in the program, so that the chip can determine one or more different isomorphisms for a program as needed.
  • the processing module can be based on data flow diagram 1 /2/3 extracts isomorphism feature 1, and extracts isomorphism feature 2 based on data flow diagram 4/5/6.
  • the above steps 201-202 are optional, as long as the processing module can generate isomorphic features of M operators, there is no limitation here.
  • the chip can determine the isomorphism characteristics based on the calculation formula of M operators, which is not limited here.
  • the processing module determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, where N is a positive integer.
  • the isomorphism feature includes N nodes. Based on the connection relationship between the N nodes in the isomorphism feature, available N PEs are selected from the PE array, where the connection relationships of the N PEs are The connection relationship between N nodes in the isomorphic feature is the same. One node in the isomorphic feature corresponds to one PE among the N PEs. Then, based on the configuration of each node in the isomorphism feature, corresponding configurations are performed on the corresponding PEs among the N PEs to obtain the static configuration of the N PEs.
  • the static configuration also includes the routing configuration of each of the N PEs; if the isomorphism feature includes the function of at least one of the N nodes, During configuration, the static configuration also includes the functional configuration of at least one PE among the N PEs.
  • the PE array is a 3 ⁇ 3 architecture.
  • N PEs PE00, PE01, PE02, PE11, PE20, PE21, PE22
  • the static configuration of 7 PEs is obtained as shown in Figure 2-10.
  • the routing configurations of N PEs PE00, PE01, PE02, PE11, PE20, PE21, PE22
  • the function configurations of PE01 and PE21 are additive, and the function configurations of PE00, PE02, PE11, PE20, and PE22 are not limited.
  • the processing module determines M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array.
  • the dynamic configuration includes other configurations in the overall configuration except the static configuration.
  • the dynamic configuration is the functional configuration of PE00, PE02, PE11, PE20, and PE22.
  • the dynamic configuration corresponding to operator 1 is: the function configurations of PE00, PE02, PE20, and PE22 are all multiplication, and the function configuration of PE11 is addition
  • the dynamic configuration corresponding to operator 2 is: the functions of PE00, PE02, PE20, and PE22
  • the configurations are all multiplication, and the functional configuration of PE11 is subtraction
  • operator 3 corresponds to 5 dynamic configurations, of which 4 dynamic configurations are: the functional configurations of PE00, PE02, PE20, and PE22 are all multiplication, and the functional configuration of PE11 is subtraction
  • Operator 3 corresponds to one of the five dynamic configurations: the functional configurations of PE00, PE02, PE11, PE20, and PE22 are all subtractions.
  • the dynamic configuration also includes the configuration of at least one PE other than the N PEs in the PE array.
  • the data flow diagram 4 corresponding to operator 4 is based on the isomorphism characteristics shown in Figure 2-10.
  • the dynamic configuration corresponding to operator 4 can also include the routing configuration of PE10. , the route of PE10 is configured to point to PE11, and the function of PE10 is configured as addition.
  • the processing module stores the mapping relationship between the static configuration and the index number in the storage module.
  • the processing module can generate a statically configured index number, and store the mapping relationship between the index number and the static configuration in the storage module.
  • the storage module can store the mapping relationship through the template lib.
  • there are two static configurations namely static configuration 1 and static configuration 2.
  • the processing module can generate two index numbers, namely index number 1 and index number 2.
  • index number 1 has a mapping relationship with static configuration 1.
  • Index number 2 has a mapping relationship with static configuration 2, and the mapping relationship between index number and static configuration is stored in the template lib of the storage module.
  • the template lib of the storage module is shown in Table 1:
  • the items under the idx column are represented as index numbers, and the items under the cfg column are static configurations of the data link.
  • the processing module transmits the configuration word to the PE array.
  • the configuration word includes static configuration and at least one dynamic configuration among M dynamic configurations.
  • Operator 1 corresponds to 1 configuration word
  • operator 2 corresponds to 1 configuration word
  • operator 3 corresponds to 5 configuration words.
  • the dynamic configurations of the first four configuration words are the same, and only the dynamic configuration of the fifth one is different.
  • the configuration word may also include the number of configuration copies.
  • the number of configuration copies is used to indicate the number of configurations based on the configuration word.
  • multiple copies of the configuration word with multiple identical static configurations and the same dynamic configuration can be abbreviated It is 1 copy of the configuration word to further reduce the transmission overhead.
  • Table 3 it is an example of the configuration words of three operators in the embodiment of the present application.
  • each configuration word shown in Table 3 is different, that is, one configuration word corresponds to one reconfigurable cycle.
  • the dynamic configuration includes configurations of the N PEs other than the static configuration, and the dynamic configuration also includes configurations of at least 1 PE other than the N PEs. , so that the dynamic configuration can also include the configuration of other PEs other than the N nodes of the PE array, enhancing the applicability.
  • FIG. 2-11 it is the data flow diagram 4 corresponding to operator 4.
  • the isomorphism feature shown in Figure 2-5 can only be used as the local structure of the data flow diagram 4 of operator 4.
  • the configuration of the remaining node can correspond to PE10.
  • the configuration of PE10 is at least 1 other than N PEs.
  • Configuration of a PE Illustratively, as shown in Table 4-1 or Table 4-2, they are examples of configuration words of three operators in the embodiment of this application.
  • the dynamic configuration is divided into the Cfg_operation_list part and the other cfg part, where the Cfg_operation_list part is the configuration of the N PEs other than the static configuration, and the other cfg part is the configuration of at least 1 PE other than the N PEs.
  • the configuration word includes an index number and at least one dynamic configuration among the M dynamic configurations.
  • the index number represents the static configuration, which effectively reduces transmission overhead and improves transmission efficiency.
  • Table 5-1, Table 5-2, Table 5-3 or Table 5-4 they are examples of configuration words corresponding to three operators in the embodiment of the present application.
  • the processing module can transmit the configuration word to the storage module, and then the storage module stores the configuration word through the built-in config ram and transmits the configuration word to the PE array.
  • the transmission module can transmit all configuration words to the storage module at one time, and the storage module sequentially transmits the configuration words to the PE array according to certain rules, one configuration word at a time.
  • the configuration words received by the storage module but not yet transmitted to the PE array can be stored in the config ram. Since the configuration word includes the index number rather than the static configuration itself, storage requirements are greatly reduced.
  • the PE array obtains the static configuration that has a mapping relationship with the index number from the storage module.
  • the cfg buffer in the PE array can obtain static configuration from the template lib of the storage module based on the index number.
  • the cfg buffer in the PE array requests a static configuration from the storage module based on the index number.
  • the storage module determines the static configuration from the template lib based on the index number and mapping relationship, and returns the static configuration to the PE array.
  • the PE array when the PE array receives a new configuration word, check the index number. If the index number is the same as the index number in the last received configuration word, the PE array does not need to obtain the static configuration from the storage module. Instead, it uses the static configuration of the previous configuration word and only needs to switch the dynamic configuration. Transmission overhead is reduced.
  • the cfg buffer can be divided into two separate storage spaces, namely storage space 1 and storage space 2.
  • Storage space 1 is used to store static configuration
  • storage space 2 is used to store static configuration.
  • the cfg buffer in the PE array receives configuration word 1, configuration word 2, and configuration word 3 sequentially transmitted by the storage module.
  • Configuration word 1 includes the index number and dynamic configuration dynamic0
  • configuration Word 2 includes the index number and dynamic configuration dynamic1
  • configuration word 3 includes the index number and dynamic configuration dynamic2.
  • the cfg buffer in the PE array receives the configuration word 1, it obtains the static configuration from the template lib in the storage module based on the index number, and stores the static configuration and dynamic configuration dynamic0.
  • the cfg buffer in the PE array receives configuration word 2, it can be determined that the index number in configuration word 2 is the same as the index number in configuration word 1, and the static configuration (static) needs to be obtained from the config RAM in the storage module, and It is to switch the dynamic configuration dynamic0 to the dynamic configuration dynamic1 in configuration word 2.
  • the cfg buffer in the PE array receives configuration word 3, it can be determined that the index number in configuration word 3 is the same as the index number in configuration word 2, and the static configuration (static) needs to be obtained from the config RAM in the storage module, and It is to switch the dynamic configuration dynamic1 to the dynamic configuration dynamic2 in configuration word 2. Since only dynamic configurations need to be switched, there is no need to switch static configurations, which reduces switching overhead.
  • the PE array is configured based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
  • the PE array when the chip executes the first operator among M operators, the PE array is configured based on the static configuration and the first dynamic configuration corresponding to the first operator, and the first dynamic configuration is one of the M dynamic configurations. ;
  • the PE array switches the first dynamic configuration to the second dynamic configuration corresponding to the second operator.
  • the second operator is executed after the first operator is executed.
  • the second dynamic configuration is one of M dynamic configurations. Since only dynamic configuration is switched, there is no need to switch static configuration, which reduces switching overhead.
  • operator 1, operator 2, and operator 3 correspond to the same static configuration.
  • the cfg switcher in the processing module only needs to Switching dynamic configuration in the cfg buffer of the PE array eliminates the need to switch static configuration, thus saving switching overhead.
  • a chip 300 provided by an embodiment of the present application includes:
  • processing module 310 and PE array 320 wherein,
  • the processing module 310 is used to generate isomorphism features of M operators.
  • the isomorphism features correspond to the same local structure among the M operators.
  • M is a positive integer; according to the isomorphism
  • the characteristics determine the static configuration of N PEs in the PE array, and N is a positive integer; determine M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array, and the dynamic configuration Be other configurations in the overall configuration except the static configuration;
  • the PE array 320 is configured to configure based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
  • the PE array 320 is specifically configured to: when executing the first operator among the M operators, based on the static configuration and the first dynamic value corresponding to the first operator Configuration is configured, and the first dynamic configuration is one of the M dynamic configurations; when executing the second operator among the M operators, the first dynamic configuration is switched to the second The second dynamic configuration corresponding to the operator, the second operator is an operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations.
  • the processing module 310 is configured to obtain M data flow graphs corresponding to the M operators, and extract the isomorphism features according to the M data flow graphs.
  • the chip 300 further includes: a storage module 330; the processing module 310 is also configured to transmit the mapping relationship between the static configuration and the index number to the storage module 330, and to the PE
  • the array transmits a configuration word, which includes the index number and at least one dynamic configuration among the M dynamic configurations; the PE array 320 is also used to obtain the data from the storage module 330 based on the index number. Describe static configuration.
  • An embodiment of the present application also provides a computer storage medium, wherein the computer storage medium stores a program, and the program executes some or all of the steps described in the above method embodiments.
  • An embodiment of the present application also provides a computer program product, wherein the computer program product stores a program, and the program executes some or all of the steps recorded in the above method embodiments.
  • the communication device 400 includes:
  • Receiver 401, transmitter 402, processor 403 and memory 404 may be connected through a bus or other means. In FIG. 4, the connection through the bus is taken as an example.
  • Memory 404 may include read-only memory and random access memory and provides instructions and data to processor 403 .
  • a portion of memory 404 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 404 stores an operating system and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for implementing various operations.
  • the operating system may include various system programs that are used to implement various basic services and handle hardware-based tasks.
  • the processor 403 controls the operation of the communication device 400.
  • the processor 403 may also be called a central processing unit (CPU).
  • CPU central processing unit
  • various components of the communication device 400 are coupled together through a bus system, where in addition to a data bus, the bus system may also include a power bus, a control bus, a status signal bus, etc.
  • bus systems in the figure.
  • the methods disclosed in the above embodiments of the present application can be applied to the processor 403 or implemented by the processor 403.
  • the processor 403 may be included as a chip as described in FIG. 3 .
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory 404.
  • the processor 403 reads the information in the memory 404 and completes the steps of the above method in combination with its hardware.
  • the receiver 401 can be used to receive input numeric or character information, and generate signal input related to the relevant settings and function control of the communication device 400.
  • the transmitter 402 can include a display device such as a display screen, and the transmitter 402 can be used to output through an external interface. Numeric or character information.
  • the processor 403 is configured to execute the configuration method of the processing unit PE array executed by the communication device 400 .
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
  • the technical solution of the present application is essentially, or the part that contributes to the existing technology, can be embodied in the form of a software product.
  • the computer software product is stored in a readable storage medium, such as a computer floppy disk, a U disk, a mobile phone, etc.
  • a hard disk, ROM, RAM, magnetic disk or optical disk, etc. includes a number of instructions to cause a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a server, data center, etc. that contains one or more available media integrated.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), etc.

Abstract

Embodiments of the present application disclose a configuration method for a processing element (PE) array and a related device, which are used for configuring a PE array. The present application may be applied to a chip, and the chip comprises a processing module and a PE array. The processing module generates homogeneous features of M operators, then determines a static configuration of N PEs in the PE array according to the homogeneous features, and determines M dynamic configurations on the basis of the static configuration and an overall configuration of the M operators in the PE array, the dynamic configurations being other configurations in the overall configuration except for the static configuration. Therefore, the PE array may be configured on the basis of the static configuration and one of the M dynamic configurations, without switching the static configuration of the PE array, thereby reducing the switching overhead.

Description

一种处理单元PE阵列的配置方法和相关设备Configuration method and related equipment of processing unit PE array
本申请要求于2022年3月17日提交中国专利局、申请号为202210264327.4、发明名称为“一种处理单元PE阵列的配置方法和相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on March 17, 2022, with application number 202210264327.4 and the invention title "A configuration method and related equipment for a processing unit PE array", the entire content of which is incorporated by reference. incorporated in this application.
技术领域Technical field
本申请涉及芯片技术领域,尤其涉及一种处理单元(processing element,PE)阵列的配置方法和相关设备。The present application relates to the field of chip technology, and in particular to a configuration method of a processing element (PE) array and related equipment.
背景技术Background technique
粗粒度可重构架构(coarse-grained reconfigurable array,CGRA)芯片是一种兼具现场可编程逻辑门阵列(field programmable gate array,FPGA)芯片的灵活性与专用集成电路(application specific integrated circuit,ASIC)芯片的高能效比特性的新一代可编程加速架构,通过配置字对CGRA芯片中PE阵列进行配置,让CGRA芯片能够执行相应的算法。Coarse-grained reconfigurable array (CGRA) chip is a kind of chip that combines the flexibility of field programmable gate array (FPGA) chip with application specific integrated circuit (ASIC) ), a new generation of programmable acceleration architecture with high energy efficiency characteristics of the chip, configures the PE array in the CGRA chip through configuration words, allowing the CGRA chip to execute the corresponding algorithm.
当前,CGRA芯片可以根据程序的算子对PE阵列进行配置,得到配置后的PE阵列。那么,当CGRA芯片执行该算子时,仅需向PE阵列传输业务数据,无需切换该PE阵列的配置,PE阵列即可基于业务数据执行该算子。Currently, the CGRA chip can configure the PE array according to the program's operators to obtain the configured PE array. Then, when the CGRA chip executes the operator, it only needs to transmit the service data to the PE array without switching the configuration of the PE array. The PE array can execute the operator based on the service data.
但是,在一个CGRA芯片中,由于PE阵列中PE的数量有限,多个不同的算子往往需要在PE阵列中复用相同的一个或多个PE。当CGRA芯片执行算子1之后执行算子2时,需要切换复用的一个或多个PE的配置,切换开销大,制约了CGRA芯片的性能的进一步提高。However, in a CGRA chip, due to the limited number of PEs in the PE array, multiple different operators often need to multiplex the same one or more PEs in the PE array. When the CGRA chip executes operator 1 and then executes operator 2, the configuration of one or more multiplexed PEs needs to be switched. The switching overhead is high, which restricts the further improvement of the performance of the CGRA chip.
发明内容Contents of the invention
本申请实施例提供了一种PE阵列的配置方法和相关设备,用于对PE阵列进行配置。Embodiments of the present application provide a PE array configuration method and related equipment for configuring the PE array.
本申请第一方面提供了一种处理单元PE阵列的配置方法,可应用于芯片,芯片包括处理模块和PE阵列。其中,处理模块生成M个算子的同构性特征,M为正整数,然后根据同构性特征确定PE阵列中N个PE的静态配置,N为正整数,并基于静态配置和M个算子在PE阵列中的整体配置确定M个动态配置,动态配置包括所述整体配置中除了静态配置之外的其他配置。那么,PE阵列可以基于静态配置和M个动态配置中的一个动态配置进行配置,即仅需切换动态配置,无需切换静态配置,降低了切换开销。The first aspect of this application provides a method for configuring a processing unit PE array, which can be applied to a chip. The chip includes a processing module and a PE array. Among them, the processing module generates the isomorphism characteristics of M operators, M is a positive integer, and then determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, N is a positive integer, and based on the static configuration and M operators The overall configuration of the sub-unit in the PE array determines M dynamic configurations, and the dynamic configurations include other configurations in the overall configuration except the static configuration. Then, the PE array can be configured based on the static configuration and one of the M dynamic configurations, that is, only the dynamic configuration needs to be switched, and the static configuration does not need to be switched, which reduces the switching overhead.
在一些可能的实现方式中,在PE阵列执行基于所述静态配置和所述M个动态配置中至少一个动态配置对所述PE阵列进行配置的步骤中,可以包括:在执行所述M个算子中的第一算子时,所述PE阵列基于所述静态配置和所述第一算子对应的第一动态配置进行配置,所述第一动态配置为所述M个动态配置中的一个;在执行所述M个算子中的第二算子时,所述PE阵列将所述第一动态配置切换为所述第二算子对应的第二动态配置,所述第二算子为执行完所述第一算子之后执行的算子,所述第二动态配置为所述M个动态配置中的一个。由此可知,PE阵列仅需切换PE阵列的动态配置,无需切换静态配置,降低了切换开销。In some possible implementations, in the step of the PE array configuring the PE array based on the static configuration and at least one dynamic configuration among the M dynamic configurations, it may include: performing the M calculations When the first operator is the first operator in the operator, the PE array is configured based on the static configuration and the first dynamic configuration corresponding to the first operator, and the first dynamic configuration is one of the M dynamic configurations. ; When executing the second operator among the M operators, the PE array switches the first dynamic configuration to the second dynamic configuration corresponding to the second operator, and the second operator is An operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations. It can be seen that the PE array only needs to switch the dynamic configuration of the PE array and does not need to switch the static configuration, which reduces the switching overhead.
在一些可能的实现方式中,在所述处理模块执行生成M个算子的同构性特征的步骤之前,还可以包括:所述处理模块获取所述M个算子对应的M个数据流图,那么所述处理模块可以根据所述M个数据流图提取所述同构性特征,从而得到M个算子的同构性特征。In some possible implementations, before the processing module performs the step of generating isomorphism features of the M operators, it may also include: the processing module obtains M data flow graphs corresponding to the M operators. , then the processing module can extract the isomorphism features according to the M data flow graphs, thereby obtaining the isomorphism features of the M operators.
在一些可能的实现方式中,所述芯片还包括存储模块,在所述PE阵列执行基于所述静态配置和所述M个动态配置中至少一个动态配置进行配置的步骤之前,还可以包括:所述存储模块存储所述静态配置和索引号的映射关系;所述处理模块向所述PE阵列传输配置字,所述配置字包括所述索引号和所述M个动态配置中至少一个动态配置;所述PE阵列从所述存储模块中获取与所述索引号具有映射关系的所述静态配置。当处理模块传输多份配置字时,由于无需直接传输静态配置,而是通过索引号替代,降低了传输开销,提高了配置字的传输效率。In some possible implementations, the chip further includes a memory module. Before the PE array performs the step of configuring based on the static configuration and at least one of the M dynamic configurations, it may also include: The storage module stores the mapping relationship between the static configuration and the index number; the processing module transmits a configuration word to the PE array, where the configuration word includes the index number and at least one dynamic configuration among the M dynamic configurations; The PE array obtains the static configuration that has a mapping relationship with the index number from the storage module. When the processing module transmits multiple copies of the configuration word, it does not need to directly transmit the static configuration, but replaces it with the index number, which reduces the transmission overhead and improves the transmission efficiency of the configuration word.
在一些可能的实现方式中,所述配置字还包括配置份数,所述配置份数用于指示基于所述配置字进行配置的次数。对于具有相同静态配置、相同动态配置的多份配置字可以简写为1份配置字,进一步降低传输开销,提高传输效率。In some possible implementations, the configuration word also includes a configuration number, and the configuration number is used to indicate the number of times configuration is performed based on the configuration word. Multiple configuration words with the same static configuration and the same dynamic configuration can be abbreviated as one configuration word to further reduce transmission overhead and improve transmission efficiency.
在一些可能的实现方式中,所述同构性特征包括N个节点中各个节点的路由配置,所述N个节点中任意2个节点直接相连或间接相连,所述静态配置包括所述N个PE的路由配置,那么N个PE中的路由配置不需要修改,降低了切换开销。In some possible implementations, the isomorphism feature includes the routing configuration of each node among the N nodes, any two of the N nodes are directly connected or indirectly connected, and the static configuration includes the N nodes. PE routing configuration, then the routing configuration in N PEs does not need to be modified, reducing switching overhead.
在一些可能的实现方式中,所述同构性特征还包括所述N个节点中至少1个节点的功能配置,所述静态配置还包括所述N个PE中至少1个PE的功能配置,那么N个PE中的至少一个PE的路由配置不需要修改,降低了切换开销。In some possible implementations, the isomorphism feature also includes the functional configuration of at least 1 node among the N nodes, and the static configuration also includes the functional configuration of at least 1 PE among the N PEs, Then the routing configuration of at least one PE among the N PEs does not need to be modified, which reduces the switching overhead.
在一些可行的实现方式中,芯片还包括MEM接口,MEM接口可以获取程序的源代码,并将程序的源代码传输给处理模块,以使得处理模块可以基于程序的源代码生成M个算子的数据流图,得到M个数据流图。In some feasible implementations, the chip also includes a MEM interface. The MEM interface can obtain the source code of the program and transmit the source code of the program to the processing module, so that the processing module can generate M operators based on the source code of the program. Data flow diagram, get M data flow diagrams.
在一些可能的实现方式中,处理模块可以从1个数据流图内提取的同构性特征,该同构性特征为该数据流图中至少两个相同的局部结构,通过复用该同构性特征对应的PE阵列中的N个PE,可减少所需PE的数量,增强了可用性。In some possible implementations, the processing module can extract isomorphism features from a data flow graph. The isomorphism features are at least two identical local structures in the data flow graph, and reuse the isomorphism features. N PEs in the PE array corresponding to the characteristics can reduce the number of required PEs and enhance usability.
在一些可能的实现方式中,所述动态配置还包括所述N个PE以外的至少1个PE的配置,那么,对于无法通过整数个同构性特征构成的算子,配置字也能适用,增强了配置字的适用性。In some possible implementations, the dynamic configuration also includes the configuration of at least 1 PE other than the N PEs. Then, the configuration word can also be applied to operators that cannot be composed of an integer number of isomorphic features. Enhanced applicability of configuration words.
在一些可能的实现方式中,同构性特征还可以为多种不同粒度的局部结构,从而处理模块可以在不同情况下根据需要确定不同粒度的同构性特征,减少了所需的PE的数量,增强了可用性。In some possible implementations, isomorphism features can also be local structures of multiple different granularities, so that the processing module can determine isomorphism features of different granularities as needed under different circumstances, reducing the number of required PEs. , enhanced usability.
在一些可能的实现方式中,存储模块包括配置随机存取存储器(config RAM)和静态配置模板库(template lib),其中,config ram用于存储配置字,template lib用于存储索引号和静态配置的映射关系,那么多于多份具有相同静态配置的配置字,仅需在template lib中存储一份静态配置,config ram中仅需存储索引号,相比较对每份配置字都存储静态配置,降低了存储开销。In some possible implementations, the storage module includes configuration random access memory (config RAM) and static configuration template library (template lib), where config ram is used to store configuration words, and template lib is used to store index numbers and static configuration. Mapping relationship, if there are more than one configuration word with the same static configuration, only one static configuration needs to be stored in template lib, and only the index number needs to be stored in config ram. Compared with storing static configuration for each configuration word, Reduced storage overhead.
在一些可能的实现方式中,M个算子可以为一个程序中的所有算子,也可以为程序中的部分算子,从而芯片可以根据需要为一个程序确定一个或多个不同的同构性特征,适用 于无法提取一个适合的同构性特征的程序中的多个算子,增强了其适用性。In some possible implementations, the M operators can be all operators in a program or part of the operators in the program, so that the chip can determine one or more different isomorphisms for a program as needed. Features, suitable for multiple operators in programs where a suitable isomorphic feature cannot be extracted, enhancing its applicability.
本申请第二方面提供了一种芯片,所述芯片用于执行前述第一方面中任一项所述的方法。A second aspect of the present application provides a chip, which is used to perform the method described in any one of the foregoing first aspects.
本申请第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面中任一项所述的方法。A third aspect of the present application provides a computer-readable storage medium. The computer-readable storage medium stores instructions that, when run on a computer, cause the computer to execute the method described in any one of the above-mentioned first aspects. .
本申请第四方面提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中;设备的处理器可以从计算机可读存储介质读取该计算机执行指令,处理器执行该计算机执行指令使得设备实施上述第一方面或者第一方面的任一种可能的实现方式所提供的方法。A fourth aspect of the present application provides a computer program product. The computer program product includes computer-executable instructions. The computer-executable instructions are stored in a computer-readable storage medium. The processor of the device can read the computer-executed instructions from the computer-readable storage medium. The processor executes the computer execution instruction to cause the device to implement the method provided by the above-mentioned first aspect or any possible implementation of the first aspect.
本申请第五方面提供一种通信装置,该通信装置可以包括处理器、存储器和通信接口。处理器与存储器和通信接口耦合。存储器用于存储指令,处理器用于执行该指令,通信接口用于在处理器的控制下与其他通信装置进行通信。该指令在被处理器执行时,使处理器执行第一方面或第一方面的任意可能的实现方式中的方法。A fifth aspect of the present application provides a communication device, which may include a processor, a memory, and a communication interface. The processor is coupled to memory and communication interfaces. The memory is used to store instructions, the processor is used to execute the instructions, and the communication interface is used to communicate with other communication devices under the control of the processor. When executed by the processor, the instruction causes the processor to execute the method of the first aspect or any possible implementation of the first aspect.
其中,第二至第五方面或者其中任一种可能实现方式所带来的技术效果可参见第一方面或第一方面不同可能实现方式所带来的技术效果,此处不再赘述。Among them, the technical effects brought by the second to fifth aspects or any one of the possible implementation methods can be referred to the technical effects brought by the first aspect or different possible implementation methods of the first aspect, and will not be described again here.
附图说明Description of the drawings
图1-1为PE阵列的组成结构示意图;Figure 1-1 is a schematic diagram of the structure of the PE array;
图1-2为本申请实施例中的数据链路示意图;Figure 1-2 is a schematic diagram of the data link in the embodiment of this application;
图1-3为本申请实施例中的数据链路示意图;Figure 1-3 is a schematic diagram of the data link in the embodiment of this application;
图1-4为本申请实施例提供的一种芯片的实施例示意图;Figures 1-4 are schematic diagrams of a chip provided by embodiments of the present application;
图2-1为本申请实施例提供的一种PE阵列的配置方法的实施例一的流程示意图;Figure 2-1 is a schematic flowchart of Embodiment 1 of a PE array configuration method provided by the embodiment of this application;
图2-2为本申请实施例中的数据流图1的示意图;Figure 2-2 is a schematic diagram of data flow diagram 1 in the embodiment of the present application;
图2-3为本申请实施例中的数据流图2的示意图;Figure 2-3 is a schematic diagram of data flow diagram 2 in the embodiment of the present application;
图2-4为本申请实施例中的数据流图3的示意图;Figures 2-4 are schematic diagrams of data flow diagram 3 in the embodiment of the present application;
图2-5为本申请实施例中的同构性特征的示意图;Figures 2-5 are schematic diagrams of isomorphic features in embodiments of the present application;
图2-6为本申请实施例中的数据流图3划分为多个局部结构的示意图;Figures 2-6 are schematic diagrams of the data flow diagram 3 divided into multiple partial structures in the embodiment of the present application;
图2-7为本申请实施例中的同构性特征的另一示意图;Figures 2-7 are another schematic diagram of the isomorphism feature in the embodiment of the present application;
图2-8为本申请实施例中的同构性特征的另一示意图;Figure 2-8 is another schematic diagram of the isomorphism feature in the embodiment of the present application;
图2-9为本申请实施例中的同构性特征的另一示意图;Figure 2-9 is another schematic diagram of the isomorphism feature in the embodiment of the present application;
图2-10为本申请实施例中的静态配置的示意图;Figure 2-10 is a schematic diagram of the static configuration in the embodiment of the present application;
图2-11为本申请实施例中的数据流图4的示意图;Figure 2-11 is a schematic diagram of data flow diagram 4 in the embodiment of the present application;
图2-12为本申请实施例中配置缓存器(cfg buffer)内划分两个单独的存储空间的示意图;Figure 2-12 is a schematic diagram of dividing the configuration buffer (cfg buffer) into two separate storage spaces in the embodiment of the present application;
图2-13为本申请实施例中cfg buffer依次接收配置随机存取存储器(config RAM)传输的3份配置字的示意图;Figure 2-13 is a schematic diagram of the cfg buffer sequentially receiving three configuration words transmitted by the configuration random access memory (config RAM) in the embodiment of the present application;
图3为本申请实施例提供的一种PE阵列的配置设备的结构示意图;Figure 3 is a schematic structural diagram of a PE array configuration device provided by an embodiment of the present application;
图4为本申请实施例提供的一种通信装置设备的结构示意图。Figure 4 is a schematic structural diagram of a communication device provided by an embodiment of the present application.
具体实施方式Detailed ways
本申请实施例提供了一种PE阵列的配置方法和相关设备,用于对CGRA芯片中处理单元PE阵列进行配置。Embodiments of the present application provide a PE array configuration method and related equipment for configuring the PE array of processing units in a CGRA chip.
下面结合附图,对本申请的实施例进行描述。The embodiments of the present application are described below with reference to the accompanying drawings.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances, and are merely a way of distinguishing objects with the same attributes in describing the embodiments of the present application. Furthermore, the terms "include" and "having" and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, product or apparatus comprising a series of elements need not be limited to those elements, but may include not explicitly other elements specifically listed or inherent to such processes, methods, products or equipment.
CGRA芯片是一种兼FPGA芯片的灵活性与ASIC芯片的高能效比特性的新一代可编程加速架构。CGRA芯片内置PE阵列,PE阵列包括多个PE,PE阵列用于执行算法。需要说明的是,PE由多个逻辑门(logic gates)构成,用于执行对应的运算方式,例如加、减、乘、除等。用户可以通过配置字对CGRA芯片中PE阵列的至少一个PE进行配置,让CGRA芯片能够执行相应的算法。CGRA chip is a new generation of programmable acceleration architecture that combines the flexibility of FPGA chips with the high energy efficiency of ASIC chips. The CGRA chip has a built-in PE array, which includes multiple PEs. The PE array is used to execute algorithms. It should be noted that PE is composed of multiple logic gates, which are used to perform corresponding operations, such as addition, subtraction, multiplication, division, etc. Users can configure at least one PE of the PE array in the CGRA chip through the configuration word, so that the CGRA chip can execute the corresponding algorithm.
示例性的,请参考图1-1,为CGRA芯片中的一个3×3的PE阵列,其中各个元素可以表示为PEij(i=0,1,2;j=0,1,2),其中,箭头表示相连的PE之间的直接传递数据的方向。例如,相连的PE00和PE01之间可以互相直接传递数据。当用户可以对PE阵列中的至少一个PE进行配置后,则形成对应的数据链路,该数据链路可以用于执行对应的算法,例如,如图1-2所示的数据链路。For example, please refer to Figure 1-1, which is a 3×3 PE array in a CGRA chip, in which each element can be expressed as PEij (i=0,1,2; j=0,1,2), where , the arrow indicates the direction of direct data transfer between connected PEs. For example, connected PE00 and PE01 can directly transfer data to each other. When the user can configure at least one PE in the PE array, a corresponding data link is formed, which can be used to execute the corresponding algorithm, for example, the data link shown in Figure 1-2.
当前,CGRA芯片可以根据程序的算子对PE阵列进行配置,得到该算子对应的数据链路,该数据链路包括PE阵列中多个PE的配置。那么,当CGRA芯片执行该算子时,仅需向该算子对应的数据链路传输业务数据,无需切换该数据链路中PE的配置。Currently, the CGRA chip can configure the PE array according to the operator of the program and obtain the data link corresponding to the operator. The data link includes the configuration of multiple PEs in the PE array. Then, when the CGRA chip executes the operator, it only needs to transmit the service data to the data link corresponding to the operator without switching the configuration of the PE in the data link.
但是,在一个CGRA芯片中,由于PE阵列中PE的数量有限,多个不同的算子往往需要在PE阵列中复用相同的一个或多个PE。当CGRA芯片执行算子1之后执行算子2时,需要切换复用的一个或多个PE的配置,切换开销大,制约了CGRA芯片的性能的进一步提高。However, in a CGRA chip, due to the limited number of PEs in the PE array, multiple different operators often need to multiplex the same one or more PEs in the PE array. When the CGRA chip executes operator 1 and then executes operator 2, the configuration of one or more multiplexed PEs needs to be switched. The switching overhead is high, which restricts the further improvement of the performance of the CGRA chip.
例如,算子2对应的数据链路如图1-3所示,其中SUB为减法,其区别仅为PE11从加法(ADD)改为减法(SUB),CGRA芯片仍然需要对数据链路对应的N个PE进行整体配置的切换。For example, the data link corresponding to operator 2 is shown in Figure 1-3, where SUB is subtraction. The only difference is that PE11 changes from addition (ADD) to subtraction (SUB). The CGRA chip still needs to N PEs perform overall configuration switching.
为此,本申请提出了一种PE阵列的配置方法和相关设备,用于对PE阵列进行配置。To this end, this application proposes a PE array configuration method and related equipment for configuring the PE array.
在本申请可应用于芯片,芯片包括处理模块和PE阵列。其中,处理模块生成M个算子的同构性特征,M为正整数,然后根据同构性特征确定PE阵列中N个PE的静态配置,N为正整数,并基于静态配置和M个算子在PE阵列中的整体配置确定M个动态配置,动态配置为所述整体配置中除了静态配置之外的其他配置。那么,PE阵列可以基于静态配置和M个动态配置中的一个动态配置进行配置,即仅需切换动态配置,无需切换静态配置,降低了切换开销。This application can be applied to chips, which include processing modules and PE arrays. Among them, the processing module generates the isomorphism characteristics of M operators, M is a positive integer, and then determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, N is a positive integer, and based on the static configuration and M operators The overall configuration of the sub-unit in the PE array determines M dynamic configurations, and the dynamic configurations are other configurations in the overall configuration except the static configuration. Then, the PE array can be configured based on the static configuration and one of the M dynamic configurations, that is, only the dynamic configuration needs to be switched, and the static configuration does not need to be switched, which reduces the switching overhead.
示例性的,本申请可应用于如图1-4所示的芯片100,其中,芯片100包括存储(memory,MEM)接口110、处理模块120、存储模块130和PE阵列140。需要说明的是,芯片100可 以为FPGA芯片或CGRA芯片,或者其他具有可重构性质的芯片,此处不做限定。Illustratively, the present application can be applied to the chip 100 shown in Figures 1-4, where the chip 100 includes a memory (memory, MEM) interface 110, a processing module 120, a storage module 130 and a PE array 140. It should be noted that the chip 100 can be an FPGA chip, a CGRA chip, or other reconfigurable chips, which is not limited here.
其中,MEM接口110为芯片100的内部器件与外部设备进行交互的接口。示例性的,MEM接口110可以从芯片100的外部设置中接收程序的源代码和业务数据,并将程序的源代码传输给处理模块120,将业务数据传输给存储模块130。Among them, the MEM interface 110 is an interface through which internal devices of the chip 100 interact with external devices. For example, the MEM interface 110 can receive the source code and business data of the program from the external settings of the chip 100, and transmit the source code of the program to the processing module 120 and the business data to the storage module 130.
处理模块120中可以内置编译器(compiler)121,其中,compiler 121为逻辑模块。compiler 121可用于:基于程序的源代码生成M个算子的同构性特征,并根据同构性特征确定PE阵列中N个PE的静态配置,并基于静态配置和M个算子在PE阵列中的整体配置确定M个动态配置,动态配置包括整体配置中除了静态配置之外的其他配置。compiler 121可以将静态配置和M个动态配置中的至少一个动态配置存储在存储模块130中。在一些可能的实现方式中,compiler 121可以通过MEM接口110向存储模块130转发静态配置和M个动态配置中的至少一个动态配置。在一些可能的实现方式中,compiler 121也可以与存储模块直连,从而直接向存储模块130转发静态配置和M个动态配置中的至少一个动态配置。The processing module 120 may have a built-in compiler (compiler) 121, where the compiler 121 is a logic module. compiler 121 can be used to: generate the isomorphism characteristics of M operators based on the source code of the program, and determine the static configuration of N PEs in the PE array based on the isomorphism characteristics, and based on the static configuration and M operators in the PE array The overall configuration in determines M dynamic configurations, and the dynamic configuration includes other configurations in the overall configuration except the static configuration. The compiler 121 may store the static configuration and at least one dynamic configuration among the M dynamic configurations in the storage module 130. In some possible implementations, the compiler 121 can forward the static configuration and at least one dynamic configuration among the M dynamic configurations to the storage module 130 through the MEM interface 110. In some possible implementations, the compiler 121 can also be directly connected to the storage module, thereby directly forwarding the static configuration and at least one dynamic configuration among the M dynamic configurations to the storage module 130.
存储模块130可以为芯片100内置的随机存取存储器(random access memory,RAM)。存储模块130可以将静态配置和M个动态配置中至少一个动态配置传输给PE阵列140,让PE阵列140基于静态配置和M个动态配置中至少一个动态配置进行配置。The storage module 130 may be a random access memory (RAM) built into the chip 100 . The storage module 130 may transmit the static configuration and at least one dynamic configuration among the M dynamic configurations to the PE array 140, so that the PE array 140 configures based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
PE阵列140中内置配置缓存器(Cfg buffer)141,Cfg buffer 141可用于接收存储模块130传输的静态配置和M个动态配置中至少一个动态配置,以使得PE阵列140基于静态配置和M个动态配置中的一个动态配置对PE阵列140进行配置。处理模块120还包括配置切换器(Cfg switcher)122,可用于切换Cfg buffer 141中的动态配置。The PE array 140 has a built-in configuration buffer (Cfg buffer) 141. The Cfg buffer 141 can be used to receive the static configuration transmitted by the storage module 130 and at least one dynamic configuration among the M dynamic configurations, so that the PE array 140 is based on the static configuration and the M dynamic configurations. A dynamic configuration in the configuration configures the PE array 140. The processing module 120 also includes a configuration switcher (Cfg switcher) 122, which can be used to switch the dynamic configuration in the Cfg buffer 141.
在一些可能的实现方式中,compiler 121可以在存储模块130中存储静态配置和M个动态配置中至少一个动态配置。在一些可能的实现方式中,compiler 121还在存储模块130中存储静态配置和索引号的映射关系,再向PE阵列140传输配置字,配置字包括索引号和M个动态配置中至少一个动态配置。需要说明的是,compiler 121可以通过向存储模块130发送配置字,再由存储模块130向PE阵列140转发配置字,或者,compiler 121可以直接向PE阵列140转发配置字,此处不做限定。In some possible implementations, the compiler 121 can store the static configuration and at least one dynamic configuration among the M dynamic configurations in the storage module 130. In some possible implementations, the compiler 121 also stores the mapping relationship between the static configuration and the index number in the storage module 130, and then transmits the configuration word to the PE array 140. The configuration word includes the index number and at least one dynamic configuration among the M dynamic configurations. . It should be noted that the compiler 121 can send the configuration word to the storage module 130, and then the storage module 130 forwards the configuration word to the PE array 140. Alternatively, the compiler 121 can directly forward the configuration word to the PE array 140, which is not limited here.
在一些可能的实现方式中,存储模块130中可以划分为多个区域,分别为配置随机存取存储器(config RAM)131、静态配置模板库(template lib)132和数据随机存取存储器(data RAM)133。其中,config RAM 131用于存储配置字,并向PE阵列140传输配置字;template lib 132用于存储静态配置和索引号的映射关系,并基于配置字中的索引号向PE阵列140返回对应的静态配置;data RAM 133用于存储业务数据,并传输给PE阵列140。那么,Cfg buffer 141可以接收存储模块130的config RAM 131传输的配置字,并基于配置字中的索引号从存储模块130的template lib 132中获取静态配置,并基于静态配置和M个动态配置中的一个动态配置对PE阵列140进行配置,并基于配置后的PE阵列140对业务数据进行计算,以执行对应的算子。In some possible implementations, the storage module 130 can be divided into multiple areas, namely configuration random access memory (config RAM) 131, static configuration template library (template lib) 132 and data random access memory (data RAM). )133. Among them, the config RAM 131 is used to store the configuration word and transmit the configuration word to the PE array 140; the template lib 132 is used to store the mapping relationship between the static configuration and the index number, and return the corresponding data to the PE array 140 based on the index number in the configuration word. Static configuration; data RAM 133 is used to store business data and transmit it to the PE array 140. Then, the Cfg buffer 141 can receive the configuration word transmitted by the config RAM 131 of the storage module 130, and obtain the static configuration from the template lib 132 of the storage module 130 based on the index number in the configuration word, and based on the static configuration and M dynamic configurations A dynamic configuration of the PE array 140 is configured, and business data is calculated based on the configured PE array 140 to execute corresponding operators.
前述介绍了芯片100,接下来介绍基于芯片100中的执行的PE阵列的配置方法,请参阅图2-1所示,所述方法实施例主要包括如下步骤:The foregoing has introduced the chip 100. Next, the configuration method of the PE array based on execution in the chip 100 will be introduced. Please refer to Figure 2-1. The method embodiment mainly includes the following steps:
201、处理模块基于程序的源代码生成M个算子对应的M个数据流图(data flow graph,DFG),M为正整数。201. The processing module generates M data flow graphs (DFG) corresponding to M operators based on the source code of the program, where M is a positive integer.
在本申请实施例中,芯片可以通过MEM接口接收程序的源代码和业务数据,然后MEM接口向处理模块传输程序的源代码,向存储模块传输业务数据。当处理模块接收到程序的源代码后,可以基于程序的源代码生成M个算子中各个算子对应的数据流图,得到M个数据流图。其中,一个数据流图包括多个节点中各个节点的功能配置和路由配置。在一些可能的实现方式中,该M个算子可以为一个程序中的全部算子,也可以为程序中的部分算子,此处不做限定。In this embodiment of the present application, the chip can receive the source code and business data of the program through the MEM interface, and then the MEM interface transmits the source code of the program to the processing module and the business data to the storage module. After the processing module receives the source code of the program, it can generate a data flow diagram corresponding to each of the M operators based on the source code of the program, and obtain M data flow diagrams. Among them, a data flow graph includes the functional configuration and routing configuration of each node among multiple nodes. In some possible implementations, the M operators may be all operators in a program, or may be part of the operators in the program, which is not limited here.
示例性的,M=3,即3个算子,分别为算子1、算子2和算子3。其中,算子1用于计算2×2阶矩阵之间的乘加运算:A*B+C*D;算子2用于计算2×2阶矩阵之间的乘减运算:A*B-C*D;算子3用于计算4×4阶矩阵之间的乘加运算:For example, M=3, that is, three operators, namely operator 1, operator 2 and operator 3. Among them, operator 1 is used to calculate the multiplication and addition operations between 2×2 order matrices: A*B+C*D; operator 2 is used to calculate the multiplication and subtraction operations between 2×2 order matrices: A*B-C* D; Operator 3 is used to calculate the multiplication and addition operations between 4×4 order matrices:
K 0*K 1+K 2*K 3+K 4*K 5+K 6*K 7 K 0 *K 1 +K 2 *K 3 +K 4 *K 5 +K 6 *K 7
其中,A、B、C和D均为二维矩阵:Among them, A, B, C and D are all two-dimensional matrices:
Figure PCTCN2023070594-appb-000001
Figure PCTCN2023070594-appb-000001
K 0、K 1、K 2、K 3、K 4、K 5、K 6和K 7均为4维矩阵: K 0 , K 1 , K 2 , K 3 , K 4 , K 5 , K 6 and K 7 are all 4-dimensional matrices:
Figure PCTCN2023070594-appb-000002
Figure PCTCN2023070594-appb-000002
其中,p=0,1,2,3,4,5,6,7。Among them, p=0,1,2,3,4,5,6,7.
示例性的,以算子1为例,令二维矩阵E 1=A*B+C*D,那么E 1有4个元素: For example, taking operator 1 as an example, let the two-dimensional matrix E 1 =A*B+C*D, then E 1 has 4 elements:
Figure PCTCN2023070594-appb-000003
Figure PCTCN2023070594-appb-000003
对于E 1中的任意一个元素E 1ij(i=0,1;j=0,1),都需要执行运算: For any element E 1 ij (i=0,1; j=0,1) in E 1 , an operation needs to be performed:
E 1ij=(Ai0*B0i+Ai1*B1i)+(Ci0*D0i+Ci1*D1i) E 1 ij=(Ai0*B0i+Ai1*B1i)+(Ci0*D0i+Ci1*D1i)
由此可得,处理模块可以基于算子1的源代码生成如图2-2所示的数据流图1。数据流图1中包括7个节点,分别为MUL0,MUL1,MUL2,MUL3,ADD0,ADD1,ADD2。其中,MUL0,MUL1,MUL2,MUL3的功能配置均为乘法,ADD0,ADD1,ADD2的功能配置均为加法;MUL0、MUL1的路由配置为指向ADD0,MUL2、MUL3的路由配置为指向ADD1,ADD0、ADD1的路由配置为指向ADD2。It can be seen that the processing module can generate data flow diagram 1 as shown in Figure 2-2 based on the source code of operator 1. Data flow diagram 1 includes 7 nodes, namely MUL0, MUL1, MUL2, MUL3, ADD0, ADD1, and ADD2. Among them, the functional configurations of MUL0, MUL1, MUL2, and MUL3 are all multiplication, and the functional configurations of ADD0, ADD1, and ADD2 are all addition; the routes of MUL0 and MUL1 are configured to point to ADD0, and the routes of MUL2 and MUL3 are configured to point to ADD1, ADD0, The route of ADD1 is configured to point to ADD2.
其中,MUL0用于执行Ai0*B0i的运算,MUL1用于执行Ai1*B1i的运算,MUL2用于执行Ci0*D0i的运算,MUL3用于执行Ci1*D1i的运算,ADD0用于执行MUL0+MUL1的运算,ADD1用于执行MUL2+MUL3的运算,ADD2用于执行ADD0+ADD1的运算,最后得出E 1ij的值。 Among them, MUL0 is used to perform the operation of Ai0*B0i, MUL1 is used to perform the operation of Ai1*B1i, MUL2 is used to perform the operation of Ci0*D0i, MUL3 is used to perform the operation of Ci1*D1i, and ADD0 is used to perform the operation of MUL0+MUL1 Operation, ADD1 is used to perform the operation of MUL2+MUL3, ADD2 is used to perform the operation of ADD0+ADD1, and finally the value of E 1 ij is obtained.
示例性的,再以算子2为例,令二维矩阵E 2=A*B-C*D,那么E 2有4个元素: For example, taking operator 2 as an example, let the two-dimensional matrix E 2 =A*BC*D, then E 2 has 4 elements:
Figure PCTCN2023070594-appb-000004
Figure PCTCN2023070594-appb-000004
对于E 2中的任意一个元素E 2ij(i=0,1;j=0,1),都需要执行运算: For any element E 2 ij (i=0,1; j=0,1) in E 2 , an operation needs to be performed:
E 2ij=(Ai0*B0i+Ai1*B1i)+(Ci0*D0i+Ci1*D1i) E 2 ij=(Ai0*B0i+Ai1*B1i)+(Ci0*D0i+Ci1*D1i)
由此可得,处理模块可以基于算子2的源代码生成如图2-3所示的数据流图2。数据流图2中包括7个节点,分别为MUL0,MUL1,MUL2,MUL3,ADD0,ADD1,SUB0。其中,MUL0,MUL1,MUL2,MUL3的功能配置均为乘法,ADD0、ADD1的功能配置均为加法,SUB0的功能配置为减法;MUL0、MUL1的路由配置为指向ADD0,MUL2、MUL3的路由配置为指向ADD1,ADD0、ADD1的路由配置为指向SUB0。It can be seen that the processing module can generate data flow diagram 2 as shown in Figure 2-3 based on the source code of operator 2. Data flow diagram 2 includes 7 nodes, namely MUL0, MUL1, MUL2, MUL3, ADD0, ADD1, and SUB0. Among them, the functional configurations of MUL0, MUL1, MUL2, and MUL3 are all multiplication, the functional configurations of ADD0 and ADD1 are all addition, and the functional configuration of SUB0 is subtraction; the routing configurations of MUL0 and MUL1 are directed to ADD0, and the routing configurations of MUL2 and MUL3 are The routes pointing to ADD1, ADD0, and ADD1 are configured to point to SUB0.
其中,MUL0用于执行Ai0*B0i的运算,MUL1用于执行Ai1*B1i的运算,MUL2用于执行Ci0*D0i的运算,MUL3用于执行Ci1*D1i的运算,ADD0用于执行MUL0+MUL1的运算,ADD1用于执行MUL2+MUL3的运算,SUB0用于执行ADD0-ADD1的运算,最后得出E 2ij的值。 Among them, MUL0 is used to perform the operation of Ai0*B0i, MUL1 is used to perform the operation of Ai1*B1i, MUL2 is used to perform the operation of Ci0*D0i, MUL3 is used to perform the operation of Ci1*D1i, and ADD0 is used to perform the operation of MUL0+MUL1 Operation, ADD1 is used to perform the operation of MUL2+MUL3, SUB0 is used to perform the operation of ADD0-ADD1, and finally the value of E 2 ij is obtained.
示例性的,再以算子3为例,令二维矩阵E 3=K 0*K 1+K 2*K 3+K 4*K 5+K 6*K 7,那么E 3有16个元素: For example, taking operator 3 as an example, let the two-dimensional matrix E 3 =K 0 *K 1 +K 2 *K 3 +K 4 *K 5 +K 6 *K 7 , then E 3 has 16 elements :
Figure PCTCN2023070594-appb-000005
Figure PCTCN2023070594-appb-000005
对于E 3中的任意一个元素E 3ij(i=0,1,2,3;j=0,1,2,3),都需要执行运算: For any element E 3 ij (i=0,1,2,3; j=0,1,2,3) in E 3 , an operation needs to be performed:
E 3ij=(K 0i0*K 1i0+K 0i1*K 1i1+K 0i2*K 1i2+K 0i3*K 1i3)+(K 2i0*K 3i0+K 2i1*K 3i1+K 2i2*K 3i2+K 2i3*K 3i3)+(K 4i0*K 5i0+K 4i1*K 5i1+K 4i2*K 5i2+K 4i3*K 5i3)+(K 6i0*K 7i0+K 6i1*K 7i1+K 6i2*K 7i2+K 6i3*K 7i3) E 3 ij=(K 0 i0*K 1 i0+K 0 i1*K 1 i1+K 0 i2*K 1 i2+K 0 i3*K 1 i3)+(K 2 i0*K 3 i0+K 2 i1 *K 3 i1+K 2 i2*K 3 i2+K 2 i3*K 3 i3)+(K 4 i0*K 5 i0+K 4 i1*K 5 i1+K 4 i2*K 5 i2+K 4 i3 *K 5 i3)+(K 6 i0*K 7 i0+K 6 i1*K 7 i1+K 6 i2*K 7 i2+K 6 i3*K 7 i3)
在本申请实施例中,处理模块可以基于算子3的源代码生成如图2-4所示的数据流图3。其中,数据流图2中包括31个节点,分别为MUL0至MUL15,ADD0至ADD114,其中,MUL0至MUL15的功能配置均为乘法,ADD0至ADD114的功能配置均为加法;MUL0至MUL15,ADD0至ADD114的路由配置如图2-4所示,此处不做赘述。其中,ADD14用于执行ADD13+ADD12的运算,最后得出E 3ij的值。 In this embodiment of the present application, the processing module can generate the data flow diagram 3 shown in Figure 2-4 based on the source code of operator 3. Among them, the data flow diagram 2 includes 31 nodes, namely MUL0 to MUL15, ADD0 to ADD114. Among them, the functional configurations of MUL0 to MUL15 are all multiplication, and the functional configurations of ADD0 to ADD114 are all addition; MUL0 to MUL15, ADD0 to The routing configuration of ADD114 is shown in Figure 2-4 and will not be described in detail here. Among them, ADD14 is used to perform the operation of ADD13+ADD12, and finally obtain the value of E 3 ij.
202、处理模块根据M个数据流图提取同构性特征,同构性特征对应M个算子之间相同的局部结构。202. The processing module extracts isomorphic features based on the M data flow graphs, and the isomorphic features correspond to the same local structures among the M operators.
在一些可能的实现方式中,同构性特征可以包括N个节点中各个节点的路由配置,N个节点中任意2个节点直接相连或间接相连。在一些可能的实现方式中,同构性特征还包括N个节点中至少1个节点的功能配置。示例性的,如图2-2所示的数据流图1和如图2-3所示的数据流图2之间确定的同构性特征可以如图2-5所示,该同构性特征中包括7个节点,7个节点中任意2个节点直接相连或间接相连,分别为a、b、c、d、e、f、g,其中,a、b的路由配置为指向e,c、d的路由配置为指向f,e、f的路由配置为指向g。示例性的,同构性特征还包括7个节点中至少1个节点的功能配置,示例性的,如图2-5所示,a、b、c、d的功能配置均为乘法,e、f的功能配置均为加法,g的功能配置不限定。In some possible implementations, the isomorphism feature may include the routing configuration of each node among the N nodes, and any two nodes among the N nodes are directly connected or indirectly connected. In some possible implementations, the isomorphism feature also includes the functional configuration of at least one node among the N nodes. For example, the isomorphism characteristics determined between the data flow diagram 1 shown in Figure 2-2 and the data flow diagram 2 shown in Figure 2-3 can be shown in Figure 2-5. The isomorphism The feature includes 7 nodes. Any 2 nodes among the 7 nodes are directly connected or indirectly connected, namely a, b, c, d, e, f, and g. Among them, the routes of a and b are configured to point to e, c. The routes of , d are configured to point to f, and the routes of e and f are configured to point to g. Illustratively, the isomorphism feature also includes the functional configuration of at least one node among the seven nodes. Exemplarily, as shown in Figure 2-5, the functional configurations of a, b, c, and d are all multiplication, e, The functional configuration of f is all additive, and the functional configuration of g is not limited.
在一些可能的实现方式中,处理模块可以从1个数据流图内提取同构性特征,该同构性特征为该数据流图中至少两个相同的局部结构,通过复用该同构性特征对应的PE阵列中的N个PE,减少了所需的PE的数量,增强了可用性。示例性的,处理模块可以将图2-4进行如图2-6的划分,从而将数据流图3划分为结构相似5个局部结构,基于这5个局部结构可提取出如图2-7所示的同构性特征。该同构性特征中包括7个节点,分别为a、b、c、d、e、f、g,其中,a、b的路由配置为指向e,c、d的路由配置为指向f,e、f的路 由配置为指向g。其中,a、b、c、d的功能配置均不限定,e、f、g的功能配置均为加法。In some possible implementations, the processing module can extract isomorphism features from a data flow graph. The isomorphism features are at least two identical local structures in the data flow graph. By reusing the isomorphism features The N PEs in the PE array corresponding to the feature reduce the number of required PEs and enhance usability. For example, the processing module can divide Figure 2-4 into Figure 2-6, thereby dividing the data flow Figure 3 into 5 local structures with similar structures. Based on these 5 local structures, Figure 2-7 can be extracted. The isomorphic characteristics shown. This isomorphism feature includes 7 nodes, namely a, b, c, d, e, f, and g. Among them, the routes of a and b are configured to point to e, and the routes of c and d are configured to point to f and e. , f's route is configured to point to g. Among them, the functional configurations of a, b, c, and d are not limited, and the functional configurations of e, f, and g are all additive.
在本申请实施例中,处理模块可以基于数据流图1、数据流图2和数据流图3提取如图2-8所示的同构性特征,该同构性特征中包括7个节点,分别为a、b、c、d、e、f、g。其中,a、b的路由配置为指向e,c、d的路由配置为指向f,e、f的路由配置为指向g,a、b、c、d、g的功能配置均不限定,e、f的功能配置均为加法。In the embodiment of this application, the processing module can extract the isomorphism features shown in Figure 2-8 based on the data flow graph 1, data flow graph 2 and data flow graph 3. The isomorphism features include 7 nodes. They are a, b, c, d, e, f, g respectively. Among them, the routes of a and b are configured to point to e, the routes of c and d are configured to point to f, the routes of e and f are configured to point to g, the function configurations of a, b, c, d, and g are not limited, e, The functional configurations of f are all additive.
在一些可能的实现方式中,同构性特征还可以为多种不同粒度的局部结构,从而芯片可以在不同情况下根据需要确定不同粒度的同构性特征,减少了所需的PE的数量,增强了可用性。示例性的,基于数据流图1、数据流图2和数据流图3可提取如图2-9所示的同构性特征,该同构性特征中包括3个节点,分别为a、b、c,其中,a、b的路由配置为指向c,a、b、c的功能配置均不限定。如图2-9所示的同构性特征较如图2-8所示的同构性特征具有更小的粒度。In some possible implementations, isomorphism features can also be local structures of multiple different granularities, so that the chip can determine isomorphism features of different granularities as needed under different circumstances, reducing the number of required PEs. Enhanced usability. For example, based on data flow graph 1, data flow graph 2 and data flow graph 3, the isomorphism feature shown in Figure 2-9 can be extracted. This isomorphism feature includes three nodes, namely a and b. , c, where the routes of a and b are configured to point to c, and the functional configurations of a, b, and c are not limited. The isomorphic features shown in Figure 2-9 have smaller granularity than the isomorphic features shown in Figure 2-8.
在一些可能的实现方式中,M个算子可以为一个程序中的所有算子,也可以为程序中的部分算子,从而芯片可以根据需要为一个程序确定一个或多个不同的同构性特征,适用于无法提取一个适合的同构性特征的程序中的多个算子,增强了其适用性。示例性的,若程序包括有6个算子,6个算子分别对应6个数据流图,分别为数据流图1/2/3/4/5/6,处理模块可以基于数据流图1/2/3提取同构性特征1,基于数据流图4/5/6提取同构性特征2。In some possible implementations, the M operators can be all operators in a program or part of the operators in the program, so that the chip can determine one or more different isomorphisms for a program as needed. Features, suitable for multiple operators in programs where a suitable isomorphic feature cannot be extracted, enhancing its applicability. For example, if the program includes 6 operators, and the 6 operators correspond to 6 data flow diagrams, namely data flow diagram 1/2/3/4/5/6, the processing module can be based on data flow diagram 1 /2/3 extracts isomorphism feature 1, and extracts isomorphism feature 2 based on data flow diagram 4/5/6.
需要说明的是,上述步骤201-202是可选的,只要处理模块可以生成M个算子的同构性特征,此处不做限定。例如,芯片可以基于M个算子的算式确定同构性特征,此处不做限定。It should be noted that the above steps 201-202 are optional, as long as the processing module can generate isomorphic features of M operators, there is no limitation here. For example, the chip can determine the isomorphism characteristics based on the calculation formula of M operators, which is not limited here.
203、处理模块根据同构性特征确定PE阵列中N个PE的静态配置,N为正整数。203. The processing module determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, where N is a positive integer.
在一些可能的实现方式中,同构性特征包括N个节点,基于同构性特征中N个节点之间连接关系,从PE阵列中选取可用的N个PE,其中,N个PE的连接关系与同构性特征中N个节点的连接关系相同,同构性特征中的一个节点与N个PE中的一个PE一一对应。然后,基于同构性特征中各个节点的配置,将N个PE中对应的PE中进行对应的配置,得到N个PE的静态配置。对应的,若同构性特征包括N个节点中各个节点的路由配置时,静态配置也包括N个PE中各个PE的路由配置;若同构性特征包括N个节点中至少1个节点的功能配置时,静态配置也包括N个PE中至少1个PE的功能配置。In some possible implementations, the isomorphism feature includes N nodes. Based on the connection relationship between the N nodes in the isomorphism feature, available N PEs are selected from the PE array, where the connection relationships of the N PEs are The connection relationship between N nodes in the isomorphic feature is the same. One node in the isomorphic feature corresponds to one PE among the N PEs. Then, based on the configuration of each node in the isomorphism feature, corresponding configurations are performed on the corresponding PEs among the N PEs to obtain the static configuration of the N PEs. Correspondingly, if the isomorphism feature includes the routing configuration of each of the N nodes, the static configuration also includes the routing configuration of each of the N PEs; if the isomorphism feature includes the function of at least one of the N nodes, During configuration, the static configuration also includes the functional configuration of at least one PE among the N PEs.
示例性的,如图1-1所示,为PE阵列为3×3的架构。对于如图2-8所示的同构性特征,可以映射到如图1-1所示的PE阵列的N个PE(PE00、PE01、PE02、PE11、PE20、PE21、PE22,即N=7)上,得到如图2-10所示的7个PE的静态配置。其中,N个PE(PE00、PE01、PE02、PE11、PE20、PE21、PE22)的路由配置构成一条传输路径。在一些可能的实现方式中,PE01和PE21的功能配置为加法,PE00、PE02、PE11、PE20、PE22的功能配置不限定。As an example, as shown in Figure 1-1, the PE array is a 3×3 architecture. For the isomorphic features shown in Figure 2-8, it can be mapped to N PEs (PE00, PE01, PE02, PE11, PE20, PE21, PE22) of the PE array shown in Figure 1-1, that is, N=7 ), the static configuration of 7 PEs is obtained as shown in Figure 2-10. Among them, the routing configurations of N PEs (PE00, PE01, PE02, PE11, PE20, PE21, PE22) constitute a transmission path. In some possible implementations, the function configurations of PE01 and PE21 are additive, and the function configurations of PE00, PE02, PE11, PE20, and PE22 are not limited.
204、处理模块基于静态配置和M个算子在PE阵列中的整体配置确定M个动态配置,动态配置包括整体配置中除了静态配置之外的其他配置。204. The processing module determines M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array. The dynamic configuration includes other configurations in the overall configuration except the static configuration.
示例性的,若基于算子1、算子2和算子3提取的静态配置为如图2-10所示,那么,动态配置为PE00、PE02、PE11、PE20、PE22的功能配置。其中,算子1对应的动态配置为:PE00、PE02、PE20、PE22的功能配置均为乘法,PE11的功能配置为加法;算子2对应的动态配置为:PE00、PE02、PE20、PE22的功能配置均为乘法,PE11的功能配置为减法;算子 3对应5份动态配置,其中4份动态配置均为:PE00、PE02、PE20、PE22的功能配置均为乘法,PE11的功能配置为减法;算子3对应5份动态配置中的1份动态配置为:PE00、PE02、PE11、PE20、PE22的功能配置均为减法。For example, if the static configuration extracted based on operator 1, operator 2, and operator 3 is as shown in Figure 2-10, then the dynamic configuration is the functional configuration of PE00, PE02, PE11, PE20, and PE22. Among them, the dynamic configuration corresponding to operator 1 is: the function configurations of PE00, PE02, PE20, and PE22 are all multiplication, and the function configuration of PE11 is addition; the dynamic configuration corresponding to operator 2 is: the functions of PE00, PE02, PE20, and PE22 The configurations are all multiplication, and the functional configuration of PE11 is subtraction; operator 3 corresponds to 5 dynamic configurations, of which 4 dynamic configurations are: the functional configurations of PE00, PE02, PE20, and PE22 are all multiplication, and the functional configuration of PE11 is subtraction; Operator 3 corresponds to one of the five dynamic configurations: the functional configurations of PE00, PE02, PE11, PE20, and PE22 are all subtractions.
在一些可能的实现方式中,动态配置还包括PE阵列中N个PE以外的至少1个PE的配置。示例性的,如图2-11所示,算子4对应的数据流图4,基于如图2-10所示的同构性特征,算子4对应的动态配置还可以包括PE10的路由配置,PE10的路由配置为指向PE11,且PE10的功能配置为加法。In some possible implementations, the dynamic configuration also includes the configuration of at least one PE other than the N PEs in the PE array. For example, as shown in Figure 2-11, the data flow diagram 4 corresponding to operator 4 is based on the isomorphism characteristics shown in Figure 2-10. The dynamic configuration corresponding to operator 4 can also include the routing configuration of PE10. , the route of PE10 is configured to point to PE11, and the function of PE10 is configured as addition.
205、处理模块在存储模块中存储静态配置和索引号的映射关系。205. The processing module stores the mapping relationship between the static configuration and the index number in the storage module.
可选的,在一些可行的实现方式,处理模块可以生成静态配置的索引号,并将索引号和静态配置的映射关系存储在存储模块中。在一些可能的实现方式中,存储模块可以通过其中的template lib存储该映射关系。例如,有2个静态配置,分别为静态配置1和静态配置2,处理模块可以生成2个索引号,分别为索引号1和索引号2,其中,索引号1与静态配置1具有映射关系,索引号2与静态配置2具有映射关系,并将索引号和静态配置的映射关系存储在存储模块的template lib中。Optionally, in some feasible implementations, the processing module can generate a statically configured index number, and store the mapping relationship between the index number and the static configuration in the storage module. In some possible implementations, the storage module can store the mapping relationship through the template lib. For example, there are two static configurations, namely static configuration 1 and static configuration 2. The processing module can generate two index numbers, namely index number 1 and index number 2. Among them, index number 1 has a mapping relationship with static configuration 1. Index number 2 has a mapping relationship with static configuration 2, and the mapping relationship between index number and static configuration is stored in the template lib of the storage module.
示例性的,存储模块的template lib如表1所示:For example, the template lib of the storage module is shown in Table 1:
表1Table 1
IdxIdx Cfgcfg
#0#0 Cfg_template_0 Cfg_template_0
#1#1 Cfg_template_1Cfg_template_1
#n#n Cfg_template_nCfg_template_n
其中,idx列下的项表示为索引号,cfg列下的项为数据链路的静态配置。Among them, the items under the idx column are represented as index numbers, and the items under the cfg column are static configurations of the data link.
206、处理模块向PE阵列传输配置字。206. The processing module transmits the configuration word to the PE array.
在一些可能的实现方式中,配置字包括静态配置和M个动态配置中至少一个动态配置。In some possible implementations, the configuration word includes static configuration and at least one dynamic configuration among M dynamic configurations.
示例性的,如图表2所示,为本申请实施例中3个算子对应的配置字的示例。Illustratively, as shown in Figure 2, it is an example of configuration words corresponding to three operators in the embodiment of the present application.
表2Table 2
Figure PCTCN2023070594-appb-000006
Figure PCTCN2023070594-appb-000006
其中,算子1、算子2和算子3的静态配置均相同。算子1对应1份配置字,算子2对应1份配置字,而算子3对应5份配置字。其中,在算子3的5份配置字中,前4份配置字的动态配置是相同的,只有第5份的动态配置不同。Among them, the static configurations of operator 1, operator 2 and operator 3 are all the same. Operator 1 corresponds to 1 configuration word, operator 2 corresponds to 1 configuration word, and operator 3 corresponds to 5 configuration words. Among the five configuration words of operator 3, the dynamic configurations of the first four configuration words are the same, and only the dynamic configuration of the fifth one is different.
在一些可能的实现方式中,配置字还可以包括配置份数,配置份数用于指示基于配置字进行配置的次数,那么对于具有多个相同静态配置、相同动态配置的多份配置字可以简 写为1份配置字,进一步降低传输开销。示例性的,如表3所示,为本申请实施例中3个算子的配置字的示例。In some possible implementations, the configuration word may also include the number of configuration copies. The number of configuration copies is used to indicate the number of configurations based on the configuration word. Then, multiple copies of the configuration word with multiple identical static configurations and the same dynamic configuration can be abbreviated It is 1 copy of the configuration word to further reduce the transmission overhead. Illustratively, as shown in Table 3, it is an example of the configuration words of three operators in the embodiment of the present application.
表3table 3
Figure PCTCN2023070594-appb-000007
Figure PCTCN2023070594-appb-000007
其中,通过使用*num列下的项表示配置份数。需要说明的是,如表3所示的每1份配置字都有所不同,即1份配置字对应了一个可重构周期。Among them, the number of configuration copies is expressed by using the items under the *num column. It should be noted that each configuration word shown in Table 3 is different, that is, one configuration word corresponds to one reconfigurable cycle.
在一些可能的实现方式中,所述动态配置包括所述N个PE的配置中除了所述静态配置以外的其他配置,所述动态配置还包括所述N个PE以外的至少1个PE的配置,从而使得动态配置也可以包括PE阵列的N个节点之外的其他PE的配置,增强了适用性。In some possible implementations, the dynamic configuration includes configurations of the N PEs other than the static configuration, and the dynamic configuration also includes configurations of at least 1 PE other than the N PEs. , so that the dynamic configuration can also include the configuration of other PEs other than the N nodes of the PE array, enhancing the applicability.
示例性的,如图2-11所示,为算子4对应的数据流图4。如图2-5所示的同构性特征只能作为算子4的数据流图4的局部结构,对于剩余一个节点的配置,可以对应PE10,PE10的配置即为N个PE以外的至少1个PE的配置。示例性的,如表4-1或表4-2所示,为本申请实施例中3个算子的配置字的示例。For example, as shown in Figure 2-11, it is the data flow diagram 4 corresponding to operator 4. The isomorphism feature shown in Figure 2-5 can only be used as the local structure of the data flow diagram 4 of operator 4. The configuration of the remaining node can correspond to PE10. The configuration of PE10 is at least 1 other than N PEs. Configuration of a PE. Illustratively, as shown in Table 4-1 or Table 4-2, they are examples of configuration words of three operators in the embodiment of this application.
表4-1Table 4-1
Figure PCTCN2023070594-appb-000008
Figure PCTCN2023070594-appb-000008
表4-2Table 4-2
Figure PCTCN2023070594-appb-000009
Figure PCTCN2023070594-appb-000009
其中,动态配置分为Cfg_operation_list部分和other cfg部分,其中Cfg_operation_list部分为N个PE的配置中除了所述静态配置以外的其他配置,other cfg 部分为N个PE以外的至少1个PE的配置。Among them, the dynamic configuration is divided into the Cfg_operation_list part and the other cfg part, where the Cfg_operation_list part is the configuration of the N PEs other than the static configuration, and the other cfg part is the configuration of at least 1 PE other than the N PEs.
在一些可能的实现方式中,配置字包括索引号和M个动态配置中至少一个动态配置,通过索引号表示静态配置,有效降低传输开销,提高传输效率。示例性的,如表5-1、表5-2、表5-3或表5-4所示,为本申请实施例中3个算子对应的的配置字的示例。In some possible implementations, the configuration word includes an index number and at least one dynamic configuration among the M dynamic configurations. The index number represents the static configuration, which effectively reduces transmission overhead and improves transmission efficiency. Illustratively, as shown in Table 5-1, Table 5-2, Table 5-3 or Table 5-4, they are examples of configuration words corresponding to three operators in the embodiment of the present application.
表5-1Table 5-1
Figure PCTCN2023070594-appb-000010
Figure PCTCN2023070594-appb-000010
表5-2Table 5-2
Figure PCTCN2023070594-appb-000011
Figure PCTCN2023070594-appb-000011
表5-3Table 5-3
Figure PCTCN2023070594-appb-000012
Figure PCTCN2023070594-appb-000012
表5-4Table 5-4
Figure PCTCN2023070594-appb-000013
Figure PCTCN2023070594-appb-000013
在一些可能的实现方式中,处理模块可以向存储模块传输配置字,然后存储模块通过 内置的config ram存储配置字,并将配置字传输给PE阵列。在一些可能的实现方式中,传输模块可以一次性将所有配置字都传输给存储模块,存储模块根据一定规则将配置字依次传输给PE阵列,每次传输一份配置字。对于存储模块接收到但尚未传输给PE阵列的配置字,可以存储在config ram中。由于配置字包括索引号,而非静态配置本身,大大降低了存储需求。In some possible implementations, the processing module can transmit the configuration word to the storage module, and then the storage module stores the configuration word through the built-in config ram and transmits the configuration word to the PE array. In some possible implementations, the transmission module can transmit all configuration words to the storage module at one time, and the storage module sequentially transmits the configuration words to the PE array according to certain rules, one configuration word at a time. The configuration words received by the storage module but not yet transmitted to the PE array can be stored in the config ram. Since the configuration word includes the index number rather than the static configuration itself, storage requirements are greatly reduced.
207、PE阵列从存储模块中获取与索引号具有映射关系的静态配置。207. The PE array obtains the static configuration that has a mapping relationship with the index number from the storage module.
可选的,在一些可能的实现方式中,PE阵列中的cfg buffer可以基于索引号从存储模块的template lib中获取静态配置。示例性的,PE阵列中的cfg buffer基于索引号向存储模块请求静态配置,存储模块从template lib中基于索引号和映射关系确定静态配置,并向PE阵列返回静态配置。Optionally, in some possible implementations, the cfg buffer in the PE array can obtain static configuration from the template lib of the storage module based on the index number. For example, the cfg buffer in the PE array requests a static configuration from the storage module based on the index number. The storage module determines the static configuration from the template lib based on the index number and mapping relationship, and returns the static configuration to the PE array.
需要说明的是,当PE阵列接收到一个新的配置字时,查看其中的索引号。若该索引号与上一份接收到的配置字中的索引号相同,则PE阵列无需从存储模块中获取静态配置,而是沿用上一份配置字的静态配置,而仅需切换动态配置,降低了传输开销。It should be noted that when the PE array receives a new configuration word, check the index number. If the index number is the same as the index number in the last received configuration word, the PE array does not need to obtain the static configuration from the storage module. Instead, it uses the static configuration of the previous configuration word and only needs to switch the dynamic configuration. Transmission overhead is reduced.
在一些可能的实现方式中,如图2-12,cfg buffer内可以划分两个单独的存储空间,分别为存储空间1和存储空间2,其中存储空间1用于存储静态配置,存储空间2用于存储动态配置。示例性的,如图2-13所示,PE阵列中的cfg buffer接收存储模块依次传输的配置字1、配置字2和配置字3,其中,配置字1包括索引号和动态配置dynamic0,配置字2包括索引号和动态配置dynamic1,配置字3包括索引号和动态配置dynamic2。其中,那么,当PE阵列中的cfg buffer在接收到配置字1时,基于索引号从存储模块中的template lib中获取静态配置,并存储静态配置和动态配置dynamic0。当PE阵列中的cfg buffer接收到配置字2时,可以确定配置字2中的索引号与配置字1中的索引号相同,则需要从存储模块中的config RAM获取静态配置(static),而是将动态配置dynamic0切换为配置字2中的动态配置dynamic1。当PE阵列中的cfg buffer接收到配置字3时,可以确定配置字3中的索引号与配置字2中的索引号相同,则需要从存储模块中的config RAM获取静态配置(static),而是将动态配置dynamic1切换为配置字2中的动态配置dynamic2。由于仅需切换动态配置,无需切换静态配置,降低了切换开销。In some possible implementations, as shown in Figure 2-12, the cfg buffer can be divided into two separate storage spaces, namely storage space 1 and storage space 2. Storage space 1 is used to store static configuration, and storage space 2 is used to store static configuration. To store dynamic configuration. For example, as shown in Figure 2-13, the cfg buffer in the PE array receives configuration word 1, configuration word 2, and configuration word 3 sequentially transmitted by the storage module. Configuration word 1 includes the index number and dynamic configuration dynamic0, configuration Word 2 includes the index number and dynamic configuration dynamic1, and configuration word 3 includes the index number and dynamic configuration dynamic2. Among them, then, when the cfg buffer in the PE array receives the configuration word 1, it obtains the static configuration from the template lib in the storage module based on the index number, and stores the static configuration and dynamic configuration dynamic0. When the cfg buffer in the PE array receives configuration word 2, it can be determined that the index number in configuration word 2 is the same as the index number in configuration word 1, and the static configuration (static) needs to be obtained from the config RAM in the storage module, and It is to switch the dynamic configuration dynamic0 to the dynamic configuration dynamic1 in configuration word 2. When the cfg buffer in the PE array receives configuration word 3, it can be determined that the index number in configuration word 3 is the same as the index number in configuration word 2, and the static configuration (static) needs to be obtained from the config RAM in the storage module, and It is to switch the dynamic configuration dynamic1 to the dynamic configuration dynamic2 in configuration word 2. Since only dynamic configurations need to be switched, there is no need to switch static configurations, which reduces switching overhead.
208、PE阵列基于静态配置和M个动态配置中至少一个动态配置进行配置。208. The PE array is configured based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
示例性的,在芯片执行M个算子中的第一算子时,PE阵列基于静态配置和第一算子对应的第一动态配置进行配置,第一动态配置为M个动态配置中的一个;在芯片执行M个算子中的第二算子时,PE阵列将第一动态配置切换为第二算子对应的第二动态配置,第二算子为执行完第一算子之后执行的算子,第二动态配置为M个动态配置中的一个。由于仅切换动态配置,无需切换静态配置,降低了切换开销。For example, when the chip executes the first operator among M operators, the PE array is configured based on the static configuration and the first dynamic configuration corresponding to the first operator, and the first dynamic configuration is one of the M dynamic configurations. ;When the chip executes the second operator among the M operators, the PE array switches the first dynamic configuration to the second dynamic configuration corresponding to the second operator. The second operator is executed after the first operator is executed. operator, the second dynamic configuration is one of M dynamic configurations. Since only dynamic configuration is switched, there is no need to switch static configuration, which reduces switching overhead.
示例性的,算子1、算子2、算子3对应相同的静态配置,PE阵列依次以任意顺序执行算子1、算子2、算子3时,处理模块中的cfg switcher仅需在PE阵列的cfg buffer中切换动态配置,无需要切换静态配置,从而节省了切换开销。For example, operator 1, operator 2, and operator 3 correspond to the same static configuration. When the PE array executes operator 1, operator 2, and operator 3 in any order, the cfg switcher in the processing module only needs to Switching dynamic configuration in the cfg buffer of the PE array eliminates the need to switch static configuration, thus saving switching overhead.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因 为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that for the sake of simple description, the foregoing method embodiments are expressed as a series of action combinations. However, those skilled in the art should know that the present application is not limited by the described action sequence. Because in accordance with this application, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily necessary for this application.
为便于更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关装置。In order to facilitate better implementation of the above solutions in the embodiments of the present application, relevant devices for implementing the above solutions are also provided below.
请参阅图3所示,本申请实施例提供的一种芯片300,包括:Please refer to Figure 3. A chip 300 provided by an embodiment of the present application includes:
处理模块310和PE阵列320;其中, processing module 310 and PE array 320; wherein,
所述处理模块310,用于生成M个算子的同构性特征,所述同构性特征对应所述M个算子之间相同的局部结构,M为正整数;根据所述同构性特征确定所述PE阵列中N个PE的静态配置,N为正整数;基于所述静态配置和所述M个算子在所述PE阵列中的整体配置确定M个动态配置,所述动态配置为所述整体配置中除了所述静态配置之外的其他配置;The processing module 310 is used to generate isomorphism features of M operators. The isomorphism features correspond to the same local structure among the M operators. M is a positive integer; according to the isomorphism The characteristics determine the static configuration of N PEs in the PE array, and N is a positive integer; determine M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array, and the dynamic configuration Be other configurations in the overall configuration except the static configuration;
所述PE阵列320,用于基于所述静态配置和所述M个动态配置中至少一个动态配置进行配置。The PE array 320 is configured to configure based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
在一些可能的实现方式中,所述PE阵列320具体用于:在执行所述M个算子中的第一算子时,基于所述静态配置和所述第一算子对应的第一动态配置进行配置,所述第一动态配置为所述M个动态配置中的一个;在执行所述M个算子中的第二算子时,将所述第一动态配置切换为所述第二算子对应的第二动态配置,所述第二算子为执行完所述第一算子之后执行的算子,所述第二动态配置为所述M个动态配置中的一个。In some possible implementations, the PE array 320 is specifically configured to: when executing the first operator among the M operators, based on the static configuration and the first dynamic value corresponding to the first operator Configuration is configured, and the first dynamic configuration is one of the M dynamic configurations; when executing the second operator among the M operators, the first dynamic configuration is switched to the second The second dynamic configuration corresponding to the operator, the second operator is an operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations.
在一些可能的实现方式中,所述处理模块310,用于获取所述M个算子对应的M个数据流图,并根据所述M个数据流图提取所述同构性特征。In some possible implementations, the processing module 310 is configured to obtain M data flow graphs corresponding to the M operators, and extract the isomorphism features according to the M data flow graphs.
在一些可能的实现方式中,所述芯片300还包括:存储模块330;所述处理模块310,还用于向所述存储模块330传输所述静态配置和索引号的映射关系,向所述PE阵列传输配置字,所述配置字包括所述索引号和所述M个动态配置中至少一个动态配置;所述PE阵列320,还用于基于所述索引号从所述存储模块330中获取所述静态配置。In some possible implementations, the chip 300 further includes: a storage module 330; the processing module 310 is also configured to transmit the mapping relationship between the static configuration and the index number to the storage module 330, and to the PE The array transmits a configuration word, which includes the index number and at least one dynamic configuration among the M dynamic configurations; the PE array 320 is also used to obtain the data from the storage module 330 based on the index number. Describe static configuration.
需要说明的是,上述装置各模块/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其带来的技术效果与本申请方法实施例相同,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。It should be noted that the information interaction, execution process, etc. between the modules/units of the above-mentioned device are based on the same concept as the method embodiments of the present application, and the technical effects they bring are the same as those of the method embodiments of the present application. The specific content can be Please refer to the descriptions in the method embodiments shown above in this application, which will not be described again here.
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储有程序,该程序执行包括上述方法实施例中记载的部分或全部步骤。An embodiment of the present application also provides a computer storage medium, wherein the computer storage medium stores a program, and the program executes some or all of the steps described in the above method embodiments.
本申请实施例还提供一种计算机程序产品,其中,该计算机程序产品存储有程序,该程序执行包括上述方法实施例中记载的部分或全部步骤。An embodiment of the present application also provides a computer program product, wherein the computer program product stores a program, and the program executes some or all of the steps recorded in the above method embodiments.
接下来介绍本申请实施例提供的另一种通信装置,请参阅图4所示,通信装置400包括:Next, another communication device provided by an embodiment of the present application is introduced. Please refer to Figure 4. The communication device 400 includes:
接收器401、发射器402、处理器403和存储器404。在本申请的一些实施例中,接收器401、发射器402、处理器403和存储器404可通过总线或其它方式连接,其中,图4中以通过总线连接为例。 Receiver 401, transmitter 402, processor 403 and memory 404. In some embodiments of the present application, the receiver 401, the transmitter 402, the processor 403 and the memory 404 may be connected through a bus or other means. In FIG. 4, the connection through the bus is taken as an example.
存储器404可以包括只读存储器和随机存取存储器,并向处理器403提供指令和数据。存储器404的一部分还可以包括非易失性随机存取存储器(non-volatile random access  memory,NVRAM)。存储器404存储有操作系统和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。操作系统可包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。 Memory 404 may include read-only memory and random access memory and provides instructions and data to processor 403 . A portion of memory 404 may also include non-volatile random access memory (NVRAM). The memory 404 stores an operating system and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for implementing various operations. The operating system may include various system programs that are used to implement various basic services and handle hardware-based tasks.
处理器403控制通信装置400的操作,处理器403还可以称为中央处理单元(central processing unit,CPU)。具体的应用中,通信装置400的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。The processor 403 controls the operation of the communication device 400. The processor 403 may also be called a central processing unit (CPU). In specific applications, various components of the communication device 400 are coupled together through a bus system, where in addition to a data bus, the bus system may also include a power bus, a control bus, a status signal bus, etc. However, for the sake of clarity, various buses are called bus systems in the figure.
上述本申请实施例揭示的方法可以应用于处理器403中,或者由处理器403实现。处理器403可以包括为如图3所述的芯片。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器404,处理器403读取存储器404中的信息,结合其硬件完成上述方法的步骤。The methods disclosed in the above embodiments of the present application can be applied to the processor 403 or implemented by the processor 403. The processor 403 may be included as a chip as described in FIG. 3 . The steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory 404. The processor 403 reads the information in the memory 404 and completes the steps of the above method in combination with its hardware.
接收器401可用于接收输入的数字或字符信息,以及产生与通信装置400的相关设置以及功能控制有关的信号输入,发射器402可包括显示屏等显示设备,发射器402可用于通过外接接口输出数字或字符信息。The receiver 401 can be used to receive input numeric or character information, and generate signal input related to the relevant settings and function control of the communication device 400. The transmitter 402 can include a display device such as a display screen, and the transmitter 402 can be used to output through an external interface. Numeric or character information.
本申请实施例中,处理器403,用于执行前述通信装置400执行的处理单元PE阵列的配置方法。In this embodiment of the present application, the processor 403 is configured to execute the configuration method of the processing unit PE array executed by the communication device 400 .
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。In addition, it should be noted that the device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate. The physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in this application, the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。The technical solution of the present application is essentially, or the part that contributes to the existing technology, can be embodied in the form of a software product. The computer software product is stored in a readable storage medium, such as a computer floppy disk, a U disk, a mobile phone, etc. A hard disk, ROM, RAM, magnetic disk or optical disk, etc., includes a number of instructions to cause a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数 据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a server, data center, etc. that contains one or more available media integrated. The available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), etc.

Claims (15)

  1. 一种处理单元PE阵列的配置方法,其特征在于,用于芯片,所述芯片包括处理模块和PE阵列,所述方法包括:A method for configuring a processing unit PE array, which is characterized in that it is used in a chip, and the chip includes a processing module and a PE array, and the method includes:
    所述处理模块生成M个算子的同构性特征,所述同构性特征对应所述M个算子之间相同的局部结构,M为正整数;The processing module generates isomorphic features of M operators, the isomorphic features correspond to the same local structures among the M operators, and M is a positive integer;
    所述处理模块根据所述同构性特征确定所述PE阵列中N个PE的静态配置,N为正整数;The processing module determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, where N is a positive integer;
    所述处理模块基于所述静态配置和所述M个算子在所述PE阵列中的整体配置确定M个动态配置,所述动态配置包括所述整体配置中除了所述静态配置之外的其他配置;The processing module determines M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array. The dynamic configuration includes other configurations in the overall configuration except the static configuration. configuration;
    所述PE阵列基于所述静态配置和所述M个动态配置中至少一个动态配置进行配置。The PE array is configured based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
  2. 根据权利要求1所述方法,其特征在于,所述PE阵列基于所述静态配置和所述M个动态配置中至少一个动态配置对所述PE阵列进行配置包括:The method according to claim 1, characterized in that, the PE array configuring the PE array based on the static configuration and at least one dynamic configuration among the M dynamic configurations includes:
    在执行所述M个算子中的第一算子时,所述PE阵列基于所述静态配置和所述第一算子对应的第一动态配置进行配置,所述第一动态配置为所述M个动态配置中的一个;When executing the first operator among the M operators, the PE array is configured based on the static configuration and the first dynamic configuration corresponding to the first operator, and the first dynamic configuration is the One of M dynamic configurations;
    在执行所述M个算子中的第二算子时,所述PE阵列将所述第一动态配置切换为所述第二算子对应的第二动态配置,所述第二算子为执行完所述第一算子之后执行的算子,所述第二动态配置为所述M个动态配置中的一个。When executing the second operator among the M operators, the PE array switches the first dynamic configuration to the second dynamic configuration corresponding to the second operator, and the second operator is executed An operator is executed after completing the first operator, and the second dynamic configuration is one of the M dynamic configurations.
  3. 根据权利要求1或2所述方法,其特征在于,所述处理模块生成M个算子的同构性特征之前,还包括:The method according to claim 1 or 2, characterized in that before the processing module generates isomorphism features of M operators, it further includes:
    所述处理模块获取所述M个算子对应的M个数据流图;The processing module obtains M data flow graphs corresponding to the M operators;
    所述处理模块生成M个算子的同构性特征包括:The isomorphic characteristics of M operators generated by the processing module include:
    所述处理模块根据所述M个数据流图提取所述同构性特征。The processing module extracts the isomorphism features according to the M data flow graphs.
  4. 根据权利要求1-3所述方法,其特征在于,所述芯片还包括存储模块,所述PE阵列基于所述静态配置和所述M个动态配置中至少一个动态配置进行配置之前,还包括:The method according to claims 1-3, characterized in that the chip further includes a storage module, and before the PE array is configured based on the static configuration and at least one dynamic configuration among the M dynamic configurations, it further includes:
    所述存储模块存储所述静态配置和索引号的映射关系;The storage module stores the mapping relationship between the static configuration and the index number;
    所述处理模块向所述PE阵列传输配置字,所述配置字包括所述索引号和所述M个动态配置中至少一个动态配置;The processing module transmits a configuration word to the PE array, where the configuration word includes the index number and at least one dynamic configuration among the M dynamic configurations;
    所述PE阵列从所述存储模块中获取与所述索引号具有映射关系的所述静态配置。The PE array obtains the static configuration that has a mapping relationship with the index number from the storage module.
  5. 根据权利要求4所述方法,其特征在于,所述配置字还包括配置份数,所述配置份数用于指示基于所述配置字进行配置的次数。The method according to claim 4, wherein the configuration word further includes a configuration number, and the configuration number is used to indicate the number of times configuration is performed based on the configuration word.
  6. 根据权利要求1-5中任一项所述方法,其特征在于,所述同构性特征包括N个节点中各个节点的路由配置,所述N个节点中任意2个节点直接相连或间接相连;所述静态配置包括所述N个PE的路由配置。The method according to any one of claims 1 to 5, characterized in that the isomorphism feature includes the routing configuration of each node among the N nodes, and any two nodes among the N nodes are directly connected or indirectly connected. ; The static configuration includes the routing configuration of the N PEs.
  7. 根据权利要求6所述方法,其特征在于,所述同构性特征还包括所述N个节点中至少1个节点的功能配置;所述静态配置还包括所述N个PE中至少1个PE的功能配置。The method according to claim 6, wherein the isomorphism feature further includes a functional configuration of at least one of the N nodes; and the static configuration further includes at least one PE of the N PEs. functional configuration.
  8. 根据权利要求1-7中任一项所述方法,其特征在于,所述动态配置还包括所述N个PE以外的至少1个PE的配置。The method according to any one of claims 1 to 7, characterized in that the dynamic configuration further includes the configuration of at least one PE other than the N PEs.
  9. 一种芯片,其特征在于,包括:A chip is characterized by including:
    处理模块和PE阵列:Processing module and PE array:
    所述处理模块,用于:生成M个算子的同构性特征,所述同构性特征对应所述M个算子之间相同的局部结构,M为正整数;根据所述同构性特征确定所述PE阵列中N个PE的静态配置,N为正整数;基于所述静态配置和所述M个算子在所述PE阵列中的整体配置确定M个动态配置,所述动态配置包括所述整体配置中除了所述静态配置之外的其他配置;The processing module is used to: generate isomorphism features of M operators, the isomorphism features correspond to the same local structure among the M operators, M is a positive integer; according to the isomorphism The characteristics determine the static configuration of N PEs in the PE array, and N is a positive integer; determine M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array, and the dynamic configuration Including other configurations in the overall configuration except the static configuration;
    所述PE阵列,用于:基于所述静态配置和所述M个动态配置中至少一个动态配置进行配置。The PE array is configured to configure based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
  10. 根据权利要求9所述芯片,其特征在于,所述PE阵列具体用于:The chip according to claim 9, characterized in that the PE array is specifically used for:
    在执行所述M个算子中的第一算子时,基于所述静态配置和所述第一算子对应的第一动态配置进行配置,所述第一动态配置为所述M个动态配置中的一个;When executing the first operator among the M operators, configuration is performed based on the static configuration and the first dynamic configuration corresponding to the first operator, and the first dynamic configuration is the M dynamic configurations. one of the;
    在执行所述M个算子中的第二算子时,将所述第一动态配置切换为所述第二算子对应的第二动态配置,所述第二算子为执行完所述第一算子之后执行的算子,所述第二动态配置为所述M个动态配置中的一个。When executing the second operator among the M operators, the first dynamic configuration is switched to the second dynamic configuration corresponding to the second operator. The second operator is the first dynamic configuration after executing the first operator. An operator is executed after an operator, and the second dynamic configuration is one of the M dynamic configurations.
  11. 根据权利要求9或10所述芯片,其特征在于,The chip according to claim 9 or 10, characterized in that:
    所述处理模块,还用于:获取所述M个算子对应的M个数据流图,并根据所述M个数据流图提取所述同构性特征。The processing module is also configured to: obtain M data flow graphs corresponding to the M operators, and extract the isomorphism features according to the M data flow graphs.
  12. 根据权利要求9-11所述芯片,其特征在于,还包括:存储模块;The chip according to claims 9-11, further comprising: a memory module;
    所述存储模块,用于:存储所述索引号和所述静态配置的映射关系;The storage module is used to: store the mapping relationship between the index number and the static configuration;
    所述处理模块,还用于:向所述PE阵列传输配置字,所述配置字包括所述索引号和所述M个动态配置中至少一个动态配置;The processing module is further configured to: transmit a configuration word to the PE array, where the configuration word includes the index number and at least one dynamic configuration among the M dynamic configurations;
    所述PE阵列,还用于:从所述存储模块中获取与所述索引号具有映射关系的所述静态配置。The PE array is also used to obtain the static configuration that has a mapping relationship with the index number from the storage module.
  13. 一种计算机可读存储介质,其特征在于,该计算机可读存储介质存储有程序,所述程序使得计算机设备执行如权利要求1-8中任一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores a program, and the program causes the computer device to execute the method according to any one of claims 1-8.
  14. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机执行指令,所述计算机执行指令存储在计算机可读存储介质中;设备的处理器从所述计算机可读存储介质中读取所述计算机执行指令,所述处理器执行所述计算机执行指令使得所述设备执行如权利要求1-8中任一项所述的方法。A computer program product, characterized in that the computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium; the processor of the device reads the instructions from the computer-readable storage medium. The computer-executed instructions are executed by the processor to cause the device to perform the method according to any one of claims 1-8.
  15. 一种通信装置,其特征在于,所述通信装置包括处理器、存储器和通信接口;A communication device, characterized in that the communication device includes a processor, a memory and a communication interface;
    所述处理器与所述存储器和所述通信接口耦合;the processor is coupled to the memory and the communication interface;
    所述存储器用于存储指令,所述处理器用于执行所述指令,所述通信接口用于在所述处理器的控制下与其他通信装置进行通信;The memory is used to store instructions, the processor is used to execute the instructions, and the communication interface is used to communicate with other communication devices under the control of the processor;
    所述指令在被所述处理器执行时,使所述处理器执行如权利要求1-8中任一项所述的方法。The instructions, when executed by the processor, cause the processor to perform the method according to any one of claims 1-8.
PCT/CN2023/070594 2022-03-17 2023-01-05 Configuration method for processing element (pe) array and related device WO2023173912A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210264327.4A CN116822595A (en) 2022-03-17 2022-03-17 Configuration method of processing unit PE array and related equipment
CN202210264327.4 2022-03-17

Publications (1)

Publication Number Publication Date
WO2023173912A1 true WO2023173912A1 (en) 2023-09-21

Family

ID=88022168

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/070594 WO2023173912A1 (en) 2022-03-17 2023-01-05 Configuration method for processing element (pe) array and related device

Country Status (2)

Country Link
CN (1) CN116822595A (en)
WO (1) WO2023173912A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402745A (en) * 2017-07-04 2017-11-28 清华大学 The mapping method and device of DFD
US20210092174A1 (en) * 2019-09-23 2021-03-25 Netapp, Inc. Methods for dictionary-based compression and devices thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402745A (en) * 2017-07-04 2017-11-28 清华大学 The mapping method and device of DFD
US20210092174A1 (en) * 2019-09-23 2021-03-25 Netapp, Inc. Methods for dictionary-based compression and devices thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YIN CHONG-YONG, YIN SHOU-YI, WEI SHAO-JUN: "Optimization of configuration contexts generated for reconfigurable media processor ", JOURNAL OF JILIN UNIVERSITY (ENGINEERING AND TECHNOLOGY EDITION), vol. 42, no. 04, 1 January 2012 (2012-01-01), pages 1059 - 1065, XP093090367 *
YIN SHOUYI, YIN CHONGYONG, LIU LEIBO, ZHU MIN, WEI SHAOJUN: "Configuration Context Reduction for Coarse-Grained Reconfigurable Architecture", IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, INFORMATION & SYSTEMS SOCIETY, TOKYO., JP, vol. E95-D, no. 2, 1 January 2012 (2012-01-01), JP , pages 335 - 344, XP093090377, ISSN: 0916-8532, DOI: 10.1587/transinf.E95.D.335 *

Also Published As

Publication number Publication date
CN116822595A (en) 2023-09-29

Similar Documents

Publication Publication Date Title
US10938889B2 (en) Performing optimized collective operations in a irregular subcommunicator of compute nodes in a parallel computer
US20080022079A1 (en) Executing an allgather operation with an alltoallv operation in a parallel computer
US11398981B2 (en) Path creation method and device for network on chip and electronic apparatus
US11789733B2 (en) Instruction processing apparatus, acceleration unit, and server
US9246792B2 (en) Providing point to point communications among compute nodes in a global combining network of a parallel computer
Devanathan et al. Congestion-aware wireless network-on-chip for high-speed communication
Musha et al. Deep learning on high performance FPGA switching boards: Flow-in-cloud
Touzene On all-to-all broadcast in dense Gaussian network on-chip
US10476492B2 (en) Structures and operations of integrated circuits having network of configurable switches
US9390054B2 (en) Identifying a largest logical plane from a plurality of logical planes formed of compute nodes of a subcommunicator in a parallel computer
US9330059B2 (en) Identifying logical planes formed of compute nodes of a subcommunicator in a parallel computer
US8296457B2 (en) Providing nearest neighbor point-to-point communications among compute nodes of an operational group in a global combining network of a parallel computer
US9769112B2 (en) Optimising data transmission in a hypercube network
WO2023173912A1 (en) Configuration method for processing element (pe) array and related device
KR102238600B1 (en) Scheduler computing device, data node of distributed computing system having the same, and method thereof
US20220343144A1 (en) Server and accelerator for neural network computations
US9367329B2 (en) Initialization of multi-core processing system
US11223703B2 (en) Instruction initialization in a dataflow architecture
Larson et al. The möbius cubes
Rettkowski et al. Application-specific processing using high-level synthesis for networks-on-chip
US11954053B2 (en) Integrating buffer views into buffer access operations in a coarse-grained reconfigurable computing environment
Touzene All-to-all broadcast in hexagonal torus networks on-chip
Yang et al. Ray tracing on a networked processor array
WO2022029926A1 (en) Computer system and computation processing method
JP6665607B2 (en) Communication management method, communication management program, and information processing device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23769423

Country of ref document: EP

Kind code of ref document: A1