WO2023173912A1

WO2023173912A1 - Configuration method for processing element (pe) array and related device

Info

Publication number: WO2023173912A1
Application number: PCT/CN2023/070594
Authority: WO
Inventors: 张鑫; 蔡兆晖; 何雷骏; 邵芳琳
Original assignee: 华为技术有限公司
Priority date: 2022-03-17
Filing date: 2023-01-05
Publication date: 2023-09-21
Also published as: CN116822595A

Abstract

Embodiments of the present application disclose a configuration method for a processing element (PE) array and a related device, which are used for configuring a PE array. The present application may be applied to a chip, and the chip comprises a processing module and a PE array. The processing module generates homogeneous features of M operators, then determines a static configuration of N PEs in the PE array according to the homogeneous features, and determines M dynamic configurations on the basis of the static configuration and an overall configuration of the M operators in the PE array, the dynamic configurations being other configurations in the overall configuration except for the static configuration. Therefore, the PE array may be configured on the basis of the static configuration and one of the M dynamic configurations, without switching the static configuration of the PE array, thereby reducing the switching overhead.

Description

Configuration method and related equipment of processing unit PE array

This application claims priority to the Chinese patent application filed with the China Patent Office on March 17, 2022, with application number 202210264327.4 and the invention title "A configuration method and related equipment for a processing unit PE array", the entire content of which is incorporated by reference. incorporated in this application.

Technical field

The present application relates to the field of chip technology, and in particular to a configuration method of a processing element (PE) array and related equipment.

Background technique

Coarse-grained reconfigurable array (CGRA) chip is a kind of chip that combines the flexibility of field programmable gate array (FPGA) chip with application specific integrated circuit (ASIC) ), a new generation of programmable acceleration architecture with high energy efficiency characteristics of the chip, configures the PE array in the CGRA chip through configuration words, allowing the CGRA chip to execute the corresponding algorithm.

Currently, the CGRA chip can configure the PE array according to the program's operators to obtain the configured PE array. Then, when the CGRA chip executes the operator, it only needs to transmit the service data to the PE array without switching the configuration of the PE array. The PE array can execute the operator based on the service data.

However, in a CGRA chip, due to the limited number of PEs in the PE array, multiple different operators often need to multiplex the same one or more PEs in the PE array. When the CGRA chip executes operator 1 and then executes operator 2, the configuration of one or more multiplexed PEs needs to be switched. The switching overhead is high, which restricts the further improvement of the performance of the CGRA chip.

Contents of the invention

Embodiments of the present application provide a PE array configuration method and related equipment for configuring the PE array.

The first aspect of this application provides a method for configuring a processing unit PE array, which can be applied to a chip. The chip includes a processing module and a PE array. Among them, the processing module generates the isomorphism characteristics of M operators, M is a positive integer, and then determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, N is a positive integer, and based on the static configuration and M operators The overall configuration of the sub-unit in the PE array determines M dynamic configurations, and the dynamic configurations include other configurations in the overall configuration except the static configuration. Then, the PE array can be configured based on the static configuration and one of the M dynamic configurations, that is, only the dynamic configuration needs to be switched, and the static configuration does not need to be switched, which reduces the switching overhead.

In some possible implementations, in the step of the PE array configuring the PE array based on the static configuration and at least one dynamic configuration among the M dynamic configurations, it may include: performing the M calculations When the first operator is the first operator in the operator, the PE array is configured based on the static configuration and the first dynamic configuration corresponding to the first operator, and the first dynamic configuration is one of the M dynamic configurations. ; When executing the second operator among the M operators, the PE array switches the first dynamic configuration to the second dynamic configuration corresponding to the second operator, and the second operator is An operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations. It can be seen that the PE array only needs to switch the dynamic configuration of the PE array and does not need to switch the static configuration, which reduces the switching overhead.

In some possible implementations, before the processing module performs the step of generating isomorphism features of the M operators, it may also include: the processing module obtains M data flow graphs corresponding to the M operators. , then the processing module can extract the isomorphism features according to the M data flow graphs, thereby obtaining the isomorphism features of the M operators.

In some possible implementations, the chip further includes a memory module. Before the PE array performs the step of configuring based on the static configuration and at least one of the M dynamic configurations, it may also include: The storage module stores the mapping relationship between the static configuration and the index number; the processing module transmits a configuration word to the PE array, where the configuration word includes the index number and at least one dynamic configuration among the M dynamic configurations; The PE array obtains the static configuration that has a mapping relationship with the index number from the storage module. When the processing module transmits multiple copies of the configuration word, it does not need to directly transmit the static configuration, but replaces it with the index number, which reduces the transmission overhead and improves the transmission efficiency of the configuration word.

In some possible implementations, the configuration word also includes a configuration number, and the configuration number is used to indicate the number of times configuration is performed based on the configuration word. Multiple configuration words with the same static configuration and the same dynamic configuration can be abbreviated as one configuration word to further reduce transmission overhead and improve transmission efficiency.

In some possible implementations, the isomorphism feature includes the routing configuration of each node among the N nodes, any two of the N nodes are directly connected or indirectly connected, and the static configuration includes the N nodes. PE routing configuration, then the routing configuration in N PEs does not need to be modified, reducing switching overhead.

In some possible implementations, the isomorphism feature also includes the functional configuration of at least 1 node among the N nodes, and the static configuration also includes the functional configuration of at least 1 PE among the N PEs, Then the routing configuration of at least one PE among the N PEs does not need to be modified, which reduces the switching overhead.

In some feasible implementations, the chip also includes a MEM interface. The MEM interface can obtain the source code of the program and transmit the source code of the program to the processing module, so that the processing module can generate M operators based on the source code of the program. Data flow diagram, get M data flow diagrams.

In some possible implementations, the processing module can extract isomorphism features from a data flow graph. The isomorphism features are at least two identical local structures in the data flow graph, and reuse the isomorphism features. N PEs in the PE array corresponding to the characteristics can reduce the number of required PEs and enhance usability.

In some possible implementations, the dynamic configuration also includes the configuration of at least 1 PE other than the N PEs. Then, the configuration word can also be applied to operators that cannot be composed of an integer number of isomorphic features. Enhanced applicability of configuration words.

In some possible implementations, isomorphism features can also be local structures of multiple different granularities, so that the processing module can determine isomorphism features of different granularities as needed under different circumstances, reducing the number of required PEs. , enhanced usability.

In some possible implementations, the storage module includes configuration random access memory (config RAM) and static configuration template library (template lib), where config ram is used to store configuration words, and template lib is used to store index numbers and static configuration. Mapping relationship, if there are more than one configuration word with the same static configuration, only one static configuration needs to be stored in template lib, and only the index number needs to be stored in config ram. Compared with storing static configuration for each configuration word, Reduced storage overhead.

In some possible implementations, the M operators can be all operators in a program or part of the operators in the program, so that the chip can determine one or more different isomorphisms for a program as needed. Features, suitable for multiple operators in programs where a suitable isomorphic feature cannot be extracted, enhancing its applicability.

A second aspect of the present application provides a chip, which is used to perform the method described in any one of the foregoing first aspects.

A third aspect of the present application provides a computer-readable storage medium. The computer-readable storage medium stores instructions that, when run on a computer, cause the computer to execute the method described in any one of the above-mentioned first aspects. .

A fourth aspect of the present application provides a computer program product. The computer program product includes computer-executable instructions. The computer-executable instructions are stored in a computer-readable storage medium. The processor of the device can read the computer-executed instructions from the computer-readable storage medium. The processor executes the computer execution instruction to cause the device to implement the method provided by the above-mentioned first aspect or any possible implementation of the first aspect.

A fifth aspect of the present application provides a communication device, which may include a processor, a memory, and a communication interface. The processor is coupled to memory and communication interfaces. The memory is used to store instructions, the processor is used to execute the instructions, and the communication interface is used to communicate with other communication devices under the control of the processor. When executed by the processor, the instruction causes the processor to execute the method of the first aspect or any possible implementation of the first aspect.

Among them, the technical effects brought by the second to fifth aspects or any one of the possible implementation methods can be referred to the technical effects brought by the first aspect or different possible implementation methods of the first aspect, and will not be described again here.

Description of the drawings

Figure 1-1 is a schematic diagram of the structure of the PE array;

Figure 1-2 is a schematic diagram of the data link in the embodiment of this application;

Figure 1-3 is a schematic diagram of the data link in the embodiment of this application;

Figures 1-4 are schematic diagrams of a chip provided by embodiments of the present application;

Figure 2-1 is a schematic flowchart of Embodiment 1 of a PE array configuration method provided by the embodiment of this application;

Figure 2-2 is a schematic diagram of data flow diagram 1 in the embodiment of the present application;

Figure 2-3 is a schematic diagram of data flow diagram 2 in the embodiment of the present application;

Figures 2-4 are schematic diagrams of data flow diagram 3 in the embodiment of the present application;

Figures 2-5 are schematic diagrams of isomorphic features in embodiments of the present application;

Figures 2-6 are schematic diagrams of the data flow diagram 3 divided into multiple partial structures in the embodiment of the present application;

Figures 2-7 are another schematic diagram of the isomorphism feature in the embodiment of the present application;

Figure 2-8 is another schematic diagram of the isomorphism feature in the embodiment of the present application;

Figure 2-9 is another schematic diagram of the isomorphism feature in the embodiment of the present application;

Figure 2-10 is a schematic diagram of the static configuration in the embodiment of the present application;

Figure 2-11 is a schematic diagram of data flow diagram 4 in the embodiment of the present application;

Figure 2-12 is a schematic diagram of dividing the configuration buffer (cfg buffer) into two separate storage spaces in the embodiment of the present application;

Figure 2-13 is a schematic diagram of the cfg buffer sequentially receiving three configuration words transmitted by the configuration random access memory (config RAM) in the embodiment of the present application;

Figure 3 is a schematic structural diagram of a PE array configuration device provided by an embodiment of the present application;

Figure 4 is a schematic structural diagram of a communication device provided by an embodiment of the present application.

Detailed ways

Embodiments of the present application provide a PE array configuration method and related equipment for configuring the PE array of processing units in a CGRA chip.

The embodiments of the present application are described below with reference to the accompanying drawings.

The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances, and are merely a way of distinguishing objects with the same attributes in describing the embodiments of the present application. Furthermore, the terms "include" and "having" and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, product or apparatus comprising a series of elements need not be limited to those elements, but may include not explicitly other elements specifically listed or inherent to such processes, methods, products or equipment.

CGRA chip is a new generation of programmable acceleration architecture that combines the flexibility of FPGA chips with the high energy efficiency of ASIC chips. The CGRA chip has a built-in PE array, which includes multiple PEs. The PE array is used to execute algorithms. It should be noted that PE is composed of multiple logic gates, which are used to perform corresponding operations, such as addition, subtraction, multiplication, division, etc. Users can configure at least one PE of the PE array in the CGRA chip through the configuration word, so that the CGRA chip can execute the corresponding algorithm.

For example, please refer to Figure 1-1, which is a 3×3 PE array in a CGRA chip, in which each element can be expressed as PEij (i=0,1,2; j=0,1,2), where , the arrow indicates the direction of direct data transfer between connected PEs. For example, connected PE00 and PE01 can directly transfer data to each other. When the user can configure at least one PE in the PE array, a corresponding data link is formed, which can be used to execute the corresponding algorithm, for example, the data link shown in Figure 1-2.

Currently, the CGRA chip can configure the PE array according to the operator of the program and obtain the data link corresponding to the operator. The data link includes the configuration of multiple PEs in the PE array. Then, when the CGRA chip executes the operator, it only needs to transmit the service data to the data link corresponding to the operator without switching the configuration of the PE in the data link.

For example, the data link corresponding to operator 2 is shown in Figure 1-3, where SUB is subtraction. The only difference is that PE11 changes from addition (ADD) to subtraction (SUB). The CGRA chip still needs to N PEs perform overall configuration switching.

To this end, this application proposes a PE array configuration method and related equipment for configuring the PE array.

This application can be applied to chips, which include processing modules and PE arrays. Among them, the processing module generates the isomorphism characteristics of M operators, M is a positive integer, and then determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, N is a positive integer, and based on the static configuration and M operators The overall configuration of the sub-unit in the PE array determines M dynamic configurations, and the dynamic configurations are other configurations in the overall configuration except the static configuration. Then, the PE array can be configured based on the static configuration and one of the M dynamic configurations, that is, only the dynamic configuration needs to be switched, and the static configuration does not need to be switched, which reduces the switching overhead.

Illustratively, the present application can be applied to the chip 100 shown in Figures 1-4, where the chip 100 includes a memory (memory, MEM) interface 110, a processing module 120, a storage module 130 and a PE array 140. It should be noted that the chip 100 can be an FPGA chip, a CGRA chip, or other reconfigurable chips, which is not limited here.

Among them, the MEM interface 110 is an interface through which internal devices of the chip 100 interact with external devices. For example, the MEM interface 110 can receive the source code and business data of the program from the external settings of the chip 100, and transmit the source code of the program to the processing module 120 and the business data to the storage module 130.

The processing module 120 may have a built-in compiler (compiler) 121, where the compiler 121 is a logic module. compiler 121 can be used to: generate the isomorphism characteristics of M operators based on the source code of the program, and determine the static configuration of N PEs in the PE array based on the isomorphism characteristics, and based on the static configuration and M operators in the PE array The overall configuration in determines M dynamic configurations, and the dynamic configuration includes other configurations in the overall configuration except the static configuration. The compiler 121 may store the static configuration and at least one dynamic configuration among the M dynamic configurations in the storage module 130. In some possible implementations, the compiler 121 can forward the static configuration and at least one dynamic configuration among the M dynamic configurations to the storage module 130 through the MEM interface 110. In some possible implementations, the compiler 121 can also be directly connected to the storage module, thereby directly forwarding the static configuration and at least one dynamic configuration among the M dynamic configurations to the storage module 130.

The storage module 130 may be a random access memory (RAM) built into the chip 100 . The storage module 130 may transmit the static configuration and at least one dynamic configuration among the M dynamic configurations to the PE array 140, so that the PE array 140 configures based on the static configuration and at least one dynamic configuration among the M dynamic configurations.

The PE array 140 has a built-in configuration buffer (Cfg buffer) 141. The Cfg buffer 141 can be used to receive the static configuration transmitted by the storage module 130 and at least one dynamic configuration among the M dynamic configurations, so that the PE array 140 is based on the static configuration and the M dynamic configurations. A dynamic configuration in the configuration configures the PE array 140. The processing module 120 also includes a configuration switcher (Cfg switcher) 122, which can be used to switch the dynamic configuration in the Cfg buffer 141.

In some possible implementations, the compiler 121 can store the static configuration and at least one dynamic configuration among the M dynamic configurations in the storage module 130. In some possible implementations, the compiler 121 also stores the mapping relationship between the static configuration and the index number in the storage module 130, and then transmits the configuration word to the PE array 140. The configuration word includes the index number and at least one dynamic configuration among the M dynamic configurations. . It should be noted that the compiler 121 can send the configuration word to the storage module 130, and then the storage module 130 forwards the configuration word to the PE array 140. Alternatively, the compiler 121 can directly forward the configuration word to the PE array 140, which is not limited here.

In some possible implementations, the storage module 130 can be divided into multiple areas, namely configuration random access memory (config RAM) 131, static configuration template library (template lib) 132 and data random access memory (data RAM). )133. Among them, the config RAM 131 is used to store the configuration word and transmit the configuration word to the PE array 140; the template lib 132 is used to store the mapping relationship between the static configuration and the index number, and return the corresponding data to the PE array 140 based on the index number in the configuration word. Static configuration; data RAM 133 is used to store business data and transmit it to the PE array 140. Then, the Cfg buffer 141 can receive the configuration word transmitted by the config RAM 131 of the storage module 130, and obtain the static configuration from the template lib 132 of the storage module 130 based on the index number in the configuration word, and based on the static configuration and M dynamic configurations A dynamic configuration of the PE array 140 is configured, and business data is calculated based on the configured PE array 140 to execute corresponding operators.

The foregoing has introduced the chip 100. Next, the configuration method of the PE array based on execution in the chip 100 will be introduced. Please refer to Figure 2-1. The method embodiment mainly includes the following steps:

201. The processing module generates M data flow graphs (DFG) corresponding to M operators based on the source code of the program, where M is a positive integer.

In this embodiment of the present application, the chip can receive the source code and business data of the program through the MEM interface, and then the MEM interface transmits the source code of the program to the processing module and the business data to the storage module. After the processing module receives the source code of the program, it can generate a data flow diagram corresponding to each of the M operators based on the source code of the program, and obtain M data flow diagrams. Among them, a data flow graph includes the functional configuration and routing configuration of each node among multiple nodes. In some possible implementations, the M operators may be all operators in a program, or may be part of the operators in the program, which is not limited here.

For example, M=3, that is, three operators, namely operator 1, operator 2 and operator 3. Among them, operator 1 is used to calculate the multiplication and addition operations between 2×2 order matrices: A*B+C*D; operator 2 is used to calculate the multiplication and subtraction operations between 2×2 order matrices: A*B-C* D; Operator 3 is used to calculate the multiplication and addition operations between 4×4 order matrices:

K ₀ *K ₁ +K ₂ *K ₃ +K ₄ *K ₅ +K ₆ *K ₇

Among them, A, B, C and D are all two-dimensional matrices:

K ₀ , K ₁ , K ₂ , K ₃ , K ₄ , K ₅ , K ₆ and K ₇ are all 4-dimensional matrices:

Among them, p=0,1,2,3,4,5,6,7.

For example, taking operator 1 as an example, let the two-dimensional matrix E ₁ =A*B+C*D, then E ₁ has 4 elements:

For any element E ₁ ij (i=0,1; j=0,1) in E ₁ , an operation needs to be performed:

E ₁ ij＝(Ai0*B0i+Ai1*B1i)+(Ci0*D0i+Ci1*D1i)

It can be seen that the processing module can generate data flow diagram 1 as shown in Figure 2-2 based on the source code of operator 1. Data flow diagram 1 includes 7 nodes, namely MUL0, MUL1, MUL2, MUL3, ADD0, ADD1, and ADD2. Among them, the functional configurations of MUL0, MUL1, MUL2, and MUL3 are all multiplication, and the functional configurations of ADD0, ADD1, and ADD2 are all addition; the routes of MUL0 and MUL1 are configured to point to ADD0, and the routes of MUL2 and MUL3 are configured to point to ADD1, ADD0, The route of ADD1 is configured to point to ADD2.

Among them, MUL0 is used to perform the operation of Ai0*B0i, MUL1 is used to perform the operation of Ai1*B1i, MUL2 is used to perform the operation of Ci0*D0i, MUL3 is used to perform the operation of Ci1*D1i, and ADD0 is used to perform the operation of MUL0+MUL1 Operation, ADD1 is used to perform the operation of MUL2+MUL3, ADD2 is used to perform the operation of ADD0+ADD1, and finally the value of E ₁ ij is obtained.

For example, taking operator 2 as an example, let the two-dimensional matrix E ₂ =A*BC*D, then E ₂ has 4 elements:

For any element E ₂ ij (i=0,1; j=0,1) in E ₂ , an operation needs to be performed:

E ₂ ij＝(Ai0*B0i+Ai1*B1i)+(Ci0*D0i+Ci1*D1i)

It can be seen that the processing module can generate data flow diagram 2 as shown in Figure 2-3 based on the source code of operator 2. Data flow diagram 2 includes 7 nodes, namely MUL0, MUL1, MUL2, MUL3, ADD0, ADD1, and SUB0. Among them, the functional configurations of MUL0, MUL1, MUL2, and MUL3 are all multiplication, the functional configurations of ADD0 and ADD1 are all addition, and the functional configuration of SUB0 is subtraction; the routing configurations of MUL0 and MUL1 are directed to ADD0, and the routing configurations of MUL2 and MUL3 are The routes pointing to ADD1, ADD0, and ADD1 are configured to point to SUB0.

Among them, MUL0 is used to perform the operation of Ai0*B0i, MUL1 is used to perform the operation of Ai1*B1i, MUL2 is used to perform the operation of Ci0*D0i, MUL3 is used to perform the operation of Ci1*D1i, and ADD0 is used to perform the operation of MUL0+MUL1 Operation, ADD1 is used to perform the operation of MUL2+MUL3, SUB0 is used to perform the operation of ADD0-ADD1, and finally the value of E ₂ ij is obtained.

For example, taking operator 3 as an example, let the two-dimensional matrix E ₃ =K ₀ *K ₁ +K ₂ *K ₃ +K ₄ *K ₅ +K ₆ *K ₇ , then E ₃ has 16 elements :

For any element E ₃ ij (i=0,1,2,3; j=0,1,2,3) in E ₃ , an operation needs to be performed:

E ₃ ij＝(K ₀ i0*K ₁ i0+K ₀ i1*K ₁ i1+K ₀ i2*K ₁ i2+K ₀ i3*K ₁ i3)+(K ₂ i0*K ₃ i0+K ₂ i1 *K ₃ i1+K ₂ i2*K ₃ i2+K ₂ i3*K ₃ i3)+(K ₄ i0*K ₅ i0+K ₄ i1*K ₅ i1+K ₄ i2*K ₅ i2+K ₄ i3 *K ₅ i3)+(K ₆ i0*K ₇ i0+K ₆ i1*K ₇ i1+K ₆ i2*K ₇ i2+K ₆ i3*K ₇ i3)

In this embodiment of the present application, the processing module can generate the data flow diagram 3 shown in Figure 2-4 based on the source code of operator 3. Among them, the data flow diagram 2 includes 31 nodes, namely MUL0 to MUL15, ADD0 to ADD114. Among them, the functional configurations of MUL0 to MUL15 are all multiplication, and the functional configurations of ADD0 to ADD114 are all addition; MUL0 to MUL15, ADD0 to The routing configuration of ADD114 is shown in Figure 2-4 and will not be described in detail here. Among them, ADD14 is used to perform the operation of ADD13+ADD12, and finally obtain the value of E ₃ ij.

202. The processing module extracts isomorphic features based on the M data flow graphs, and the isomorphic features correspond to the same local structures among the M operators.

In some possible implementations, the isomorphism feature may include the routing configuration of each node among the N nodes, and any two nodes among the N nodes are directly connected or indirectly connected. In some possible implementations, the isomorphism feature also includes the functional configuration of at least one node among the N nodes. For example, the isomorphism characteristics determined between the data flow diagram 1 shown in Figure 2-2 and the data flow diagram 2 shown in Figure 2-3 can be shown in Figure 2-5. The isomorphism The feature includes 7 nodes. Any 2 nodes among the 7 nodes are directly connected or indirectly connected, namely a, b, c, d, e, f, and g. Among them, the routes of a and b are configured to point to e, c. The routes of , d are configured to point to f, and the routes of e and f are configured to point to g. Illustratively, the isomorphism feature also includes the functional configuration of at least one node among the seven nodes. Exemplarily, as shown in Figure 2-5, the functional configurations of a, b, c, and d are all multiplication, e, The functional configuration of f is all additive, and the functional configuration of g is not limited.

In some possible implementations, the processing module can extract isomorphism features from a data flow graph. The isomorphism features are at least two identical local structures in the data flow graph. By reusing the isomorphism features The N PEs in the PE array corresponding to the feature reduce the number of required PEs and enhance usability. For example, the processing module can divide Figure 2-4 into Figure 2-6, thereby dividing the data flow Figure 3 into 5 local structures with similar structures. Based on these 5 local structures, Figure 2-7 can be extracted. The isomorphic characteristics shown. This isomorphism feature includes 7 nodes, namely a, b, c, d, e, f, and g. Among them, the routes of a and b are configured to point to e, and the routes of c and d are configured to point to f and e. , f's route is configured to point to g. Among them, the functional configurations of a, b, c, and d are not limited, and the functional configurations of e, f, and g are all additive.

In the embodiment of this application, the processing module can extract the isomorphism features shown in Figure 2-8 based on the data flow graph 1, data flow graph 2 and data flow graph 3. The isomorphism features include 7 nodes. They are a, b, c, d, e, f, g respectively. Among them, the routes of a and b are configured to point to e, the routes of c and d are configured to point to f, the routes of e and f are configured to point to g, the function configurations of a, b, c, d, and g are not limited, e, The functional configurations of f are all additive.

In some possible implementations, isomorphism features can also be local structures of multiple different granularities, so that the chip can determine isomorphism features of different granularities as needed under different circumstances, reducing the number of required PEs. Enhanced usability. For example, based on data flow graph 1, data flow graph 2 and data flow graph 3, the isomorphism feature shown in Figure 2-9 can be extracted. This isomorphism feature includes three nodes, namely a and b. , c, where the routes of a and b are configured to point to c, and the functional configurations of a, b, and c are not limited. The isomorphic features shown in Figure 2-9 have smaller granularity than the isomorphic features shown in Figure 2-8.

In some possible implementations, the M operators can be all operators in a program or part of the operators in the program, so that the chip can determine one or more different isomorphisms for a program as needed. Features, suitable for multiple operators in programs where a suitable isomorphic feature cannot be extracted, enhancing its applicability. For example, if the program includes 6 operators, and the 6 operators correspond to 6 data flow diagrams, namely data flow diagram 1/2/3/4/5/6, the processing module can be based on data flow diagram 1 /2/3 extracts isomorphism feature 1, and extracts isomorphism feature 2 based on data flow diagram 4/5/6.

It should be noted that the above steps 201-202 are optional, as long as the processing module can generate isomorphic features of M operators, there is no limitation here. For example, the chip can determine the isomorphism characteristics based on the calculation formula of M operators, which is not limited here.

203. The processing module determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, where N is a positive integer.

In some possible implementations, the isomorphism feature includes N nodes. Based on the connection relationship between the N nodes in the isomorphism feature, available N PEs are selected from the PE array, where the connection relationships of the N PEs are The connection relationship between N nodes in the isomorphic feature is the same. One node in the isomorphic feature corresponds to one PE among the N PEs. Then, based on the configuration of each node in the isomorphism feature, corresponding configurations are performed on the corresponding PEs among the N PEs to obtain the static configuration of the N PEs. Correspondingly, if the isomorphism feature includes the routing configuration of each of the N nodes, the static configuration also includes the routing configuration of each of the N PEs; if the isomorphism feature includes the function of at least one of the N nodes, During configuration, the static configuration also includes the functional configuration of at least one PE among the N PEs.

As an example, as shown in Figure 1-1, the PE array is a 3×3 architecture. For the isomorphic features shown in Figure 2-8, it can be mapped to N PEs (PE00, PE01, PE02, PE11, PE20, PE21, PE22) of the PE array shown in Figure 1-1, that is, N=7 ), the static configuration of 7 PEs is obtained as shown in Figure 2-10. Among them, the routing configurations of N PEs (PE00, PE01, PE02, PE11, PE20, PE21, PE22) constitute a transmission path. In some possible implementations, the function configurations of PE01 and PE21 are additive, and the function configurations of PE00, PE02, PE11, PE20, and PE22 are not limited.

204. The processing module determines M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array. The dynamic configuration includes other configurations in the overall configuration except the static configuration.

For example, if the static configuration extracted based on operator 1, operator 2, and operator 3 is as shown in Figure 2-10, then the dynamic configuration is the functional configuration of PE00, PE02, PE11, PE20, and PE22. Among them, the dynamic configuration corresponding to operator 1 is: the function configurations of PE00, PE02, PE20, and PE22 are all multiplication, and the function configuration of PE11 is addition; the dynamic configuration corresponding to operator 2 is: the functions of PE00, PE02, PE20, and PE22 The configurations are all multiplication, and the functional configuration of PE11 is subtraction; operator 3 corresponds to 5 dynamic configurations, of which 4 dynamic configurations are: the functional configurations of PE00, PE02, PE20, and PE22 are all multiplication, and the functional configuration of PE11 is subtraction; Operator 3 corresponds to one of the five dynamic configurations: the functional configurations of PE00, PE02, PE11, PE20, and PE22 are all subtractions.

In some possible implementations, the dynamic configuration also includes the configuration of at least one PE other than the N PEs in the PE array. For example, as shown in Figure 2-11, the data flow diagram 4 corresponding to operator 4 is based on the isomorphism characteristics shown in Figure 2-10. The dynamic configuration corresponding to operator 4 can also include the routing configuration of PE10. , the route of PE10 is configured to point to PE11, and the function of PE10 is configured as addition.

205. The processing module stores the mapping relationship between the static configuration and the index number in the storage module.

Optionally, in some feasible implementations, the processing module can generate a statically configured index number, and store the mapping relationship between the index number and the static configuration in the storage module. In some possible implementations, the storage module can store the mapping relationship through the template lib. For example, there are two static configurations, namely static configuration 1 and static configuration 2. The processing module can generate two index numbers, namely index number 1 and index number 2. Among them, index number 1 has a mapping relationship with static configuration 1. Index number 2 has a mapping relationship with static configuration 2, and the mapping relationship between index number and static configuration is stored in the template lib of the storage module.

For example, the template lib of the storage module is shown in Table 1:

Table 1

IdxIdx	Cfgcfg
#0#0	Cfg_template_0 Cfg_template_0

#1#1	Cfg_template_1Cfg_template_1
#n#n	Cfg_template_nCfg_template_n

Among them, the items under the idx column are represented as index numbers, and the items under the cfg column are static configurations of the data link.

206. The processing module transmits the configuration word to the PE array.

In some possible implementations, the configuration word includes static configuration and at least one dynamic configuration among M dynamic configurations.

Illustratively, as shown in Figure 2, it is an example of configuration words corresponding to three operators in the embodiment of the present application.

Table 2

Among them, the static configurations of operator 1, operator 2 and operator 3 are all the same. Operator 1 corresponds to 1 configuration word, operator 2 corresponds to 1 configuration word, and operator 3 corresponds to 5 configuration words. Among the five configuration words of operator 3, the dynamic configurations of the first four configuration words are the same, and only the dynamic configuration of the fifth one is different.

In some possible implementations, the configuration word may also include the number of configuration copies. The number of configuration copies is used to indicate the number of configurations based on the configuration word. Then, multiple copies of the configuration word with multiple identical static configurations and the same dynamic configuration can be abbreviated It is 1 copy of the configuration word to further reduce the transmission overhead. Illustratively, as shown in Table 3, it is an example of the configuration words of three operators in the embodiment of the present application.

table 3

Among them, the number of configuration copies is expressed by using the items under the *num column. It should be noted that each configuration word shown in Table 3 is different, that is, one configuration word corresponds to one reconfigurable cycle.

In some possible implementations, the dynamic configuration includes configurations of the N PEs other than the static configuration, and the dynamic configuration also includes configurations of at least 1 PE other than the N PEs. , so that the dynamic configuration can also include the configuration of other PEs other than the N nodes of the PE array, enhancing the applicability.

For example, as shown in Figure 2-11, it is the data flow diagram 4 corresponding to operator 4. The isomorphism feature shown in Figure 2-5 can only be used as the local structure of the data flow diagram 4 of operator 4. The configuration of the remaining node can correspond to PE10. The configuration of PE10 is at least 1 other than N PEs. Configuration of a PE. Illustratively, as shown in Table 4-1 or Table 4-2, they are examples of configuration words of three operators in the embodiment of this application.

Table 4-1

Table 4-2

Among them, the dynamic configuration is divided into the Cfg_operation_list part and the other cfg part, where the Cfg_operation_list part is the configuration of the N PEs other than the static configuration, and the other cfg part is the configuration of at least 1 PE other than the N PEs.

In some possible implementations, the configuration word includes an index number and at least one dynamic configuration among the M dynamic configurations. The index number represents the static configuration, which effectively reduces transmission overhead and improves transmission efficiency. Illustratively, as shown in Table 5-1, Table 5-2, Table 5-3 or Table 5-4, they are examples of configuration words corresponding to three operators in the embodiment of the present application.

Table 5-1

Table 5-2

Table 5-3

Table 5-4

In some possible implementations, the processing module can transmit the configuration word to the storage module, and then the storage module stores the configuration word through the built-in config ram and transmits the configuration word to the PE array. In some possible implementations, the transmission module can transmit all configuration words to the storage module at one time, and the storage module sequentially transmits the configuration words to the PE array according to certain rules, one configuration word at a time. The configuration words received by the storage module but not yet transmitted to the PE array can be stored in the config ram. Since the configuration word includes the index number rather than the static configuration itself, storage requirements are greatly reduced.

207. The PE array obtains the static configuration that has a mapping relationship with the index number from the storage module.

Optionally, in some possible implementations, the cfg buffer in the PE array can obtain static configuration from the template lib of the storage module based on the index number. For example, the cfg buffer in the PE array requests a static configuration from the storage module based on the index number. The storage module determines the static configuration from the template lib based on the index number and mapping relationship, and returns the static configuration to the PE array.

It should be noted that when the PE array receives a new configuration word, check the index number. If the index number is the same as the index number in the last received configuration word, the PE array does not need to obtain the static configuration from the storage module. Instead, it uses the static configuration of the previous configuration word and only needs to switch the dynamic configuration. Transmission overhead is reduced.

In some possible implementations, as shown in Figure 2-12, the cfg buffer can be divided into two separate storage spaces, namely storage space 1 and storage space 2. Storage space 1 is used to store static configuration, and storage space 2 is used to store static configuration. To store dynamic configuration. For example, as shown in Figure 2-13, the cfg buffer in the PE array receives configuration word 1, configuration word 2, and configuration word 3 sequentially transmitted by the storage module. Configuration word 1 includes the index number and dynamic configuration dynamic0, configuration Word 2 includes the index number and dynamic configuration dynamic1, and configuration word 3 includes the index number and dynamic configuration dynamic2. Among them, then, when the cfg buffer in the PE array receives the configuration word 1, it obtains the static configuration from the template lib in the storage module based on the index number, and stores the static configuration and dynamic configuration dynamic0. When the cfg buffer in the PE array receives configuration word 2, it can be determined that the index number in configuration word 2 is the same as the index number in configuration word 1, and the static configuration (static) needs to be obtained from the config RAM in the storage module, and It is to switch the dynamic configuration dynamic0 to the dynamic configuration dynamic1 in configuration word 2. When the cfg buffer in the PE array receives configuration word 3, it can be determined that the index number in configuration word 3 is the same as the index number in configuration word 2, and the static configuration (static) needs to be obtained from the config RAM in the storage module, and It is to switch the dynamic configuration dynamic1 to the dynamic configuration dynamic2 in configuration word 2. Since only dynamic configurations need to be switched, there is no need to switch static configurations, which reduces switching overhead.

208. The PE array is configured based on the static configuration and at least one dynamic configuration among the M dynamic configurations.

For example, when the chip executes the first operator among M operators, the PE array is configured based on the static configuration and the first dynamic configuration corresponding to the first operator, and the first dynamic configuration is one of the M dynamic configurations. ;When the chip executes the second operator among the M operators, the PE array switches the first dynamic configuration to the second dynamic configuration corresponding to the second operator. The second operator is executed after the first operator is executed. operator, the second dynamic configuration is one of M dynamic configurations. Since only dynamic configuration is switched, there is no need to switch static configuration, which reduces switching overhead.

For example, operator 1, operator 2, and operator 3 correspond to the same static configuration. When the PE array executes operator 1, operator 2, and operator 3 in any order, the cfg switcher in the processing module only needs to Switching dynamic configuration in the cfg buffer of the PE array eliminates the need to switch static configuration, thus saving switching overhead.

It should be noted that for the sake of simple description, the foregoing method embodiments are expressed as a series of action combinations. However, those skilled in the art should know that the present application is not limited by the described action sequence. Because in accordance with this application, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily necessary for this application.

In order to facilitate better implementation of the above solutions in the embodiments of the present application, relevant devices for implementing the above solutions are also provided below.

Please refer to Figure 3. A chip 300 provided by an embodiment of the present application includes:

processing module 310 and PE array 320; wherein,

The processing module 310 is used to generate isomorphism features of M operators. The isomorphism features correspond to the same local structure among the M operators. M is a positive integer; according to the isomorphism The characteristics determine the static configuration of N PEs in the PE array, and N is a positive integer; determine M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array, and the dynamic configuration Be other configurations in the overall configuration except the static configuration;

The PE array 320 is configured to configure based on the static configuration and at least one dynamic configuration among the M dynamic configurations.

In some possible implementations, the PE array 320 is specifically configured to: when executing the first operator among the M operators, based on the static configuration and the first dynamic value corresponding to the first operator Configuration is configured, and the first dynamic configuration is one of the M dynamic configurations; when executing the second operator among the M operators, the first dynamic configuration is switched to the second The second dynamic configuration corresponding to the operator, the second operator is an operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations.

In some possible implementations, the processing module 310 is configured to obtain M data flow graphs corresponding to the M operators, and extract the isomorphism features according to the M data flow graphs.

In some possible implementations, the chip 300 further includes: a storage module 330; the processing module 310 is also configured to transmit the mapping relationship between the static configuration and the index number to the storage module 330, and to the PE The array transmits a configuration word, which includes the index number and at least one dynamic configuration among the M dynamic configurations; the PE array 320 is also used to obtain the data from the storage module 330 based on the index number. Describe static configuration.

It should be noted that the information interaction, execution process, etc. between the modules/units of the above-mentioned device are based on the same concept as the method embodiments of the present application, and the technical effects they bring are the same as those of the method embodiments of the present application. The specific content can be Please refer to the descriptions in the method embodiments shown above in this application, which will not be described again here.

An embodiment of the present application also provides a computer storage medium, wherein the computer storage medium stores a program, and the program executes some or all of the steps described in the above method embodiments.

An embodiment of the present application also provides a computer program product, wherein the computer program product stores a program, and the program executes some or all of the steps recorded in the above method embodiments.

Next, another communication device provided by an embodiment of the present application is introduced. Please refer to Figure 4. The communication device 400 includes:

Receiver 401, transmitter 402, processor 403 and memory 404. In some embodiments of the present application, the receiver 401, the transmitter 402, the processor 403 and the memory 404 may be connected through a bus or other means. In FIG. 4, the connection through the bus is taken as an example.

Memory 404 may include read-only memory and random access memory and provides instructions and data to processor 403 . A portion of memory 404 may also include non-volatile random access memory (NVRAM). The memory 404 stores an operating system and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for implementing various operations. The operating system may include various system programs that are used to implement various basic services and handle hardware-based tasks.

The processor 403 controls the operation of the communication device 400. The processor 403 may also be called a central processing unit (CPU). In specific applications, various components of the communication device 400 are coupled together through a bus system, where in addition to a data bus, the bus system may also include a power bus, a control bus, a status signal bus, etc. However, for the sake of clarity, various buses are called bus systems in the figure.

The methods disclosed in the above embodiments of the present application can be applied to the processor 403 or implemented by the processor 403. The processor 403 may be included as a chip as described in FIG. 3 . The steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory 404. The processor 403 reads the information in the memory 404 and completes the steps of the above method in combination with its hardware.

The receiver 401 can be used to receive input numeric or character information, and generate signal input related to the relevant settings and function control of the communication device 400. The transmitter 402 can include a display device such as a display screen, and the transmitter 402 can be used to output through an external interface. Numeric or character information.

In this embodiment of the present application, the processor 403 is configured to execute the configuration method of the processing unit PE array executed by the communication device 400 .

In addition, it should be noted that the device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate. The physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in this application, the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.

The technical solution of the present application is essentially, or the part that contributes to the existing technology, can be embodied in the form of a software product. The computer software product is stored in a readable storage medium, such as a computer floppy disk, a U disk, a mobile phone, etc. A hard disk, ROM, RAM, magnetic disk or optical disk, etc., includes a number of instructions to cause a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a server, data center, etc. that contains one or more available media integrated. The available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), etc.

Claims

A method for configuring a processing unit PE array, which is characterized in that it is used in a chip, and the chip includes a processing module and a PE array, and the method includes:

The processing module generates isomorphic features of M operators, the isomorphic features correspond to the same local structures among the M operators, and M is a positive integer;

The processing module determines the static configuration of N PEs in the PE array based on the isomorphism characteristics, where N is a positive integer;

The processing module determines M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array. The dynamic configuration includes other configurations in the overall configuration except the static configuration. configuration;

The PE array is configured based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
The method according to claim 1, characterized in that, the PE array configuring the PE array based on the static configuration and at least one dynamic configuration among the M dynamic configurations includes:

When executing the first operator among the M operators, the PE array is configured based on the static configuration and the first dynamic configuration corresponding to the first operator, and the first dynamic configuration is the One of M dynamic configurations;

When executing the second operator among the M operators, the PE array switches the first dynamic configuration to the second dynamic configuration corresponding to the second operator, and the second operator is executed An operator is executed after completing the first operator, and the second dynamic configuration is one of the M dynamic configurations.
The method according to claim 1 or 2, characterized in that before the processing module generates isomorphism features of M operators, it further includes:

The processing module obtains M data flow graphs corresponding to the M operators;

The isomorphic characteristics of M operators generated by the processing module include:

The processing module extracts the isomorphism features according to the M data flow graphs.
The method according to claims 1-3, characterized in that the chip further includes a storage module, and before the PE array is configured based on the static configuration and at least one dynamic configuration among the M dynamic configurations, it further includes:

The storage module stores the mapping relationship between the static configuration and the index number;

The processing module transmits a configuration word to the PE array, where the configuration word includes the index number and at least one dynamic configuration among the M dynamic configurations;

The PE array obtains the static configuration that has a mapping relationship with the index number from the storage module.
The method according to claim 4, wherein the configuration word further includes a configuration number, and the configuration number is used to indicate the number of times configuration is performed based on the configuration word.
The method according to any one of claims 1 to 5, characterized in that the isomorphism feature includes the routing configuration of each node among the N nodes, and any two nodes among the N nodes are directly connected or indirectly connected. ; The static configuration includes the routing configuration of the N PEs.
The method according to claim 6, wherein the isomorphism feature further includes a functional configuration of at least one of the N nodes; and the static configuration further includes at least one PE of the N PEs. functional configuration.
The method according to any one of claims 1 to 7, characterized in that the dynamic configuration further includes the configuration of at least one PE other than the N PEs.
A chip is characterized by including:

Processing module and PE array:

The processing module is used to: generate isomorphism features of M operators, the isomorphism features correspond to the same local structure among the M operators, M is a positive integer; according to the isomorphism The characteristics determine the static configuration of N PEs in the PE array, and N is a positive integer; determine M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array, and the dynamic configuration Including other configurations in the overall configuration except the static configuration;

The PE array is configured to configure based on the static configuration and at least one dynamic configuration among the M dynamic configurations.
The chip according to claim 9, characterized in that the PE array is specifically used for:

When executing the first operator among the M operators, configuration is performed based on the static configuration and the first dynamic configuration corresponding to the first operator, and the first dynamic configuration is the M dynamic configurations. one of the;

When executing the second operator among the M operators, the first dynamic configuration is switched to the second dynamic configuration corresponding to the second operator. The second operator is the first dynamic configuration after executing the first operator. An operator is executed after an operator, and the second dynamic configuration is one of the M dynamic configurations.
The chip according to claim 9 or 10, characterized in that:

The processing module is also configured to: obtain M data flow graphs corresponding to the M operators, and extract the isomorphism features according to the M data flow graphs.
The chip according to claims 9-11, further comprising: a memory module;

The storage module is used to: store the mapping relationship between the index number and the static configuration;

The processing module is further configured to: transmit a configuration word to the PE array, where the configuration word includes the index number and at least one dynamic configuration among the M dynamic configurations;

The PE array is also used to obtain the static configuration that has a mapping relationship with the index number from the storage module.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores a program, and the program causes the computer device to execute the method according to any one of claims 1-8.
A computer program product, characterized in that the computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium; the processor of the device reads the instructions from the computer-readable storage medium. The computer-executed instructions are executed by the processor to cause the device to perform the method according to any one of claims 1-8.
A communication device, characterized in that the communication device includes a processor, a memory and a communication interface;

the processor is coupled to the memory and the communication interface;

The memory is used to store instructions, the processor is used to execute the instructions, and the communication interface is used to communicate with other communication devices under the control of the processor;

The instructions, when executed by the processor, cause the processor to perform the method according to any one of claims 1-8.