CN119166321A

CN119166321A - Method, processor and device for allocating vector computing power

Info

Publication number: CN119166321A
Application number: CN202311438764.4A
Authority: CN
Inventors: 吴兴良; 林越
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2023-06-19
Filing date: 2023-10-31
Publication date: 2024-12-20

Abstract

A method, processor and device for allocating vector computing power, belonging to the field of vector computing technology. After the system vector length is updated, the method selects an operator in a vector computing power pool according to the updated system vector length, and reconfigures the vector execution unit, so that the newly configured vector execution unit can fully utilize the included operators under the premise of being able to perform vector calculations, and try to avoid the waste of computing power of the vector execution unit caused by the operator not participating in the calculation.

Description

Vector calculation force distribution method, processor and device

The present disclosure claims priority from chinese patent application No. 202310734151.9, entitled "a scheduling method," filed on 19 of 2023, 06, the entire contents of which are incorporated herein by reference.

Technical Field

The present disclosure relates to the field of vector computing technologies, and in particular, to a method, a processor, and an apparatus for vector computing power distribution.

Background

Vector computation is widely used in many fields, such as high performance computing (High performance computing, HPC) and artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) computing.

Currently, in an instruction pipeline that performs vector computation, a fixed number of vector execution units (Vector Execution Unit, VEU) are configured, each consisting of a fixed number of operators, each capable of computing vectors that are no greater than the length of the system vector.

When the user updates the length of the system vector to be far smaller than the length of the current system vector, a large part of arithmetic units do not participate in the calculation when each vector execution unit executes the vector calculation, and the vector execution units cannot fully exert the calculation capability, so that the calculation force is wasted.

Disclosure of Invention

The disclosure provides a method, a processor and equipment for vector calculation power distribution, which can improve the calculation power utilization rate of a vector execution unit, and the corresponding technical scheme is as follows:

In a first aspect, a method for vector computing force allocation is provided, the method comprising an operator control circuit determining that a system vector length is updated from a first vector length to a second vector length, selecting an operator in a vector computing force pool based on the second vector length, configuring a first vector execution unit, and executing vector operations of the second vector length by the first vector execution unit. The first vector execution unit is a vector execution unit for executing a second vector length, and the first vector execution unit is composed of at least one arithmetic unit, wherein the total data length which can be calculated by the at least one arithmetic unit is greater than or equal to the second vector length.

According to the technical scheme, after the system vector length is updated, an operator is selected from a vector calculation force pool according to the updated system vector length, and a vector execution unit is reconfigured, so that the newly configured vector execution unit fully utilizes the included operator on the premise of being capable of executing vector calculation, and the operator is prevented from participating in calculation as much as possible, so that the calculation force of the vector execution unit is wasted.

In one possible implementation, the selecting of the operators in the vector computing pool, the configuring of the vector execution units may be such that the first number of operators comprised by the first vector execution unit is determined based on a multiple relation between the second vector length and the data length that can be calculated by the single operator, the operators are selected in the vector computing pool, and the selected first number of operators is configured as the first vector execution unit.

In the technical scheme provided by the disclosure, in order to enable the operators included in the configured vector execution unit to meet the vector calculation requirement and not to be idle when vector calculation is performed, when determining the number of operators included in the vector execution unit, a multiple relation between the second vector length and the data length which can be calculated by a single operator is calculated first, and then the number of operators included in the first vector execution unit is determined based on the multiple relation.

In one possible implementation manner, in the case that the length of the system vector after updating is smaller than the length of the system vector before updating, after determining that the first number of operators included in the first vector execution unit to be configured includes the first number of operators and the size of the first number of operators included in the second vector execution unit may be compared first, where the second vector execution unit is a vector execution unit that executes the length of the system vector before updating. In the case that the second number is determined to be greater than the first number, it means that if the second vector execution unit continues to be used for performing subsequent vector computation, an operator in the second vector execution unit is in an idle state in the process of vector computation, and therefore, the calculation force is wasted, and in this case, the operator can be selected in the vector calculation force pool, and the vector execution unit can be reconfigured.

In one possible implementation manner, in the case that the length of the system vector after updating is smaller than the length of the system vector before updating, after determining that the first number of operators included in the first vector execution unit to be configured includes the first number of operators and the size of the first number of operators included in the second vector execution unit may be compared first, where the second vector execution unit is a vector execution unit that executes the length of the system vector before updating. In case it is determined that the third number is smaller than the first number, it is explained that the vector length that the second vector execution unit can execute cannot support the subsequent vector calculation, and therefore in this case the operator may be selected in the vector calculation pool, the vector execution unit may be reconfigured.

In one possible implementation, the selecting an operator in the vector computing pool may be performed by selecting an operator in the vector computing pool based on an identification of each operator in the vector computing pool, and configuring each first number of operators selected as a vector execution unit.

In one possible implementation, after the system vector length is updated, in order to be able to cooperate with the reconfigured vector execution unit to store the calculation result, the vector physical registers may be further grouped in the present disclosure, by selecting a vector physical register based on the second vector length, and configuring at least one vector physical register set, where each vector physical register set includes at least one vector physical register, and the total storable data length of the at least one vector physical register is greater than or equal to the second vector length.

In the technical scheme provided by the disclosure, after the length of the system vector is updated, vector physical registers can be grouped according to the updated length of the system vector, and the vector physical register set is configured, so that the configured vector physical register set can store calculation data and calculation results of subsequent vector calculation.

In one possible implementation, selecting vector physical registers and configuring at least one vector physical register set is performed by determining a fourth number of vector physical registers included in each vector physical register set based on a multiple relationship between a second vector length and a data length storable by a single vector physical register. Vector physical registers are selected, and every fourth number of selected vector physical registers are configured into a vector physical register group.

In the technical scheme provided by the disclosure, in order to enable the vector physical registers included in the configured single vector physical register set to meet the requirement of storing the calculation result of single vector calculation, and not to generate too much residual storage space when storing the single calculation result, when determining the number of vector physical registers included in the vector physical register set, a multiple relation between the second vector length and the storable data length of the single vector physical register is calculated first, and then the number of vector physical registers included in the single vector physical register set is determined based on the multiple relation.

In one possible implementation, in the case that the updated system vector length is smaller than the pre-update system vector length, after determining the fourth number of vector physical registers included in each vector physical register group that needs to be configured, the fifth number and the fourth number of vector physical registers included in each vector physical register group that are present may be compared. In the case where it is determined that the fifth number is greater than the fourth number, it is explained that if the calculation result of performing the subsequent vector calculation is continued to be stored using the vector physical register group that has been already configured, there may be more remaining memory space in one vector physical register group when the one calculation result is stored, and therefore, in this case, the vector physical register group may be reconfigured.

In one possible implementation, in the case that the updated system vector length is greater than the pre-update system vector length, after determining the fourth number of vector physical registers included in each vector physical register set that needs to be configured, the sixth number and the fourth number of vector physical registers included in each vector physical register set that are present may be compared. In the case where it is determined that the sixth number is greater than the fourth number, it is explained that a single vector physical register group that has been currently configured may not be able to store the calculation result of one vector calculation, and therefore, in this case, the vector physical register group may be reconfigured.

In one possible implementation, a method of selecting vector physical registers to configure vector physical registers may be by selecting vector physical registers based on an identification of each vector physical register, and configuring each fourth number of selected vector physical registers as a set of vector physical registers.

In one possible implementation, in order to allocate vector physical register sets to vectors, in the technical solution provided in the present disclosure, it may also be recorded whether each vector physical register set is idle, and the processing may be such that a set identifier is allocated to each vector physical register set, and a correspondence between the set identifier of the vector physical register set and status indication information of the vector physical register set is established, where the status indication information is used to indicate whether the vector physical register set is idle.

In one possible implementation, in the case of storing the calculation result using the vector physical register set, the processing flow of vector calculation may be such that a vector calculation instruction is acquired, and a destination vector physical register set corresponding to the vector calculation instruction is determined in the free vector physical register set according to the correspondence between the set identifier of the vector physical register set and the state indication information of the vector physical register set. Executing the vector calculation instruction through the first vector execution unit, and writing a calculation result corresponding to the vector calculation instruction into the target vector physical register group.

In a second aspect, the present disclosure provides an apparatus for vector computing force allocation, the apparatus comprising respective modules for performing the method of vector computing force allocation in the first aspect or any one of the possible implementations of the first aspect.

In a third aspect, the present disclosure provides a processor comprising logic circuitry for performing the method of vector computing force allocation as described in the first aspect above, and power supply circuitry.

In a fourth aspect, the present disclosure provides a computing device comprising a processor and a memory, the processor configured to perform the method of vector computing force allocation as described in the first aspect above.

In a fifth aspect, the present disclosure provides a computer-readable storage medium having instructions stored therein that, when run on a computing device, cause the computer to perform the method of the above aspects.

In a sixth aspect, the present disclosure provides a computer program product containing instructions that, when run on a computing device, cause the computing device to perform the method of the above aspects.

Further combinations of the present disclosure may be made to provide further implementations based on the implementations provided in the above aspects.

Drawings

FIG. 1 is a schematic architecture diagram of an instruction pipeline provided by the present disclosure;

FIG. 2 is a flow chart of a method for vector computing force distribution provided by the present disclosure;

FIG. 3 is a schematic illustration of the effect of a vector computing force distribution provided by the present disclosure;

FIG. 4 is a schematic diagram of the effect of a vector computing force distribution provided by the present disclosure;

FIG. 5 is a flow chart of a method for vector computing force distribution provided by the present disclosure;

FIG. 6 is a schematic diagram of the effect of a vector computing force distribution provided by the present disclosure;

FIG. 7 is a flow chart of a method of vector computation provided by the present disclosure;

FIG. 8 is a schematic diagram of a processor provided by the present disclosure;

FIG. 9 is a schematic diagram of a computing device provided by the present disclosure;

fig. 10 is a schematic structural diagram of a device for vector computing force distribution provided by the present disclosure.

Detailed Description

In order to improve the utilization rate of vector execution units, the present disclosure provides a method of vector computing force allocation, in which a processor adapts the data length that can be calculated by a single vector execution unit, and the number of vector execution units, according to the change of the system vector length. Therefore, when the length of the system vector is smaller, more vector execution units can be configured, the parallelism of vector calculation is improved, and the arithmetic unit participates in the vector calculation as much as possible, so that the calculation power waste of the vector execution units is avoided.

The method for vector computing power allocation provided in the present disclosure may be applied to an instruction pipeline for performing vector computation in a processor, where the processor may be deployed in various computing devices, and the processor may be implemented in at least one hardware form of an application-specific integrated circuit (ASIC), a digital signal processing (DIGITAL SIGNAL processing, DSP), a field-programmable gate array (FPGA) and a programmable logic array (programmable logic array, PLA), and of course, the processor may have other hardware implementations, which is not limited to this disclosure. A processor, such as a central processing unit (central processing unit, CPU). In some examples, the processor may also be implemented using an image processor (graphics processing unit, GPU) or data processing unit (data processing unit, DPU), a system on chip (SoC), an accelerator chip or accelerator card, or the like. In some examples, the processor may also be an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor.

Referring to FIG. 1, an instruction pipeline 100 may include a Decode unit (decoder) 101, a renaming unit (Rename) 102, a reorder buffer (ReOrder Buffer) 103, an Issue unit (Issue) 104, a Vector physical register (Vector PHYSICAL REGISTERS) 105, and an arithmetic unit 106, where a plurality of arithmetic units form a Vector execution unit (Vector Execution Unit) 1061, and the arithmetic unit may perform floating point addition, floating point multiplication, integer addition, integer multiplication, and other operations. In addition, under the architecture shown in fig. 1, an operator control circuit 107 and a vector physical register control circuit 108 may also be included. The arithmetic unit control circuit 107 may perform the method of vector computing force allocation provided by the present disclosure, group arithmetic units based on the system vector length, and the vector register control circuit 108 may perform the method of vector computing force allocation provided by the present disclosure as a vector execution unit, group vector physical registers based on the system vector length.

The method of vector computing force distribution provided by the present disclosure is described below with reference to the accompanying drawings. As shown in fig. 2, the process flow of the method may include the steps of:

In step 201, the operator control circuit determines that the system vector length is updated from the first vector length to the second vector length.

The system vector length refers to a vector length in vector calculation performed by the computing device, and the system vector length can be set by a user according to actual calculation requirements.

In practice, the user may update the system vector length according to the actual computing requirements. In updating the system vector length, a user may input an update instruction for the system vector length to the computing device, and in response, the processor receives the update instruction, updates the system vector length from a first vector length to a second vector length indicated by the update, and the operator control circuit obtains the second vector length and determines that the system vector length is updated from the first vector length to the second vector length.

In one possible implementation, a scalable vector expansion control register (Scalable Vector Extension Control Register, SVE Control Register) is provided in the processor, and a user may input a system register (Move (to) SYSTEM REGISTER, MSR) instruction to the computing device when specifying the system vector length. The processor receives the MSR instruction, updates the LEN field in SVE Control Register from a first value to a second value indicated by the MSR instruction, the value of the LEN field being used to indicate the length of the system vector, the first value being used to indicate the length of the system vector as a first vector length, the second value being used to indicate the length of the system vector as a second vector length. The operator control circuit reads the second value of the LEN field in SVE Control Register and determines the second vector length indicated by the second value and determines that the system vector length is updated from the first vector length to the second vector length.

Step 202, the arithmetic unit control circuit selects arithmetic units from the vector calculation force pool based on the second vector length, and configures at least one vector execution unit, wherein the vector execution unit is composed of at least one arithmetic unit, and the total calculated length of the at least one arithmetic unit is greater than or equal to the second vector length.

The data types that the arithmetic unit can calculate can be FP8 type, BF16, half-precision floating point number (FP 16) type, single-precision floating point number (FP 32) type, double-precision floating point number (FP 64) type, INT8 type, short integer (INT 16) type, basic integer (INT 32) type, long integer (INT 64) type, BF16, etc. The data length of the single arithmetic unit is 8 bits when the arithmetic unit can calculate the data types of INT8 and FP8, 16 bits when the arithmetic unit can calculate the data types of BF16, half-precision floating point number and short integer number, 32 bits when the arithmetic unit can calculate the data types of single-precision floating point number and basic integer number, and 64 bits when the arithmetic unit can calculate the data types of double-precision floating point number and long integer number. The above data types are merely taken as examples, and the data length that can be calculated by a single arithmetic unit is exemplified, and the disclosure is not limited to the data types that can be calculated by the arithmetic unit and the corresponding data lengths that can be calculated.

In an implementation, when a plurality of operators are included in the vector calculation force pool, the operator control circuit may determine, based on the second vector length and a data length that can be calculated by a single operator, the number of operators included in one vector execution unit and the number of vector execution units configured when determining that the system vector length is updated from the first vector length to the second vector length. Then, the issue width of the issue unit in the instruction pipeline is updated to the number of vector execution units. Where the issue width refers to the maximum number of vector compute instructions that the issue unit is allowed to issue to the vector execution unit in one clock cycle.

There may be a variety of calculation methods for how many vector execution units are configured and how many operators each vector execution unit includes, so long as the total data length that each vector execution unit can calculate is not less than the second vector length, and the following exemplary methods are described.

The method comprises the following steps:

The number of vector execution units configured is determined based on a multiple relationship of the total data length and the second vector length that are computable by the operators in the vector calculation force pool. Then, the number of operators included in one vector execution unit is determined based on a multiple relationship between the number of operators in the vector calculation force pool and the number of vector execution units configured. Specifically, the following calculation formula may be adopted.

The number of operators in the vector calculation force pool is multiplied by the length of data which can be calculated by a single operator to obtain a first numerical value. Dividing the first numerical value by the length of the system vector, and then rounding up the division result to obtain the number of configured vector execution units. Dividing the number of the arithmetic units in the vector calculation force pool by the number of the configured vector execution units, and then rounding up the division result to obtain the number of the arithmetic units included in one vector execution unit. The calculation formula may be the following formula (1):

wherein E is the number of operators in the vector calculation force pool, L is the data length which can be calculated by a single operator, VL is the second vector length, E is L is the first numerical value, N is the number of configured vector execution units, N is the number of operators included in one vector execution unit, Representing an upward rounding.

The method one is described below by way of one example:

The number of the arithmetic units is 32, the data length which can be calculated by a single arithmetic unit is 64 bits, and the system vector length is 256 bits. The number of operators is multiplied by the length of data that can be calculated by a single operator, resulting in a first value of 2048. Dividing the first value by the system vector length 256 yields a second value of 8. Dividing the number of operators by the second value to obtain a third value of 4. That is, the operators are divided into 8 groups each including 4 operators, each group of operators is configured as one vector execution unit, and a total of 8 vector execution units are configured.

The second method is as follows:

The number of operators included in one vector execution unit is determined based on a multiple relationship between the second vector length and the data length that can be calculated by a single operator. Then, the number of vector execution units to be configured is determined from a multiple relationship between the number of operators in the vector calculation force pool and the number of operators included in one vector execution unit. Specifically, the following calculation formula may be adopted.

Dividing the second vector length by the data length which can be calculated by a single arithmetic unit, and then rounding up the division result to obtain the number of arithmetic units included in one vector execution unit. Dividing the number of the operators in the vector calculation force pool by the number of the operators included in one vector execution unit, and rounding up the division result to obtain the number of the configured vector execution units. The calculation formula may be the following formula (2):

The second method is described below by way of an example:

The number of the arithmetic units is 32, the data length which can be calculated by a single arithmetic unit is 64 bits, and the system vector length is 256 bits. Dividing the length of the system vector by the length of the data which can be calculated by a single arithmetic unit, and obtaining a third value of 4. Dividing the number of operators by the third value to obtain a second value of 8. That is, the operators are divided into 8 groups each including 4 operators, each group of operators is configured as one vector execution unit, and a total of 8 vector execution units are configured.

After the number of the configured vector execution units and the number of the operators included in one vector execution unit are determined, comparing the number of the operators included in the current vector execution unit with the number of the operators included in the determined vector execution unit, if the two numbers are different, selecting the operators from the vector calculation pool according to the identification of the operators, and configuring the vector execution unit. Here, the current vector execution unit is the vector execution unit that executes the first vector length.

The number of the arithmetic units is E, the identifiers of the arithmetic units are C ₁、C₂、C₃…C_E, N arithmetic units marked as C ₁、C₂…C_n are divided into a group and configured as a vector execution unit, N arithmetic units marked as C _n+1、C_n+2…C_2n are divided into a group and configured as a vector execution unit, and the like, E arithmetic units are divided into N groups and configured as N vector execution units.

For example, the number of the arithmetic units is 32, the identifiers of the arithmetic units are C ₁、C₂、C₃…C₃₂, 4 arithmetic units identified as C ₁、C₂、C₃、C₄ are divided into a group and configured as a vector execution unit, 4 arithmetic units identified as C ₅、C₆、C₇、C₈ are divided into a group and configured as a vector execution unit, and the like, the 32 arithmetic units are divided into 8 groups and configured as 8 vector execution units.

Through the steps 201 and 202, when the length of the system vector is shortened from long, the arithmetic unit can be reassigned, and the vector execution units can be configured, so that the number of vector execution units can be adapted to be adjusted, the emission width can be increased, and the vector calculation parallelism can be improved.

The following describes, by way of an example, the effects that can be achieved by the method provided by the present disclosure in the case where the length of the system vector is shortened from long:

As shown in fig. 3, the number of the operators is 32, the data length that can be calculated by a single operator is 64 bits (bit), the system vector length before updating is 512 bits, the number of the vector execution units is 4, the emission width of the emission unit is 4, each vector execution unit comprises 8 operators, and the data length that can be calculated by a single vector execution unit is 512 bits. The length of the updated system vector is changed to 256 bits, if the configuration of the vector execution unit is unchanged, the emission width is still 4, and when vector calculation is executed, half of the arithmetic units do not execute calculation, so that the calculation force is wasted. Through the above step 202, the operators may be reassigned, the number of operators in a single vector execution unit is reduced, the number of vector execution units is increased, and 8 vector execution units are configured, each vector execution unit includes 4 operators, and accordingly, the emission width of the emission unit is updated to 8. Therefore, after the length of the system vector is shortened, more vector calculation instructions can be executed at the same time, the parallelism of vector calculation is improved, the waste of calculation force is avoided, and the overall calculation efficiency is improved.

In the case that the system vector length is increased from short to long, the operator can be reassigned, and the vector execution unit can be configured so as to adapt to the data length that can be calculated by the vector execution unit, so that the vector execution unit can perform calculation of longer vectors at a time.

The following describes, by way of an example, the effects that can be achieved with the method provided by the present disclosure in the case where the system vector length is lengthened from short:

As shown in fig. 4, the number of the operators is 32, the data length that can be calculated by a single operator is 64 bits, the system vector length before updating is 128 bits, the number of vector execution units is 16, each vector execution unit comprises 2 operators, the data length that can be calculated by a single vector execution unit is 128 bits, and the emission width of the emission unit is 16. The length of the updated system vector is changed to 256 bits, if the configuration of the vector execution unit is unchanged and the data length which can be calculated by a single vector execution unit is still 128 bits, the vector execution unit cannot calculate the instruction one vector at a time, the vector execution unit needs to be divided into a plurality of times of execution, and the calculation efficiency is low. Through the above step 202, the operators may be reassigned, the number of operators in a single vector execution unit is increased, 8 vector execution units are configured, the transmission width of the transmitting unit is updated to 8, each vector execution unit includes 4 operators, and the data length that can be calculated by the single vector execution unit is 256 bits. Therefore, after the length of the system vector is prolonged, the length of data which can be calculated by each vector execution unit is also adapted to be prolonged, so that the vector execution unit can execute one vector calculation instruction at a time without being divided into multiple times of execution, and the calculation efficiency is improved.

After the grouping of the operators is completed, the operator control circuit may send the configuration information of the vector execution unit to the transmitting unit. The configuration information of the vector execution units may include the number of vector execution units, an identification of the vector execution units, an identification of operators included in each vector execution unit. Exemplary, configuration information of the vector execution unit may be as shown in table 1 below.

TABLE 1

Of course, the steps 201 and 202 may also be implemented by other hardware in the processor, such as a transmitting unit, and the disclosure is not limited to a specific implementation body.

After the vector execution unit is configured, the vector calculation of the second vector length described above may be performed by the configured vector execution unit.

In one possible implementation, vector physical registers may also be grouped according to system vector length, after which one vector physical register set may be used to store vectors, and vector physical register sets may be used to store vectors of longer length than vectors stored using a single vector physical memory. Accordingly, referring to FIG. 5, the process of grouping vector physical registers may include the steps of:

Step 203, the vector physical register control circuit determines that the system vector length is updated from the first vector length to the second vector length.

The specific process of the vector physical register control circuit in step 203 is the same as that of the operator control circuit in step 201, and will not be described here again.

Step 204, the vector physical register control circuit selects a vector physical register based on the second vector length, and configures at least one vector physical register set, where each vector physical register set includes at least one vector physical register, and the total storable data length of the at least one vector physical register is greater than or equal to the second vector length.

The data length that can be stored in a single vector physical register refers to the maximum data length that can be stored in one vector physical register.

In implementations, the vector physical register control circuitry, upon determining that the system vector length is updated from the first vector length to the second vector length, may determine a number of vector physical registers included in one vector physical register set, and a number of vector physical register sets configured, based on the second vector length and a data length that a single vector physical register may store.

For how many vector physical register sets are configured and how many vector physical registers each vector physical register set includes, there may be a variety of calculation methods to satisfy that the total storable data length of each vector physical register set is greater than or equal to the system vector length, and the following exemplary methods are illustrated:

The method comprises the following steps:

The number of vector physical register sets configured is determined based on a multiple of the total data length and the second vector length that the available vector physical registers can store. Then, the number of vector physical registers included in one vector physical register group is determined based on a multiple relationship between the number of available vector physical registers and the number of configured vector physical register groups. Specifically, the following calculation formula may be adopted.

The number of vector physical registers is multiplied by the length of data that a single vector physical register can store to obtain a second value. Dividing the second value by the second vector length, and then rounding up the division result to obtain the number of the configured vector physical register groups. Dividing the number of available vector physical registers by the number of configured vector physical register sets, and rounding up the division result to obtain the number of vector physical registers included in one vector physical register set. The calculation formula may be the following formula (3):

Wherein R is the number of available vector physical registers, L is the data length that a single vector physical register can store, VL is the second vector length, r×l is the first numerical value, M is the number of configured vector physical register sets, and M is the number of vector physical registers included in one vector physical register set.

The method one is described below by way of one example:

The system vector length is 256 bits, the data length which can be stored in a single vector physical register is 128 bits, and the number of the vector physical registers is 256. The number of vector physical registers is multiplied by the length of data that a single vector physical register can store, resulting in a fourth value of 32768. Dividing the fourth value by the system vector length yields a fifth value 128. The number of vector physical registers is divided by the fifth value to obtain a sixth value of 2. The vector physical registers are divided into 128 vector physical register banks. Each vector physical register group includes 2 vector physical registers.

The second method is as follows:

The number of vector physical registers included in one vector physical register group is determined based on a multiple relationship between the second vector length and the data length that can be stored by the single vector physical register. The number of configured vector physical register sets is then determined based on a multiple relationship between the number of available vector physical registers and the number of vector physical registers that one vector physical register set includes. Specifically, the following calculation formula may be adopted.

Dividing the second vector length by the storable data length of the single vector physical register, and then rounding up the division result to obtain the number of vector physical registers included in one vector physical register group. Dividing the number of available vector physical registers by the number of vector physical registers included in one vector physical register group, and rounding up the division result to obtain the number of configured vector physical register groups. The calculation formula may be the following formula (4):

The second method is described below by way of an example:

The number of vector physical registers is 256, the data length which can be stored in a single vector physical register is 128 bits, and the system vector length is 256 bits. Dividing the length of the system vector by the length of the data which can be calculated by a single arithmetic unit to obtain a sixth value of 2. The number of vector physical registers is divided by the sixth value to yield a fifth value of 128. That is, the vector physical registers are divided into 128 groups, each vector physical register group including 2 vector physical registers.

After determining the number of the configured vector physical register sets and the number of vector physical registers included in one vector physical register set, comparing the number of vector physical registers included in the current vector physical register set with the number of vector physical registers included in the determined vector physical register set, if the two numbers are different, selecting the vector physical registers according to the identification of the vector physical register set, and configuring the vector physical register set.

The number of the vector physical registers is R, the identifiers of the vector physical registers are V ₁、V₂、V₃…V_R, M vector physical registers marked as V ₁、V₂…V_m are divided into a group to be used as a vector physical register group, M arithmetic units marked as V _m+1、V_m+2…V_2m are divided into a group to be used as a vector physical register group, and the R arithmetic units are divided into M groups by analogy.

For example, the number of vector physical registers is 256, the identifiers of the vector physical registers are V ₁、V₂、V₃…V₂₅₆, 2 operators identified as V ₁、V₂ are divided into a group to be used as a vector physical register group, 2 vector physical registers identified as V ₃、V₄ are divided into a group to be used as a vector physical register group, and the 256 vector physical registers are divided into 128 groups by analogy.

In the case that the system vector length is increased from short to long by the above step 203, if a single vector physical register cannot store one vector, a plurality of vector physical registers may be divided into a group, and one vector physical register group is used to store one vector, where the vector may refer to a calculation result of vector calculation.

The above effects are described below by way of example:

As shown in fig. 6, the number of vector physical registers is 256, the data amount storable by a single vector physical register is 128 bits, the length of the system vector before updating is 128 bits, and one vector physical register can store one vector. The updated system vector is 256 bits long, and one vector physical register cannot store one vector. By the above step 203, 2 vector physical registers are formed into one vector physical register group, 128 vector physical register groups may be formed, and one vector physical register group stores one vector, and each vector physical register group may store 256 bits of data length. In this way, the instruction pipeline may be enabled to support vector computations with longer vector lengths.

After completion of the grouping of the vector physical registers, the vector physical register control circuit may send grouping information of the vector physical registers to the renaming unit. The grouping information of the vector physical registers may include the number of vector physical register sets, an identification of the vector physical registers included in each vector physical register set, and a set identification of each vector physical register set. Illustratively, the grouping information of the following vector physical registers may be as shown in Table 2 below:

TABLE 2

Of course, the steps 203 and 204 may be implemented by other hardware in the processor, such as a renaming unit, and the disclosure is not limited to a specific execution body.

On the basis of the vector calculation force distribution, the present disclosure also provides a vector calculation method, referring to fig. 7, which may include the following processing steps:

Step 301, the decoding unit decodes the code to obtain at least one set of vector operation information.

In an implementation, a decoding unit obtains a code, decodes the code to obtain at least one set of vector operation information, where each set of vector operation information includes a vector calculation instruction, an identification (ARCHITECTED REGISTER Number, ARN) of an architecture register of each calculation data, and an identification of an architecture register of a calculation result. The vector compute instruction may be a micro-operation instruction (Micro operation code, μop). The calculation data refers to data participating in vector calculation, for example, if the vector calculation is a+b, then the vector a and the vector B are calculation data.

Step 302, the decoding unit sends the at least one set of vector operation information to the renaming unit.

Step 303, for each vector calculation instruction, the renaming unit determines the destination vector physical register set corresponding to the vector calculation instruction.

The destination vector physical register set is used for storing the calculation result of the vector calculation instruction.

In an implementation, in the renaming unit or in the reorder buffer, a renaming list is recorded, where the renaming list includes a correspondence between an identification of an architectural register and a group identification of a vector physical register group, where the correspondence is dynamically changeable. And for each received set of vector operation information, the renaming unit acquires the identification of the architecture register of each calculation data in the vector operation information, and queries the group identification of the original vector physical register set corresponding to the identification of the architecture register of each calculation data in the renaming list. Then, the calculation data is read from the queried original vector physical register group.

In the renaming unit or in the reorder buffer, a state list of the physical vector register set is also recorded, wherein the state list comprises a corresponding relation between a set identifier of the vector physical register set and state indication information of the vector physical register set. And for the vector calculation instruction in each received set of vector operation information, the renaming unit selects one physical vector register group with idle state from the state list as a destination vector physical register group corresponding to the vector calculation instruction according to the correlation between the vector calculation instruction and other vector calculation instructions in the at least one vector operation information.

In addition, in order to return the calculation result according to the order of the vector calculation instructions in the code in the case of out-of-order execution of the vector calculation instructions, the renaming unit may send the vector calculation instructions in the at least one vector operation information to the reorder buffer according to the order of the vector calculation instructions in the code. The reorder buffer assigns a ROB id to each vector compute instruction, the ROB id indicating the order of the corresponding vector compute instruction in the code.

Step 304, the renaming unit sends the vector calculation instruction, the calculation data of the vector calculation instruction and the group identification of the destination vector physical register group corresponding to the vector calculation instruction to the transmitting unit.

In one possible implementation, in the case where the rebinning buffer assigns ROB ids to vector instructions, the renaming unit may further obtain the ROB id corresponding to each vector calculation instruction, and send the ROB id of the vector calculation instruction and the vector calculation instruction together to the transmitting unit.

In step 305, the transmitting unit sends the vector calculation instruction, the calculation data of the vector calculation instruction, and the group identifier of the destination vector physical register group corresponding to the vector calculation instruction to the vector execution unit.

In an implementation, a transmitting unit obtains a state of a vector execution unit, determines the vector execution unit in an idle state. Specifically, when receiving the vector calculation instruction, the transmitting unit may determine, according to configuration information of the vector execution units, an operator included in each vector execution unit, then acquire a state of the operator included in the vector execution unit, and if the operators included in the vector execution unit are all in an idle state, determine that the vector execution unit is in the idle state.

For each vector calculation instruction, the transmitting unit transmits the vector calculation instruction, the calculation data of the vector calculation instruction and the group identification of the destination vector physical register group corresponding to the vector calculation instruction to an idle vector execution unit. In the case of a vector execution unit having a plurality of idle states, the transmitting unit may simultaneously transmit vector calculation instructions to the vector execution units of the plurality of idle states, respectively.

In one possible implementation, where the reorder buffer assigns a ROB id to a vector instruction, the renaming unit may also send the ROB id of the vector calculation instruction along with the vector calculation instruction to the vector execution unit.

And 306, the vector execution unit calculates the calculation data according to the vector calculation instruction to obtain a calculation result, and writes the calculation result into a target vector physical register group corresponding to the vector calculation instruction.

In an implementation, the vector execution unit includes a plurality of operators, and the transmitting unit may divide two calculation data of the vector calculation instruction into a plurality of groups of elements, send one group of elements to one operator, and execute the operation of the group of elements by the operator to obtain an element calculation result. For example, the vector operation indicated by the vector calculation instruction is vector addition, and the calculation data includes a and B, where a is a vector (A1, A2, A3, A4), B is a vector (B1, B2, B3, B4), four operators are included in the vector execution unit, A1 and B1 are a set of elements, A2 and B2 are a set of elements, A3 and B3 are a set of elements, A4 and B4 are a set of elements, and four operators calculate a1+b1, a2+b2, a3+b3, a4+b4, respectively.

The arithmetic unit sends element calculation results to the vector physical register control circuit in sequence, and meanwhile, the arithmetic unit also sends the group identification of the target vector physical register group corresponding to the vector execution unit to the vector physical register control circuit. The vector physical register control circuit records grouping information of the vector physical registers, and the vector physical register control circuit inquires vector physical registers included in the target vector physical register group and writes calculation results into the vector physical registers included in the target vector physical register group.

In one possible implementation, in a case where the rebinning buffer assigns ROB ids to vector instructions, the vector execution unit may further send the ROB ids corresponding to the vector calculation instructions and the calculation results of the vector calculation instructions together to the vector physical register control circuit. Correspondingly, the vector physical register control circuit can write the calculation result of the vector calculation instruction into the corresponding physical vector register group in sequence according to the ROB id.

Fig. 8 is a schematic diagram of a processor provided in the present disclosure, and as shown in fig. 8, the processor 800 includes a logic circuit 801 and a power supply circuit 802. Wherein the logic 801 may comprise at least one instruction pipeline as shown in fig. 1. The power supply circuit 802 is configured to supply power to the logic circuit 801.

In one possible implementation, the processor 1801 may be a CPU or other general purpose processor, and the processor 1801 may also be one or more integrated circuits for implementing the aspects of the present disclosure, such as a digital signal processor (DIGITAL SIGNAL processing, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), programmable logic device (programmable logic device, PLD), field programmable gate array (Field Programmable GATE ARRAY, FPGA), discrete gate or transistor logic, discrete hardware components, or the like. A general purpose processor may be a microprocessor or any conventional processor or the like.

Fig. 9 is a schematic diagram of a computing device provided by the present disclosure, as shown in fig. 9, the computing device 900 includes a bus 902, a processor 904, a memory 906, and a communication interface 908. Communication between the processor 904, the memory 906, and the communication interface 908 is via the bus 902. Bus 902 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in fig. 9, but not only one bus or one type of bus. Bus 902 may include a path to transfer information between various components of computing device 900 (e.g., memory 906, processor 904, communication interface 908). The processor 904 may include any one or more of a CPU, a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (DIGITAL SIGNAL processor, DSP). The storage 906 may be as memory or external storage to the computing device 900. Memory 906 may include volatile memory (RAM), such as random access memory (random access memory). The memory 906 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk (HARD DISK DRIVE, HDD) or solid state disk (SSD STATE DRIVE).

Fig. 10 is a device for vector computing force distribution provided in the present disclosure, as shown in fig. 10, the device 1000 includes a determining module 1001, an operator distributing module 1002, and a vector computing module 1003, where:

A determining module 1001, configured to determine that the system vector length is updated from the first vector length to the second vector length;

An operator allocation module 1002, configured to select an operator from a vector computing power pool based on the second vector length, and configure a first vector execution unit, where the first vector execution unit is one vector execution unit that executes the second vector length, and the first vector execution unit is composed of at least one operator, and a total data length that can be calculated by the at least one operator is greater than or equal to the second vector length;

A vector calculation module 1003, configured to perform a vector operation of the second vector length.

It should be understood that the apparatus provided by the present application may be implemented by a central processing unit (central processing unit, CPU), an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), which may be a complex program logic device (complex programmable logical device, CPLD), a field-programmable gate array (FPGA), a general-purpose array logic (GENERIC ARRAY logic, GAL), a data processing unit (data processing unit, DPU), a system on chip (SoC), an off-chip card, or any combination thereof. Or may be implemented by the above-described operator control circuit of fig. 1, or by the processor shown in fig. 8 or the computing device shown in fig. 9. When the above-mentioned calculation force distribution method is implemented by software, the apparatus 1000 and its respective modules may be software modules.

In one possible implementation, the arithmetic unit allocation module 1002 is configured to:

determining a first number of operators comprised by the first vector execution unit based on a multiple relation between the second vector length and a single operator computable data length;

Selecting operators in the vector calculation force pool, and configuring the selected first number of operators as a first vector execution unit.

In a possible implementation manner, in a case that the second vector length is smaller than the first vector length, the selecting an operator in the vector calculation pool, and before configuring the selected first number of operators as the first vector execution unit, the operator allocation module 1002 is further configured to:

Comparing a second number of operators included in a second vector execution unit with the first number, and determining that the second number is greater than the first number, wherein the second vector execution unit is one vector execution unit executing the first vector length.

In a possible implementation manner, in a case that the second vector length is greater than the first vector length, the selecting an operator in the vector calculation pool, and before configuring the selected first number of operators as the first vector execution unit, the operator allocation module 1002 is further configured to:

comparing a third number of operators included in a second vector execution unit with the first number, and determining that the third number is smaller than the first number, wherein the second vector execution unit is one vector execution unit executing the first vector length.

According to the identification of each operator in the vector computing pool, selecting operators in the vector computing pool, and configuring the selected first number of operators as a first vector execution unit.

In one possible implementation manner, the apparatus further includes a vector physical register allocation module configured to:

And selecting vector physical registers based on the second vector length, and configuring at least one vector physical register group, wherein each vector physical register group comprises at least one vector physical register, and the total data length storable by the at least one vector physical register is greater than or equal to the second vector length.

In one possible implementation manner, the vector physical register allocation module is configured to:

Determining a fourth number of vector physical registers included in each vector physical register group based on a multiple relationship between the second vector length and a data length storable by a single vector physical register;

Vector physical registers are selected, and every fourth number of selected vector physical registers are configured into a vector physical register group.

In one possible implementation, in a case where the second vector length is smaller than the first vector length, the vector physical register allocation module is further configured to:

It is determined that each current set of vector physical registers includes a fifth number of vector physical registers greater than the fourth number.

In one possible implementation, in a case where the second vector length is greater than the first vector length, the selecting vector physical registers, before configuring each fourth number of selected vector physical registers into one vector physical register group, the method further includes:

It is determined that each current set of vector physical registers includes a sixth number of vector physical registers that is less than the fourth number.

And selecting vector physical registers according to the identification of each vector physical register, and configuring each selected fourth number of vector physical registers into a vector physical register group.

In one possible implementation, the vector physical register allocation module is further configured to:

Assigning a group identifier to each vector physical register group;

And establishing a corresponding relation between the group identification of the vector physical register group and state indication information of the vector physical register group, wherein the state indication information is used for indicating whether the vector physical register group is idle or not.

In a possible implementation manner, the first vector execution unit is further configured to:

Obtaining a vector calculation instruction;

according to the corresponding relation, determining a target vector physical register group corresponding to the vector calculation instruction in the idle vector physical register group;

Executing the vector calculation instruction through the first vector execution unit, and writing a calculation result corresponding to the vector calculation instruction into the target vector physical register group.

The apparatus 1000 for vector computing force distribution is applied to the processor in fig. 8 and 9, and when the apparatus 1000 performs vector computing force distribution, only the division of the functional modules is used for illustration, in practical application, the functional distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus 1000 for vector computing force distribution is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus 1000 for distributing vector computing force and the method for distributing vector computing force belong to the same concept, and the specific implementation process is detailed in the flow of the method for distributing vector computing force, which is not described herein again.

Finally, it should be noted that the foregoing embodiments are merely for illustrating the technical solutions of the present disclosure, and not for limiting the same, and although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solutions described in the foregoing embodiments may be modified or some of the technical features may be equivalently replaced, and these modifications or replacements do not depart from the essence of the corresponding technical solutions from the protection scope of the technical solutions of the embodiments of the present disclosure.

Claims

1. A method of vector computing force distribution, the method comprising:

Determining that the system vector length is updated from a first vector length to a second vector length;

Selecting an operator from a vector calculation force pool based on the second vector length, and configuring a first vector execution unit, wherein the first vector execution unit is one vector execution unit for executing the second vector length, the first vector execution unit consists of at least one operator, and the total calculated length of the at least one operator is greater than or equal to the second vector length;

Vector operations of the second vector length are performed by the first vector execution unit.

2. The method of claim 1, wherein selecting an operator in a vector computing force pool based on the second vector length configures at least one vector execution unit, comprising:

3. The method of claim 2, wherein in the case where the second vector length is less than the first vector length, the selecting an operator in a vector force pool, the method further comprises, prior to configuring the selected first number of operators as a first vector execution unit:

4. The method of claim 2, wherein in the event that the second vector length is greater than the first vector length, the selecting an operator in a vector force pool, the method further comprising, prior to configuring the selected first number of operators as a first vector execution unit:

5. The method according to any one of claims 2-4, wherein selecting operators in the vector force pool, configuring each first number of operators selected as a vector execution unit, comprises:

6. The method according to any one of claims 1-5, further comprising:

7. The method of claim 6, wherein selecting a vector physical register based on the second vector length configures at least one vector physical register set, comprising:

8. The method of claim 7, wherein in the event that the second vector length is less than the first vector length, the selecting vector physical registers, prior to configuring each fourth number of selected vector physical registers as a set of vector physical registers, the method further comprises:

9. The method of claim 7, wherein in the event that the second vector length is greater than the first vector length, the selecting vector physical registers, prior to configuring each fourth number of selected vector physical registers as a group of vector physical registers, the method further comprises:

10. The method according to any of claims 7-9, wherein the selecting vector physical registers, configuring each fourth number of selected vector physical registers as a set of vector physical registers, comprises:

11. The method according to any one of claims 7-10, wherein after said configuring the selected every fourth number of vector physical registers as one vector physical register bank, the method further comprises:

Assigning a group identifier to each vector physical register group;

12. The method of claim 11, wherein the performing, by the first vector execution unit, the vector operation of the second vector length comprises:

Obtaining a vector calculation instruction;

13. An apparatus for vector force distribution, the apparatus comprising:

A determining module, configured to determine that the system vector length is updated from a first vector length to a second vector length;

the vector calculation force distribution module is used for selecting an arithmetic unit from a vector calculation force pool based on the second vector length, and configuring a first vector execution unit, wherein the first vector execution unit is one vector execution unit for executing the second vector length, the first vector execution unit consists of at least one arithmetic unit, and the total calculated length of the at least one arithmetic unit is larger than or equal to the second vector length;

and the vector calculation module is used for executing vector operation of the second vector length by the first vector execution unit.

14. A processor comprising logic circuitry for performing the method of vector computing force allocation of claim 1 to claim 12 and power supply circuitry.

15. A computing device comprising a processor and a memory, the processor configured to perform the method of vector computing force allocation of claim 1 to claim 12.