CN113407154A - Vector calculation device and method - Google Patents

Vector calculation device and method Download PDF

Info

Publication number
CN113407154A
CN113407154A CN202010183821.9A CN202010183821A CN113407154A CN 113407154 A CN113407154 A CN 113407154A CN 202010183821 A CN202010183821 A CN 202010183821A CN 113407154 A CN113407154 A CN 113407154A
Authority
CN
China
Prior art keywords
vector
calculated
scalar
parallel
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010183821.9A
Other languages
Chinese (zh)
Inventor
俞立呈
李涛
侯新宇
张斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202010183821.9A priority Critical patent/CN113407154A/en
Publication of CN113407154A publication Critical patent/CN113407154A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Advance Control (AREA)

Abstract

The embodiment of the application discloses a vector calculation device and method, relates to the field of computers, reduces time consumption in a vector calculation process, reduces power consumption in the vector calculation process, and improves vector calculation efficiency. The specific scheme is as follows: obtaining a vector to be calculated and a first function; comparing scalar numerical values of the vectors to be calculated with the rule conditions respectively in parallel to obtain state information of each scalar numerical value in the vectors to be calculated; determining that scalar numerical values in vectors to be calculated are in a normal state; and parallelly substituting scalar numerical values in a normal state in the vector to be calculated into the first function for calculation to obtain a calculation result of the first function of the vector to be calculated.

Description

Vector calculation device and method
Technical Field
The embodiment of the application relates to the field of computers, in particular to a vector calculation device and method.
Background
The basic math library is the core of a high-performance computer system and is mainly used for completing common numerical value intensive operations in science and engineering. The calculation efficiency of various upper-layer science and engineering is directly influenced by the calculation performance of the basic mathematic library, so that the improvement of the calculation efficiency of the basic mathematic library has very important significance.
With the proposal of single instruction stream multiple data (SIMD) technology, a vector basic mathematical library developed by SIMD technology is widely used. Compared with the traditional scalar basic mathematical library, when the vector basic mathematical library processes a vector containing a plurality of scalar data, the traditional serial processing mode is converted into parallel processing, so that the data processing performance can be accelerated, and the calculation rate of the basic mathematical library is obviously improved. In the application process of the fields of multimedia, signal processing and the like, when the vector basic database is adopted for numerical calculation, the data processing efficiency is obviously improved.
Currently, the calculation process of the vector basis math base is: firstly, acquiring a vector value and a calculation function which need to be calculated; then, serially detecting the states of a plurality of scalar numerical values in the vector, judging whether each scalar numerical value in the vector is normal, if all the scalar numerical values are in normal states, converting a calculation function into a polynomial by adopting a power series, and then substituting all the scalar numerical values into the polynomial for calculation by using a SIMD instruction set in parallel to obtain a polynomial calculation result to be output; if at least one scalar value in the vector has an abnormal state, processing each scalar value in a serial mode, outputting abnormal information to the scalar value in the abnormal state, and performing polynomial calculation to the scalar value in the normal state to output a result.
It can be seen that the calculation process of the current vector basic math library has long time consumption, low processing efficiency, more instruction reading times, large number of memory access transactions and large power consumption.
Disclosure of Invention
The application provides a vector calculation device and method, which reduce time consumption in a vector calculation process, reduce power consumption in the vector calculation process and improve vector calculation efficiency.
In order to achieve the purpose, the following technical scheme is adopted in the application:
in a first aspect, the present application provides a vector calculation apparatus, which may include: the device comprises an acquisition unit, a processing unit, a determination unit and a calculation unit. The apparatus is configured with a logic module that may include multiple sets of parallel Arithmetic and Logic Units (ALUs). The device comprises an acquisition unit, a calculation unit and a processing unit, wherein the acquisition unit is used for acquiring a vector to be calculated and a first function, and the vector to be calculated comprises a plurality of scalar numerical values; the processing unit is used for respectively comparing scalar numerical values of the vectors to be calculated with the rule conditions in parallel through a plurality of groups of ALUs connected in parallel to obtain state information of each scalar numerical value in the vectors to be calculated; the rule condition is used for judging whether the scalar numerical value is normal or not, and the state information of one scalar numerical value is used for indicating that the scalar numerical value is in a normal state or an abnormal state in comparison with the rule condition; the determining unit is used for determining that scalar numerical values existing in the vectors to be calculated are in a normal state; and the calculation unit is used for parallelly substituting the scalar numerical values in the normal state in the vector to be calculated into the first function for calculation to obtain a calculation result of the first function of the vector to be calculated.
By the vector computing device, the state of each scalar numerical value in the vector is detected in parallel during vector computing; when the scalar numerical value is normal in the vector, the function calculation is carried out on the scalar numerical value in the normal state in parallel, so that the instruction reading times and the memory access transaction number in the vector calculation process are reduced; the time consumption in the vector calculation process is further reduced, the power consumption in the vector calculation process is reduced, and the vector calculation efficiency is improved.
An ALU may refer to a combinational logic circuit that performs arithmetic and logical operations. The logic module includes a plurality of groups of ALUs connected in parallel for performing parallel operations.
The content of the rule condition may be configured according to actual requirements, which is not limited in this application.
The first function is a calculation to be performed by a vector to be calculated, and the first function referred to herein may be an expression of the first function, or may also be a polynomial obtained by converting the first function, or others.
With reference to the first aspect or one of the foregoing possible implementation manners, in another possible implementation manner, the processing unit is specifically configured to: and calling the first instruction to start a plurality of groups of ALUs connected in parallel, and respectively comparing the scalar numerical values of the vector to be calculated with the rule conditions in parallel through the plurality of groups of ALUs connected in parallel to obtain the state information of each scalar numerical value in the vector to be calculated. In the possible implementation mode, parallel comparison detection is realized through the configured first instruction, and compared with the realization through a plurality of instructions, the instruction reading times and the number of access transactions in the vector calculation process are reduced; the time consumption in the vector calculation process is further reduced, the power consumption in the vector calculation process is reduced, and the vector calculation efficiency is improved.
With reference to the first aspect or any one of the foregoing possible implementation manners, in another possible implementation manner, the determining unit is specifically configured to: and calling a second instruction to start a plurality of groups of ALUs connected in parallel, and comparing the state information of each scalar numerical value in the vector to be calculated with the judgment condition in parallel through the plurality of groups of ALUs connected in parallel to determine that the scalar numerical value in the vector to be calculated is in a normal state. In the possible implementation mode, whether a scalar numerical value exists in a vector to be calculated and is in a normal state or not is judged in parallel through the configured second instruction, and compared with the method realized through a plurality of instructions, the instruction reading times and the access transaction number in the vector calculation process are reduced; the time consumption in the vector calculation process is further reduced, the power consumption in the vector calculation process is reduced, and the vector calculation efficiency is improved.
The judgment condition may be configured according to the actual requirement of the user, and the present application is not particularly limited.
With reference to the first aspect or any one of the foregoing possible implementation manners, in another possible implementation manner, the obtaining unit may be further configured to obtain a polynomial coefficient vector after the first function conversion; the computing unit is specifically configured to: and calling three instructions to start a plurality of groups of ALUs connected in parallel, and performing multiplication and addition calculation on the scalar numerical value and the coefficient vector in the normal state in the vector to be calculated in parallel through the plurality of groups of ALUs connected in parallel to obtain the polynomial value of the scalar numerical value in the normal state in the vector to be calculated. In the possible implementation mode, parallel polynomial calculation is realized through the configured third instruction, and compared with the realization through a plurality of instructions, the instruction reading times and the number of access transactions in the vector calculation process are reduced; the time consumption in the vector calculation process is further reduced, the power consumption in the vector calculation process is reduced, and the vector calculation efficiency is improved.
With reference to the first aspect or any one of the foregoing possible implementation manners, in another possible implementation manner, the vector calculation apparatus may further include a configuration unit configured to configure an instruction to activate multiple sets of ALUs that are connected in parallel.
The instruction configured by the configuration unit to start the multiple groups of ALUs connected in parallel may be one or more of the following: the system comprises a first instruction, a second instruction and a third instruction.
With reference to the first aspect or any one of the foregoing possible implementation manners, in another possible implementation manner, the logic module may include a first logic sub-module, a second logic sub-module, and a third logic sub-module; the different logic sub-modules each include ALUs in parallel for performing different parallel operations. The parallel ALUs included in one logic submodule are subsets of a plurality of groups of parallel ALUs included in the logic module.
With reference to the first aspect or any one of the foregoing possible implementations, in another possible implementation, the rule condition may include one or more of the following: whether greater than a maximum threshold, whether less than a minimum threshold, whether a not a number (NaN), whether negative infinity, whether positive infinity.
The status information may include a comparison result with each rule condition item.
With reference to the first aspect or any one of the foregoing possible implementations, in another possible implementation, the rule condition may be: whether it falls within the specified range.
With reference to the first aspect or any one of the foregoing possible implementation manners, in another possible implementation manner, the vector calculation apparatus may further include an output unit, configured to output a calculation result of the first function of the vector to be calculated.
With reference to the first aspect or any one of the foregoing possible implementation manners, in another possible implementation manner, the output unit may be further configured to perform abnormal output on a scalar numerical value in an abnormal state in a vector to be calculated when the scalar numerical value in the abnormal state exists in the vector to be calculated.
In a second aspect, the present application provides a vector calculation method, which may include: obtaining a vector to be calculated and a first function, wherein the vector to be calculated comprises a plurality of scalar numerical values; comparing scalar numerical values of the vectors to be calculated with the rule conditions respectively in parallel to obtain state information of each scalar numerical value in the vectors to be calculated; the rule condition is used for judging whether the scalar numerical value is normal or not, and the state information of one scalar numerical value is used for indicating that the scalar numerical value is in a normal state or an abnormal state in comparison with the rule condition; determining that scalar numerical values in vectors to be calculated are in a normal state; and parallelly substituting scalar numerical values in a normal state in the vector to be calculated into the first function for calculation to obtain a calculation result of the first function of the vector to be calculated.
By the vector calculation method, during vector calculation, the state of each scalar numerical value in the vector is detected in parallel; when the scalar numerical value is normal in the vector, the function calculation is carried out on the scalar numerical value in the normal state in parallel, so that the instruction reading times and the memory access transaction number in the vector calculation process are reduced; the time consumption in the vector calculation process is further reduced, the power consumption in the vector calculation process is reduced, and the vector calculation efficiency is improved.
With reference to the second aspect, in a possible implementation manner, the comparing scalar values of the vectors to be computed with the rule condition in parallel to obtain state information of each scalar value in the vectors to be computed includes: and calling a first instruction, executing parallel comparison of scalar numerical values of the vectors to be calculated with the rule conditions, and obtaining the state information of each scalar numerical value in the vectors to be calculated. In the possible implementation mode, parallel comparison detection is realized through the configured first instruction, and compared with the realization through a plurality of instructions, the instruction reading times and the number of access transactions in the vector calculation process are reduced; the time consumption in the vector calculation process is further reduced, the power consumption in the vector calculation process is reduced, and the vector calculation efficiency is improved.
With reference to the second aspect or one possible implementation manner, in another possible implementation manner, the determining that a scalar numerical value exists in a vector to be calculated and is in a normal state includes: and calling a second instruction, executing parallel comparison of the state information of each scalar numerical value in the vector to be calculated with the judgment condition, and determining that the scalar numerical value in the vector to be calculated is in a normal state. In the possible implementation mode, whether a scalar numerical value exists in a vector to be calculated and is in a normal state or not is judged in parallel through the configured second instruction, and compared with the method realized through a plurality of instructions, the instruction reading times and the access transaction number in the vector calculation process are reduced; the time consumption in the vector calculation process is further reduced, the power consumption in the vector calculation process is reduced, and the vector calculation efficiency is improved.
With reference to the second aspect or any one of the foregoing possible implementation manners, in another possible implementation manner, the vector calculation method may further include: obtaining a polynomial coefficient vector after the first function conversion; parallelly substituting scalar numerical values in a normal state in the vector to be calculated into the first function for calculation to obtain a calculation result of the first function of the vector to be calculated, wherein the calculation result comprises the following steps: and calling a third instruction, and executing parallel multiplication and addition calculation on the scalar numerical value and the coefficient vector in the normal state in the vector to be calculated to obtain a polynomial value of the scalar numerical value in the normal state in the vector to be calculated. In the possible implementation mode, parallel polynomial calculation is realized through the configured third instruction, and compared with the realization through a plurality of instructions, the instruction reading times and the number of access transactions in the vector calculation process are reduced; the time consumption in the vector calculation process is further reduced, the power consumption in the vector calculation process is reduced, and the vector calculation efficiency is improved.
With reference to the second aspect or any one of the foregoing possible implementation manners, in another possible implementation manner, the vector calculation method may further include: an instruction is configured to activate a plurality of sets of parallel ALUs that are used to perform parallel operations.
With reference to the second aspect or any one of the foregoing possible implementations, in another possible implementation, the rule condition may include one or more of the following: whether greater than a maximum threshold, whether less than a minimum threshold, whether a non-number NaN, whether negative infinity, whether positive infinity. The status information may include a comparison result with each rule condition item.
In a third aspect, the present application provides another vector computing apparatus, which can implement the functions in the above-mentioned apparatus examples, and the functions can be implemented by hardware, or by hardware executing corresponding software. The hardware or software comprises one or more modules corresponding to the functions. The vector computing means may be in the form of a product of chips.
With reference to the third aspect, in a possible implementation manner, the structure of the vector computing apparatus includes a processor and a memory, and the processor is configured to support the vector computing apparatus to execute the corresponding functions in the above method. The memory is for coupling to a processor and holds the necessary program instructions and data for the vector calculation means.
In a fourth aspect, a computer-readable storage medium is provided, which includes instructions that, when executed on a computer, cause the computer to perform the vector calculation method provided in any one of the above aspects or any one of the possible implementations.
In a fifth aspect, a computer program product containing instructions is provided, which when run on a computer causes the computer to perform the vector calculation method provided in any one of the above aspects or any one of the possible implementations.
In a sixth aspect, an embodiment of the present application provides a chip system, where the chip system includes a processor and may further include a memory, and is configured to implement the functions in the foregoing method. The chip system may be formed by a chip, and may also include a chip and other discrete devices.
It should be noted that, all possible implementation manners of any one of the above aspects may be combined without departing from the scope of the claims.
Drawings
FIG. 1 is a schematic diagram illustrating a process for calculating a function value of a scalar value according to the prior art;
FIG. 2 is a schematic diagram illustrating a process of calculating a function value of a vector according to the prior art;
fig. 3 is a schematic view of an application scenario provided in an embodiment of the present application;
fig. 4 is a schematic structural diagram of a vector computing apparatus according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of another vector calculation apparatus according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of another vector computing apparatus according to an embodiment of the present application;
fig. 7 is a schematic diagram illustrating a storage structure of state information of a vector according to an embodiment of the present application;
fig. 8 is a schematic diagram illustrating a storage structure of state information of a vector according to an embodiment of the present application;
FIG. 9 is a schematic structural diagram of another vector computing apparatus according to an embodiment of the present application;
fig. 10 is a schematic diagram illustrating a connection structure between a processing unit and a logic module according to an embodiment of the present disclosure;
fig. 11 is a schematic diagram illustrating a connection structure between a determination unit and a logic module according to an embodiment of the present application;
fig. 12 is a schematic diagram of another connection structure between a determination unit and a logic module according to an embodiment of the present application;
fig. 13 is a schematic diagram of a connection structure between a computing unit and a logic module according to an embodiment of the present disclosure;
fig. 14 is a flowchart illustrating a vector calculation method according to an embodiment of the present application.
Detailed Description
The terms "first," "second," and "third," etc. in the description and claims of this application and the above-described drawings are used for distinguishing between different objects and not for limiting a particular order.
In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion for ease of understanding.
In the description of the present application, a "/" indicates a relationship in which the objects associated before and after are an "or", for example, a/B may indicate a or B; in the present application, "and/or" is only an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. Also, in the description of the present application, "a plurality" means two or more than two unless otherwise specified. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.
In the embodiments of the present application, at least one may also be described as one or more, and a plurality may be two, three, four or more, which is not limited in the present application.
For ease of understanding, the terms referred to in this application are explained first.
Scalar values, which may refer to quantized numbers or mathematical expressions. Wherein a scalar value may be referred to as a lane. For example, the scalar value may be 0, or the scalar value may be
Figure BDA0002413470850000051
A vector, may refer to a mathematical expression consisting of a plurality of scalar values.
The calculation of the mathematical library may refer to a process of performing mathematical function calculation on a numerical value (a scalar numerical value or a vector) to obtain a function value corresponding to the numerical value.
The vector to be calculated may refer to a vector for which function calculation is required. Wherein the vector to be calculated may comprise a plurality of scalar values. For example, the vector to be computed may be [1, 2, 3, 4 ].
To clarify the calculation process of the mathematical library, first, the calculation process of a scalar value in the mathematical library will be described in detail. Fig. 1 illustrates a flow of calculating a function value of a scalar value, which is used to calculate a value of a function M of a scalar value H. The scalar value H can be any value to be calculated, and the function M can be any function in the mathematical library. The process may include S101 to S103.
S101, detecting the state of the scalar numerical value H.
S101 may be implemented as: first, it is determined whether the scalar value H is positive infinity (infinitas, INF) or negative infinity, and if the scalar value H is positive infinity or negative infinity, the scalar value H is considered to be in an abnormal state. And if the scalar numerical value H is not positive infinity and the scalar numerical value H is not negative infinity, comparing the scalar numerical value H with a normal numerical range set by a user, if the scalar numerical value H is not in the normal numerical range, determining that the scalar numerical value H is in an abnormal state, and if the scalar numerical value H is in the normal numerical range, determining that the scalar numerical value H is in a normal state.
If the scalar value H is in the normal state, S102 is executed. If the scalar value H is in an abnormal state, S103 is executed.
S102, calculating the value of a function M of the scalar numerical value H.
Specifically, a polynomial after conversion of the function M is obtained, and then a scalar numerical value H is substituted into the polynomial to obtain a value of the polynomial as a value of the function M of the scalar numerical value H.
And S103, performing abnormal output on the scalar numerical value H.
In one possible implementation, the abnormal output in S103 may refer to a value that the function M should output when the input is a certain abnormal value mathematically or by convention in a mathematical library.
For example, the contract function M outputs NaN when the input is positive infinity, and NaN may be output in S103 assuming the scalar value H is positive infinity.
In another possible implementation, the value of the function M value of the scalar numerical value H of the abnormal state may be output in S103.
For example, if the function M is lnx, the normal value range is: greater than 0. When the scalar value H is 0, the function M may be calculated for the scalar value H, and the calculation result is negative infinity, and negative infinity may be output in S103.
If the function M cannot be calculated from the scalar value H in the abnormal state, a predetermined value may be output in S103. For example, the predetermined value may be NaN.
In another possible implementation manner, the scalar value H and its corresponding abnormal state may be output in S103; alternatively, a scalar value hsync may be output.
For example, assuming a scalar value H is positive infinity, the output result may be: scalar value H is positive and infinite.
For another example, assuming the scalar value H is positive infinity, the output result may be: scalar value H enters an exception.
Currently, the calculation of the function value of a vector (including a plurality of scalar values) in a vector math library can be as shown in fig. 2, and includes the following steps S201 to S203 for calculating the value of the function N of the vector I. The vector I can be any vector to be calculated, and comprises a plurality of scalar numerical values; the function N may be any one of a number of functions in a mathematical library.
S201, detecting the state of each scalar numerical value in the vector I.
S201 is implemented as: the states of each scalar value in vector I are detected serially.
Specifically, the detection process of each scalar value may refer to S101, and is not described in detail.
Further, if each scalar value in the vector I is in a normal state, executing S202; if the scalar value in the vector I is in an abnormal state, S203 is executed.
S202, calculating the value of the function N of the vector I.
Specifically, a polynomial after power series conversion is performed on the function N is obtained, then each scalar numerical value in the vector I is respectively substituted into the polynomial, and a polynomial value of each scalar numerical value is obtained through parallel calculation and is used as a value of the function N of the vector I.
For example, relative instruction computation directions in a SIMD instruction set may be employedThe polynomial value of quantity I, the polynomial N after function N conversion is: a is0+a1x+a2x2Wherein the vector a is a coefficient vector of the polynomial N, and a ═ a0 a1 a2]. The process of computing a polynomial value of a vector I using associated instructions in a SIMD instruction set may comprise:
VLD:A=a[2];
VMUL:Y=A*X;
VLD:A=a[1];
VFMA:Y=Y*X+A;
VLD:A=a[0];
VFMA:Y=Y*X+A。
where VLD is a load command, VMUL is a product command, VFMA is a multiply-add command, and when the value of X is vector I, the calculated value of Y is the polynomial value of vector I, which may also be referred to as the value of function N of vector I.
And S203, serially processing each scalar numerical value in the vector I.
Specifically, in S203, the values of the functions N of the scalar values in the normal state in the vector I are respectively calculated in series, where the implementation of calculating the value of each function N of the scalar values in the normal state in the vector I may refer to S102, and details are not repeated.
Further, in S203, the scalar numerical values in the vector I in the abnormal state are abnormally output, and implementation of abnormal output on each scalar numerical value in the vector I in the abnormal state may refer to S103, which is not described in detail.
As can be seen from the above, in the existing vector calculation process, it is necessary to detect the state of each scalar value in the vector serially, and when there is an abnormal scalar value, it is necessary to adopt a serial processing mode (serial calculation or abnormal output) for each scalar value in the vector, so that the number of instruction reads is large and the number of memory access transactions is large in the vector calculation process, which results in long time consumption, low efficiency, and large power consumption in vector calculation.
Based on this, the embodiment of the present application provides a vector computing apparatus and method, during vector computing, detecting the state of each scalar value in a vector in parallel; when the scalar numerical value is normal in the vector, the scalar numerical value in the normal state is subjected to function calculation in parallel, so that the instruction reading times and the memory access transaction number in the vector calculation process are reduced; the time consumption in the vector calculation process is further reduced, the power consumption in the vector calculation process is reduced, and the vector calculation efficiency is improved.
Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
The scheme provided by the embodiment of the application can be applied to the application scene shown in fig. 3. As shown in fig. 3, the application scenario may include a server 301 and an administrator 302. The administrator 302 can manage the server 301 according to actual needs. For example, the administrator 302 may directly input a vector to be calculated to the server 301, and the server 301 processes the input vector (performs vector calculation and/or exception processing), and then outputs the result.
Optionally, the application scenario may further include a terminal device 303, and the administrator 302 may manage the server 301 through the terminal device 303. For example, the administrator 302 may input a vector to be calculated to the server 301 through the terminal device 303, the server 301 processes the input vector (performs vector calculation and/or exception processing), and then outputs the result through the terminal device 303.
The server 301 may be a processor, a computer, a physical server, or a cloud server, or other devices with data processing capability and storage capability, which is not limited in this application.
The terminal device 303 may be an electronic device with input and display functions, such as a computer, a netbook, a television, and a mobile phone.
The embodiments of the present application will be described in detail with reference to the accompanying drawings.
In one aspect, the present embodiment provides a vector computing apparatus, which may be deployed in the server 301 shown in fig. 3, and the vector computing apparatus may be part or all of the server 301. Alternatively, the vector computing apparatus may be deployed separately, for example, as an electronic device or chip system with associated data processing and storage capabilities that may be in communication with the server 301.
Fig. 4 illustrates a vector calculating apparatus 40 provided in the embodiment of the present application. As shown in fig. 4, the vector calculation apparatus 40 may include a processor 401, a memory 402, and a transceiver 403.
The respective constituent components of the vector calculation apparatus 40 will be specifically described below with reference to fig. 4:
the memory 402 may be a volatile memory (volatile memory), such as a random-access memory (RAM); or a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); or a combination of the above types of memories, for storing program code, configuration files, data information, or other content, which may implement the methods of the present application.
The transceiver 403 is used for information interaction between the vector computing apparatus 40 and other devices.
Processor 401 may be the control center of vector calculation apparatus 40. For example, the processor 401 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application, such as: one or more microprocessors (digital signal processors, DSPs), or one or more Field Programmable Gate Arrays (FPGAs).
In particular, the processor 401 may perform the following functions by running or executing software programs and/or modules stored in the memory 402:
obtaining a vector to be calculated and a first function, wherein the vector to be calculated comprises a plurality of scalar numerical values; comparing scalar numerical values of the vectors to be calculated with the rule conditions respectively in parallel to obtain state information of each scalar numerical value in the vectors to be calculated; the rule condition is used for judging whether the scalar numerical value is normal or not, and the state information of one scalar numerical value is used for indicating that the scalar numerical value is in a normal state or an abnormal state in comparison with the rule condition; determining that scalar numerical values in vectors to be calculated are in a normal state; and parallelly substituting scalar numerical values in a normal state in the vector to be calculated into the first function for calculation to obtain a calculation result of the first function of the vector to be calculated.
It should be noted that the system framework for performing vector calculation may be configured according to actual situations, and the present application is not limited specifically.
Fig. 5 illustrates another vector calculating apparatus 50 provided in the embodiment of the present application. As shown in fig. 5, the vector calculating device 50 may include: instruction fetch unit 51, first instruction decode unit 52, second instruction decode unit 53, source register 54, first ALU 55, second ALU 56, memory access unit 57, and destination register 58.
The instruction fetch unit 51 is configured to read an instruction to be executed.
The first instruction decoding unit 52 is configured to generate a hardware control signal according to a standard instruction, and read the content in the source register 54.
The second instruction decoding unit 53 is configured to generate a hardware control signal according to a customized math library-related instruction (e.g., the first instruction in the embodiment of the present application), and read the contents of the source register 54.
A source register 54 for storing relevant source data for the vector calculation. For example, source registers 54 may be used to store a plurality of scalar values of the vectors to be computed.
The first ALU 55 is used for performing arithmetic logic calculations corresponding to standard instructions, and memory access address calculations.
And a second ALU 56 for performing arithmetic logic calculations (parallel operations) corresponding to the customized math library-related instructions, and memory access address calculations.
And the memory access unit 57 is used for sending memory access requests.
And a destination register 58 for storing the calculation or access result.
As shown in fig. 5, the process of completing the vector calculation by the vector calculating device 50 may include 5 stages, which are respectively: instruction fetch stage 501, decode stage 502, execute stage 503, access stage 504, and write back stage 505.
Wherein, the instruction fetching stage 501: instruction fetch unit 51 reads the instruction to be executed.
The decoding stage 502: the first instruction decoding unit 52 and the second instruction decoding unit 53 generate hardware control signals according to the instructions, and read the contents of the source register 54.
Execution stage 503: the first ALU 55 and the second ALU 56 perform arithmetic logic calculations, memory address calculations, corresponding to instructions.
The access phase 504: the memory access unit 57 issues a memory access request.
Write-back stage 505: the result of the calculation or access is written back to the destination register 58.
On the other hand, the present embodiment provides another vector calculating device, and fig. 6 is a schematic structural diagram of a vector calculating device 60 provided in the present embodiment. As shown in fig. 6, the vector calculating means 60 may include: a logic module 601, an acquisition unit 602, a processing unit 603, a determination unit 604 and a calculation unit 605.
The logic module 601 configured in the vector computing apparatus 60 may include a plurality of parallel ALU, and the vector computing apparatus 60 may perform parallel operation by calling an instruction to activate the plurality of parallel ALU in the logic module 601.
Optionally, the logic module 601 may include a plurality of logic sub-modules, each logic sub-module includes an ALU connected in parallel, different logic sub-modules may start the parallel ALUs included in themselves by calling an instruction, and different logic sub-modules may perform different parallel operations. One logic submodule includes parallel ALUs that are a subset of the plurality of groups of parallel ALUs in the logic module 601.
For example, the logic module 601 may include a first logic sub-module, a second logic sub-module, and a third logic sub-module. The first logic submodule, the second logic submodule and the third logic submodule respectively comprise ALUs which are connected in parallel and are respectively used for executing different parallel operations (realizing different functions). The parallel ALUs included in the first logic sub-module may be a subset of the plurality of sets of parallel ALUs in the logic module 601, the parallel ALUs included in the second logic sub-module may be a subset of the plurality of sets of parallel ALUs in the logic module 601, and the parallel ALUs included in the third logic sub-module may be a subset of the plurality of sets of parallel ALUs in the logic module 601.
The obtaining unit 602 is configured to obtain a vector to be calculated and a first function.
The processing unit 603 is configured to compare scalar values of the vector to be calculated with the rule condition in parallel through a plurality of groups of ALUs connected in parallel in the logic module 601, and obtain state information of each scalar value in the vector to be calculated.
The rule condition is used for judging whether the scalar numerical value is normal or not.
Specifically, the user may configure the content of the rule condition according to the actual requirement, which is not limited in this application.
In one possible implementation, the rule condition may include one or more of the following: whether greater than a maximum threshold, whether less than a minimum threshold, whether NaN, whether negative infinity, whether positive infinity.
In another possible implementation, the rule condition may be: whether it falls within the specified range. For example, a rule condition may be: whether greater than or equal to 0.
Wherein the state information of a scalar value is used to indicate whether a scalar value is in a normal state or an abnormal state as compared to the rule condition.
Specifically, the status information may include a comparison result with each rule condition item. The state information may include a plurality of state types, one of the state types corresponding to one of the rule conditions.
The specific storage structure of the state information may be as shown in examples a and B below.
Example a, fig. 7 illustrates a storage structure of state information of a vector. The vector includes 4 lanes (lane0, lane1, lane2, lane3), and the state information of the 4 lanes is stored in a fixed position. The state information for each lane (scalar value) includes 5 state types (state bits), respectively: LO, HI, NaN, IN, IP. LO indicates whether the scalar value is less than a minimum threshold, HI indicates whether the scalar value is greater than a maximum threshold, NaN indicates whether the scalar value is not, IN indicates whether the scalar value is negative infinity, and IP indicates whether the scalar value is positive infinity. Wherein, when the content of the status bit is 0, the description of the status bit is negated; when the content of the status bit is 1, the description of the status bit is affirmed; that is, if the content of the LO status bit in the status information of a scalar value is 1, it indicates that the scalar value is smaller than the minimum threshold; if the content of the LO status bit in the status information of a scalar value is 0, it indicates that the scalar value is greater than the minimum threshold.
Specifically, when the state information of a scalar value is stored in the structure shown in fig. 7, if the content of one or more state bits in the state information of a scalar value is 1, it indicates that the scalar value is in an abnormal state; if the contents of all state types of a scalar value are 0, it indicates that the scalar value is in a normal state.
Example B, the storage structure of the state information of the vector may also be as shown in fig. 8. The state information shown in fig. 8 is added with a state identification bit Y to the state information structure shown in fig. 7, where Y is used to directly indicate that the corresponding scalar value is in a normal state (1) or in an abnormal state (0). Specifically, if the content of one or more status bits in the status information of a scalar value is 1, directly setting the status flag bit Y to 0, which indicates that the scalar value is in an abnormal state; if the contents of all the state bits of a scalar value are 0, the state identification bit Y is directly set to 1, which indicates that the scalar value is in a normal state.
A determining unit 604, configured to determine that a scalar value exists in the vector to be calculated and is in a normal state.
The calculating unit 605 is configured to substitute the scalar numerical value in the vector to be calculated, which is in the normal state, into the first function in parallel for calculation, so as to obtain a calculation result of the first function of the vector to be calculated.
Optionally, the first function is a calculation to be performed by a vector to be calculated, and the first function referred to herein may be an expression of the first function, or may also be a polynomial obtained by converting the first function, or the like, which is not limited in this embodiment of the present application.
Further, as shown in fig. 9, the vector calculation apparatus 60 may further include a configuration unit 606 for configuring an instruction for activating a plurality of groups of parallel ALUs in the logic module 601, which may instruct some or all of the ALUs in the logic module 601 to be activated.
Further, as shown in fig. 9, the vector calculation apparatus 60 may further include an output unit 607 for outputting a calculation result of the first function of the vector to be calculated.
Optionally, the output unit 607 may be further configured to, when a scalar numerical value in an abnormal state exists in the vector to be calculated, perform abnormal output on the scalar numerical value in the abnormal state in the vector to be calculated.
It should be noted that, for the specific implementation of the output unit 607, reference may be made to the implementation of S203, which is not described in detail.
Each unit is described in detail below.
The vector to be calculated acquired by the acquisition unit 602 may include a plurality of scalar numerical values. The first function is a calculation involving a vector to be calculated, and the application does not limit the type and form of the first function.
Specifically, the obtaining unit 602 may obtain a vector and a function input by a user, which are respectively used as a vector to be calculated and a first function; or acquiring a specified vector and a function in a certain calculation process, and respectively taking the vector and the function as a vector to be calculated and a first function; or obtaining the vector and the function in other modes to be respectively used as the vector to be calculated and the first function. This is not particularly limited in the present application.
Further, if the first function is a function that can be converted into a polynomial, the obtaining unit 602 may be further configured to obtain a polynomial coefficient vector after the first function is converted.
Illustratively, the program development stage may perform a power series conversion (chebyshev polynomial conversion or rational function approximation or other conversion) of the first function into a polynomial, and store the converted polynomial coefficients in designated registers. The obtaining unit 602 may obtain the polynomial coefficient vector after the first function conversion directly in a specified register. The polynomial after the first function conversion may be stored in a register in the form of a coefficient vector, or may be in other forms, which is not limited in this application.
In a possible implementation manner, the configuration unit 606 may be configured to configure a first instruction, where the first instruction may be used to instruct parallel ALUs corresponding to the first instruction in the start logic module 601 to execute parallel comparison of scalar values of vectors to be calculated with rule conditions, respectively, to obtain state information of each scalar value in the vectors to be calculated.
When a plurality of groups of parallel ALUs included in the logic module 601 are multiplexed by a plurality of instructions, the parallel ALUs corresponding to the first instruction are a plurality of groups of parallel ALUs included in the logic module 601. When the logic module 601 includes different logic sub-modules corresponding to different instructions, the first instruction is used to instruct to start the parallel ALUs in the first logic sub-module corresponding to the first instruction in the logic module 601.
Correspondingly, the processing unit 603 may specifically be configured to: calling a first instruction to start a plurality of groups of ALUs connected in parallel in the logic module 601, and respectively comparing scalar numerical values of the vector to be calculated with the rule conditions in parallel through the plurality of groups of ALUs connected in parallel in the logic module 601 to obtain state information of each scalar numerical value in the vector to be calculated.
For example, fig. 10 illustrates a connection structure of the processing unit 603 and the logic module 601. Assume that the rule conditions include: whether greater than a maximum threshold, whether less than a minimum threshold, whether NaN, whether negative infinity, whether positive infinity. As shown in fig. 10, the processing unit 603 may include a first vector register 1001, a second register 1002, a third register 1003, and a fourth vector register 1004. The first vector register 1001, the second register 1002, and the third register 1003 are input registers, and the fourth vector register 1004 is an output register. The first vector register 1001 is used to store a plurality of scalar values in a vector to be computed. The second register 1002 is used to store the minimum threshold value in the rule condition and the third register 1003 is used to store the maximum threshold value in the rule condition. The fourth vector register 1004 is used to store the comparison result (state information for each scalar value in the vector to be computed).
The second register 1002 and the third register 1003 may be vector registers or scalar registers, and the types of the second register 1002 and the third register 1003 are not limited in the embodiment of the present application.
Specifically, the dashed box in fig. 10 is that the logic unit 601 includes a plurality of groups of ALUs connected in parallel corresponding to the first instruction. As shown in fig. 10, the processing unit 603 invokes the first instruction, starts the multiple groups of parallel ALUs corresponding to the first instruction in the logic module 601, respectively inputs plus infinity, minus infinity, NaN, the minimum threshold, and the maximum threshold into different ALUs in the multiple groups of parallel ALUs corresponding to the first instruction as the reference value of each ALU, then inputs each scalar value in the vector to be detected into each ALU in the multiple groups of parallel ALUs corresponding to the first instruction, and each ALU compares the scalar value input therein with its own reference value to output a result, and the multiple groups of parallel ALUs corresponding to the first instruction operate in parallel, so as to achieve the purpose of comparing each scalar value in the vector to be detected with plus infinity, minus infinity, NaN, the minimum threshold, and the maximum threshold in parallel, and obtain a comparison result (whether greater than the maximum threshold, less than the minimum threshold, or not NaN, or not the minimum threshold, or not being NaN, or not being the maximum threshold) of each scalar value, Whether it is positive infinity and whether it is negative infinity), as state information for each scalar value in the vector to be computed, is stored in the fourth vector register 1004.
It should be noted that, during the processing of the processing unit 603, if the number of groups of the parallel ALUs included in the logic module is smaller than the number of terms of the rule condition (for example, in the example corresponding to fig. 10, the number of groups of the parallel ALUs included in the logic module is smaller than 5), the processing unit 603 may implement the corresponding function in a multi-cycle multiplexing manner.
The judgment condition may be configured according to the actual requirement of the user, and the present application is not particularly limited. For example, the determination condition is that at least one scalar numerical value is in a normal state, or that no scalar numerical value is in an abnormal state.
In another possible implementation manner, the configuration unit 606 may be further configured to configure a second instruction, where the second instruction may be configured to instruct to start a parallel ALU corresponding to the second instruction in the logic module 601, so as to compare state information of each scalar value in the vector to be calculated with the judgment condition, and determine that a scalar value in the vector to be calculated is in a normal state.
When the plurality of groups of parallel ALUs included in the logic module 601 are multiplexed by a plurality of instructions, the parallel ALUs corresponding to the second instruction are the plurality of groups of parallel ALUs included in the logic module 601. When the logic module 601 includes different logic sub-modules corresponding to different instructions, the second instruction is used to instruct the start of the parallel ALUs in the second logic sub-module corresponding to the second instruction in the logic module 601.
Accordingly, the determining unit 604 may be configured to: and calling a second instruction to start a plurality of groups of ALUs connected in parallel in the logic module 601, and comparing the state information of each scalar value in the vector to be calculated with the judgment condition in parallel through the plurality of groups of ALUs connected in parallel in the logic module 601 to determine that the scalar value in the vector to be calculated is in a normal state.
The connection structure between the determination unit 604 and the logic module 601 may include the following examples 1 and 2.
Example 1, with respect to the storage structure of the state information shown in fig. 7, fig. 11 illustrates a connection structure of a determination unit 604 and a logic module 601. As shown in fig. 11, the determining unit 604 may include: status register 1101, status conditions, judgment conditions, and target address. The status register 1101 is used to store status information for each scalar value in the vector to be computed. For example, the status register may be an output vector register of the first instruction. The state condition is used for selecting one or more state types of the query; specifically, the state condition is set to traverse each state type. The determination condition is that at least one scalar value is in a normal state. The target address is used to instruct the computing unit 605 to start the corresponding operation.
Specifically, the dashed line in fig. 11 is a logic unit 601 including a plurality of parallel ALUs corresponding to the second instruction. As shown in fig. 11, the determining unit 604 calls the second instruction, starts a plurality of groups of ALUs connected in parallel corresponding to the second instruction in the logic module 601, respectively reads information of each scalar value in the state register in all state types, obtains a state (abnormal or normal) of each scalar value in the vector to be calculated, then determines whether the state of each scalar value in the obtained vector to be calculated is consistent with the determination condition, and instructs the calculating unit 605 to start corresponding operations when it is determined that at least one scalar value in the obtained vector to be calculated is in a normal state.
Example 2, with respect to the storage structure of the status information shown in fig. 8, fig. 12 illustrates another connection structure of the determination unit 604 and the logic module 601. As shown in fig. 12, the determining unit 604 may include: a status register 1201, a judgment condition, and a target address. The state registers 1201 are used to store state information for each scalar value in the vector to be computed. For example, the status register is an output vector register of the first instruction. The determination condition is that at least one scalar value is in a normal state. The target address is used to instruct the computing unit 605 to start the corresponding operation.
Specifically, the dashed box in fig. 12 is that the logic unit 601 includes a plurality of groups of ALUs connected in parallel corresponding to the second instruction. As shown in fig. 12, the determining unit 604 calls the second instruction, starts a plurality of groups of ALUs connected in parallel corresponding to the second instruction in the logic module 601, respectively reads the content of the state identification bit in the state information of each scalar value in the state register, obtains the state (abnormal or normal) of each scalar value in the vector to be calculated, then determines whether the state of each scalar value in the obtained vector to be calculated is consistent with the determination condition, and instructs the calculating unit 605 to start corresponding operations when at least one scalar value in the obtained vector to be calculated is determined to be in the normal state.
In another possible implementation manner, the configuration unit 606 may be further configured to configure a third instruction, where the third instruction may be configured to instruct to start a parallel ALU corresponding to the third instruction in the logic module 601, so as to perform parallel multiply-add calculation on a scalar value and a coefficient vector in a normal state in a vector to be calculated, so as to obtain a polynomial value of the scalar value in the normal state in the vector to be calculated.
When the plurality of groups of parallel ALUs included in the logic module 601 are multiplexed by a plurality of instructions, the parallel ALU corresponding to the third instruction is the plurality of groups of parallel ALUs included in the logic module 601. When the logic module 601 includes different logic sub-modules corresponding to different instructions, the third instruction is used to instruct to start the parallel ALUs in the third logic sub-module corresponding to the third instruction in the logic module 601.
Accordingly, the computing unit 605 is specifically operable to: and calling a third instruction to start a plurality of groups of ALUs connected in parallel in the logic module 601, and substituting scalar numerical values in a normal state in the vector to be calculated into the first function for calculation through the plurality of groups of ALUs connected in parallel in the logic module 601 to obtain a first function value of the scalar numerical values in the normal state in the vector to be calculated.
For example, when the first function can be converted into a polynomial, the calculating unit 605 is specifically configured to: and calling a third instruction to start a plurality of groups of ALUs connected in parallel in the logic module 601, and performing multiplication and addition calculation on the scalar numerical value and the coefficient vector in the vector to be calculated in parallel through the plurality of groups of ALUs connected in parallel in the logic module 601 to obtain the polynomial value of the scalar numerical value in the normal state in the vector to be calculated. The coefficient vector is obtained by the obtaining unit 602.
For example, fig. 13 illustrates a connection structure of the calculation unit 605 and the logic module 601. As shown in fig. 13, the calculation unit 605 may include a fifth vector register 1301, a sixth vector register 1302, and a seventh vector register 1303. The fifth vector register 1301 and the sixth vector register 1302 are input registers, and the seventh vector register 1303 is an output register. A fifth vector register 1301 for storing a scalar value in a normal state in the vector to be calculated, and a sixth vector register 1302 for storing a coefficient vector (e.g., the coefficient vector is [ a ]0 a1 a2 a3]) The seventh vector register 1303 stores the calculation result (the first function value of the scalar value in the normal state in the vector to be calculated).
Specifically, the dashed box in fig. 13 is that the logic unit 601 includes a plurality of groups of ALUs connected in parallel corresponding to the third instruction. As shown in fig. 13, the processing unit 603 calls the first instruction, starts a plurality of groups of parallel ALUs in the logic module 601, respectively inputs the scalar value in the normal state in the vector to be calculated and the coefficient vector acquired by the acquiring unit 602 into a plurality of groups of parallel ALUs corresponding to the third instruction through the plurality of groups of parallel ALUs in the logic module, and performs a multiplication and addition operation in parallel to obtain a first function value of the scalar value in the normal state in the vector to be calculated, as a calculation result of the first function of the vector to be calculated, and stores the first function value in the output vector register (seventh vector state register 1303).
The embodiment of the application provides a vector computing device, and during vector computing, the state of each scalar numerical value in a vector is detected in parallel; when the scalar numerical value is normal in the vector, the scalar numerical value in the normal state is subjected to function calculation in parallel, so that the instruction reading times and the memory access transaction number in the vector calculation process are reduced; the time consumption in the vector calculation process is further reduced, the power consumption in the vector calculation process is reduced, and the vector calculation efficiency is improved.
The following describes in detail the implementation of each unit function when the vector computing apparatus provided in the embodiment of the present application performs vector computation, by taking the value of the function f (x) for computing the vector a as an example.
First, the obtaining unit obtains a vector a, and a polynomial coefficient vector after function f (x) conversion.
Wherein, A ═ 2369899647512](ii) a The polynomial after expanding f (x) in power series is f (x) a0+a1x+a2x2+a3x3. In particular, a0=1,a1=1,a2=0,a31, so the polynomial coefficient vector obtained is [ 1101 ]]。
Then, the processing unit calls the first instruction, starts a plurality of groups of ALUs connected in parallel in the logic module, and compares scalar numerical values of the vector A with the rule conditions in parallel through the plurality of groups of ALUs connected in parallel in the logic module to obtain state information of each numerical value in the vector A.
Wherein the rule conditions are as follows: whether greater than a maximum threshold (100000), whether less than a minimum threshold (-100000), whether NaN, whether positive infinity, whether negative infinity.
Specifically, after the value 2 is compared with each specific value, the obtained value 2 is not greater than a maximum threshold value (100000), the obtained value 2 is not less than a minimum threshold value (-100000), the obtained value 2 does not belong to NaN, the obtained value 2 does not belong to positive infinity, the obtained value 2 does not belong to negative infinity, and the obtained value 2 is further in a normal state; meanwhile, the numerical value 36 and the numerical value 12 in the obtained vector A are also in a normal state; meanwhile, it is also possible to obtain that the value 98996475 in the vector a is greater than the maximum threshold (100000), and further obtain that the value 98996475 is in an abnormal state.
The determination unit determines that there are three values (value 2, value 36, value 12) in the vector a in the normal state and one value (value 98996475) in a in the abnormal state.
The configuration unit configures the third instruction.
And the calculating unit is used for calculating the function values of (the value 2, the value 36 and the value 12) in parallel through the configured third instruction. Specifically, a polynomial value obtained by performing a multiply-add calculation on a ═ 236/12 and a coefficient vector of [ 1101 ] is [ 1146693/1741 ].
The output unit outputs a polynomial value [ 1146693/1741 ], and the output 98996475 is greater than a maximum threshold.
On the other hand, the embodiment of the present application provides a vector calculation method, which can be executed by the aforementioned vector calculation apparatus. As shown in fig. 14, the method may include:
s1401, the vector calculation device obtains a vector to be calculated and a first function.
The implementation process of S1401 may refer to the specific implementation of the aforementioned obtaining unit 602, and is not described in detail here.
S1402, the vector computing device compares scalar numerical values of the vectors to be computed with the rule conditions in parallel to obtain state information of each scalar numerical value in the vectors to be computed.
The implementation process of S1402 may refer to the specific implementation of the configuration unit 606 and the processing unit 603, which is not described herein again.
S1403, the vector calculation means determines that there is a scalar numerical value in the vector to be calculated in a normal state.
The implementation process of S1403 may refer to the specific implementation of the configuration unit 606 and the determination unit 604, which is not described in detail here.
S1404, the vector calculation device substitutes scalar numerical values in normal states in the vectors to be calculated into the first function in parallel for calculation to obtain calculation results of the first function of the vectors to be calculated.
The implementation process of S1404 may refer to specific implementations of the configuration unit 606 and the calculation unit 605, which are not described herein again.
Further, as shown in fig. 14, the vector calculation method provided in the embodiment of the present application may further include:
s1405, outputting the calculation result of the first function of the vector to be calculated by the vector calculation device.
Further, as shown in fig. 14, the vector calculation method provided in the embodiment of the present application may further include:
if the scalar numerical value exists in the vector to be calculated and is in the abnormal state, S1406 is performed on the scalar numerical value in the abnormal state.
If no scalar numerical value in an abnormal state exists in the vector to be calculated, the process ends after S1405 is executed.
And S1406, the vector computing device outputs the abnormal scalar numerical value in the abnormal state in the vector to be computed.
It should be noted that, the specific implementation of S1406 may refer to S103, and is not described in detail.
It should be noted that, in the present application, the execution sequence of S1401 to S1406 is not specifically limited, and a user may configure and adjust the execution sequence according to actual requirements. For example, S1406 may be performed before S1404, S1406 may also be performed after S1404, or S1406 may also be performed after S1405.
As another form of the present embodiment, there is provided a computer-readable storage medium having stored thereon instructions that, when executed, perform the vector calculation method in the above-described method embodiment.
As another form of the present embodiment, there is provided a computer program product containing instructions that, when executed, perform the vector calculation method in the above-described method embodiment.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Those skilled in the art will recognize that in one or more of the examples described above, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (12)

1. A vector calculation apparatus, characterized in that the apparatus is provided with a logic module; the logic module comprises a plurality of groups of Arithmetic Logic Units (ALU) which are connected in parallel; the device comprises:
an acquisition unit configured to: obtaining a vector to be calculated and a first function, wherein the vector to be calculated comprises a plurality of scalar numerical values;
a processing unit to: through the plurality of groups of ALUs connected in parallel, scalar numerical values of the vectors to be calculated are respectively compared with rule conditions in parallel, and state information of each scalar numerical value in the vectors to be calculated is obtained; the rule condition is used for judging whether scalar numerical values are normal or not, and the state information of one scalar numerical value is used for indicating that the scalar numerical value is in a normal state or an abnormal state in comparison with the rule condition;
a determination unit configured to: determining that scalar numerical values in the vectors to be calculated are in a normal state;
a computing unit to: and parallelly substituting scalar numerical values in a normal state in the vector to be calculated into the first function for calculation to obtain a calculation result of the first function of the vector to be calculated.
2. The apparatus of claim 1,
the processing unit is specifically configured to: and calling a first instruction to start the plurality of groups of ALUs connected in parallel, and respectively comparing the scalar numerical values of the vector to be calculated with the rule conditions in parallel through the plurality of groups of ALUs connected in parallel to obtain the state information of each scalar numerical value in the vector to be calculated.
3. The device according to claim 1 or 2,
the determining unit is specifically configured to: and calling a second instruction to start the plurality of groups of ALUs connected in parallel, and comparing the state information of each scalar value in the vector to be calculated with a judgment condition in parallel through the plurality of groups of ALUs connected in parallel to determine that the scalar value in the vector to be calculated is in a normal state.
4. The apparatus according to any of claims 1-3, wherein the obtaining unit is further configured to obtain the first function-converted polynomial coefficient vector;
the computing unit is specifically configured to: and calling a third instruction to start the plurality of groups of ALUs connected in parallel, and performing multiplication and addition calculation on the scalar numerical value in the normal state and the coefficient vector in the vector to be calculated in parallel through the plurality of groups of ALUs connected in parallel to obtain a polynomial value of the scalar numerical value in the normal state in the vector to be calculated.
5. The apparatus according to any one of claims 1 to 4,
the apparatus also includes a configuration unit to configure instructions to activate the plurality of groups of ALUs in parallel.
6. The apparatus of any one of claims 1-5, wherein the logic module comprises a first logic sub-module, a second logic sub-module, and a third logic sub-module; the different logic sub-modules respectively comprise ALUs which are connected in parallel and are used for executing different parallel operations; the parallel ALUs included in one logic submodule are subsets of the multiple groups of parallel ALUs included in the logic module.
7. A method of vector computation, the method comprising:
obtaining a vector to be calculated and a first function, wherein the vector to be calculated comprises a plurality of scalar numerical values;
comparing scalar numerical values of the vectors to be calculated with rule conditions respectively in parallel to obtain state information of each scalar numerical value in the vectors to be calculated; the rule condition is used for judging whether scalar numerical values are normal or not, and the state information of one scalar numerical value is used for indicating that the scalar numerical value is in a normal state or an abnormal state in comparison with the rule condition;
determining that scalar numerical values in the vectors to be calculated are in a normal state;
and parallelly substituting scalar numerical values in a normal state in the vector to be calculated into the first function for calculation to obtain a calculation result of the first function of the vector to be calculated.
8. The method according to claim 7, wherein the parallel comparing scalar values of the vector to be calculated with rule conditions to obtain status information of each scalar value in the vector to be calculated comprises:
and calling a first instruction, executing the parallel comparison of the scalar numerical values of the vectors to be calculated with the rule conditions, and obtaining the state information of each scalar numerical value in the vectors to be calculated.
9. The method according to claim 7 or 8, wherein the determining that a scalar numerical value exists in the vector to be calculated is in a normal state comprises:
and calling a second instruction, executing and comparing the state information of each scalar numerical value in the vector to be calculated with a judgment condition in parallel, and determining that the scalar numerical value in the vector to be calculated is in a normal state.
10. The method according to any one of claims 7 to 9,
the method further comprises the following steps: obtaining a polynomial coefficient vector after the first function conversion;
the parallel substitution of the scalar numerical values in the normal state in the vector to be calculated into the first function for calculation to obtain the calculation result of the first function of the vector to be calculated includes:
and calling a third instruction, executing the parallel multiplication and addition calculation on the scalar numerical value in the normal state in the vector to be calculated and the coefficient vector, and obtaining a polynomial value of the scalar numerical value in the normal state in the vector to be calculated.
11. The method according to any one of claims 7-10, further comprising:
configuring an instruction for starting a plurality of groups of parallel Arithmetic Logic Units (ALU); the parallel ALUs are used to perform parallel operations.
12. A vector calculation apparatus, the apparatus comprising: a processor and a memory;
the memory is connected to the processor, the memory being configured to store a computer program which, when executed by the processor, causes the apparatus to perform the vector calculation method according to any of claims 7-11.
CN202010183821.9A 2020-03-16 2020-03-16 Vector calculation device and method Pending CN113407154A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010183821.9A CN113407154A (en) 2020-03-16 2020-03-16 Vector calculation device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010183821.9A CN113407154A (en) 2020-03-16 2020-03-16 Vector calculation device and method

Publications (1)

Publication Number Publication Date
CN113407154A true CN113407154A (en) 2021-09-17

Family

ID=77676846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010183821.9A Pending CN113407154A (en) 2020-03-16 2020-03-16 Vector calculation device and method

Country Status (1)

Country Link
CN (1) CN113407154A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230237097A1 (en) * 2020-06-25 2023-07-27 Nec Corporation Information processing device, information processing method, and recording medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230237097A1 (en) * 2020-06-25 2023-07-27 Nec Corporation Information processing device, information processing method, and recording medium

Similar Documents

Publication Publication Date Title
US8984043B2 (en) Multiplying and adding matrices
CN111656367A (en) System and architecture for neural network accelerator
US9495153B2 (en) Methods, apparatus, and instructions for converting vector data
JP5647859B2 (en) Apparatus and method for performing multiply-accumulate operations
US9575753B2 (en) SIMD compare instruction using permute logic for distributed register files
KR20240011204A (en) Apparatuses, methods, and systems for instructions of a matrix operations accelerator
EP3757769B1 (en) Systems and methods to skip inconsequential matrix operations
TWI787357B (en) Method and system for operating product and methods for operating dot product and operating convolution
EP3623940A2 (en) Systems and methods for performing horizontal tile operations
WO2021116799A1 (en) Mixed precision floating-point multiply-add operation
CN111381808B (en) Multiplier, data processing method, chip and electronic equipment
EP4020169A1 (en) Apparatuses, methods, and systems for 8-bit floating-point matrix dot product instructions
CN113407154A (en) Vector calculation device and method
US11880683B2 (en) Packed 16 bits instruction pipeline
CN110688153B (en) Instruction branch execution control method, related equipment and instruction structure
CN111158757B (en) Parallel access device and method and chip
WO2019023910A1 (en) Data processing method and device
CN209895329U (en) Multiplier and method for generating a digital signal
CN113591031A (en) Low-power-consumption matrix operation method and device
US8700887B2 (en) Register, processor, and method of controlling a processor using data type information
CN113867799A (en) Computing device, integrated circuit chip, board card, electronic equipment and computing method
CN113867800A (en) Computing device, integrated circuit chip, board card, electronic equipment and computing method
CN112149050A (en) Apparatus, method and system for enhanced matrix multiplier architecture
US11886737B2 (en) Devices and systems for in-memory processing determined
CN113508363B (en) Arithmetic and logical operations in a multi-user network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination