CN116302117A - Data processing method and device, processor, electronic equipment and storage medium - Google Patents

Data processing method and device, processor, electronic equipment and storage medium Download PDF

Info

Publication number
CN116302117A
CN116302117A CN202310297284.4A CN202310297284A CN116302117A CN 116302117 A CN116302117 A CN 116302117A CN 202310297284 A CN202310297284 A CN 202310297284A CN 116302117 A CN116302117 A CN 116302117A
Authority
CN
China
Prior art keywords
vector
parameter
parameters
destination
product
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310297284.4A
Other languages
Chinese (zh)
Inventor
刘磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haiguang Information Technology Co Ltd
Original Assignee
Haiguang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haiguang Information Technology Co Ltd filed Critical Haiguang Information Technology Co Ltd
Priority to CN202310297284.4A priority Critical patent/CN116302117A/en
Publication of CN116302117A publication Critical patent/CN116302117A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

A data processing method and device, a processor, an electronic device and a storage medium. The data processing method comprises the following steps: receiving a vector multiply-add instruction, wherein the vector multiply-add instruction comprises a destination address for indicating a storage destination operand, and a first vector and a second vector as source operands, the first vector comprises N first parameters, the second vector comprises N second parameters, the N first parameters and the N second parameters have the same data type, the data type comprises 8-bit integer, and the N first parameters and the N second parameters have a one-to-one correspondence; performing vector multiplication and addition operation on the first vector and the second vector to obtain a target vector; and storing the destination vector in the destination address. The data processing method can realize 8-bit integer vector multiplication and addition operation of the same data type, data overflow cannot be generated, the calculation performance of low-precision general matrix multiplication or convolution calculation is improved, and precision degradation caused by overflow is reduced or avoided.

Description

Data processing method and device, processor, electronic equipment and storage medium
Technical Field
Embodiments of the present disclosure relate to a data processing apparatus, a tensor data processing method, a data processor, and a non-transitory computer readable storage medium.
Background
The instruction set is a set of instructions in a central processing unit (Central Processing Unit, CPU) for computing and controlling a computer system, and each new CPU is designed to define a series of instruction systems that cooperate with other hardware circuits.
SIMD (Single Instruction Multiple Data) instruction set, referred to as single instruction multiple data stream technique, can use one group of instructions to operate on multiple groups of data channels in parallel. The SIMD instruction can control a plurality of parallel processing microelements on one controller, and execute a plurality of data streams by one instruction operation, so that the operation speed of a program can be improved in many times. SIMD instructions are very similar in nature to a vector processor in that the same operations can be performed simultaneously on a set of data (also known as a "data vector") on a controller separately to achieve spatial parallelism. For example, a normal add instruction can only perform one add operation on two numbers at a time, while a SIMD add instruction can perform an add operation on two arrays (vectors) at a time.
Disclosure of Invention
At least one embodiment of the present disclosure provides a data processing method, including receiving a vector multiply add instruction, where the vector multiply add instruction includes a first vector and a second vector that indicate a destination address where a destination operand is stored and are source operands, the first vector includes N first parameters, the second vector includes N second parameters, the N first parameters and the N second parameters have a same data type, the data type includes an 8-bit integer, and there is a one-to-one correspondence between the N first parameters and the N second parameters; performing vector multiplication and addition operation on the first vector and the second vector to obtain a destination vector, wherein the destination vector comprises M destination parameters, each destination parameter is the sum of at least two product results, and each product result is the product of the corresponding first parameter and the second parameter; and storing the destination vector into the destination address, wherein M and N are positive integers.
For example, in at least one embodiment of the present disclosure, a data processing method is provided, where m=n/2, where the i-th destination parameter of the M destination parameters is a sum of a product of the 2*i-1 first parameter and the 2*i-1 second parameter and a product of the 2*i first parameter and the 2*i second parameter, and i sequentially takes 1 to M.
For example, in at least one embodiment of the present disclosure, a data processing method is provided, where performing a vector multiply-add operation on the first vector and the second vector to obtain a destination vector includes: and respectively executing M times of cyclic multiply-add operation on the N first parameters and the N second parameters by combining an intermediate buffer unit so as to obtain the destination vector.
For example, in at least one embodiment of the present disclosure, a data processing method is provided, in combination with an intermediate buffer unit, to perform M cyclic multiply-add operations on the N first parameters and the N second parameters, respectively, to obtain the destination vector, including: in the ith cyclic multiply-add operation for the ith one of the M destination parameters: calculating the product of the 2*i-1 first parameter and the 2*i-1 second parameter to obtain a first product result; writing the first product result into the intermediate cache unit; calculating the product of the 2*i first parameter and the 2*i second parameter to obtain a second product result; reading the first product result cached in the intermediate cache unit, adding the second product result and the first product result, and writing the added result into the intermediate cache unit; and assigning the addition result cached by the intermediate caching unit as the ith destination parameter to a corresponding interval in the destination vector.
For example, in at least one embodiment of the present disclosure, a data processing method is provided, where the intermediate buffer unit is a first physical register with a bit width of 16 bits, a first product result is obtained by calculating a product of a 2*i-1 first parameter and a 2*i-1 second parameter, and the first product result is written into the intermediate buffer unit, where the method includes: writing a first operand having a bit width of 16 bits, which includes 2*i-1 first parameters and 2*i first parameters, to a second physical register from a storage address where the first vector is stored; writing a second operand having a bit width of 16 bits, which includes 2*i-1 second parameter and 2*i second parameter, to a third physical register from a memory address where the second vector is stored; calculating the product of the first operand and the second operand by using a first multiplier to obtain the first product result, wherein the first multiplier is an 8-bit multiplier; the first product result is written to the first physical register.
For example, in at least one embodiment of the present disclosure, a data processing method is provided, where calculating a product of a 2*i th first parameter and a 2*i th second parameter to obtain a second product result includes: shifting the first operand and the second operand to obtain a first shifting result and a second shifting result; and calculating a product of the first shift result and the second shift result as the second product result by using a second multiplier, wherein the second multiplier is an 8-bit multiplier.
For example, in at least one embodiment of the present disclosure, a data processing method is provided, in response to a first instruction set architecture, the vector multiply add instruction includes two instruction parameters including a first instruction parameter indicating a memory address where the first vector and the destination vector are stored, and a second instruction parameter indicating a memory address where the second vector is stored; the first instruction parameter is in the form of a vector register and the second instruction parameter is in the form of a vector register or system memory.
For example, in at least one embodiment of the present disclosure, a data processing method is provided, in response to a second instruction set architecture, the vector multiply add instruction includes three instruction parameters including a first instruction parameter indicating a destination address for storing the destination vector, a second instruction parameter storing a memory address for the first vector, and a third instruction parameter storing a memory address for the second vector; the first instruction parameter and the second instruction parameter are in the form of vector registers, and the third instruction parameter is in the form of vector registers or system memory.
At least one embodiment of the present disclosure provides a data processing apparatus, including a decoding unit, an executing unit, and a storage unit, where the decoding unit is configured to: receiving a vector multiply-add instruction, wherein the vector multiply-add instruction comprises a destination address for indicating a storage destination operand and a first vector and a second vector as source operands, the first vector comprises N first parameters, the second vector comprises N second parameters, the N first parameters and the N second parameters have the same data type, the data type comprises 8-bit integer, and the N first parameters and the N second parameters have a one-to-one correspondence; and resolving the vector multiply-add instruction to execute the vector multiply-add instruction by the execution unit; the storage unit is configured to store or acquire the first vector, the second vector and a destination vector; the execution unit executing the vector multiply add instruction includes performing the operations of: reading the first vector and the second vector from the storage unit; performing vector multiplication and addition operation on the first vector and the second vector to obtain the destination vector, wherein the destination vector comprises M destination parameters, each destination parameter is the sum of at least two product results, and each product result is the product of the corresponding first parameter and the second parameter; and storing the destination vector into the destination address in the storage unit, wherein M and N are positive integers.
For example, in at least one embodiment of the present disclosure, a data processing apparatus is provided, where the performing unit performs a vector multiply-add operation on the first vector and the second vector to obtain the destination vector, the performing unit includes performing the following operations: and respectively executing M times of cyclic multiply-add operation on the N first parameters and the N second parameters by combining an intermediate buffer unit so as to obtain the destination vector.
For example, in at least one embodiment of the present disclosure, a data processing apparatus is provided, where the execution unit performs a combination intermediate buffer unit to perform M round robin multiply-add operations on the N first parameters and the N second parameters, respectively, to obtain the destination vector, the method includes: in the ith cyclic multiply-add operation for the ith one of the M destination parameters: calculating the product of the 2*i-1 first parameter and the 2*i-1 second parameter to obtain a first product result; writing the first product result into the intermediate cache unit; calculating the product of the 2*i first parameter and the 2*i second parameter to obtain a second product result; reading the first product result cached in the intermediate cache unit, adding the second product result and the first product result, and writing the added result into the intermediate cache unit; and assigning the addition result cached by the intermediate caching unit as the ith destination parameter to a corresponding interval in the destination vector.
For example, in at least one embodiment of the present disclosure, a data processing apparatus is provided, the intermediate buffer unit is a first physical register with a bit width of 16 bits, the execution unit includes a first multiplier, the first multiplier is an 8-bit multiplier, the execution unit performs a calculation of a product of a 2*i-1 first parameter and a 2*i-1 second parameter to obtain a first product result, and when writing the first product result into the intermediate buffer unit, the method includes: writing a first operand with a 16-bit width, which is stored in a storage address of the first vector, into a second physical register, wherein the first operand comprises a 2*i-1 first parameter and a 2*i first parameter; writing a second operand with a 16-bit width, which is extracted from a storage address of the second vector and stored in the storage unit, into a third physical register, wherein the second operand comprises a 2*i-1 second parameter and a 2*i second parameter; calculating the product of the first operand and the second operand by using the first multiplier to obtain the first product result; the first product result is written to the first physical register.
For example, in at least one embodiment of the present disclosure, there is provided a data processing apparatus, the execution unit further includes a second multiplier, the second multiplier is an 8-bit multiplier, and when the execution unit performs calculating a product of a 2*i th first parameter and a 2*i th second parameter to obtain a second product result, the method includes: shifting the first operand and the second operand to obtain a first shifting result and a second shifting result; calculating a product of the first shift result and the second shift result as the second product result using the second multiplier.
For example, in at least one embodiment of the present disclosure, a data processing apparatus is provided, where the storage unit includes a vector register, a cache, or a system memory.
At least one embodiment of the present disclosure provides a processor comprising a data processing apparatus according to any one of the embodiments of the present disclosure.
At least one embodiment of the present disclosure provides an electronic device, including: a memory non-transitory storing computer-executable instructions; a processor configured to execute the computer-executable instructions, wherein the computer-executable instructions, when executed by the processor, implement a data processing method according to any embodiment of the present disclosure.
At least one embodiment of the present disclosure provides a non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement a data processing method according to any embodiment of the present disclosure.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.
FIG. 1 is a schematic flow chart of a data processing method according to at least one embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a vector multiply-add operation provided by at least one embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a cyclic multiply-add calculation process according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a cyclic multiply-add calculation process according to another embodiment of the present disclosure;
FIG. 5 is a diagram illustrating an implementation of a vector multiply-add instruction in a microarchitecture according to an embodiment of the disclosure;
FIG. 6A is a schematic block diagram of a data processing apparatus provided in accordance with at least one embodiment of the present disclosure;
FIG. 6B is a schematic block diagram of another data processing apparatus provided in accordance with at least one embodiment of the present disclosure;
FIG. 7 is a schematic block diagram of a processor provided in accordance with at least one embodiment of the present disclosure;
FIG. 8 is a schematic block diagram of an electronic device provided in accordance with at least one embodiment of the present disclosure;
fig. 9 is a schematic diagram of a non-transitory computer readable storage medium according to at least one embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.
Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits a detailed description of some known functions and known components.
Deep neural networks have recently achieved major performance breakthroughs in computer vision. The performance of most advanced neural networks benefits from very deep and over parameterized multi-layer architectures, often with hundreds of thousands or millions of parameters. However, the large number of parameters requires high performance vector computing processors to process data, and the demands on computing power and storage resources are rapidly growing. One of the possible ways to solve this problem is to reduce the data bit width, because a lower data bit width can effectively reduce the computation density, while reducing the power consumption overhead of the computer, and reducing the storage requirements. However, decreasing the bit width loses model accuracy, but in practice, it has been demonstrated that using 8-bit integer (Int 8) has little loss of accuracy compared to using floating point data, and fully meets commercial requirements.
The low numerical precision is adopted for reasoning and training, and the method has the following specific advantages: first, many operations are limited by memory bandwidth, reducing accuracy will allow better use of cache and reduce bandwidth bottlenecks, so data can be moved through the memory hierarchy faster to maximize computing resources; second, the hardware can achieve higher operations per second with lower numerical accuracy, as these multipliers require less silicon area and power.
At present, for 8-bit integer, namely 8-bit wide data, the bit width is reduced by 4 times compared with that of FP32 (32-bit wide floating point number), so that the lower numerical precision of the 8-bit integer is adopted for reasoning and training, higher operation speed can be realized, and the calculation resources can be utilized to the maximum extent. Currently, in the X86 architecture, vpmadubsw instructions are provided for performing operations on 8-bit data. The input parameters of the instruction require 8-bit Unsigned integer (Unsigned Int 8) and 8-bit signed integer (signed Int 8). However, in the current situations of reasoning, training, etc., it is often required that the input parameters are all 8-bit signed integer, and when an operation instruction of the 8-bit data provided at present is used, one of the 8-bit signed integer must be converted into the 8-bit unsigned integer to use the instruction.
However, this conversion requires additional instructions, which greatly increases the number of instructions, especially in large-scale mass computing, which can reach the billion level, and thus this conversion can result in performance loss; furthermore, the instruction causes data overflow when the input parameter approaches a maximum value, and undefined overflow behavior occurs, resulting in calculation errors.
For the vpmadubsw instruction, its inputs include source operation array 1 and source operation array 2, the output is the destination operation array. Wherein the source operation array comprises a plurality of source operands 1, each source operand 1 is of an 8-bit unsigned integer type and the numerical range is [0,255] (including 0 and 255); source operation array 2 includes a plurality of source operands 2, each source operand 2 being of 8-bit signed integer type, a range of values [ -128,127] (including-128 and 127); the destination operation array includes a plurality of destination operands, each of 16-bit signed integer, having a range of values [ -32768, 32767] (including-32768 and 32767).
When executing vpmadubsw instruction, the first destination operand D0 is the sum of the product of the first source operand 1 (e.g. S0) and the first source operand 2 (e.g. S0 ') and the product of the second source operand 1 (e.g. S1) and the second source operand 2 (e.g. S1'), i.e. d0=s0×s0'+s1×s1'. For example, when s0=255, s1=255, s0 '=127, S1' =127, d0=255×127+255×127=6470, 6470 is larger than the range indicated by the destination operand, data overflow occurs. In the calculation of the neural network, through practical tests, the occurrence probability of data overflow is high, so that calculation errors are caused.
Typically, to address data overflow, the input bit width is reduced by one bit to avoid, for example, setting the value range of source operand 1 to [0,127] (including 0 and 127) and setting the value range of source operand 2 to [ -64,63] (including-64 and 63), but this way further reduces the numerical accuracy, resulting in a loss of accuracy.
Further, in practice, since the 8-bit signed integer needs to be converted into the 8-bit unsigned integer, the performance of performing the operations of the 8-bit signed integer and the 8-bit signed integer is reduced by more than 5% compared to the performance of directly performing the operations of the 8-bit signed integer and the 8-bit unsigned integer.
At least one embodiment of the present disclosure provides a data processing method, a data processing apparatus, an electronic device, a processor, and a non-transitory computer readable storage medium. The data processing method comprises the following steps: receiving a vector multiply-add instruction, wherein the vector multiply-add instruction comprises a destination address for indicating a storage destination operand, and a first vector and a second vector as source operands, the first vector comprises N first parameters, the second vector comprises N second parameters, the N first parameters and the N second parameters have the same data type, the data type comprises 8-bit integer, and the N first parameters and the N second parameters have a one-to-one correspondence; performing vector multiplication and addition operation on the first vector and the second vector to obtain a destination vector, wherein the destination vector comprises M destination parameters, each destination parameter is the sum of at least two product results, and each product result is the product of the corresponding first parameter and second parameter; and storing the destination vector into a destination address, wherein M and N are positive integers.
In the data processing method, a novel low-precision INT8 instruction is provided, and the vector multiplication and addition operation of 8-bit integer of the same data type is realized, for example, the vector multiplication and addition operation of two signed 8-bit integer or unsigned 8-bit integer is realized. The data processing method can not generate data overflow, can solve the problem of performance loss caused by conversion from 8-bit signed integer to 8-bit unsigned integer, improves the calculation performance of low-precision general matrix multiplication (GEMM, general Matrix Multiplication) or convolution calculation, and reduces or avoids the precision reduction caused by overflow.
Embodiments of the present disclosure will be described in detail below with reference to the attached drawings, but the present disclosure is not limited to these specific embodiments.
Fig. 1 is a schematic flow chart of a data processing method according to at least one embodiment of the present disclosure. As shown in fig. 1, the data processing method at least includes steps S10 to S30.
For example, in step S10, a vector multiply add instruction is received.
For example, the vector multiply-add instruction includes a destination address indicating a destination operand to be deposited, and a first vector and a second vector as source operands, the first vector including N first parameters, the second vector including N second parameters, the N first parameters and the N second parameters having the same data type, the data type including an 8-bit integer, the N first parameters having a one-to-one correspondence with the N second parameters. Here, N is a positive integer.
For example, the data type may include an 8-bit unsigned integer or an 8-bit signed integer. For example, each first parameter and each second parameter is an 8-bit signed integer (Singed Int 8), or each first parameter and each second parameter is an 8-bit unsigned integer (unsigned Int 8).
For example, the data bit width of the first vector may be 8*N bits, the data bit width of the second vector may be 8*N bits, for example, the data bit width of the first vector may be 128 bits, 256 bits, 512 bits, or the like, and those skilled in the art may set the widths of the first vector and the second vector according to actual needs.
For example, there is a one-to-one correspondence between the N first parameters and the N second parameters.
For example, the N first parameters include a first parameter A0, a first parameter A1, a first parameter a N-1 For example, the first parameter A0 includes bits 0 to 7 in the first vector, the first parameter A1 includes bits 8 to 15 in the first vector, and so on. For example, the N second parameters include a second parameter B0, a second parameter B1, a second parameter B N-1 For example, the second parameter B0 includes bits 0 to 7 in the second vector, the second parameter B1 includes bits 8 to 15 in the second vector, and so on. For example, a first parameter A0 corresponds to a second parameter B0, a first parameter A1 corresponds to a second parameter B1 N-1 Corresponding to the second parameter B N-1
In step S20, a vector multiply-add operation is performed on the first vector and the second vector to obtain a destination vector.
For example, the destination vector includes M destination parameters, each destination parameter being a sum of at least two product results, each product result being a product of a corresponding first parameter and a second parameter.
For example, the vector multiply-add instruction is a SIMD instruction, i.e., the same multiply-add operation is performed simultaneously on the first vector and the second vector, respectively, to achieve spatial parallelism.
Here, M is a positive integer smaller than N, and M and N are in a multiple relationship, where the multiple relationship of M and N is determined by the sum of the product results of each objective parameter.
In step S30, the destination vector is stored in the destination address.
For example, the destination address may be a vector register, e.g., a register defined for the instruction set architecture (ISA, instruction Set Architecture), e.g., in the X86 instruction set, a vector register may be ymm 0-ymm 15 (16 256-bit logical registers) or xmm 0-xmm 7 (8 128-bit logical registers). Further, the vector registers in embodiments of the present disclosure may be registers that are allocated for vector multiply-add operations within the processor.
The destination address may also be a cache address or a memory address, for example. For example, the destination address may be an address of 128 bits or 256 bits or 512 bits of system memory.
Fig. 2 is a schematic diagram of a vector multiply-add operation according to at least one embodiment of the present disclosure.
As shown in fig. 2, the first vector includes N first parameters, which are first parameter A0, first parameter A1, first parameter A2, first parameter A3, & gt, first parameter a, respectively N-1
As shown in fig. 2, the second vector includes N second parameters, which are the second parameter B0, the second parameter B1, the second parameter B2, the second parameter B3,.. N-1
For example, as shown in fig. 2, m=n/2, the i-th destination parameter of the M destination parameters is the sum of the product of the 2*i-1 first parameter and the 2*i-1 second parameter and the product of the 2*i first parameter and the 2*i second parameter, i being sequentially 1 to M.
For example, as shown in fig. 2, the 1 st destination parameter m0=a0×b0+a1×b1, the second destination parameter m1=a2×b2+a3×b3.. N/2-1 =A N-2 *B N-2 +A N-1 *B N-1
Of course, in other embodiments, the destination parameter may be a sum of more product results, for example, a sum of 4 product results, that is, m0=a0+a0+a1+b1+a2+b2+a3+b3, m1=a4+b4+a5+b5+a6+a7, which is not specifically limited in this disclosure.
For example, the bit width of the first parameter and the second parameter is 8 bits, e.g., the first parameter and the second parameter are signed 8 bit integer. For example, the bit width of the destination parameter is 16 bits, e.g., the destination parameter is a signed 16 bit integer.
For example, the sum of the product results of more terms may be implemented by superposition of a plurality of base instructions as the destination parameter. In this disclosure, the specific embodiment is described by taking m=n/2 as an example, that is, the vector multiply-add instruction may be a base instruction. Of course, the instruction execution process using the sum of the product results of more items as the destination parameter is similar to that of the embodiment, and the description will not be repeated in detail.
The low-precision multiply-add operation is widely applied to the deep neural network, and the low-precision operation can not only solve the problem of overlarge stored data caused by overlarge parameters in the neural network, but also improve the performance of the neural network, and can control the precision loss within a controllable range. In the above embodiment, the first parameter and the second parameter are set to be the same data type with 8 bits of bit width, for example, the first parameter and the second parameter are both 8 bits signed integer, which fills the blank that the vector multiply-add operation between signed 8 bits integer and signed 8 bits integer is not provided in the current instruction set, the provided low-precision INT8 vector multiply-add instruction does not need to perform sign conversion, and the computing performance is improved. In addition, the low-precision vector multiply-add instruction can better use the cache and reduce bandwidth bottleneck, and can move data through the memory hierarchical structure more quickly, so that computing resources are utilized to the maximum extent, and higher operation times per second are realized.
In addition, in the data processing method provided in at least one embodiment of the present disclosure, even if the first parameter or the second parameter is adjacent to the interval maximum value, no data overflow occurs.
For example, each first parameter is an 8-bit signed integer, a range of values [ -128,127] (including-128 and 127); each second parameter is an 8-bit signed integer, and the value range is [ -128,127] (including-128 and 127); each destination parameter is a 16-bit signed integer, and the range of values is [ -32768, 32767] (including-32768 and 32767). If a0=127, a1=127, b0=127, b1=127, m0=127×127+127×127=32258, 32258 belong to the range that the target parameter can represent, therefore in the data processing method provided by at least one embodiment of the present disclosure, the vector multiplication and addition operation does not overflow data, and one bit does not need to be reduced, so that the precision of the vector multiplication and addition operation is maintained, and the stability and accuracy of the calculation are ensured.
The instruction sets are generally classified into RISC (reduced instruction set Reduced Instruction Set Computer) and CISC (complex instruction set Complex Instruction Set Computer). In the CISC instruction set, a series of instruction sets were developed: SSE instruction set (Stream SIMD Extentions, data stream single instruction multiple data extensions), AVX instruction set (Advanced Vector Extensions ), and the like.
For example, in some embodiments, in response to a first instruction set architecture, a vector multiply add instruction includes two instruction parameters including a first instruction parameter indicating a memory address to store a first vector and a destination vector, and a second instruction parameter indicating a memory address to store a second vector; the first instruction parameter is in the form of a vector register and the second instruction parameter is in the form of a vector register or system memory.
For example, the first instruction set architecture may be an SSE instruction set architecture.
For example, the vector multiply-add instruction may be represented as "PMADDSSBSW xmm1, xmm2/m128" at this time to complete a 128-bit vector multiply-add operation. In the instruction, xmm1 is a first instruction parameter and is a 128-bit vector register in which a first vector as a source operand and a destination vector as a destination operand are stored, xmm1 having readable and writable rights; xmm2/m128 is a second instruction parameter and is a 128-bit vector register or a 128-bit memory address in which the second vector as a source operand is stored, the vector register xmm2 having only readable rights.
For example, in other embodiments, in response to the second instruction set architecture, the vector multiply add instruction includes three instruction parameters including a first instruction parameter to store a destination address of a destination vector, a second instruction parameter to store a storage address of the first vector, and a third instruction parameter to indicate a storage address to store the second vector; the first instruction parameter and the second instruction parameter are in the form of vector registers, and the third instruction parameter is in the form of vector registers or system memory.
For example, the second instruction set architecture may be an AVX instruction set architecture.
For example, in some examples, the vector multiply-add instruction may be represented as "VPMADDSBSW xmm1, xmm2, xmm3/m128" for performing a 128-bit vector multiply-add operation. In the instruction, xmm1 is a first instruction parameter and is a 128-bit vector register, a destination vector serving as a destination operand is stored in the vector register xmm1, and the vector register xmm1 only has writable permission; xmm2 is a second instruction parameter and is a 128-bit vector register in which a first vector is stored as a source operand; xmm3/m128 is a third instruction parameter and is a 128-bit vector register or a 128-bit memory address in which the second vector as a source operand is stored, the vector register xmm3 having only readable rights.
For example, in other examples, the vector multiply-add instruction may be represented as "VPMADDSBSW ymm1, ymm2, ymm3/m256" for performing a 256-bit vector multiply-add operation. In the instruction, ymm1 is a first instruction parameter, and is a 256-bit vector register, a destination vector serving as a destination operand is stored in the vector register ymm1, and the vector register ymm1 only has writable authority; ymm2 is a second instruction parameter and is a 256-bit vector register in which a first vector is stored as a source operand; ymm3/m256 is a third instruction parameter and is a 256-bit vector register or a 256-bit memory address, the second vector being stored as a source operand in this vector register ymm3 or memory address m256, the vector register ymm3 having only readable rights.
Table 1 is an addressing domain illustration of the parameters of a vector multiply add instruction.
TABLE 1
Instruction set architecture First instruction parameter Second instruction parameter Third instruction parameter Fourth instruction parameter
First instruction set architecture ModRM.reg ModRM.r/m NA NA
Second instruction set architecture ModRM.reg VEX.vvvv ModRM.r/m NA
For example, addressing of instruction set parameters is provided by means of three fields, such as ModRM. Reg, modRM. R/m, and VEX. Vvv in Table 1 (NA indicates that the instruction parameters are not required). Wherein vex.vvv provides register addressing of source and destination operands, modrm.reg provides register addressing of source and destination operands, modrm.r/m may provide addressing of memory operands in addition to providing register addressing of source and destination operands.
For example, step S20 may include: and respectively executing M times of cyclic multiply-add operation on the N first parameters and the N second parameters by combining the intermediate buffer unit to obtain a target vector.
For example, the intermediate buffer unit may be a physical register, and the execution of one vector multiply-add operation uses the intermediate buffer unit to perform a cyclic multiply-add operation, so that the limited memory resources are occupied as little as possible, and the cyclic multiply-add operation is utilized to improve the instruction execution efficiency.
For example, taking m=n/2 as an example, in combination with the intermediate buffer unit, performing M cyclic multiply-add operations on the N first parameters and the N second parameters to obtain the destination vector, may include: in the ith round multiply-add operation for the ith destination operand of the M destination operands: calculating the product of the 2*i-1 first parameter and the 2*i-1 second parameter to obtain a first product result; writing the first product result into an intermediate cache unit; calculating the product of the 2*i first parameter and the 2*i second parameter to obtain a second product result; reading a first product result cached in the intermediate caching unit, adding a second product result and the first product result, and writing the added result into the intermediate caching unit; and the addition result cached by the intermediate caching unit is used as an ith destination operand and is assigned to a corresponding interval in the destination vector.
The following describes the execution process of a vector multiply-add instruction in a data processing method according to at least one embodiment of the present disclosure with reference to the accompanying drawings.
Fig. 3 is a schematic diagram of a cyclic multiply-add calculation process according to an embodiment of the disclosure.
For example, in the embodiment shown in FIG. 3, a vector multiply add instruction is used to perform a 128-bit vector multiply add operation using registers that are 128 bits wide, such that a first vector includes 16 first parameters and a second vector includes 16 second parameters. For example, the 1 st first parameter is the 0 th to 7 th bits in the first vector, the 2 nd first parameter is the 8 th to 15 th bits in the first vector, and so on; the 1 st second parameter is the 0 th to 7 th bits in the second vector, the 2 nd second parameter is the 8 th to 15 th bits in the second vector, and so on.
For example, 16 first parameters correspond one-to-one with 16 second parameters, e.g., 1 st first parameter corresponds to 1 st second parameter, 2 nd first parameter corresponds to 2 nd second parameter, and so on.
For example, in fig. 3, the first vector and the destination vector share one vector register, DEST denotes the first vector and the destination vector, and SRC denotes the second vector.
For example, the loop multiply-add computation process shown in FIG. 3 may be the execution of a vector multiply-add instruction under the first instruction set architecture, e.g., where the vector multiply-add instruction may be represented as "PMADDSBSW DEST, SRC".
As shown in fig. 3, in the 1 st cyclic multiply-add operation, firstly, 0-7 bits SRC [7:0] of the SRC are acquired as 1 st second parameter, 0-7 bits DEST [7:0] of the DEST are acquired as 1 st first parameter, the product of the 1 st first parameter and the 1 st second parameter is calculated to obtain a first product result (SRC [7:0 ]. DEST [7:0 ]), and the first product result is written into the 16-bit intermediate buffer TEMP; then, 8-15 bits SRC [15:8] of SRC are obtained as 2 nd second parameters, 8-15 bits DEST [15:8] of DEST are obtained as 2 nd first parameters, the product of the 2 nd first parameters and the 2 nd second parameters is calculated to obtain a second product result (SRC [15:8 ]. DEST [15:8 ]), the second product result is added with the first product result cached in the intermediate cache unit, and the added result is written back into the intermediate cache unit; then, the addition result buffered by the intermediate buffer unit is used as the 1 st destination parameter, and assigned to the lower 16 bits (DEST [15:0 ]) in DEST.
As shown in fig. 3, in the 2 nd round of multiply-add operation, first, the 16-23 bits SRC [23:16] of the SRC are acquired as the 3 rd second parameter, the 16-23 bits DEST [23:16] of the DEST are acquired as the 3 rd first parameter, the product of the 3 rd first parameter and the 3 rd second parameter is calculated to obtain a first product result (SRC [23:16 ]. DEST [23:16 ]), and the first product result is written into the 16-bit intermediate buffer TEMP; then, 24-31 bits SRC [31:24] of SRC are obtained as a 4 th second parameter, 24-31 bits DEST [31:24] of DEST are obtained as a 4 th first parameter, the product of the 4 th first parameter and the 4 th second parameter is calculated to obtain a second product result (SRC [31:24 ]. DEST [31:24 ]), the second product result is added with the first product result cached in the intermediate cache unit, and the added result is written back into the intermediate cache unit; then, the addition result buffered by the intermediate buffer unit is used as the 2 nd objective parameter and assigned to the 16 th to 31 th bits (DEST [31:16 ]) in the DEST.
And then, continuously recycling for 6 times according to the process to finally process all bits in the first vector and the second vector, wherein the destination vector which is stored in the final DEST and comprises 8 destination parameters, and the specific implementation process is not repeated.
For example, in the above procedure, since DEST represents the storage address of the destination operand, the procedure of assigning a value to the corresponding section in DEST is also equivalent to the procedure of storing the destination vector in the destination address.
Fig. 4 is a schematic diagram of a cyclic multiply-add calculation process according to another embodiment of the disclosure.
For example, in the embodiment shown in fig. 4, the vector multiply-add instruction is configured to complete 128-bit vector multiply-add operation, and the used register is 128-bit wide, so that the first vector includes 16 first parameters, the second vector includes 16 second parameters, the 16 first parameters and the 16 second parameters are in one-to-one correspondence, and the definition and correspondence of the first parameters and the second parameters may refer to the related description in fig. 3, and the repetition is omitted.
For example, unlike fig. 3, in fig. 4, the first vector and the destination vector do not share one vector register, SRC1 represents the first vector, SRC2 represents the second vector, and DEST represents the destination vector.
For example, the loop multiply-add computation process shown in fig. 4 may be the execution of a vector multiply-add instruction under the second instruction set architecture, e.g., the vector multiply-add instruction may be represented as "PMADDSBSW DEST, SRC1, SRC2".
As shown in fig. 4, in the 1 st cyclic multiply-add operation, firstly, 0-7 bits SRC [7:0] of SRC1 are obtained as 1 st first parameter, 0-7 bits SRC2[7:0] of SRC2 are obtained as 1 st second parameter, the product of the 1 st first parameter and the 1 st second parameter is calculated to obtain a first product result (SRC 1[7:0 ]. Times SRC2[7:0 ]), and the first product result is written into the 16-bit intermediate buffer TEMP; then, 8-15 bits SRC1[15:8] of SRC1 are obtained as a 2 nd first parameter, 8-15 bits SRC2[15:8] of SRC2 are obtained as a 2 nd second parameter, the product of the 2 nd first parameter and the 2 nd second parameter is calculated to obtain a second product result (SRC 1[15:8 ]. Times.SRC2 [15:8 ]), the second product result is added with the first product result cached in the intermediate cache unit, and the added result is written back into the intermediate cache unit; then, the addition result buffered by the intermediate buffer unit is used as the 1 st destination parameter, and assigned to the lower 16 bits (DEST [15:0 ]) in DEST.
As shown in fig. 4, in the 2 nd cyclic multiply-add operation, first, the 16-23 bits SRC1[23:16] of SRC1 are obtained as the 3 rd first parameter, the 16-23 bits SRC2[23:16] of SRC2 are obtained as the 3 rd second parameter, the product of the 3 rd first parameter and the 3 rd second parameter is calculated to obtain a first product result (SRC 1[23:16 ]. Times SRC2[23:16 ]), and the first product result is written into the 16-bit intermediate buffer TEMP; then, 24-31 bits SRC1[31:24] of SRC1 are obtained as 4 th first parameters, 24-31 bits SRC2[31:24] of SRC2 are obtained as 4 th second parameters, the product of the 4 th first parameters and the 4 th second parameters is calculated to obtain a second product result (SRC 1[31:24 ]. Times.SRC 2[31:24 ]), the second product result is added with the first product result cached in the intermediate cache unit, and the added result is written back into the intermediate cache unit; then, the addition result buffered by the intermediate buffer unit is used as the 2 nd objective parameter and assigned to the 16 th to 31 th bits (DEST [31:16 ]) in the DEST.
And then, continuously recycling for 6 times according to the process to finally process all bits in the first vector and the second vector, wherein the destination vector which is stored in the final DEST and comprises 8 destination parameters, and the specific implementation process is not repeated.
For example, the vector multiply-add operation for completing 256, 512 bits, etc. is similar to the process of fig. 3 or fig. 4, except that the number of loops is different, for example 16 loop multiply-add calculations are performed for 256 bits in total, and 32 loop multiply-add calculations are performed for 512 bits, and the repetition is not repeated.
For example, in the process of calculating the first product result and the second product result, a plurality of first parameters and a plurality of second parameters can be extracted at one time, and then the first parameters and the second parameters positioned at high positions are obtained in a shifting mode, so that the number of data moving times of a storage unit is reduced, the calculation efficiency is improved, and the instruction execution efficiency is improved.
For example, the intermediate buffer unit is a first physical register with a bit width of 16 bits, calculating the product of the 2*i-1 first parameter and the 2*i-1 second parameter to obtain a first product result, and writing the first product result into the intermediate buffer unit may include: extracting a first operand with a bit width of 16 bits from a storage address storing a first vector and writing the first operand into a second physical register, wherein the first operand comprises a 2*i-1 first parameter and a 2*i first parameter; extracting a second operand with a bit width of 16 bits from a storage address storing a second vector and writing the second operand into a third physical register, wherein the second operand comprises a 2*i-1 second parameter and a 2*i second parameter; calculating the product of the first operand and the second operand by using a first multiplier to obtain a first product result; the first product result is written to a first physical register.
For example, calculating the product of the 2*i first parameter and the 2*i second parameter to obtain the second product result may include: shifting the first operand and the second operand to obtain a first shifting result and a second shifting result; a product of the first shift result and the second shift result is calculated as a second product result using a second multiplier.
FIG. 5 is a diagram illustrating an implementation of a vector multiply-add instruction in a microarchitecture according to an embodiment of the disclosure. Fig. 5 shows the execution of the ith loop multiply-add calculation, i being a positive integer and sequentially taking 1 to M.
For example, vector multiply add instructions are implemented in hardware circuitry by a plurality of micro-instructions, micro-operations, pseudo code input points, decoded instructions, control signals, or the like, when executed. For example, multiplication is performed by a multiplier, and addition is performed by an adder.
For example, the multiplication operation is related to a vector register, and the multiplier may use a multiplier in a floating point operator.
As shown in fig. 5, a first operand having a bit width of 16 bits is extracted from a memory address storing a first vector and written into a second physical register, for example, a 16 x (i-1) bit to a 16 x i-1 bit in the first vector is extracted as the first operand and written into the second physical register, whereby the first operand includes a 2*i-1 first parameter and a 2*i first parameter.
As shown in fig. 5, the second operand having a bit width of 16 bits is extracted from the memory address storing the second vector and written into the third physical register, for example, the 16 x (i-1) bit to the 16 x i-1 bit in the second vector is extracted as the second operand and written into the third physical register, whereby the second operand includes the 2*i-1 th second parameter and the 2*i th second parameter.
And calculating the product of the first operand and the second operand by using the first multiplier to obtain a first product result. Since the first multiplier is an 8-bit multiplier, although the first operand and the second operand are 16 bits, the default is to calculate the product of the lower 8 bits, i.e. the product of 2*i-1 first parameter and 2*i-1 second parameter.
The first product result is then written to a first physical register for caching.
As shown in fig. 5, the first operand and the second operand are shifted to obtain a first shift result and a second shift result. For example, the upper 8 bits of the first operand are shifted to the lower 8 bits to obtain a first shift result, the upper 8 bits of the second operand are shifted to the lower 8 bits to obtain a second shift result, at this time, the lower 8 bits in the first shift result are 2*i th first parameters, and the lower 8 bits in the second shift result are 2*i th second parameters.
Then, a product of the first shift result and the second shift result is calculated as a second product result using a second multiplier. Similarly, since the second multiplier is an 8-bit multiplier, although the first operand and the second operand are 16 bits, the default is to calculate the product of the lower 8 bits, that is, the product of the 2*i th first parameter and the 2*i th second parameter.
And adding the second product result and the first product result cached in the first physical register to obtain an ith destination parameter, and storing the ith destination parameter in a corresponding section in the destination vector.
For example, the above procedure is sequentially performed to sequentially obtain M destination parameters, and the destination vector stored in the destination address is obtained after performing the cyclic multiply-add operation M times.
It should be noted that, the instruction is executed out of order in the micro instruction form, so in practice, it is also possible to first execute the calculation of the second product result and buffer it into the first physical register, and add the first product result to the second product result buffered in the first physical register to obtain the ith objective parameter, which is not specifically limited in this disclosure. That is, the execution process of the first product result and the second product result is not sequential, the first product result is cached in the intermediate caching unit, and the addition operation is performed after the calculation of the second product result is waited to obtain.
For example, in this embodiment, since the shift speed is greater than the speed of fetching data from the memory or the register, the execution process provided in this embodiment can reduce the number of data moves and improve the calculation efficiency compared to fetching data once for each multiplication calculation; in addition, in the whole cyclic multiplication and addition process, the same intermediate buffer unit is multiplexed every time of multiplication and addition, so that a part of storage space can be multiplexed, registers do not need to be frequently applied, occupation of the storage space in the instruction execution process is reduced, the utilization efficiency of storage resources is improved, and the execution efficiency is improved.
In the data processing method provided by at least one embodiment of the present disclosure, a vector multiply-add instruction capable of being used for low-precision vector multiply-add operation is provided, where the vector multiply-add instruction supports vector multiply-add computation between signed integer and signed integer, so that not only can the problem of overlarge stored data caused by excessive parameters in a neural network be solved, but also the performance of the neural network can be improved, and meanwhile, the precision loss can be controlled within a controllable range; the calculation performance of the low-precision INT8 is improved, the precision problem caused by overflow is reduced, and the performance loss caused by data format conversion is avoided; the method can be used for general matrix multiplication or convolution operation in deep learning, improves the performance of low-precision matrix multiplication and convolution, is widely applied in the aspect of deep learning reasoning, and fills the blank that no instruction supporting vector multiplication and addition calculation between signed integer and signed integer is provided at present.
Corresponding to the data processing method, at least one embodiment of the present disclosure further provides a data processing device.
Fig. 6A is a schematic block diagram of a data processing apparatus according to at least one embodiment of the present disclosure, and fig. 6B is a schematic block diagram of another data processing apparatus according to at least one embodiment of the present disclosure.
As shown in fig. 6A, the data processing apparatus 100 includes a decoding unit 101, an executing unit 102, and a storage unit 103.
For example, decode unit 101 is configured to receive a vector multiply-add instruction and parse the vector multiply-add instruction to execute the vector multiply-add instruction by execution unit 102.
For example, the vector multiply-add instruction includes a destination address indicating a destination operand to be deposited, and a first vector and a second vector as source operands, the first vector including N first parameters, the second vector including N second parameters, the N first parameters and the N second parameters having the same data type, the data type including an 8-bit integer, the N first parameters having a one-to-one correspondence with the N second parameters.
For a specific description of the parameters in the vector multiply-add instruction, reference may be made to the description related to step S10 in the foregoing data processing method, and the repetition is not repeated.
For example, one or more micro-instructions, micro-operations, pseudo-code input points, decoded instructions or control signals, etc. are output by the decode unit 101. One or more lower level instructions or control signals may implement a high level vector multiply add instruction through circuit level gates or hardware level operations.
For example, the storage unit 103 is configured to store or acquire a first vector, a second vector, and a destination vector. For example, storage unit 103 may include vector registers, cache, or system memory, among others.
The data processing device comprises a plurality of read-write operations for registers and memory addresses. In some embodiments, the registers comprise vector registers to hold data; for example, the data corresponding to the bit width may also be obtained directly from the system memory or the cache.
For example, FIG. 6A is a block diagram of a data processing apparatus corresponding to a first instruction set architecture, e.g., in which a vector multiply add instruction includes two instruction parameters including a first instruction parameter indicating a memory address where a first vector and a destination vector are stored, and a second instruction parameter indicating a memory address where a second vector is stored; the first instruction parameter is in the form of a vector register and the second instruction parameter is in the form of a vector register or system memory.
For example, in FIG. 6A, the destination vector and the first vector use vector registers, such as XMM/YMM registers, and the second vector uses vector registers or memory addresses, such as 128-bit or 256-bit memory, etc.
The first instruction set architecture and the representation of the vector multiply-add instruction may refer to the related description in the foregoing data processing method, and the repetition is not repeated.
In fig. 6A, the execution unit 102 is coupled to the decoding unit 101 and the storage unit, and the execution unit 102 generally includes digital circuits such as an arithmetic unit, an arithmetic logic unit, and a logic operation, and digital circuits such as a multiplier and an adder, or the like. After receiving the decoding result, the execution unit 102 performs a corresponding operation, and finally obtains a destination vector and outputs the destination vector to the storage unit 103.
For example, execution unit 102, when executing a vector multiply add instruction, includes performing the following operations: reading the first vector and the second vector from the storage unit 103; performing vector multiplication and addition operation on the first vector and the second vector to obtain a destination vector, wherein the destination vector comprises M destination parameters, each destination parameter is the sum of at least two product results, and each product result is the product of the corresponding first parameter and second parameter; the destination vector is stored in the storage unit 103 at the destination address, where M and N are positive integers.
For example, m=n/2, the i-th destination parameter of the M destination parameters is the sum of the product of the 2*i-1-th first parameter and the 2*i-1-th second parameter and the product of the 2*i-th first parameter and the 2*i-th second parameter, i being sequentially 1 to M.
For example, the execution unit 102 performs a vector multiply-add operation on the first vector and the second vector to obtain the destination vector, including performing the following operations: and respectively executing M times of cyclic multiply-add operation on the N first parameters and the N second parameters by combining the intermediate buffer unit to obtain a target vector.
For example, when the execution unit 102 performs the combination intermediate buffer unit to perform M cyclic multiply-add operations on the N first parameters and the N second parameters to obtain the destination vector, the following operations are performed: in the ith cyclic multiply-add operation for the ith one of the M destination parameters: calculating the product of the 2*i-1 first parameter and the 2*i-1 second parameter to obtain a first product result; writing the first product result into an intermediate cache unit; calculating the product of the 2*i first parameter and the 2*i second parameter to obtain a second product result; reading a first product result cached in the intermediate caching unit, adding a second product result and the first product result, and writing the added result into the intermediate caching unit; and (3) taking the addition result cached by the intermediate caching unit as an ith destination parameter, and assigning the ith destination parameter to a corresponding interval in the destination vector.
For example, the intermediate buffer unit is a first physical register with a bit width of 16 bits, and the execution unit includes a first multiplier, where the first multiplier is an 8-bit multiplier. The execution unit 102 performs the calculation of the product of the 2*i-1 first parameter and the 2*i-1 second parameter to obtain a first product result, and when writing the first product result into the intermediate buffer unit, the following operations are performed: writing a first operand with a 16-bit width, which is obtained by storing a first vector in a storage unit, into a second physical register, wherein the first operand comprises a 2*i-1 first parameter and a 2*i first parameter; writing a second operand with a 16-bit width, which is extracted from a storage address of the second vector stored in the storage unit, into a third physical register, wherein the second operand comprises a 2*i-1 second parameter and a 2*i second parameter; calculating the product of the first operand and the second operand by using a first multiplier to obtain a first product result; the first product result is written to a first physical register.
For example, the execution unit further includes a second multiplier, which is an 8-bit multiplier. The execution unit 102 performs the following operations when calculating the product of the 2*i first parameter and the 2*i second parameter to obtain the second product result: shifting the first operand and the second operand to obtain a first shifting result and a second shifting result; a product of the first shift result and the second shift result is calculated as a second product result using a second multiplier.
For example, the execution unit 102 includes a multiplier through which multiplication is performed and an adder through which addition is performed. For example, the multiplication operations are associated with vector registers, and the multiplier may use a floating point operator summarized multiplier.
It should be noted that, regarding the process of performing the vector multiply-add operation on the first vector and the second vector by the execution unit 102 to obtain the destination vector, reference may be made to the related content of step S20 shown in fig. 1, and the description will not be repeated here.
It should be noted that, in fig. 6A, a relatively simple instruction execution flow is described, and other well-known instruction execution related components, such as an instruction fetch unit, a data cache, a bus interface unit, and the like, are included in the data processing apparatus by default, which is not described in detail herein.
As shown in fig. 6B, the data processing apparatus 200 includes a decoding unit 201, an executing unit 202, and a storage unit 203.
The data processing apparatus 200 is identical to the data processing apparatus 100 in structure and function, except that the data processing apparatus 200 is a data processing apparatus corresponding to the second instruction set architecture. For example, in a second instruction set architecture, a vector multiply add instruction includes three instruction parameters including a first instruction parameter indicating a destination address for storing a destination vector, a second instruction parameter storing a storage address for the first vector, a third instruction parameter storing a storage address for the second vector; the first instruction parameter and the second instruction parameter are in the form of vector registers, and the third instruction parameter is in the form of vector registers or system memory.
For example, in FIG. 6B, the destination vector uses a vector register, such as an XMM/YMM register, the first vector uses a vector register, such as an XMM/YMM register, the second vector uses a vector register or memory address, such as an XMM/YMM register, or 128/256 bits of memory, etc.
The second instruction set architecture and the representation of the vector multiply-add instruction may refer to the related description of the data processing method, and the repetition is not repeated.
The decoding unit 201 has the same structure and function as the decoding unit 101 in fig. 6A, the execution unit 202 has the same structure and function as the execution unit 102 in fig. 6A, and the storage unit 203 has the same function as the storage unit 103 in fig. 6A, so that the description of the decoding unit 201, the execution unit 202, and the storage unit 203 will be referred to the description of the decoding unit 101, the execution unit 102, and the storage unit 103, and will not be repeated here.
In the data processing device provided in at least one embodiment of the present disclosure, a vector multiply-add instruction capable of being used for low-precision vector multiply-add operation is provided, where the vector multiply-add instruction supports vector multiply-add computation between signed integer and signed integer, so that not only can the problem of overlarge stored data caused by excessive parameters in a neural network be solved, but also performance of the neural network can be improved, and meanwhile, precision loss can be controlled within a controllable range; the calculation performance of the low-precision INT8 is improved, the precision problem caused by overflow is reduced, and the performance loss caused by data format conversion is avoided; the method can be used for general matrix multiplication or convolution operation in deep learning, improves the performance of low-precision matrix multiplication and convolution, is widely applied in the aspect of deep learning reasoning, and fills the blank that no instruction supporting vector multiplication and addition calculation between signed integer and signed integer is provided at present.
At least one embodiment of the present disclosure also provides a processor. Fig. 7 is a schematic block diagram of a processor provided in accordance with at least one embodiment of the present disclosure.
As shown in fig. 7, the processor 300 includes a data processing apparatus 301 according to any of the embodiments of the present disclosure. Regarding the structure, function and technical effects of the data processing apparatus 301, reference is made to the data processing apparatus 100 or the data processing apparatus 200 as described above, and the description thereof will not be repeated here.
For example, the processor 300 may be implemented as a central processor, such as the X86 architecture central processor, according to actual needs.
It should be noted that in the embodiments of the present disclosure, the processor 300 may include more or fewer circuits or units, for example, and may further include other circuits or units that operate by the data processing apparatus 301, such as an instruction buffer, an instruction scheduler, an instruction buffer area, and related components, which are not specifically limited in this disclosure.
The connection relation between the circuits or units is not limited, and can be determined according to actual requirements. The specific configuration of each circuit or unit is not limited, and may be constituted by an analog device according to the circuit principle, a digital chip, or other applicable means.
The processor provided in at least one embodiment of the present disclosure may achieve similar technical effects as the aforementioned data processing apparatus, and will not be described herein.
Some embodiments of the present disclosure also provide an electronic device. Fig. 8 is a schematic block diagram of an electronic device provided in at least one embodiment of the present disclosure.
For example, as shown in fig. 8, electronic device 400 includes a processor 410 and a memory 420. It should be noted that the components of the electronic device 400 shown in fig. 8 are exemplary only and not limiting, and that the electronic device 400 may have other components as desired for practical applications.
For example, the processor 410 and the memory 420 may communicate with each other directly or indirectly.
For example, the processor 410 and the memory 420 may communicate over a network. The network may include a wireless network, a wired network, and/or any combination of wireless and wired networks. Intercommunication among processor 410 and memory 420 can also be implemented via a system bus as no limitation of the present disclosure.
For example, in some embodiments, memory 420 is used to non-transitory store computer readable instructions. The processor 410 is configured to execute computer readable instructions that when executed by the processor 410 implement a data processing method according to any of the embodiments described above. The specific implementation of each step of the data processing method and the related explanation content can be referred to the embodiment of the data processing method, and the details are not repeated here.
For example, the processor 410 may control other components in the electronic device 400 to perform desired functions. The processor 410 may be a Central Processing Unit (CPU), a graphics processor (Graphics Processing Unit, GPU), a Network Processor (NP), etc.; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The Central Processing Unit (CPU) can be an X86 or ARM architecture, etc.
For example, memory 420 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer readable instructions may be stored on the computer readable storage medium that can be executed by the processor 410 to implement various functions of the electronic device 400. Various applications and various data, etc. may also be stored in the storage medium.
For example, in some embodiments, the electronic device 400 may be a cell phone, tablet, electronic paper, television, display, notebook, digital photo frame, navigator, wearable electronic device, smart home device, or the like.
For example, the electronic device 400 may include a display panel that may be used to display interactive content, etc. For example, the display panel may be a rectangular panel, a circular panel, an elliptical panel, a polygonal panel, or the like. In addition, the display panel may be not only a planar panel but also a curved panel or even a spherical panel.
For example, the electronic device 400 may have a touch function, that is, the electronic device 400 may be a touch device.
For example, a detailed description of the procedure of the electronic device 400 performing the data processing method may refer to a related description in an embodiment of the data processing method, and the repetition is not repeated.
Fig. 9 is a schematic diagram of a non-transitory computer readable storage medium according to at least one embodiment of the present disclosure. For example, as shown in FIG. 9, one or more computer-readable instructions 510 may be stored non-transitory on the storage medium 500. For example, computer readable instructions 510, when executed by a processor, may perform one or more steps in accordance with the data processing methods described above.
For example, the storage medium 500 may be applied to the electronic device 400 described above. For example, the storage medium 500 may include the memory 420 in the electronic device 400.
For example, the description of the storage medium 500 may refer to the description of the memory 420 in the embodiment of the electronic device 400, and the repetition is not repeated.
For example, the storage device may comprise any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer readable instructions may be stored on the computer readable storage medium that can be executed by a processor to perform various functions of the processor. Various applications and various data, etc. may also be stored in the storage medium.
For example, the storage medium may include a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), portable compact disc read only memory (CD-ROM), flash memory, or any combination of the foregoing, as well as other suitable storage media.
For the purposes of this disclosure, the following points are also noted:
(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.
(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.
The foregoing is merely a specific embodiment of the disclosure, but the scope of the disclosure is not limited thereto and should be determined by the scope of the claims.

Claims (17)

1. A data processing method, comprising:
receiving a vector multiply-add instruction, wherein the vector multiply-add instruction comprises a destination address for indicating a storage destination operand and a first vector and a second vector as source operands, the first vector comprises N first parameters, the second vector comprises N second parameters, the N first parameters and the N second parameters have the same data type, the data type comprises 8-bit integer, and the N first parameters and the N second parameters have a one-to-one correspondence;
Performing vector multiplication and addition operation on the first vector and the second vector to obtain a destination vector, wherein the destination vector comprises M destination parameters, each destination parameter is the sum of at least two product results, and each product result is the product of the corresponding first parameter and the second parameter; and
and storing the destination vector into the destination address, wherein M and N are positive integers.
2. The data processing method according to claim 1, wherein m=n/2,
the i-th objective parameter in the M objective parameters is the sum of the product of the 2*i-1-th first parameter and the 2*i-1-th second parameter and the product of the 2*i-th first parameter and the 2*i-th second parameter, and i is sequentially 1 to M.
3. The data processing method according to claim 1, wherein performing a vector multiply-add operation on the first vector and the second vector to obtain a destination vector, comprises:
and respectively executing M times of cyclic multiply-add operation on the N first parameters and the N second parameters by combining an intermediate buffer unit so as to obtain the destination vector.
4. A data processing method according to claim 3, wherein, in combination with an intermediate buffer unit, performing M cyclic multiply-add operations on the N first parameters and the N second parameters, respectively, to obtain the destination vector, comprises:
In the ith cyclic multiply-add operation for the ith one of the M destination parameters:
calculating the product of the 2*i-1 first parameter and the 2*i-1 second parameter to obtain a first product result;
writing the first product result into the intermediate cache unit;
calculating the product of the 2*i first parameter and the 2*i second parameter to obtain a second product result;
reading the first product result cached in the intermediate cache unit, adding the second product result and the first product result, and writing the added result into the intermediate cache unit;
and assigning the addition result cached by the intermediate caching unit as the ith destination parameter to a corresponding interval in the destination vector.
5. The data processing method of claim 4, wherein the intermediate buffer unit is a first physical register having a bit width of 16 bits,
calculating the product of the 2*i-1 first parameter and the 2*i-1 second parameter to obtain a first product result, and writing the first product result into the intermediate buffer unit, where the calculating comprises:
writing a first operand having a bit width of 16 bits, which includes 2*i-1 first parameters and 2*i first parameters, to a second physical register from a storage address where the first vector is stored;
Writing a second operand having a bit width of 16 bits, which includes 2*i-1 second parameter and 2*i second parameter, to a third physical register from a memory address where the second vector is stored;
calculating the product of the first operand and the second operand by using a first multiplier to obtain the first product result, wherein the first multiplier is an 8-bit multiplier;
the first product result is written to the first physical register.
6. The data processing method of claim 5, wherein calculating the product of the 2*i first parameter and the 2*i second parameter yields a second product result, comprising:
shifting the first operand and the second operand to obtain a first shifting result and a second shifting result;
and calculating a product of the first shift result and the second shift result as the second product result by using a second multiplier, wherein the second multiplier is an 8-bit multiplier.
7. The data processing method of any of claims 1-6, wherein, in response to a first instruction set architecture, the vector multiply add instruction includes two instruction parameters including a first instruction parameter indicating a memory address where the first vector and the destination vector are stored, and a second instruction parameter indicating a memory address where the second vector is stored;
The first instruction parameter is in the form of a vector register and the second instruction parameter is in the form of a vector register or system memory.
8. The data processing method of any of claims 1-6, wherein, in response to a second instruction set architecture, the vector multiply add instruction includes three instruction parameters including a first instruction parameter indicating a destination address at which the destination vector is to be deposited, a second instruction parameter at which a memory address of the first vector is to be deposited, and a third instruction parameter at which a memory address of the second vector is to be deposited;
the first instruction parameter and the second instruction parameter are in the form of vector registers, and the third instruction parameter is in the form of vector registers or system memory.
9. A data processing device comprises a decoding unit, an executing unit and a storage unit, wherein,
the decoding unit is configured to:
receiving a vector multiply-add instruction, wherein the vector multiply-add instruction comprises a destination address for indicating a storage destination operand and a first vector and a second vector as source operands, the first vector comprises N first parameters, the second vector comprises N second parameters, the N first parameters and the N second parameters have the same data type, the data type comprises 8-bit integer, and the N first parameters and the N second parameters have a one-to-one correspondence; and
Resolving the vector multiply-add instruction to execute the vector multiply-add instruction by the execution unit;
the storage unit is configured to store or acquire the first vector, the second vector and a destination vector;
the execution unit executing the vector multiply add instruction includes performing the operations of:
reading the first vector and the second vector from the storage unit;
performing vector multiplication and addition operation on the first vector and the second vector to obtain the destination vector, wherein the destination vector comprises M destination parameters, each destination parameter is the sum of at least two product results, and each product result is the product of the corresponding first parameter and the second parameter;
and storing the destination vector into the destination address in the storage unit, wherein M and N are positive integers.
10. The data processing apparatus according to claim 9, wherein the execution unit performs a vector multiply-add operation on the first vector and the second vector to obtain the destination vector, comprising:
and respectively executing M times of cyclic multiply-add operation on the N first parameters and the N second parameters by combining an intermediate buffer unit so as to obtain the destination vector.
11. The data processing apparatus according to claim 10, wherein the execution unit performs a combination intermediate buffer unit performing M-round multiply-add operations on the N first parameters and the N second parameters, respectively, to obtain the destination vector, comprising:
in the ith cyclic multiply-add operation for the ith one of the M destination parameters:
calculating the product of the 2*i-1 first parameter and the 2*i-1 second parameter to obtain a first product result;
writing the first product result into the intermediate cache unit;
calculating the product of the 2*i first parameter and the 2*i second parameter to obtain a second product result;
reading the first product result cached in the intermediate cache unit, adding the second product result and the first product result, and writing the added result into the intermediate cache unit;
and assigning the addition result cached by the intermediate caching unit as the ith destination parameter to a corresponding interval in the destination vector.
12. The data processing apparatus of claim 11, wherein the intermediate buffer unit is a first physical register having a bit width of 16 bits, the execution unit comprises a first multiplier, the first multiplier is an 8-bit multiplier,
The execution unit performs the calculation of the product of the 2*i-1 first parameter and the 2*i-1 second parameter to obtain a first product result, and when the first product result is written into the intermediate buffer unit, the execution unit performs the following operations:
writing a first operand with a 16-bit width, which is stored in a storage address of the first vector, into a second physical register, wherein the first operand comprises a 2*i-1 first parameter and a 2*i first parameter;
writing a second operand with a 16-bit width, which is extracted from a storage address of the second vector and stored in the storage unit, into a third physical register, wherein the second operand comprises a 2*i-1 second parameter and a 2*i second parameter;
calculating the product of the first operand and the second operand by using the first multiplier to obtain the first product result;
the first product result is written to the first physical register.
13. The data processing apparatus of claim 12, wherein the execution unit further comprises a second multiplier, the second multiplier being an 8-bit multiplier,
the execution unit performs the following operations when calculating the product of the 2*i first parameter and the 2*i second parameter to obtain a second product result:
Shifting the first operand and the second operand to obtain a first shifting result and a second shifting result;
calculating a product of the first shift result and the second shift result as the second product result using the second multiplier.
14. The data processing apparatus according to any of claims 9-13, wherein the storage unit comprises a vector register, a cache, or a system memory.
15. A processor comprising a data processing apparatus as claimed in any one of claims 9 to 14.
16. An electronic device, comprising:
a memory non-transitory storing computer-executable instructions;
a processor configured to execute the computer-executable instructions,
wherein the computer executable instructions when executed by the processor implement the data processing method according to any of claims 1-8.
17. A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions,
the computer executable instructions, when executed by a processor, implement the data processing method according to any of claims 1-8.
CN202310297284.4A 2023-03-24 2023-03-24 Data processing method and device, processor, electronic equipment and storage medium Pending CN116302117A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310297284.4A CN116302117A (en) 2023-03-24 2023-03-24 Data processing method and device, processor, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310297284.4A CN116302117A (en) 2023-03-24 2023-03-24 Data processing method and device, processor, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116302117A true CN116302117A (en) 2023-06-23

Family

ID=86816543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310297284.4A Pending CN116302117A (en) 2023-03-24 2023-03-24 Data processing method and device, processor, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116302117A (en)

Similar Documents

Publication Publication Date Title
EP3779681B1 (en) Accelerator for sparse-dense matrix multiplication
EP3651017B1 (en) Systems and methods for performing 16-bit floating-point matrix dot product instructions
US20230083705A1 (en) Systems, apparatuses, and methods for chained fused multiply add
EP3629153B1 (en) Systems and methods for performing matrix compress and decompress instructions
EP3629158B1 (en) Systems and methods for performing instructions to transform matrices into row-interleaved format
EP3623941B1 (en) Systems and methods for performing instructions specifying ternary tile logic operations
EP3629154B1 (en) Systems for performing instructions to quickly convert and use tiles as 1d vectors
US20160179523A1 (en) Apparatus and method for vector broadcast and xorand logical instruction
US10324689B2 (en) Scalable memory-optimized hardware for matrix-solve
EP3929736A1 (en) Apparatuses, methods, and systems for instructions for moving data between tiles of a matrix operations accelerator and vector registers
EP3716054A2 (en) Interleaved pipeline of floating-point adders
EP3623940A2 (en) Systems and methods for performing horizontal tile operations
EP3343359A1 (en) Apparatus and method for processing sparse data
EP3974966A1 (en) Large scale matrix restructuring and matrix-scalar operations
EP3929733A1 (en) Matrix transpose and multiply
US20230315450A1 (en) Apparatuses, methods, and systems for 8-bit floating-point matrix dot product instructions
US20220197652A1 (en) Processors, methods, systems, and instructions to merge portions of two source two-dimensional arrays without explicit per-portion control
EP4020171A1 (en) Processors, methods, systems, and instructions to select and store data elements from two source two-dimensional arrays indicated by permute control elements in a result two-dimensional array
EP3757822B1 (en) Apparatuses, methods, and systems for enhanced matrix multiplier architecture
CN116302117A (en) Data processing method and device, processor, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination