WO2024020761A1 - Registre pour dépôt de prédicat - Google Patents

Registre pour dépôt de prédicat Download PDF

Info

Publication number
WO2024020761A1
WO2024020761A1 PCT/CN2022/107731 CN2022107731W WO2024020761A1 WO 2024020761 A1 WO2024020761 A1 WO 2024020761A1 CN 2022107731 W CN2022107731 W CN 2022107731W WO 2024020761 A1 WO2024020761 A1 WO 2024020761A1
Authority
WO
WIPO (PCT)
Prior art keywords
bitmask
expanded
processor
format
predicate
Prior art date
Application number
PCT/CN2022/107731
Other languages
English (en)
Inventor
Raanan Sade
Albert Jan YZELMAN
Jaiswal Manish KUMAR
Yixuan HU
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/CN2022/107731 priority Critical patent/WO2024020761A1/fr
Publication of WO2024020761A1 publication Critical patent/WO2024020761A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • G06F9/30014Arithmetic instructions with variable precision
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30018Bit or string instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • G06F9/30038Instructions to perform operations on packed data, e.g. vector, tile or matrix operations using a mask

Definitions

  • the present disclosure in some embodiments thereof, relates to processors and, more specifically, but not exclusively, to micro-instructions for vector processors.
  • processors architectures are designed to execute single instruction multiple data (SIMD) instruction set architecture (ISA) .
  • SIMD ISA instruction set architecture
  • vector processors are designed to reduce code size and/or number of executed micro-operations while still operating on multiple data. Examples of vector processors include ARM Scalable Vector Extension (SVE) and Intel AVX (Advanced Vector eXtension) .
  • a processor is configured for: obtaining a bitmask in a compressed format from a memory register, converting, in a single micro-operation, the bitmask from compressed format to expanded format, and writing the bitmask in the expanded format to a predicate register, for serving as a predicate for execution of instructions.
  • a method executed by a processor comprises: obtaining a bitmask in a compressed format from a memory register, converting, in a single micro-operation, the bitmask from compressed format to expanded format, and writing the bitmask in the expanded format to a predicate register, for serving as a predicate for execution of instructions.
  • a computer program comprises program instructions which, when executed by a processor, cause the processor to: obtain a bitmask in a compressed format from a memory register, convert, in a single micro-operation, the bitmask from compressed format to expanded format, and write the bitmask in the expanded format to a predicate register, for serving as a predicate for execution of instructions.
  • a non-transitory medium stores program instructions, which, when executed by a processor, cause the processor to: obtain a bitmask in a compressed format from a memory register, convert, in a single micro-operation, the bitmask from compressed format to expanded format, and writing the bitmask in the expanded format to a predicate register, for serving as a predicate for execution of instructions.
  • the amount of compression varies by the size of the elements of the expanded bitmask. For example, where a single bit of the compressed bitmask corresponds to an element of 8 bits of the expanded bitmask, the compression ratio is 8 times (8x) .
  • the single micro-operation improves computational efficiency of the processor, by reducing the overhead and/or improving hardware performance in comparison to using several lines of software code. Converting from the compressed format of the bitmask to the expanded format of the bitmask in a single micro-operation simplifies the conversion process and/or reduces errors, for example, in comparison to other approaches in which a programmer defines multiple complex bit manipulation operations.
  • the processor comprises a vector processor configured for executing instructions defined by a single instructions multiple data (SIMD) instruction set architecture (ISA) , and wherein the processor is further configured for obtaining the bitmask in the expanded format from the predicate register for executing instructions defined by the SIMD ISA.
  • SIMD single instructions multiple data
  • ISA instruction set architecture
  • the single micro-operation improves memory used by the vector processor, and/or improves efficiency of the vector processor’s execution of SIMD ISA.
  • the bitmask in the expanded format includes a plurality of elements, each element having a number of bits defined by a data type represented by the element corresponding to a data type of a vector that is processed according to the bitmask, wherein the bitmask includes in the expanded format includes a single bit per element in a fixed position, and a plurality of placeholder bits, wherein the bitmask in the compressed format includes a plurality of bits corresponding to the plurality of elements and excludes placeholder bits.
  • the savings in memory obtained by the compressed format of the bitmask is determined according to the data type. For example, for a vector array where each bit of the bitmask corresponds to a byte of data, the expanded format of the bitmask uses 8X more memory than then compressed form of the bitmask.
  • converting comprises: for each respective bit of a plurality of bits of the bitmask in the compressed format: copying the respective bit to a fixed position within a corresponding element of the plurality of elements of the bitmask in the expanded format, and adding placeholder bits of the corresponding element.
  • the conversion is computationally efficiency, and/or easily adaptable for different data types of the elements according to the location of the fixed position.
  • the predicate register is dynamically overwritten for each new bitmask.
  • the same predicate register, to which the expanded form of the bitmask is written to, may be reused for new computed expanded bitmasks, rather than storing the expanded bitmasks in memory. Re-using the predicate register for different expanded bitmasks (while storing the compressed bitmask in the memory register) reduces the amount of memory needed, for example, in comparison to storing the expanded bitmasks in memory.
  • bitmask in the memory register is retained and not overwritten for each new bitmask, wherein a different memory register is used for new different bitmasks.
  • Multiple different bitmask may be stored in compressed format in memory registers.
  • the compressed format uses less memory for the registers in comparison to the expanded format.
  • a size of the predicate register is fixed with respect to a size of a vector being processed by the execution of the instructions.
  • Fixing the size of the predicate register with respect to the size of the vector optimizes the amount of memory allocate to the predicate register.
  • each single bit of a plurality of single bits of the bitmask in the expanded format corresponds to a single element of a plurality of elements of a vector being processed by the processor, wherein placeholder bits that exclude the plurality of single bits of the bitmask in expanded format are written to the predicate register and are unused in processing the vector, wherein the bitmask in the compressed format includes the plurality of single bits and excludes the placeholder bits, wherein a number of bits in each of the plurality of elements is defined based on a data type represented by each element.
  • the amount of memory saved by the compressed format in comparison to the expanded format depends on the size of the vector, and the data type of elements of the vector.
  • first, second, third, and fourth aspects further comprising dynamically adapting the size of the bitmask in the expanded format, and dynamically adapting the size of each single element, according to a size and type of data of a current vector being processed.
  • the processor is further configured for multiplying a sparse matrix by the vector in view of the bitmask in the expanded format.
  • the single micro-operation for obtaining the bitmask in expanded format improves efficiency of the processor that uses the bitmask for multiplying the sparse matrix by the vector.
  • the bitmask comprises a predicate.
  • the single micro-operation for obtaining the predicate in expanded format improves efficiency of the processor that uses the predicate for execution of instructions.
  • FIG. 1 is a flowchart of a method of converting a compressed bitmask to an expanded bitmask, in accordance with some embodiments
  • FIG. 2 is a block diagram of components of a processor (s) for conversion of a compressed bitmask to an expanded bitmask, in accordance with some embodiments;
  • FIG. 3 is an example of a compressed bitmask and an expanded bitmask, in accordance with some embodiments of the present invention.
  • FIG. 4 is a schematic depicting an example of a compressed bitmask and an expanded bitmask, in accordance with some embodiments of the present invention.
  • FIG. 5 is a schematic depicting an example of pseudocode for converting a compressed bitmask to an expanded bitmask, in accordance with some embodiments of the present invention.
  • the present disclosure in some embodiments thereof, relates to processors and, more specifically, but not exclusively, to micro-instructions for vector processors.
  • bitmask and predicate are interchangeable.
  • compressed bitmask a bitmask in compressed format are interchangeable.
  • expanded bitmask a bitmask in expanded format are interchangeable.
  • An aspect of some embodiments relates to processors, systems, computing devices, methods, and/or code instructions (stored on a memory and executed by one or more processors) for converting a bitmask from compressed format to expanded format.
  • the bitmask is stored in a memory register in compressed format.
  • the compressed format excludes placeholder bits.
  • the compressed format may include a single bit per element of a defined vector. The size of the element may vary according to data type, for example, 2, 4, 8, 16 bytes per element.
  • the bitmask in compressed format is converted to expanded format in a single micro-operation.
  • the expanded bitmask may be written to a predicate register.
  • the expanded bitmask includes multiple placeholder bits (which are not used in computations) .
  • the size of the expanded bitmask corresponds to the size of the defined vector, including multiple elements of the data type.
  • the expanded bitmask may serve as a predicate for execution of instructions.
  • the expanded bitmask is used by a vector processor to perform a multiplication of a sparse matrix by the defined vector.
  • the amount of compression varies by the size of the elements of the expanded bitmask, for example, where a single bit of the compressed bitmask corresponds to an element of 8 bits of the expanded bitmask, the compression ratio is 8 times (8x) .
  • the single micro-operation improves computational efficiency of the processor, by reducing the overhead and/or improving hardware performance in comparison to using several lines of software code. Converting from the compressed format of the bitmask to the expanded format of the bitmask in a single micro-operation simplifies the conversion process and/or reduces errors, for example, in comparison to other approaches in which a programmer defines multiple complex bit manipulation operations.
  • the overhead for converting the compressed bitmask to the expanded bitmask may be negligible.
  • the single micro-operation is executed in two or less cycles of latency, with throughput of one on a single hardware execution unit.
  • the latency is the amount of processor (e.g., CPU) cycles (1/frequency) that the processor takes to complete the specific task.
  • the throughput is the maximum number of completions the instruction can get in a cycle.
  • the unit of execution is a single hardware block that is designed to perform specific or multiple functions.
  • At least some implementations described herein address the technical problem of improving efficiency of a processor (e.g., vector processor) that uses a predicate to control active elements of a vector (e.g., stored in a vector register) for example, for storage in memory, execution, and/or other operations defined by the ISA (e.g., SIMD ISA) .
  • the technical problem may be to provide an improvement to the memory accessed by the processor.
  • At least some implementations described herein improve a processor that uses a predicate to control active elements, such as of a vector register.
  • the improvement may be, for example, reduction in amount of memory used to store the expanded predicate, and/or improved operation efficiency of the processor such as a reduction in the number of micro-instructions being executed.
  • the predicate register size is typically fixed with respect to the SIMD vector length and holds 1 bit per byte of the SIMD vector –this allows to express element predication for elements at the size of 1 byte or larger.
  • the predicate register When the predicate register is used to control larger element sizes, it requires a single bit per element. In some of those SIMD ISA the single bit location to control larger element is aligned with the position of the lowest byte of the element, which requires padding between the actual mask bits with redundant bits example. For example, in the ARM SVE using the predicate for double precision floating point operations (8-bytes elements) the predicate byte includes eight bits per element while it only requires a single bit per element.
  • the predicate is expressed with the exact amount of required bits which solves the problem described above in a different way.
  • the present disclosure may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • a network for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) .
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA) , or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
  • FPGA field-programmable gate arrays
  • PLA programmable logic arrays
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function (s) .
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • FIG. 1 is a flowchart of a method of converting a compressed bitmask to an expanded bitmask, in accordance with some embodiments.
  • FIG. 2 is a block diagram of components of a processor (s) 202 for conversion of a compressed bitmask 252 to an expanded bitmask 256, in accordance with some embodiments.
  • FIG. 3 is an example of a compressed bitmask and an expanded bitmask, in accordance with some embodiments of the present invention.
  • FIG. 4 which is a schematic depicting an example of a compressed bitmask and an expanded bitmask, in accordance with some embodiments of the present invention.
  • FIG. 4 is a schematic depicting an example of a compressed bitmask and an expanded bitmask, in accordance with some embodiments of the present invention.
  • System 200 may implement the acts of the method described with reference to FIG. 1 and/or FIGs. 3-5, by processor (s) 202 of a computing device 204 executing micro-code instructions 206A.
  • Processor (s) 202 accesses a memory register 250 storing compressed bitmask 252, and predicate register 254 storing expanded bitmask 256. There may be multiple memory registers 250, each storing different bitmasks 252. Processor (s) 202 may access a vector register 258 storing vector data 260 (e.g., obtained from data 216A) , i.e., the actual data being processed.
  • vector data 260 e.g., obtained from data 216A
  • bitmask may be referred to as a predicate.
  • Predicate register 254 may be referred to as a predicate register that stores the predicate 256 in expanded form (i.e., expanded bitmask) .
  • the predicate 256 in predicate register 254 may control computation behavior per element, for example control and/or select the active elements of vector data 260 stored in vector register 258, such as for memory, execution, and/or other types of SIMD ISA.
  • Processor (s) 202 may be implemented as a vector processor, for example, designed to execute SIMD ISA.
  • Processor 202 may be implemented as for example, a central processing unit (s) (CPU) , a graphics processing unit (s) (GPU) , a field programmable gate array (s) (FPGA) , a digital signal processor (s) (DSP) , application specific integrated circuit (s) (ASIC) , customized circuit (s) , processors for interfacing with other units, and/or specialized hardware accelerators.
  • Processor (s) 202 may be implemented as a single processor, a multi-core processor, and/or a cluster of processors arranged for parallel processing (which may include homogenous and/or heterogeneous processor architectures) .
  • Other examples of processor (s) 202 include ARM based products and X86 from Intel or AMD.
  • Processor (s) 202 executes micro-code instructions 206A, for example, the single micro-instruction for converting the compressed bitmask to the expanded bitmask, as described herein.
  • Processor (s) 202 may execute code 206B stored on a memory 206.
  • Processor (s) 202 may be included within a computing device 204.
  • Computing device 204 may be implemented as, for example, one of more of: a computing cloud, a cloud network, a computer network, a virtual machine (s) (e.g., hypervisor, virtual server) , a network node (e.g., switch, a virtual network, a router, a virtual router) , a single computing device (e.g., client terminal) , a group of computing devices arranged in parallel, a network server, a web server, a storage server, a local server, a remote server, a client terminal, a mobile device, a stationary device, a kiosk, a smartphone, a laptop, a tablet computer, a wearable computing device, a glasses computing device, a watch computing device, and a desktop computer.
  • a computing cloud e.g., hypervisor, virtual server
  • a network node e.g., switch, a virtual network, a router, a virtual router
  • a single computing device e.g., client terminal
  • Processor (s) 202 and/or computing device 204 may be implemented as a high-performance computer (HPC) (e.g., processing cores, and/or system-on-chip (SoC) ) which may be executing software application using sparse matrix-vector (SPMV) multiplication computations.
  • HPC high-performance computer
  • SPMV sparse matrix-vector
  • SPMV is a fundamental and/or useful computation in HPC applications. SPMV generates a vector result from multiplying a matrix with vector, typically a highly sparse matrix that is stored in a compressed format in memory and a dense vector.
  • Memory 206 stores code instructions executable by processor (s) 202, for example, a random access memory (RAM) , dynamic random access memory (DRAM) and/or storage class memory (SCM) , non-volatile memory, magnetic media, semiconductor memory devices, hard drive, removable storage, and optical media (e.g., DVD, CD-ROM) .
  • Memory 206 may store, for example, one or more of: micro-code instructions 206A and/or code 206B that implements one or more features and/or acts of the method described with reference to FIG. 1 when executed by processor (s) 202, memory register 250 storing compressed bitmask 252, and/or predicate register 254 storing expanded bitmask 256.
  • Computing device 204 may include a data storage device 216 for storing data, for example, data 216A that is processed using the expanded bitmask 256, for example, a compressed matrix for multiplication by a vector (e.g., dense vector) .
  • Data repository 216 may be implemented as, for example, a memory, a local hard-drive, virtual storage, a removable storage unit, an optical disk, a storage device, and/or as a remote server and/or computing cloud (e.g., accessed using a network connection) .
  • Computing device 204 may include a network interface 218 for connecting to network 214, for example, one or more of, a network interface card, a wireless interface to connect to a wireless network, a physical interface for connecting to a cable for network connectivity, a virtual interface implemented in software, network communication software providing higher layers of network connectivity, and/or other implementations.
  • a network interface 218 for connecting to network 214, for example, one or more of, a network interface card, a wireless interface to connect to a wireless network, a physical interface for connecting to a cable for network connectivity, a virtual interface implemented in software, network communication software providing higher layers of network connectivity, and/or other implementations.
  • Network 214 may be implemented as, for example, the internet, a local area network, a virtual private network, a wireless network, a cellular network, a local bus, a point to point link (e.g., wired) , and/or combinations of the aforementioned.
  • Computing device 204 may connect using network 214 (or another communication channel, such as through a direct link (e.g., cable, wireless) and/or indirect link (e.g., via an intermediary computing unit such as a server, and/or via a storage device) with server (s) 210 and/or client terminal (s) 212.
  • network 214 or another communication channel, such as through a direct link (e.g., cable, wireless) and/or indirect link (e.g., via an intermediary computing unit such as a server, and/or via a storage device) with server (s) 210 and/or client terminal (s) 212.
  • Computing device 204 may include and/or be in communication with one or more physical user interfaces 208 that include a mechanism for a user to enter data and/or view data.
  • Exemplary user interfaces 208 include, for example, one or more of, a touchscreen, a display, a keyboard, a mouse, and voice activated software using speakers and microphone.
  • the processor obtains a bitmask in a compressed format.
  • the compressed bitmask may be obtained from a memory register.
  • the bitmask represents a predicate, for example, used by a vector processor to control execution of instructions according to individual elements of a vector.
  • the processor converts the bitmask from compressed format to expanded format in a single micro-operation.
  • Original vector 302 is of total length 512 bits.
  • Vector 302 includes 8 elements, each element of size 8 bytes and/or each of size 64 bits for storing numbers of type ‘double precision’ .
  • a single element 302A is marked for clarity and simplicity of explanation.
  • vector 302 has a value of A0BC00DE, where each letter or zero value is a single element.
  • a compressed bitmask 304, optionally a predicate, associated with vector 302 has a length of a number bits corresponding to the number of elements of vector 302. I. e., a single bit of bitmask is assigned to each element of vector 302.
  • compressed bitmask 304 has a length of 8 bits corresponding to the 8 elements of vector 302.
  • a bit value of one is assigned to non-zero values of vector 302, and a zero value is assigned to zero values of vector 302.
  • the following bits of bitmask 304 are assigned to corresponding elements of vector 302: 1 ⁇ A; 0 ⁇ 0; 1 ⁇ B; 1 ⁇ C; 0 ⁇ 0; 0 ⁇ 0; 1 ⁇ D; and 1 ⁇ E.
  • Compressed bitmask 304 excludes placeholder bits.
  • Compressed bitmask 306 may be stored in a single type in a memory register as 10110011. Compressed bitmask 306 excludes placeholder bits.
  • Compressed bitmask 306 may be expanded to expanded bitmask 308, as described herein.
  • Expanded bitmask 308 includes multiple elements (one element 308A is marked for clarity and simplicity of explanation) , for example, representing a full predicate for double precision usage with SVE instruction.
  • the number of elements corresponds to the number of elements of vector 302, for example, 8 as shown in the example.
  • Each element of bitmask 308 has a number of bits defined by the data type represented by the element, which corresponds to the data type of vector 302 that is processed according to the bitmask. In the example shown, each element of expanded bitmask 308 includes 8 bits.
  • Each element of expanded bitmask 308 may include a single active bit (one bit 310 is shown for clarity and simplicity of explanation) having a value of zero or one, and multiple placeholder bits (one set of placeholder bits 312 is shown for clarity and simplicity of explanation) .
  • Each active bit of each element of expanded bitmask 308 corresponds to one byte of each element of vector 302. Placeholder bits are shown as having a value of ‘x’ indicating that the actual value is irrelevant since the actual value of the placeholder bits is not used to control instructions.
  • each element 308A of expanded bitmask 308 includes a single bit 310 and seven placeholder bits 312.
  • the savings in memory obtained by the compressed format of the bitmask is determined according to the data type, i.e., according to the length of each element of the vector associated with the bitmask.
  • the expanded format of the bitmask uses 8 times (8X) more memory than then compressed form of the bitmask.
  • Each single active bit 310 of expanded bitmask 308 corresponds to a single element 302A of vector 302 being processed by the processor.
  • Placeholder bits 312 that exclude the active single bits 310 of expanded bitmask 308 may be written to the predicate register. Placeholder bits 312 are unused in processing vector 302.
  • Compressed bitmask 306 includes the active single bits and excludes the placeholder bits.
  • a number of bits in each of the elements 308A of expanded bitmask 308 is defined based on a data type represented by each element, for example, 8 byes (64 bits) for double precision numbers as shown in the example. The amount of memory saved by the compressed format in comparison to the expanded format depends on the size of the vector, and the data type of elements of the vector.
  • compressed bitmask 404 stored in a memory register is converted to an expanded bitmask 408, as described herein.
  • Elements in compressed bitmask 404 are 8 bit each. In total, compressed bitmask 404 is 64 bits. The data elements themselves are 64 bit –but that is in the data vector.
  • Compressed bitmask 404 includes eight elements, each of size 64 bits, i.e., 8 bytes, for example, for representing numbers of double precision. Values from the most significant bit (MSB) to the least significant bit (LSB) are shown in hexadecimal notation, as follows: AF, 54, 77, C2, A5, A0, 21, and 87.
  • Expansion of compressed element 404A having value A5, is shown for clarity and simplicity of explanation.
  • Expanded bitmask 408 is shown for compressed element 404A.
  • Expanded bitmask 408 for compressed element 404A includes eight elements in hexadecimal notation, where the LSB indicates the active bit, and bits to the left of the active bit are placeholder bits set to zero.
  • compressed element 404A i.e., A5 which has a binary value of 10100101, is expanded to 0100010000010001 in hex, or in binary: 00000001000000000000000100000000000000000000000000000001.
  • the process is iterated (using different index values) for each element of compressed bitmask 404, to obtain a full expanded bitmask.
  • the expansion may be executed using the pseudocode function R2DEP (i.e., register to predicate) described herein, optionally executed as a single micro-instruction.
  • pseudocode 500 is described in a vector agnostic manner and with the ARM assembly style, but it is to be understood that the pseudocode may be adapted to any different assembly as well as vector specific ISA.
  • Line 502 defines a pseudocode function termed R2PDEP (register to predicate) for converting a compressed bitmask stored in a memory register denoted Xn, to an expanded bitmask for writing to a predicate register denoted Pd.
  • R2PDEP register to predicate
  • An index, termed Imm4 may be set, for defining the size of elements of the expanded bitmask.
  • Line 504 sets the size of each element.
  • the size of each element may be set according to the data type represented by the element, for example, 64 bites (8 bytes) for double precision numbers.
  • Line 506 sets the number of the elements.
  • the number of elements is obtained by dividing the total length of the vector (denoted VL) by the size of each element.
  • Line 508 sets the index for the number of elements.
  • Line 510 initializes all the bits of the predicate register to placeholder bits, for example, setting all bits to zero.
  • Line 512 defines a loop over the number of elements.
  • Line 514 copies the compressed predicate bit into the fixed location of the predicate register.
  • the copied bit serves as the active bit.
  • the non-copied bits, which were previously set to zero, serve as placeholder bits.
  • the processor copies the respective bit to a fixed position within a corresponding element of the multiple elements of the bitmask in the expanded format.
  • the copied bits within the corresponding elements may be copied to fixed positions.
  • the copied bits within the corresponding elements represent active bits. Placeholder bits may be added to each corresponding element. Alternatively, placeholder bits are not actively added, but are already existing.
  • the processor may dynamically adapt the size of the expanded bitmask, and/or dynamically adapt the size of each element of the expanded bitmask, according to a size and/or type of data of a current vector being processed. Dynamically adapting the size of the bitmask efficiently allocates the amount of memory needed.
  • the conversion is computationally efficiency, and/or easily adaptable for different data types of the elements according to the location of the fixed position.
  • the expanded bitmask may be written to a predicate register.
  • the bitmask of the predicate register may be implemented as a predicate that is associated with a vector. Each active bit of the predicate may control execution of instructions of a corresponding element of the vector.
  • a size of the predicate register may be fixed with respect to a size of the vector. Fixing the size of the predicate register with respect to the size of the vector optimizes the amount of memory allocate to the predicate register.
  • the processor uses the expanded bitmask in the predicate register for serving as a predicate for controlling execution of instructions.
  • the processor multiplies a sparse matrix by the vector in view of the bitmask in the expanded format, for example, using block or bit-mask column row sparse matrix compression.
  • the single micro-operation for obtaining the bitmask in expanded format improves efficiency of the processor that uses the bitmask for multiplying the sparse matrix by the vector.
  • the processor is implemented as a vector processor that executes instructions defined by a SIMD ISA.
  • the processor obtains the bitmask in the expanded format from the predicate register for executing instructions defined by the SIMD ISA.
  • the single micro-operation improves memory used by the vector processor, and/or improves efficiency of the vector processor’s execution of SIMD ISA.
  • the compressed format uses less memory for the registers in comparison to the expanded format.
  • the compressed bitmasks in the memory registers which were expanded in preceding iterations are retrained.
  • the compressed bitmasks in the memory registers which were expanded in preceding iterations are not overwritten with new bitmasks during the current iteration.
  • a different memory register is used for new different bitmasks.
  • the relevant compressed bitmasks are expanded into the predicate register, without overwriting the compressed bitmasks stored in the memory registers.
  • the predicate register may be dynamically overwritten with the new bitmask expanded during the current iteration.
  • the same predicate register, in which the expanded form of the bitmask is written to, may be reused for new computed expanded bitmasks, rather than storing the expanded bitmasks in memory. Re-using the predicate register for different expanded bitmasks (while storing the compressed bitmask in the memory register) reduces the amount of memory needed, for example, in comparison to storing the expanded bitmasks in memory.
  • processors It is expected that during the life of a patent maturing from this application many relevant processors will be developed and the scope of the term processor is intended to include all such new technologies a priori.
  • composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
  • a compound or “at least one compound” may include a plurality of compounds, including mixtures thereof.
  • range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the disclosure. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
  • a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range.
  • the phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

L'invention fournit des modes de réalisation permettant de convertir un masque binaire d'un format compressé en un format étendu. Le masque binaire est enregistré dans un registre de mémoire dans un format compressé. Le format compressé exclut des bits de paramètre fictif. Le format compressé comprend un seul bit par élément d'un vecteur défini. La taille de l'élément peut varier en fonction du type de données, par exemple 4, 8, 16 octets par élément. Le masque binaire en format compressé est converti en format étendu en une seule microopération. Le masque binaire étendu peut être écrit dans un registre de prédicat. Le masque binaire étendu comprend de multiples bits de paramètre fictif (qui ne sont pas utilisés dans des calculs). La taille du masque binaire étendu correspond à la taille du vecteur défini, comprenant de multiples éléments du type de données. Le masque binaire étendu peut être utilisé pour l'exécution d'instructions. Par exemple, le masque binaire étendu est utilisé par un processeur vectoriel pour multiplier une matrice creuse par le vecteur.
PCT/CN2022/107731 2022-07-26 2022-07-26 Registre pour dépôt de prédicat WO2024020761A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/107731 WO2024020761A1 (fr) 2022-07-26 2022-07-26 Registre pour dépôt de prédicat

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/107731 WO2024020761A1 (fr) 2022-07-26 2022-07-26 Registre pour dépôt de prédicat

Publications (1)

Publication Number Publication Date
WO2024020761A1 true WO2024020761A1 (fr) 2024-02-01

Family

ID=89704905

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107731 WO2024020761A1 (fr) 2022-07-26 2022-07-26 Registre pour dépôt de prédicat

Country Status (1)

Country Link
WO (1) WO2024020761A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107003851A (zh) * 2014-12-27 2017-08-01 英特尔公司 用于压缩掩码值的方法和装置
CN107003845A (zh) * 2014-12-23 2017-08-01 英特尔公司 用于在掩码寄存器和向量寄存器之间可变地扩展的方法和装置
US20180074792A1 (en) * 2016-09-10 2018-03-15 Sap Se Loading data for iterative evaluation through simd registers

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107003845A (zh) * 2014-12-23 2017-08-01 英特尔公司 用于在掩码寄存器和向量寄存器之间可变地扩展的方法和装置
CN107003851A (zh) * 2014-12-27 2017-08-01 英特尔公司 用于压缩掩码值的方法和装置
US20180074792A1 (en) * 2016-09-10 2018-03-15 Sap Se Loading data for iterative evaluation through simd registers

Similar Documents

Publication Publication Date Title
ES2982493T3 (es) Acelerador de multiplicación de matrices densas-dispersas
ES2927546T3 (es) Procesador informático para cálculos de precisión superior que utiliza una descomposición de operaciones de precisión mixta
JP7416393B2 (ja) テンソル並べ替えエンジンのための装置および方法
JP5987233B2 (ja) 装置、方法、およびシステム
JP6466388B2 (ja) 方法及び装置
US9378182B2 (en) Vector move instruction controlled by read and write masks
US10324689B2 (en) Scalable memory-optimized hardware for matrix-solve
JP2018511099A (ja) ベクトルキャッシュラインライトバックのためのプロセッサ、方法、システム、および命令
US9122475B2 (en) Instruction for shifting bits left with pulling ones into less significant bits
JP2018500653A (ja) ベクトルブロードキャストおよびxorand論理命令のための装置および方法
JP2018500656A (ja) 1組のベクトル要素にリダクション演算を実行する方法及び装置
US10732880B2 (en) Lazy memory deduplication
US10152321B2 (en) Instructions and logic for blend and permute operation sequences
US20170177361A1 (en) Apparatus and method for accelerating graph analytics
CN108269226B (zh) 用于处理稀疏数据的装置和方法
JP2017534114A (ja) Z順序曲線において次のポイントの座標を計算するためのベクトル命令
CN114327362A (zh) 大规模矩阵重构和矩阵-标量操作
KR20160075639A (ko) 이미디에이트 핸들링 및 플래그 핸들링을 위한 프로세서 및 방법
US9473296B2 (en) Instruction and logic for a simon block cipher
WO2024020761A1 (fr) Registre pour dépôt de prédicat
JP2018500629A (ja) 3d座標から3dのz曲線インデックスを計算するための機械レベル命令
US9564917B1 (en) Instruction and logic for accelerated compressed data decoding
CN111752605A (zh) 使用浮点乘法-累加结果的模糊-j位位置
JP2018503890A (ja) ベクトル水平論理命令のための装置および方法
KR20140113579A (ko) 데이터 요소에 있는 비트들의 제로화를 위한 시스템, 장치, 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22952228

Country of ref document: EP

Kind code of ref document: A1