WO2022191859A1

WO2022191859A1 - Vector processing using vector-specific data type

Info

Publication number: WO2022191859A1
Application number: PCT/US2021/022229
Authority: WO
Inventors: Jian Wei
Original assignee: Zeku, Inc.
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2022-09-15

Abstract

Embodiments of processors and operations thereof are disclosed. In an example, a processor includes an instruction decode unit, a vector register operatively coupled to the instruction decode unit, and a vector load/store unit operatively coupled to the vector register. The instruction decode unit is configured to decode an instruction to load a first vector to determine a first data type of the first vector. The vector register is configured to store the first data type of the first vector. The vector load/store unit is configured to load the first vector into the vector register, such that the first vector is associated with the first data type in the vector register.

Description

VECTOR PROCESSING USING VECTOR-SPECIFIC DATA TYPE

BACKGROUND

[0001] Embodiments of the present disclosure relate to processors and operations thereof.

[0002] Parallel computing is the major acceleration solution for different kinds of processors, such as central processing unit (CPU), graphics processing unit (GPU), digital signal processor (DSP), etc. Single instruction, multiple data (SIMD) is a class of parallel computing that performs the same operation on multiple data points simultaneously. High- performance computing requires a parallel vector processor, which may have to handle different types of data formats constantly. However, switching operand data types can introduce huge performance overhead over known processors.

SUMMARY

[0003] Embodiments of processors and operations thereof are disclosed herein.

[0004] In one example, a processor includes an instruction decode unit, a vector register operatively coupled to the instruction decode unit, and a vector load/store unit operatively coupled to the vector register. The instruction decode unit is configured to decode an instruction to load a first vector to determine a first data type of the first vector. The vector register is configured to store the first data type of the first vector. The vector load/store unit is configured to load the first vector into the vector register, such that the first vector is associated with the first data type in the vector register.

[0005] In another example, a system-on-a-chip (SoC) includes a memory configured to store a vector and an instruction to load the vector, and a processor operatively coupled to the memory. The processor includes an instruction decode unit, a vector register operatively coupled to the instruction decode unit, and a vector load/store unit operatively coupled to the vector register. The instruction decode unit is configured to decode the instruction to load the vector to determine a data type of the vector. The vector register is configured to store the data type of the vector. The vector load/store unit is configured to load the vector from the memory into the vector register, such that the vector is associated with the data type in the vector register.

[0006] In still another example, a processor includes a vector register, an instruction decode unit, and a vector function unit operatively coupled to the instruction decode unit and the vector register. The vector register is configured to store a plurality of vectors and a plurality of data types associated with the plurality of vectors, respectively. The instruction decode unit is configured to decode an instruction to operate a first vector of the plurality of vectors. The vector function unit is configured to retrieve the first vector and a first data type of the plurality of data types that is associated with the first vector from the vector register, and operate on the first vector based on the first data type.

[0007] In yet another example, an SoC includes a memory configured to store an instruction to operate a vector and a processor operatively coupled to the memory. The processor includes a vector register, an instruction decode unit, and a vector function unit operatively coupled to the instruction decode unit and the vector register. The vector register is configured to store the vector and a data type associated with the vector. The instruction decode unit is configured to decode an instruction to operate the vector. The vector function unit is configured to retrieve the vector and the data type associated with the vector from the vector register, and operate on the vector based on the data type.

[0008] In yet another example, a method for vector operation is disclosed. An instruction to load a vector is decoded to determine a data type of the vector. The data type of the vector is stored in a vector register. The vector is loaded into the vector register. The vector is associated with the data type.

[0009] In yet another example, a method for vector operation is disclosed. A plurality of vectors and a plurality of data types associated with the plurality of vectors, respectively, are stored in a vector register. An instruction to operate a first vector of the plurality of vectors is decoded. The first vector and a first data type of the plurality of data types that is associated with the first vector are retrieved from the vector register. The first vector is operated based on the first data type.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments of the present disclosure and, together with the description, further serve to explain the principles of the present disclosure and to enable a person skilled in the pertinent art to make and use the present disclosure.

[0011] FIG. 1 illustrates a block diagram of an exemplary system having an SoC, according to some embodiments of the present disclosure.

[0012] FIG. 2 illustrates a detailed block diagram of an exemplary SoC in the system of FIG.l, according to some embodiments of the present disclosure.

[0013] FIG. 3 illustrates a vector processing scheme using global variables.

[0014] FIGs. 4A and 4B illustrate an exemplary vector processing scheme using vector- specific data types, according to some embodiments of the present disclosure.

[0015] FIG. 5A illustrates a flow chart of an exemplary method for vector operation using vector-specific data types, according to some embodiments of the present disclosure.

[0016] FIG. 5B illustrates a flow chart of another exemplary method for vector operation using vector-specific data types, according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

[0017] Although specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the pertinent art will recognize that other configurations and arrangements can be used without departing from the spirit and scope of the present disclosure. It will be apparent to a person skilled in the pertinent art that the present disclosure can also be employed in a variety of other applications.

[0018] It is noted that references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” “some embodiments,” “certain embodiments,” etc., indicate that one or more embodiments described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

[0019] In general, terminology may be understood at least in part from usage in context.

For example, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a,” “an,” or “the,” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the terms “based on,” “based upon,” and terms with similar meaning may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.

[0020] Various aspects of the present disclosure will now be described with reference to various apparatus and methods. These apparatus and methods will be described in the following detailed description and illustrated in the accompanying drawings by various blocks, modules, units, components, circuits, steps, operations, processes, algorithms, etc. (collectively referred to as “elements”). These elements may be implemented using electronic hardware, firmware, computer software, or any combination thereof. Whether such elements are implemented as hardware, firmware, or software depends upon the particular application and design constraints imposed on the overall system.

[0021] Vector processing is a data processing method that implements a set of computer instructions on one-dimensional arrays of data, as compared to on single-data items. Such a data array is also known as a “vector.” Vector processing avoids the overhead of the loop control mechanism that occurs in general-purpose computers. Each vector includes multiple data elements, which can be in different data types, e.g., fixed-point numbers (short integers and long integers), floating-point numbers, etc., and have different accuracies, e.g., 8-bit, 12- bit, 2-bit, 32-bit, etc.

[0022] In order to process vectors with different types of data, some processors introduced vector instruction extension, which uses global variables to indicate the data type of each element of each vector operand and the computation result. However, this approach can introduce huge performance overhead when switching the global variables with special control and status register (CSR) instructions in case it needs to operate on different types of vectors. The global variable-based vector processing mechanism also significantly limits the system flexibility as the data type can be different with even each vector in the vector register in an extreme case. Thus, it would be a fatal limitation if only the same data type as indicated by the global variables are required across all vectors in the vector register. For example, if each vector in the vector register has a different type of data, each time the processor may need to re-program the global variables through a CSR instruction, i.e., one extra CSR instruction needed between every data operation, thereby reducing the processing performance by half.

[0023] Various embodiments in accordance with the present disclosure introduce vector- specific data types that can be associated with each vector and store along with the associated vector in a vector register to eliminate the needs for the global variable-based vector processing mechanism and the accompany special instructions to program the global variables. As a result, the processor performance on vector processing can be significantly improved by skipping both the need for updating the global variables through CSR instructions when loading the vectors into the vector register as well as the need for checking the global variables to determine the data types of the operands when performing the vector operations. For example, the performance may be doubled in the extreme case of one data operation with one global variable indicating the data type. The scheme disclosed herein can also reduce the instruction code-size by saving the CSR instructions for global variables. Moreover, the scheme disclosed herein is also backward compatible with the existing vector instruction extension standard, such that any existing program-code based on the existing standard can still run on the processors implementing the vector-specific data types disclosed herein.

[0024] FIG. 1 illustrates a block diagram of an exemplary system 100 having an SoC 102, according to some embodiments of the present disclosure. System 100 may include SoC 102 having a processor 108 and a primary memory 110, a bus 104, and a secondary memory 106. System 100 may be applied or integrated into various systems and apparatus capable of high speed data processing, such as computers and wireless communication devices. For example, system 100 may be part of a mobile phone, a desktop computer, a laptop computer, a tablet, a vehicle computer, a gaming console, a printer, a positioning device, a wearable electronic device, a smart sensor, a virtual reality (VR) device, an argument reality (AR) device, or any other suitable electronic devices having high-speed data processing capability. Using a wireless communication device as an example, SoC 102 may serve as an application processor (AP) and/or baseband processor (BP) that import data and instructions from secondary memory 106, executing instructions to perform various mathematical and logical calculations on the data, and exporting the calculation results for further processing and transmission over cellular networks.

[0025] As shown in FIG. 1, secondary memory 106 may be located outside SoC 102 and operatively coupled to SoC 102 through bus 104. Secondary memory 106 may receive and store data of different types from various sources via communication channels (e.g., bus 104). For example, secondary memory 106 may receive and store digital imaging data captured by a camera of the wireless communication device, voice data transmitted via cellular networks, such as a phone call from another user, or text data input by the user of the system through an interactive input device, such as a touch panel, a keyboard, or the like. Secondary memory 106 may also receive and store computer instructions to be loaded to processor 108 for data processing. Such instructions may be in the form of an instruction set, which contains discrete instructions that teach the microprocessor or other functional components of the microcontroller chip to perform one or more of the following types of operations — data handling and memory operations, arithmetic and logic operations, control flow operations, co processor operations, etc. Secondary memory 106 may be provided as a standalone component in or attached to the apparatus, such as a hard drive, a Flash drive, a solid-state drive (SSD), or the like. Other types of memory compatible with the current disclosure may also be conceived. It is understood that secondary memory 106 may not be the only component capable of storing data and instructions. Primary memory 110 may also store data and instructions and, unlike secondary memory 106, may have direct access to processor 108. Secondary memory 106 may be a non-volatile memory, which can keep the stored data even though power is lost. In contrast, primary memory 110 may be volatile memory, and the data may be lost once the power is lost. Because of this difference in structure and design, each type of memory may have its own dedicated use within the system.

[0026] Data between secondary memory 106 and SoC 102 may be transmitted via bus 104.

Bus 104 functions as a highway that allows data to move between various nodes, e.g., memory, microprocessor, transceiver, user interface, or other sub-components in system 100, according to some embodiments. Bus 104 can be serial or parallel. Bus 104 can also be implemented by hardware (such as electrical wires, optical fiber, etc.). It is understood that bus 104 can have sufficient bandwidth for storing and loading a large amount of data (e.g., vectors) between secondary memory 106 and primary memory 110 without delay to the data processing by processor 108.

[0027] SoC designs may integrate one or more components for computation and processing on an integrated-circuit (IC) substrate. For applications where chip size matters, such as smartphones and wearable gadgets, SoC design is an ideal design choice because of its compact area. It further has the advantage of small power consumption. In some embodiments, as shown in FIG. 1, one or more processors 108 and primary memory 110 are integrated into SoC 102. It is understood that in some examples, primary memory 110 and processor 108 may not be integrated on the same chip, but instead on separate chips.

[0028] Processor 108 may include any suitable specialized processor including, but not limited to, CPU, GPU, DSP, tensor processing unit (TPU), vision processing unit (VPU), neural processing unit (NPU), synergistic processing unit (SPU), physics processing unit (PPU), and image signal processor (ISP). Processor 108 may also include a microcontroller unit (MCU), which can handle a specific operation in an embedded system. In some embodiments in which system 100 is used in wireless communications, each MCU handles a specific operation of a mobile device, for example, communications other than cellular communication (e.g., Bluetooth communication, Wi-Fi communication, FM radio, etc.), power management, display drive, positioning and navigation, touch screen, camera, etc.

[0029] As shown in FIG. 1, processor 108 may include one or more processing cores 112

(a.k.a. “cores”), a register array 114, and a control module 116. In some embodiments, processing core 112 may include one or more functional units that perform various data operations. For example, processing core 112 may include an arithmetic logic unit (ALU) that performs arithmetic and bitwise operations on data (also known as “operand”), such as addition, subtraction, increment, decrement, AND, OR, Exclusive-OR, etc. Processing core 112 may also include a floating-point unit (FPU) that performs similar arithmetic operations but on a type of operands (e.g., floating-point numbers) different from those operated by the ALU (e.g., binary numbers). The operations may be addition, subtraction, multiplication, etc. As described below in detail, another way of categorizing the functional units may be based on whether the data processed by the function unit is a scalar or vector. For example, processing cores 112 may include scalar function units (SFUs) for handling scalar operations and vector function units (VFUs) for handling vector operations. It is understood that in case that processor 108 includes multiple processing cores 112, each processing core 112 may carry out data and instruction operations in serial or in parallel. This multi-core processor design can effectively enhance the processing speed of processor 108 and multiplies its performance. In some embodiments, processor 108 may be a vector processor in which processing core 112 includes VFUs for handling vector operations based on vector-specific data types, as opposed to global -variable data types, as disclosed below in detail.

[0030] Register array 114 may be operatively coupled to processing core 112 and primary memory 110 and include multiple sets of registers for various purposes. Because of their architecture design and proximity to processing core 112, register array 114 allows processor 108 to access data, execute instructions, and transfer computation results faster than primary memory 110, according to some embodiments. In some embodiments, register array 114 includes a plurality of physical registers fabricated on SoC 102, such as fast static random- access memory (RAM) having multiple transistors and multiple dedicated read and write ports for high-speed processing and simultaneous read and/or write operations, thus distinguishing from primary memory 110 and secondary memory 106 (such as a dynamic random-access memory (DRAM), a hard drive, or the like). The register size may be measured by the number of bits they can hold (e.g., 4 bits, 8 bits, 16 bits, 32 bits, 64 bits, etc.). In some embodiments, register array 114 serves as an intermediary memory placed between primary memory 110 and processing core 112. For example, register array 114 may hold frequently used programs or processing tools so that access time to these data can be reduced, thus increasing the processing speed of processor 108 while also reducing power consumption of SoC 102. In another example, register array 114 may store data being operated by processing core 112, thus reducing delay in accessing the data from primary memory 110. This type of register is known as data registers. Another type is address registers, which may hold addresses and may be used by instructions for indirect access of primary memory 110. There are also status registers that decide whether a certain instruction should be executed, such as the CSR. In some embodiments, at least part of register array 114 is implemented by a physical register file (PRF) within processor 108.

[0031] Consistent with the scope of the present disclosure, in some embodiments, register array 114 includes a vector register configured to associate each vector with a corresponding data type and store the vector along with the corresponding data type of the vector, as described below in detail. The data type may be indicative of at least one of the number of data elements in the vector, the type of each data element, or the accuracy of each data element, which can replace the global variables for indicating the data type of the vector operand, which is stored in the CSR separated from the vector register.

[0032] Control module 116 may be operatively coupled to primary memory 110 and processing core 112. Control module 116 may be implemented by circuits fabricated on the same semiconductor chip as processing core 112. Control module 116 may serve as a role similar to a command tower. For example, control module 116 may retrieve and decode various computer instructions from primary memory 110 to processing core 112 and instruct processing core 112 what processes to be carried out on operands loaded from primary memory 110. Computer instructions may be in the form of a computer instruction set. Different computer instructions may have a different impact on the performance of processor 108. For example, instructions from a reduced instruction set computer (RISC) are generally simpler than those from a complex instruction set computer (CISC) and thus may be used to achieve fewer cycles per instruction, therefore reducing the processing time by processor 108. Examples of processes carried out by processor 108 include setting a register to a fixed value, copying data from a memory location to a register, adding, subtracting, multiplying, and dividing, comparing values stored on two different registers, etc. In some embodiments, control module 116 may further include an instruction decoder (not shown in FIG. 1, described below in detail) that decodes the computer instructions into instructions readable by other components on processor 108, such as processing core 112. The decoded instructions may be subsequently provided to processing core 112.

[0033] Consistent with the scope of the present disclosure, in some embodiments, control module 116 determines the data type of each vector when decoding an instruction to load the vector from primary memory 110 to register array 114, such that register array 114 can associate the determined data type with the vector and store the pair of vector and associated data type, for example, into the vector register. That is, a one-to-one mapping between vectors and their data types can be formed and recorded by control module 116 in conjunction with register array 114, as described below in detail.

[0034] It is understood that additional components, although not shown in FIG.l, may be included in SoC 102 as well, such as interfacing components for data loading, storing, routing, or multiplexing within SoC 102, as described below in detail with respect to FIG. 2.

[0035] FIG. 2 illustrates a detailed block diagram of exemplary SoC 102 in system 100 of

FIG.l, according to some embodiments of the present disclosure. As shown in FIG. 2, processor 108 may be configured to handle both vectors and scalars by including one or more vector function units (VFUs) 202 and one or more scalar function units (SFUs) 204. Each VFU 202 or SFU 204 may be fully pipelined and can perform arithmetic or logic operations on vectors and scalars, respectively. VFUs 202 and SFUs 204 may be parts of processing core 112 in FIG. 1.

[0036] Processor 108 in FIG. 2 may further include vector registers 206 and scalar registers

208 configured to store vectors and scalars, respectively, as parts of register array 114 in FIG. 1. Processor 108 may further include one or more multiplexer units (MUXs) 220 operatively couple vector registers 206 and VFUs 202 and configured to select between multiple input data to output. For example, two MUXs 218 may each select a vector operand from vector registers 206 and output them to VFUs 202 for vector operations on the two vector operands, and one MUX 218 may select an operation result and output it back to vector registers 206. Similarly, processor 108 may further include one or more MUXs 220 operatively coupled scalar registers 208 and SFUs 204 and configured to select between multiple input data to output. As shown in FIG. 2, processor 108 may also include another register - CSR 210, as part of register array 114 in FIG. 1. CSR 210 may store additional information about the results of instructions, e.g., comparisons. In some embodiments, CSR 210 includes several independent flags such as carry, overflow, and zero. CSR 210 may be used to determine the outcome of conditional branch instructions or other forms of conditional execution. In some known systems, CSR 210 is also configured to store global variables indicative of the data type of the current vector operand, as described below in detail.

[0037] As shown in FIG. 2, processor 108 may further include data load/store units for data moving data between primary memory 110 and registers 206 and 208, including a vector load/store unit 214 operatively coupled to vector registers 206 and a scalar load/store unit 216 operatively coupled to scalar registers 208. For example, vector load/store unit 214 may load vector operands from primary memory 110 to vector registers 206 and store vector results from vector registers 206 to primary memory 110; scalar load/store unit 216 may load scalar operands from primary memory 110 to scalar registers 208 and store scalar results from scalar registers 208 to primary memory 110. As described above, data, such as scalars and vectors, can be transferred and processed in data paths having components, such as vector load/store unit 214, vector registers 206, MUXs 218, VFUs 202, CSR 210, scalar load/store unit 216, scalar registers 208, MUXs 220, and SFUs 204.

[0038] As shown in FIG. 2, in some embodiments, control module 116 of processor 108 may include an instruction fetch unit 211 and an instruction decode unit 212 (a.k.a. instruction processing unit (IPU) collectively). Instruction fetch unit 211 may be operatively coupled to primary memory 110 and configured to fetch instructions from primary memory 110 that are to be processed by processor 108. Instruction decode unit 212 may be operatively coupled to instruction fetch unit 211 and each of the components in the data paths described above and configured to decode each instruction and control the operations of each component in the data paths described above based on the decoded instruction, as described below in detail. Consist with the scope of the present disclosure, in some embodiments, for vector processing, the instruction includes SIMD instruction or vector instruction. Instruction decode unit 212 may determine either the fetched instruction is a scalar instruction or a vector instruction. If it is a scalar instruction, scalar processing may be performed by SFUs 204 in conjunction with scalar registers 208. If it is a vector instruction, vector processing may be performed by VFUs 202 in conjunction with vector registers 206. For example, the address of the vector operand in primary memory 110 may be determined from the decoded instruction and provided to vector load/store unit 214 to load the vector operand into vector registers 206. At the same time, the decoded instruction may be provided to VFUs 202 to operate on the vector operand from vector registers 206.

[0039] FIG. 3 illustrates a vector processing scheme using global variables. As shown in

FIG. 3, vector registers 302 include multiple register sets 308 (32 sets from V0 to V31 in this example), each of which includes a number of registers 310 ([0] to [VLM-1], wherein VLM represents the “maximum vector length”). Each register set 308 stores a vector having the number of data elements that is not greater than VLM, each of which is stored in a respective register 310. The data types of vectors stored in vector registers 302 may vary, for example, with different numbers of data elements (as long as it does not exceed VLM), different types of the data elements (e.g., fixed-point numbers, floating-point numbers, etc.), and/or different actuaries of the data elements (e.g., 8-bits, 12-bits, 16-bits, 24-bits, 32-bits, etc.). As vector registers 302 do not store the data type of each vector, in order for a vector function unit (VFU) 306 to know the data type of the current vector operand when performing a vector operation on the current vector operand, a vector CSR 304 stores a plurality of global variables 312 indicative of the data type of the current vector operand, such as “vtype” for setting the type and accuracy of each data element of the vector (e.g., 16-bits signed integers, or 32-bits unsigned floating-point numbers), and “vl” for setting the number of data elements in the vector. Global variables 312 can further include the fixed-point rounding mode and saturation flag, and resumption data element after trap.

[0040] The vector processing scheme illustrated in FIG. 3 requires an instruction decode unit 314 to receive and decode special CSR instructions that set the global variables (GV) and update global variables 312 in vector CSR 304 each time the data type of the current vector operand changes. In the extreme case in which each vector operand has a different data type from the previous one, a GV-CSR instruction needs to be inserted into each vector operation instruction in order to keep updating the data type of the current vector operand. That is, when a VFU 306 performs a vector operation, VFU 306 needs to separately retrieve the data itself (e.g., vector operands V-A and V-B) from vector registers 302 and the data type (e.g., GV-A and GV-B) from vector CSR 304. For example, VFU 306 may retrieve the first vector operand V-A from vector registers 302 and retrieve the current data type GV-A from vector CSR 304. When retrieving the second vector operand V-B from vector registers 302, VFU 306 needs to check the current data type GV-B again as it may be different from the previous data type GV- A.

[0041] The vector processing scheme illustrated in FIG. 3 thus introduces significant overhead to vector processing due to the needs for both updating global variables 312 in vector CSR 304 through specific CSR instructions as well as checking global variables in vector CSR 304 each time a new vector operand is retrieved by VFU 306. To solve the overhead issues, FIGs. 4A and 4B illustrate an exemplary vector processing scheme using vector-specific data types, according to some embodiments of the present disclosure. In some embodiments, the vector processing scheme is implemented by processor 108 described above in detail with respect to FIGs. 1 and 2. For ease of description, the details of the components in processor 108 may not be repeated.

[0042] As shown in FIG. 4A, vector registers 206, e.g., part of a physical register file, may include a plurality of register sets 402 each including a plurality of data registers 404 as well as a data type register 406 (or a data type field of part of a data register 404). The number of register sets 402 may vary in different implementations, such as 32 (from V0 to V31) as shown in FIG. 4A, which determines the maximum number of vectors that can be stored in vector registers 206 at the same time. The number of data registers 404 in each register set 402 may vary, but their total lengths may not exceed VLM. In some embodiments, VLM is the same for each register set 402, for example, 512 bits or 1,024 bits. It is understood that although VLM may be the same, the sizes of vectors stored in register sets 402 may vary as a vector does not have to occupy the entire data registers 404, i.e., to reach VLM, as long as it does not exceed VLM. It is further understood that even for vectors having the same size, e.g., VLM, their data types may still vary as they can have different numbers of data elements, different types of data elements, and/or different accuracies of data elements. For example, for vectors having 512-bit lengths, it may include 16 data elements, each of which is a 32-bit floating-point number or 32 data elements, each of which is a 16-bit fixed-point number. To correctly identify the data type of each vector stored in a respective register set 402 in vector registers 206, register set 402 also includes data type register 406 for storing the data type of the vector, which can replace global variables 312 in vector CSR 304 in FIG. 3. Thus, the size of each data register 404 may be the total of the size of data registers 404 and the size of data type register 406. It is understood that compared with data registers 404, data type register 406 may have a significantly smaller size due to the limited information to be recorded therein and thus, add little overhead to the processor size. In some embodiments, the size of data type register 406 is not greater than 1/100 of the size of data register 404. For example, the size of data type register 406 may be 4-bits, while the size of data register 404 may be 512-bits or 1,024-bits.

[0043] FIG. 4A shows the scheme of determining and associating the data type with a corresponding vector, according to some embodiments. Instruction decode unit 212 may be configured to decode an instruction to load a vector to determine the data type of the vector. For example, instruction decode unit 212 may receive an instruction to load a vector V (Instr: LOAD V) from primary memory 110 through instruction fetch unit 211. The load instruction may include information about the vector, for example, the address of the vector in primary memory 110. Depending on the address information, instruction decode unit 212 may determine the data type of the vector, which may be indicative of at least one of the number of data elements in the vector (e.g., 1 to 32), the type of each data element (e.g., signed or unsigned fixed-point numbers (integers), or signed or unsigned floating-point numbers), or the accuracy of each data element (e.g., 8-bits, 12-bits, 16-bits, 24-bits, or 32-bits). In some embodiments, besides the address information, the load instruction includes additional information for instruction decode unit 212 to determine the data type of the vector. Instruction decode unit 212 may perform the data type identification operation for each vector to be loaded to vector registers 206 to ensure that all vectors stored in vector registers 206 can have their data types determined by instruction decode unit 212.

[0044] As shown in FIG. 4A, instruction decode unit 212 is also configured to decode the load instruction to determine the address of the vector and provide the address information to vector load/store unit 214, such that vector load/store unit 214 can load the vector V from the corresponding memory address in primary memory 110 and send the vector to vector registers 206, according to some embodiments. Instruction decode unit 212 is also configured to provide the determined data type of the vector V to vector registers 206 as well, according to some embodiments.

[0045] Vector registers 206 may be operatively coupled to instruction decode unit 212 and vector load/store unit 214 and configured to associate each vector received from vector load/store unit 214 with a respective data type of the vector received from instruction decode unit 212 and store both the vector and the associated data type. For example, each pair of vector and associated data type may be stored into data registers 404 and data type register 406, respectively, of a respective register set 402. The association may be performed, for example, based on the address, or any other suitable identifiers, of the vector. It is understood that different from global variables 312 in FIG. 3 in which only one data type is identified at a particular time point for the current vector operand, different data types can co-exist at the same time for different vectors stored in vector registers 206. That is, vector registers 206 may be configured to store a plurality of vectors and a plurality of data types associated with the plurality of vectors, respectively.

[0046] FIG. 4B shows the scheme of performing vector operations on vectors using their associated data types, according to some embodiments. Instruction decode unit 212 may be configured to decode an instruction to operate a vector of the plurality of vectors stored in vector registers 206. In some embodiments in which SIMD is implemented, the instruction is to operate multiple vectors stored in vector registers 206. For example, instruction decode unit 212 may receive an instruction to operate on vectors V-A and V-B (Instr: OP V-A, V-B) from primary memory 110 through instruction fetch unit 211. The operation instruction may be an arithmetical operation or a logic operation. Instruction decode unit 212 may decode information from the instruction including, for example, the operation (OP, e.g., add, substrate, multiply, divide, OR, AND, XOR, NAND, etc.) and the addresses or any other suitable identifiers of the vector operands (V-A and V-B), and provide the decoded information about the operation instruction to VFU 202.

[0047] VFU 202 may be operatively coupled to instruction decode unit 212 and vector registers 206 and configured to retrieve the vector and the type associated with the vector from vector registers 206, and operate on the vector based on the data type. In some embodiments in which SIMD is implemented, VFU 202 may retrieve each of the multiple vector operands and their associated data types. For example, VFU 202 may retrieve the two vector operands V-A and V-B from data registers 404 of corresponding register sets 402, respectively, in vector registers 206 based on decoded instruction information from instruction decode unit 212. VFU 202 may also retrieve the two data types DT-A and DT-B associated with the two vector operands V-A and V-B, respectively, from data type registers 406 of corresponding register sets 402, respectively, in vector registers 206 based on decoded instruction information from instruction decode unit 212 as well. That is, different from VFU 306 in FIG. 3 that checks global variables 312 in vector CSR 304 to determine the data type of the current vector operand, VFU 202 may be further configured to refrain from retrieving the global variable from CSR 210 when operating on the vector. In some embodiments, CSR 210 does not include global variables for indicating the data type of the current vector operand. In some embodiments, CSR 210 still includes the global variables for indicating the data type of the current vector operand for backward-compatible purposes, but VFU 202 may not retrieve such global variables when it can retrieve the vector-specific data types from vector registers.

[0048] Nevertheless, VFU 202 may operate on the two vector operands V-A and V-B based on their data types, and return the result vector and its data type back to vector registers 206. That is, the vector resulting from the operation may be associated with a data type, which may be stored in data registers 404 and data type register 406, respectively, in register set 402. For example, for a “+/-“ operation with the data types of 32-bit signed integer and 32-bit unsigned integer, the data type of the result vector may be 32-bit signed integer; for a “x” operation with the data types of 8-bit signed and 8-bit signed, the data type of the result vector may be 16-bit signed short-integer; for a “x” operation with the data types of 32-bit floating-point number and 32-bit integer, the data type of the result vector may be a 32-bit floating-point number.

[0049] FIG. 5 A illustrates a flow chart of an exemplary method 500 for vector operation using vector-specific data types, according to some embodiments of the present disclosure. Examples of the apparatus that can perform operations of method 500 include, for example, processor 108 depicted in FIGs. 1, 2, and 4A or any other suitable apparatus disclosed herein. It is understood that the operations shown in method 500 are not exhaustive and that other operations can be performed as well before, after, or between any of the illustrated operations. Further, some of the operations may be performed simultaneously, or in a different order than shown in FIG. 5 A.

[0050] Referring to FIG. 5A, method 500 starts at operation 502, in which an instruction to load a vector is decoded to determine a data type of the vector. The data type may be indicative of at least one of the number of data elements in the vector, the type of each of the data elements, or the accuracy of each of the data elements. For example, as shown in FIG. 4 A, instruction decode unit 212 of processor 108 may decode a load vector to determine the data type of a vector. Method 500 proceeds to operation 504, as illustrated in FIG. 5 A, in which the data type of the vector is stored in a vector register, for example, data type register 406 of register set 402 in vector registers 206 as shown in FIG. 4A. Method 500 proceeds to operation 506, as illustrated in FIG. 5A, in which the vector is loaded into the vector register. For example, as shown in FIG. 4 A, vector load/store unit 214 may load the vector from primary memory 110 into data register 404 of register set 402 in vector registers 206 based on the decoded instruction, e.g., the address information of the vector, provided by instruction decode unit 212. Method 500 proceeds to operation 508, as illustrated in FIG. 5A, in which the vector is associated with the data type, for example, by vector registers 206 as shown in FIG. 4A.

[0051] FIG. 5B illustrates a flow chart of another exemplary method 510 for vector operation using vector-specific data types, according to some embodiments of the present disclosure. Examples of the apparatus that can perform operations of method 510 include, for example, processor 108 depicted in FIGs. 1, 2, and 4B or any other suitable apparatus disclosed herein. It is understood that the operations shown in method 510 are not exhaustive and that other operations can be performed as well before, after, or between any of the illustrated operations. Further, some of the operations may be performed simultaneously, or in a different order than shown in FIG. 5B.

[0052] Referring to FIG. 5B, method 510 starts at operation 512, in which an instruction to operate the vector is decoded. In some embodiments, the operate instruction includes an instruction to operate a first vector and a second vector. For example, as shown in FIG. 4B, instruction decode unit 212 may decide an operation instruction to operate two vector operands stored in vector registers 206. Method 510 proceeds to operation 514, as illustrated in FIG. 5B, in which the vector and the data type are retrieved from the vector register. For example, as shown in FIG. 4B, VFU 202 may retrieve the first and second vector operands and their associated data types from vector registers 206. Method 510 proceeds to operation 516, as illustrated in FIG. 5B, in which the vector is performed based on the data type. For example, as shown in FIG. 4B, VFU 202 may operate on the first and second vector operands based on their data types.

[0053] In various aspects of the present disclosure, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or encoded as instructions or code on a non-transitory computer-readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computing device, such system 100 in FIG. 1. By way of example, and not limitation, such computer-readable media can include RAM, read-only memory (ROM), electrically erasable programmable ROM (EEPROM), compact disc-ROM (CD-ROM) or other optical disk storage, hard disk drive (HDD), such as magnetic disk storage or other magnetic storage devices, flash drive, SSD, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processing system, such as a mobile device or a computer. Disk and disc, as used herein, includes CD, laser disc, optical disc, digital versatile disc (DVD), and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

[0054] According to one aspect of the present disclosure, a processor includes an instruction decode unit, a vector register operatively coupled to the instruction decode unit, and a vector load/store unit operatively coupled to the vector register. The instruction decode unit is configured to decode an instruction to load a first vector to determine a first data type of the first vector. The vector register is configured to store the first data type of the first vector. The vector load/store unit is configured to load the first vector into the vector register, such that the first vector is associated with the first data type in the vector register.

[0055] In some embodiments, the vector register is configured to store the first vector, and associate the first vector with the first data type.

[0056] In some embodiments, the vector register is configured to store a plurality of vectors and a plurality of data types associated with the plurality of vectors, respectively.

[0057] In some embodiments, the first data type is indicative of at least one of a number of data elements in the first vector, a type of each of the data elements, or an accuracy of each of the data elements.

[0058] In some embodiments, the instruction decode unit is further configured to decode an instruction to operate the first vector. In some embodiments, the processor further includes a vector function unit operatively coupled to the vector register and configured to retrieve the first vector and the first data type from the vector register, and operate on the first vector based on the first data type.

[0059] In some embodiments, the instruction includes an instruction to operate the first vector and a second vector, and the vector function unit is further configured to retrieve the second vector and a second data type associated with the second vector from the vector register, and operate on the first and second vectors based on the first and second data types.

[0060] In some embodiments, the first data type is different from the second data type. [0061] In some embodiments, the processor further includes a control and status register configured to store a global variable, and the vector function unit is further configured to refrain from retrieving the global variable from the control and status register when operating on the first vector.

[0062] According to another aspect of the present disclosure, an SoC includes a memory configured to store a vector and an instruction to load the vector, and a processor operatively coupled to the memory. The processor includes an instruction decode unit, a vector register operatively coupled to the instruction decode unit, and a vector load/store unit operatively coupled to the vector register. The instruction decode unit is configured to decode the instruction to load the vector to determine a data type of the vector. The vector register is configured to store the data type of the vector. The vector load/store unit is configured to load the vector from the memory into the vector register, such that the vector is associated with the data type in the vector register.

[0063] According to still another aspect of the present disclosure, a processor includes a vector register, an instruction decode unit, and a vector function unit operatively coupled to the instruction decode unit and the vector register. The vector register is configured to store a plurality of vectors and a plurality of data types associated with the plurality of vectors, respectively. The instruction decode unit is configured to decode an instruction to operate a first vector of the plurality of vectors. The vector function unit is configured to retrieve the first vector and a first data type of the plurality of data types that is associated with the first vector from the vector register, and operate on the first vector based on the first data type.

[0064] In some embodiments, the instruction includes an instruction to operate the first vector and a second vector of the plurality of vectors, and the vector function unit is further configured to retrieve the second vector and a second data type of the plurality of data types that is associated with the second vector from the vector register, and operate on the first and second vectors based on the first and second data types.

[0065] In some embodiments, the first data type is different from the second data type.

[0066] In some embodiments, the first data type is indicative of at least one of a number of data elements in the first vector, a type of each of the data elements, or an accuracy of each of the data elements.

[0067] In some embodiments, the instruction decode unit is further configured to decode an instruction to load the first vector to determine the first data type of the first vector. In some embodiments, the processor further includes a vector load/store unit operatively coupled to the vector register and configured to load the first vector into the vector register, such that the first vector is associated with the first data type in the vector register.

[0068] In some embodiments, the processor further includes a control and status register configured to store a global variable, and the vector function unit is further configured to refrain from retrieving the global variable from the control and status register when operating on the first vector.

[0069] According to yet another aspect of the present disclosure, an SoC includes a memory configured to store an instruction to operate a vector and a processor operatively coupled to the memory. The processor includes a vector register, an instruction decode unit, and a vector function unit operatively coupled to the instruction decode unit and the vector register. The vector register is configured to store the vector and a data type associated with the vector. The instruction decode unit is configured to decode an instruction to operate the vector. The vector function unit is configured to retrieve the vector and the data type associated with the vector from the vector register, and operate on the vector based on the data type.

[0070] According to yet another aspect of the present disclosure, a method for vector operation is disclosed. An instruction to load a vector is decoded to determine a data type of the vector. The data type of the vector is stored in a vector register. The vector is loaded into the vector register. The vector is associated with the data type.

[0071] In some embodiments, an instruction to operate the vector is decoded, the vector and the data type are retrieved from the vector register, and the vector is operated based on the data type.

[0072] According to yet another aspect of the present disclosure, a method for vector operation is disclosed. A plurality of vectors and a plurality of data types associated with the plurality of vectors, respectively, are stored in a vector register. An instruction to operate a first vector of the plurality of vectors is decoded. The first vector and a first data type of the plurality of data types that is associated with the first vector are retrieved from the vector register. The first vector is operated based on the first data type.

[0073] In some embodiments, the instruction includes an instruction to operate the first vector and a second vector of the plurality of vectors. In some embodiments, the second vector and a second data type of the plurality of data types that is associated with the second vector are retrieved from the vector register, and the first and second vectors are operated based on the first and second data types.

[0074] In some embodiments, an instruction to load the first vector is decoded to determine the first data type of the first vector, the first data type of the first vector is stored in the vector register, the first vector is loaded into the vector register, and the first vector is associated with the first data type.

[0075] The foregoing description of the specific embodiments will reveal the general nature of the present disclosure that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present disclosure. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.

[0076] Embodiments of the present disclosure have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.

[0077] The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present disclosure as contemplated by the inventor(s), and thus, are not intended to limit the present disclosure and the appended claims in any way.

[0078] Various functional blocks, modules, and steps are disclosed above. The particular arrangements provided are illustrative and without limitation. Accordingly, the functional blocks, modules, and steps may be re-ordered or combined in different ways than in the examples provided above. Likewise, certain embodiments include only a subset of the functional blocks, modules, and steps, and any such subset is permitted.

[0079] The breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

WHAT IS CLAIMED IS:

1. A processor, comprising: an instruction decode unit configured to decode an instruction to load a first vector to determine a first data type of the first vector; a vector register operatively coupled to the instruction decode unit and configured to store the first data type of the first vector; and a vector load/store unit operatively coupled to the vector register and configured to load the first vector into the vector register, such that the first vector is associated with the first data type in the vector register.

2. The processor of claim 1, wherein the vector register is configured to store the first vector, and associate the first vector with the first data type.

3. The processor of claim 2, wherein the vector register is configured to store a plurality of vectors and a plurality of data types associated with the plurality of vectors, respectively.

4. The processor of claim 1, wherein the first data type is indicative of at least one of a number of data elements in the first vector, a type of each of the data elements, or an accuracy of each of the data elements.

5. The processor of claim 1, wherein the instruction decode unit is further configured to decode an instruction to operate the first vector; and the processor further comprises a vector function unit operatively coupled to the vector register and configured to retrieve the first vector and the first data type from the vector register, and operate on the first vector based on the first data type.

6. The processor of claim 5, wherein the instruction comprises an instruction to operate the first vector and a second vector; and the vector function unit is further configured to retrieve the second vector and a second data type associated with the second vector from the vector register, and operate on the first and second vectors based on the first and second data types.

7. The processor of claim 6, wherein the first data type is different from the second data type.

8. The processor of claim 5, further comprising a control and status register configured to store a global variable, wherein the vector function unit is further configured to refrain from retrieving the global variable from the control and status register when operating on the first vector.

9. A system-on-a-chip (SoC), comprising: a memory configured to store a vector and an instruction to load the vector; and a processor operatively coupled to the memory and comprising: an instruction decode unit configured to decode the instruction to load the vector to determine a data type of the vector; a vector register operatively coupled to the instruction decode unit and configured to store the data type of the vector; and a vector load/store unit operatively coupled to the vector register and configured to load the vector from the memory into the vector register, such that the vector is associated with the data type in the vector register.

10. A processor, comprising: a vector register configured to store a plurality of vectors and a plurality of data types associated with the plurality of vectors, respectively; an instruction decode unit configured to decode an instruction to operate a first vector of the plurality of vectors; and a vector function unit operatively coupled to the instruction decode unit and the vector register and configured to retrieve the first vector and a first data type of the plurality of data types that is associated with the first vector from the vector register, and operate on the first vector based on the first data type.

11. The processor of claim 10, wherein the instruction comprises an instruction to operate the first vector and a second vector of the plurality of vectors; and the vector function unit is further configured to retrieve the second vector and a second data type of the plurality of data types that is associated with the second vector from the vector register, and operate on the first and second vectors based on the first and second data types.

12. The processor of claim 11, wherein the first data type is different from the second data type.

13. The processor of claim 10, wherein the first data type is indicative of at least one of a number of data elements in the first vector, a type of each of the data elements, or an accuracy of each of the data elements.

14. The processor of claim 10, wherein the instruction decode unit is further configured to decode an instruction to load the first vector to determine the first data type of the first vector; and the processor further comprises a vector load/store unit operatively coupled to the vector register and configured to load the first vector into the vector register, such that the first vector is associated with the first data type in the vector register.

15. The processor of claim 10, further comprising a control and status register configured to store a global variable, wherein the vector function unit is further configured to refrain from retrieving the global variable from the control and status register when operating on the first vector.

16. A system-on-a-chip (SoC), comprising: a memory configured to store an instruction to operate a vector; and a processor operatively coupled to the memory and comprising: a vector register configured to store the vector and a data type associated with the vector; an instruction decode unit configured to decode the instruction to operate the vector; and a vector function unit operatively coupled to the instruction decode unit and the vector register and configured to retrieve the vector and the data type associated with the vector from the vector register, and operate on the vector based on the data type.

17. A method for vector operation, comprising: decoding an instruction to load a vector to determine a data type of the vector; storing the data type of the vector in a vector register; loading the vector into the vector register; and associating the vector with the data type.

18. The method of claim 17, further comprising: decoding an instruction to operate the vector; retrieving the vector and the data type from the vector register; and operating on the vector based on the data type.

19. A method for vector operating, comprising: storing a plurality of vectors and a plurality of data types associated with the plurality of vectors, respectively, in a vector register; decoding an instruction to operate a first vector of the plurality of vectors; retrieving the first vector and a first data type of the plurality of data types that is associated with the first vector from the vector register; and operating on the first vector based on the first data type.

20. The method of claim 19, wherein the instruction comprises an instruction to operate the first vector and a second vector of the plurality of vectors; and the method further comprises: retrieving the second vector and a second data type of the plurality of data types that is associated with the second vector from the vector register; and operating on the first and second vectors based on the first and second data types.

21. The method of claim 19, further comprising: decoding an instruction to load the first vector to determine the first data type of the first vector; storing the first data type of the first vector in the vector register; loading the first vector into the vector register; and associating the first vector with the first data type.