CN112860320A

CN112860320A - Method, system, device and medium for data processing based on RISC-V instruction set

Info

Publication number: CN112860320A
Application number: CN202110175746.6A
Authority: CN
Inventors: 贾兆荣
Original assignee: Shandong Yingxin Computer Technology Co Ltd
Current assignee: Shandong Yingxin Computer Technology Co Ltd
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-05-28
Also published as: WO2022170997A1

Abstract

The invention discloses a method, a system, equipment and a storage medium for processing data based on a RISC-V instruction set, wherein the method comprises the following steps: obtaining an instruction from a RISC-V instruction space and caching the instruction into a cache, and judging the type of the instruction; responding to the instruction as a branch jump instruction, regenerating an instruction address, and jumping to a corresponding branch according to the instruction address; responding to the jump to the AIPU branch, storing the characteristic data and the coefficient data for the current convolution operation through a first-level input characteristic cache and a first-level coefficient cache, and storing the characteristic data and the coefficient data for the next convolution operation through a second-level input characteristic cache and a second-level coefficient cache; and carrying out convolution operation according to the corresponding characteristic data and the coefficient data, and activating, normalizing and pooling the operation result. The invention designs a processor architecture with three instruction branches according to a RISC-V instruction set, realizes general control, vector operation, convolution and matrix accelerated calculation, and is suitable for a terminal AI inference chip.

Description

Method, system, device and medium for data processing based on RISC-V instruction set

Technical Field

The present invention relates to the field of data processing, and more particularly, to a method, system, computer device and readable medium for performing data processing based on RISC-V instruction set.

Background

The value of the data is in analytical utilization rather than simple storage. The data volume is continuously increasing, all data cannot be transmitted to the cloud end through the network, and the bandwidth increasing speed is slower than the data increasing speed. For an application scene with a high real-time requirement, data needs to be judged at the edge, such as the fields of automatic driving, unmanned driving and the like. For scenes with high privacy protection requirements, such as medical information or data that a user does not want to share in the cloud, the data needs to be stored locally. For example, most of data generated by the security device is data which is useless or has no mining potential, all data transmitted to the cloud is a waste of bandwidth, and if intelligent analysis is performed at an edge end, only useful or potential data is transmitted to the cloud, so that the network bandwidth can be greatly saved. Therefore, data processing is inevitably transferred from the cloud to the edge, and therefore, an AI (artificial intelligence) chip at the edge is also in the trend.

The AI chip is required for artificial intelligent processing at the edge end, and the challenges of the AI chip are mainly computational power and computational efficiency. The computation power of the AI chip is determined by the number of on-chip computation units. Since the amount of data involved in AI calculation is very large, theoretically the larger the calculation power of the AI chip is, the better, but actually the calculation power of the AI chip is limited by various factors:

1. on-chip memory bandwidth and bus bandwidth: the main contradiction of the AI chip is the contradiction between memory bandwidth and computational power. The larger the calculation power is, the larger the input data, the intermediate result and the output data volume are, the higher the required storage bandwidth is, but the current storage bandwidth far cannot meet the calculation power requirement, and if the calculation unit and the storage unit cannot be reasonably arranged, the result of high calculation power but low efficiency is caused.

The AI calculation involves a variety of operators, such as convolution calculation, matrix calculation, normalization, activation, pooling, and other linear and nonlinear calculations. Deep neural network models are usually composed of multiple layers, with the output of the previous layer being the input of the next layer; in the same layer, the result of the multiply-add operation is often the input for activation, pooling, and normalization. Therefore, if multithreading/parallel computing/computing pipelining cannot be reasonably realized, the computation of the previous step can block the computation of the next step, which causes resource waste and reduces computing efficiency.

3. As described in fig. 2, the AI involves various operators, but the AI chip is fixed and invariant, and how to make the invariant hardware efficiently process the variable operators, it is necessary that software can reasonably allocate hardware resources according to the hardware architecture and compile efficient machine code. Meanwhile, the AI chip is also required to have efficient control capability.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, a system, a computer device, and a computer readable storage medium for performing data processing based on a RISC-V instruction set, in which an AIPU (artificial intelligence unit) and a CPU share a memory, so that computation and storage are adjacent to each other, thereby improving memory access bandwidth, facilitating data interaction between the AIPU and the CPU, reducing data interaction amount with an external bus, and reducing a requirement for bus bandwidth. Meanwhile, the AIPU and the CPU are respectively provided with a small buffer for buffering input data, intermediate results, output data and instructions for pre-reading of the CPU, so that data loading is allowed while data calculation is carried out, the data reading and writing time is prolonged, and the requirement on bus bandwidth is further reduced.

In view of the above, an aspect of the embodiments of the present invention provides a method for data processing based on RISC-V instruction set, including the following steps: obtaining an instruction from a RISC-V instruction space and caching the instruction into a cache, and judging the type of the instruction; responding to the instruction as a branch jump instruction, regenerating an instruction address, and jumping to a corresponding branch according to the instruction address; responding to the jump to the AIPU branch, storing the characteristic data and the coefficient data for the current convolution operation through a first-level input characteristic cache and a first-level coefficient cache, and storing the characteristic data and the coefficient data for the next convolution operation through a second-level input characteristic cache and a second-level coefficient cache; and carrying out convolution operation according to the corresponding characteristic data and the coefficient data, and activating, normalizing and pooling the operation result.

In some embodiments, the method further comprises: in response to jumping to a vector architecture branch, performing a vector operation in accordance with the instruction.

In some embodiments, the method further comprises: and in response to the instruction being a load or store instruction, reading an address of the storage space into the destination operand according to the address in the source operand.

In some embodiments, the method further comprises: judging whether vector registers corresponding to vector source operands are in the same group; and in response to the fact that the vector registers corresponding to the vector source operands are not in the same group, enabling two ports with the same bit width as the vector registers to read and write simultaneously.

In some embodiments, the method further comprises: and configuring the register file in the AIPU branch into two parts, wherein the first part runs the current AIPU operation, and the second part acquires parameters required by the next operation of the AIPU.

In some embodiments, the method further comprises: and reading corresponding data and carrying out dimension conversion on the data according to the operation requirement, and writing the converted data into a corresponding coefficient cache or an input feature cache.

In some embodiments, the method further comprises: reading data in the first-level input feature cache and the first-level coefficient cache in response to convolution calculation, and judging whether the residual space of the first-level input feature cache and the first-level coefficient cache is larger than the size of the next group of data; and responding to the fact that the remaining space of the first-level input characteristic cache and the first-level coefficient cache is larger than the size of the next group of data, and starting a write cache.

In another aspect of the embodiments of the present invention, a system for performing data processing based on a RISC-V instruction set is provided, which includes: the acquisition module is configured to acquire an instruction from a RISC-V instruction space and cache the instruction into a cache, and judge the type of the instruction; the jump module is configured to respond to the instruction as a branch jump instruction, regenerate an instruction address and jump to a corresponding branch according to the instruction address; the AIPU module is configured for responding to jumping to an AIPU branch, storing feature data and coefficient data used for current convolution operation through a first-level input feature cache and a first-level coefficient cache, and storing feature data and coefficient data of next convolution operation through a second-level input feature cache and a second-level coefficient cache; and the execution module is configured to perform convolution operation according to the corresponding characteristic data and coefficient data, and activate, normalize and pool the operation result.

In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method as above.

In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.

The invention has the following beneficial technical effects: the AIPU (artificial intelligence unit) and the CPU share the memory, so that the calculation and the storage are adjacent, the memory access bandwidth is improved, the data interaction between the AIPU and the CPU is convenient, the data interaction amount with an external bus is reduced, and the requirement on the bus bandwidth is reduced. Meanwhile, the small buffers are respectively arranged in the AIPU and the CPU and used for caching input data, intermediate results, output data and instructions pre-read by the CPU, so that data loading is allowed while data calculation is carried out, the data reading and writing time is prolonged, and the requirement on bus bandwidth is further reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a diagram illustrating an embodiment of a method for data processing based on RISC-V instruction set according to the present invention;

FIG. 2 is a schematic diagram of a CPU architecture according to an embodiment of the present invention;

FIG. 3 is a diagram of the AIPU architecture provided by the present invention;

FIG. 4 is a diagram illustrating convolution operations according to an embodiment of the method for processing data based on RISC-V instruction set;

FIG. 5 is a diagram of a hardware structure of an embodiment of a computer device for data processing based on RISC-V instruction set according to the present invention;

FIG. 6 is a diagram of an embodiment of a computer storage medium for data processing based on a RISC-V instruction set.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In view of the above objects, a first aspect of embodiments of the present invention proposes an embodiment of a method for data processing based on a RISC-V instruction set. FIG. 1 is a diagram illustrating an embodiment of a method for data processing based on a RISC-V instruction set according to the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:

s1, obtaining an instruction from the RISC-V instruction space and caching the instruction into a cache, and judging the type of the instruction;

s2, responding to the instruction as a branch jump instruction, regenerating an instruction address, and jumping to a corresponding branch according to the instruction address;

s3, responding to the jump to the AIPU branch, storing the feature data and the coefficient data used for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and storing the feature data and the coefficient data of the next convolution operation through the second-level input feature cache and the second-level coefficient cache; and

and S4, performing convolution operation according to the corresponding characteristic data and coefficient data, and activating, normalizing and pooling the operation result.

The embodiment of the invention adopts a storage-computation integrated structure, and the AIPU and the CPU share the memory, so that computation and storage are adjacent, the memory access bandwidth is improved, the data interaction between the AIPU and the CPU is convenient, the data interaction quantity with an external bus is reduced, and the requirement on the bus bandwidth is reduced. Meanwhile, the small buffers are respectively arranged in the AIPU and the CPU and used for caching input data, intermediate results, output data and instructions pre-read by the CPU, so that data loading is allowed while data calculation is carried out, the data reading and writing time is prolonged, and the requirement on bus bandwidth is further reduced.

The RISC-V instruction set includes a general instruction set and a vector extension instruction set, and can be divided into: the system comprises an integer instruction set I, a multiply-add operation instruction set M, an atomic operation instruction set A, a single-precision instruction set F, a double-precision instruction set D, a compression instruction set C and a vector instruction set V. The arithmetic logic unit completes IMAFDC instruction set operation, and the vector operation unit completes vector instruction set V operation. The CPU architecture is designed according to RISC-V instruction set, and the function of the CPU is to run system codes to complete system control and data operation.

FIG. 2 is a schematic diagram of a CPU architecture in an embodiment of the present invention. As shown in fig. 2, the CPU adopts a two-stage pipeline architecture, and the first stage is an instruction fetching stage, which is responsible for fetching an instruction from an instruction storage space and caching the instruction into an instruction cache. The second stage decodes and executes the instruction. When decoding, the type of instruction (vector instruction or normal instruction) is analyzed, and corresponding data operation is started according to the corresponding instruction type and the operation code, for example, a vector addition instruction, data is read from a vector data storage to a vector register, then operation is completed in a vector operation unit, and a result is cached in a vector data cache.

The meaning of setting the vector data cache here is: in AI inference calculation, vector operations are usually not independent, and it is often necessary to reasonably complete calculation by pipelining a plurality of vector operations, and if an intermediate result is stored in a sram (Static Random Access Memory), the vector data may need a plurality of cycles to complete storage or reading, which greatly increases the vector calculation cycle. Setting a vector cache buffer can load data into the vector cache buffer in advance before starting vector calculation, and after the vector calculation is finished, storing a final result into the data sram. The pre-reading and result storage of the vector data can be completed during other operations, and the vector operation period is reduced. The port of the vector data caching module is wide, and the bandwidth requirement of the vector operation unit is met.

Obtaining an instruction from a RISC-V instruction space and caching the instruction into a cache, and judging whether the instruction is a branch jump instruction or not; and in response to the instruction being a branch jump instruction, regenerating the instruction address and jumping to the corresponding branch according to the instruction address. When a branch jump instruction is encountered, if the branch jump is established (or unconditional jump), the pc (instruction address) is regenerated and the instructions in the instruction cache are cleared.

The architecture has 3 architecture branches, respectively: the general architecture branch is used for supporting general instructions and realizing the function of the cpu; a vector architecture branch for supporting a risc-v vector instruction set to perform vector operations; the AIPU branch supports general load/store instructions and custom user instructions and is used for completing special intensive calculation such as convolution, matrix multiplication and the like. Wherein the AIPU branch may establish contact with the AIPU architecture. The AIPU branch configures registers of all functional modules through load/store instructions of a CPU, the work of all functional modules in the AIPU is only controlled by the registers, and the instructions of the CPU are not needed to participate, so that the calculation efficiency is high but not flexible enough, and the method is suitable for special large-scale calculation. The vector architecture branch is controlled by a vector instruction of the CPU, and the operation of each step needs instruction control, so that the vector architecture branch is more flexible than the AIPU, but has lower calculation efficiency, and is suitable for small-batch and diversified vector calculation. Since vector operations involve more data, how to speed up the load and store of data is critical.

And responding to the jump to the AIPU branch, storing the characteristic data and the coefficient data for the current convolution operation through the first-level input characteristic cache and the first-level coefficient cache, and storing the characteristic data and the coefficient data for the next convolution operation through the second-level input characteristic cache and the second-level coefficient cache. The input feature vector cache and the coefficient vector cache are mainly used for caching data to be calculated in the current clock cycle of the multiply-add operation unit, and the data are calculated in parallel in a vector mode. Since these data cannot be read out from the input feature (or coefficient) buffer in a single cycle, the multiplexing characteristics of the input feature data and coefficient (weight) data in convolution need to be reasonably utilized, and the process of reading data is hidden in the process of data calculation, so that the whole calculation process is not interrupted.

FIG. 3 is a schematic diagram of the AIPU architecture provided by the present invention. As shown in fig. 4, the AIPU architecture includes a register file, a DMA, a read/write interface arbitration, an address generator, a convolution timing controller, a vector buffer, a multiply-add operation matrix, an intermediate result accumulator, and a special vector operation unit.

The core of the AIPU framework is a multiply-add operation matrix module which comprises a large number of multiply and add hardware resources, can realize parallel and high-speed multiply-add operation and meet the calculation force requirement of intensive convolution/matrix operation; other modules are used for enabling convolution operation to be more efficient, for example, the introduced data multiplexing is used for solving the contradiction that the data requirement is large during calculation, but the bandwidth of a data bus and the SRAM is not enough, read data are multiplexed as much as possible, and the pressure of the bandwidth is reduced; the buffer (cache) is arranged to adjust the difference of the data throughput rates of the modules before and after the buffer, reduce the occurrence of blockage and enable all the functional modules to run at full speed without blockage; the vector operation unit can provide different algorithm supports according to the requirements of the convolution algorithm, so that the used operation can be completed after data is read, and then the data is stored instead of reading the data for multiple times to complete one-time complete convolution calculation; the address generator is matched with read-write control, and data arrangement can be realized through different read-write data sequences, so that convolution operation is more efficient; the convolution neural network used for AI calculation is generally divided into a plurality of layers, an AI inference chip needs to calculate layer by layer, each layer has a large amount of convolution or matrix operation, and after a ping-pong register is arranged, parameters required by the next layer of AIPU calculation, such as data dimension and other information, can be configured while the layer is calculated.

FIG. 4 is a diagram illustrating convolution operations in an embodiment of a method for data processing based on a RISC-V instruction set according to the present invention. As shown in fig. 3, in one calculation of the multiply-add operation matrix, f0 vector blocks are simultaneously subjected to multiply-add operation with w0 … w7 (the vector blocks include a plurality of vector elements, and the multiply-add operation is to multiply the corresponding vector elements and then to accumulate the products of all the elements, and the sum is to output the result). F0 and w0 … w7 are correspondingly added to the multiplication and addition matrix, f0 is copied for 8 times, and the multiplication and addition operation is respectively carried out on the f0 and the w0 … w7 vector blocks and the w0 … w7 vector blocks. Similarly, f1 … f7 needs to be multiplied and added with w0 … w 7. In this process, f0 … f7 multiplexes w0 … w7 vector chunks, and each w vector chunk multiplexes the same f vector chunk. Therefore, in the 8 matrix operations, only one time w0 … w7 needs to be taken, and then one f vector chunk is read for each calculation. 8 clock cycles are needed for 8 operations, and 8 cycles are needed for reading w0 … w7, so that the process of reading w vector blocks can be hidden in the calculation process (namely, the data reading process is completely overlapped with the calculation process, and the data calculation process does not need to be interrupted to wait for data reading). This is why an input feature vector buffer and a coefficient vector buffer need to be set.

The intermediate result buffer is used for buffering intermediate results of vector calculation, and according to the convolution principle, a final result cannot be obtained through one-time vector multiplication and addition operation, and multiple multiplication and addition results need to be accumulated. Therefore, a buffer is provided after the multiply-add result accumulator. When the intermediate result is accumulated continuously to obtain a complete final result, the complete result needs to be stored in the complete result buffer. This buffer has several roles:

1. avoiding data being overwritten by later intermediate results;

2. the cache buffer is shared by a following activation module, a pooling module and the like and is used for storing input data and output data of the calculation modules;

3. the module has bus read-write control and sends the final calculation data to the DMA interface.

In some embodiments, the method further comprises: and configuring the register file in the AIPU branch into two parts, wherein the first part runs the current AIPU operation, and the second part acquires parameters required by the next operation of the AIPU. The register file may be set as a system register when adding the compiler back end, and the configuration information is loaded by a load instruction. The register file is configured in two parts, performing ping-pong operations, namely: when the first part controls the operation process of the AIPU, the second part receives parameters required by the next calculation of the AIPU, and after the operation of the first part is finished, the registers of the second part are converted into the currently available registers. This ensures continuous uninterrupted operation of the AIPU.

The register file configuration and conversion principles are as follows: since two sets of registers are added to the compiler back-end when the compiler back-end describes the chip architecture, the compiler will look for the corresponding registers according to the register description in the architecture. For example, load r0, address loads the data at address into register 0, and load r1, address loads the data into register 1. However, when the AIPU uses registers, it needs to determine which register is available, and a "compute done" signal is needed to alternately enable register 0 and register 1. When programming, another register is required to be configured immediately after enabling one AIPU calculation in preparation for the next AIPU start.

In some embodiments, the method further comprises: and in response to the instruction being a load or store instruction, reading an address of the storage space into the destination operand according to the address in the source operand. RISC-V instructions typically have one or two source operands rs1, rs2, corresponding to vector source operands vs1, vs 2. An instruction provides source operands to corresponding execution units (including data loads, data stores, scalar calculations, vector calculations, etc.) according to an opcode (which represents a type of computation such as an add, subtract, multiply, divide, etc. operation). For example, when the opcode represents a load/store, indicating that the instruction is an access store instruction, the execution unit reads the address in the data store into the destination operand (rd or vd) based on the address in rs 1.

In some embodiments, the method further comprises: judging whether vector registers corresponding to vector source operands are in the same group; and in response to the fact that the vector registers corresponding to the vector source operands are not in the same group, enabling two ports with the same bit width as the vector registers to read and write simultaneously. To speed up the load and store of a vector, multiple ports may be provided that are as wide as the vector register bit. For example, there are 32 vector registers, and the 32 vector registers are divided into 4 groups in hardware, each group corresponding to one port. Enabling the corresponding port according to the vector group number set by the vsetvli instruction, for example, if the vsetvli t0, a0, e8 and m4 instructions set a group of 4 vector registers, software divides the 32 registers into 8 groups, and the corresponding relation with hardware is that 2 software vector groups correspond to a hardware vector group. If the vector registers vs1 and vs2 are in the same group during calculation, only one port is enabled to read and write, and if the two ports are in the two groups, two ports are enabled to read and write simultaneously.

In some embodiments, the method further comprises: and reading corresponding data and carrying out dimension conversion on the data according to the operation requirement, and writing the converted data into a corresponding coefficient cache or an input feature cache. According to the architecture of the AIPU, both the coefficient cache unit and the input feature cache unit need to read weight and feature data from a shared external SRAM, and the address generator generates a corresponding SRAM address according to the register configuration. Convolution calculation or matrix operation has different calculation modes according to different applications, for example, convolution calculation is further divided into one-dimensional/two-dimensional/three-dimensional convolution, void convolution, depthwise convolution, separation type convolution, transposition convolution and the like. Different calculation methods are different in data reading mode, and convolution calculation usually performs corresponding conversion on the dimensionality of data, so that an address generator is required to read data in different modes according to the configuration of a register, and the conversion is completed through phase change. Therefore, the address generator and the read-write data control function are as follows: according to different calculation requirements, data reading is completed, corresponding dimension conversion is carried out, and then the data are written into a corresponding coefficient (weight) cache unit or an input feature (feature) cache unit.

The convolution time sequence control unit is a control core of the whole AIPU and is responsible for collecting the states of all the functional modules, controlling the enabling of the related modules and generating a synchronous signal of convolution operation. The convolution synchronization signal is the beat of the entire convolution process. The whole convolution process comprises N (N > ═ 1) beats, one beat comprises M (> ═ 1) clock cycles, and the cycle of the multiplication and addition operation and the accumulation operation is one clock cycle. Thus, one beat contains M such multiply-add and accumulate operations. The size of M is determined by the number of times the data is multiplexed during the convolution process. For example, if the same set of coefficients is multiplexed 8 times, then the minimum value of M is 8 (if the number of compute cycles is sufficient to load the next set of data, then M is the compute cycle; otherwise, M requires additional time to load the next set of data). Because the convolution calculation and the data loading are carried out synchronously, the synchronous signal of the data loading is the synchronous signal of the convolution calculation after the fixed time delay of reading and writing the data period. Similarly, the synchronization signal of the accumulator is a signal obtained by delaying the convolution synchronization signal by a fixed multiplication and addition operation period.

In some embodiments, the method further comprises: reading data in the first-level input feature cache and the first-level coefficient cache in response to convolution calculation, and judging whether the residual space of the first-level input feature cache and the first-level coefficient cache is larger than the size of the next group of data; and responding to the fact that the remaining space of the first-level input characteristic cache and the first-level coefficient cache is larger than the size of the next group of data, and starting a write cache. The remaining available storage space of the input feature cache and the coefficient cache is determined by the two processes of reading and writing data together, the remaining available space is reduced by data writing, and the remaining available space is increased by data reading. The convolution time sequence controller calculates the size of the available space of the cache according to the times of reading and writing the cache, and when the data in the two caches is enough to start convolution operation (for example, the coefficient data meets the requirement of multiplexing number, the input characteristic data meets the requirement of multiple times of calculation, and the calculation time is more than or equal to the data loading time required by the next calculation), the convolution enabling is started. And during convolution calculation, continuously reading the input characteristic cache and the coefficient cache to gradually increase the residual space of the two caches, and starting write cache enabling when the residual space is larger than the size of the next group of data. Thus, if the load time of the next set of data is less than the convolution calculation time of the previous set, the convolution calculations will run without interruption. If the calculation is fast and the data loading is slow, the convolution calculation process will be interrupted.

And performing convolution operation according to the corresponding characteristic data and coefficient data, and activating, normalizing and pooling the operation result. During convolution operation, after the multiplication and addition calculation is completed, operations such as activation (e.g., Relu), normalization, pooling and the like need to be performed on data. If a vector operation unit with low computation efficiency outside the AIPU is used, a large number of multiply-add operation intermediate results are accumulated before vector operation, and wait for activation or pooling, so that the overall convolution operation efficiency is lowered by the vector operation unit. Therefore, the active isopector operation required for convolution is specialized and placed after the multiply-add matrix unit. The special vector computing units such as activation unit and the like can be connected in series with the multiplication and addition unit and the accumulation unit, and can also work independently, and the intermediate result buffer unit is shared by the special vector computing units.

The key points of the embodiment of the invention are as follows:

(1) according to RISC-V instruction set, a processor architecture of three instruction branches is designed, which is respectively as follows: general instruction branch, vector instruction branch, AIPU branch;

(2) an AIPU framework is designed, the AIPU is combined with a RISC-V framework in the form of an accelerator, and a special register file is arranged for accelerating convolution and matrix operation through load/store instruction configuration of RISC-V;

(3) the architecture of an AIPU multiply-add operation array is designed, and the AIPU multiply-add operation array is a two-dimensional parallel multiply-add operation unit. Two vector buffer buffers are arranged and matched with the input characteristic buffer and the coefficient buffer to form two-stage double buffers. The front-stage double buffers (aiming at ensuring that the subsequent units have continuous data) are jointly formed by the input characteristic buffer, the coefficient buffer and the convolution control unit, and the data required by the next step is written into the buffers while the buffers are continuously read out by utilizing a method of monitoring the residual space in real time. The latter buffer is to realize the functions of increasing bandwidth and multiplexing data.

(4) The design of each buffer in the AIPU reasonably buffers different data throughput rates among all levels of functional modules of convolution operation;

(5) a flexible address generator was designed: the address generator is matched with a buffer at the later stage according to the configuration of the register, and the data dimension is converted while the data is read;

(6) a ping-pong operation register is designed to ensure that two different convolution operations before and after the operation are carried out uninterruptedly.

The architecture in the embodiment of the invention is very flexible in application, has the control function of a general CPU and has the calculation power required by AI. The method can be applied to the edge end machine which can only be used for the Internet of things manually. The system can also realize higher computing power through an on-chip internet (NoC) and be installed on a PC or a server in the form of an accelerator card to realize cloud reasoning or training.

It should be particularly noted that, the steps in the embodiments of the method for processing data based on RISC-V instruction set described above can be mutually intersected, replaced, added, or deleted, so that these reasonable permutation and combination transformations for the method for processing data based on RISC-V instruction set also belong to the protection scope of the present invention, and the protection scope of the present invention should not be limited to the embodiments.

In view of the above object, according to a second aspect of the embodiments of the present invention, there is provided a system for performing data processing based on a RISC-V instruction set, comprising: the acquisition module is configured to acquire an instruction from a RISC-V instruction space and cache the instruction into a cache, and judge the type of the instruction; the jump module is configured to respond to the instruction as a branch jump instruction, regenerate an instruction address and jump to a corresponding branch according to the instruction address; the AIPU module is configured for responding to jumping to an AIPU branch, storing feature data and coefficient data used for current convolution operation through a first-level input feature cache and a first-level coefficient cache, and storing feature data and coefficient data of next convolution operation through a second-level input feature cache and a second-level coefficient cache; and the execution module is configured to perform convolution operation according to the corresponding characteristic data and coefficient data, and activate, normalize and pool the operation result.

In some embodiments, the system further comprises a vector module configured to: in response to jumping to a vector architecture branch, performing a vector operation in accordance with the instruction.

In some embodiments, the system further comprises a first determining module configured to: and in response to the instruction being a load or store instruction, reading an address of the storage space into the destination operand according to the address in the source operand.

In some embodiments, the system further comprises a second determining module configured to: judging whether vector registers corresponding to vector source operands are in the same group; and in response to the fact that the vector registers corresponding to the vector source operands are not in the same group, enabling two ports with the same bit width as the vector registers to read and write simultaneously.

In some embodiments, the system further comprises a configuration module configured to: and configuring the register file in the AIPU branch into two parts, wherein the first part runs the current AIPU operation, and the second part acquires parameters required by the next operation of the AIPU.

In some embodiments, the system further comprises a conversion module configured to: and reading corresponding data and carrying out dimension conversion on the data according to the operation requirement, and writing the converted data into a corresponding coefficient cache or an input feature cache.

In some embodiments, the system further comprises a computing module configured to: reading data in the first-level input feature cache and the first-level coefficient cache in response to convolution calculation, and judging whether the residual space of the first-level input feature cache and the first-level coefficient cache is larger than the size of the next group of data; and responding to the fact that the remaining space of the first-level input characteristic cache and the first-level coefficient cache is larger than the size of the next group of data, and starting a write cache.

In view of the above object, a third aspect of the embodiments of the present invention provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, obtaining an instruction from the RISC-V instruction space and caching the instruction into a cache, and judging the type of the instruction; s2, responding to the instruction as a branch jump instruction, regenerating an instruction address, and jumping to a corresponding branch according to the instruction address; s3, responding to the jump to the AIPU branch, storing the feature data and the coefficient data used for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and storing the feature data and the coefficient data of the next convolution operation through the second-level input feature cache and the second-level coefficient cache; and S4, performing convolution operation according to the corresponding characteristic data and coefficient data, and activating, normalizing and pooling the operation result.

In some embodiments, the steps further comprise: in response to jumping to a vector architecture branch, performing a vector operation in accordance with the instruction.

In some embodiments, the steps further comprise: and in response to the instruction being a load or store instruction, reading an address of the storage space into the destination operand according to the address in the source operand.

In some embodiments, the steps further comprise: judging whether vector registers corresponding to vector source operands are in the same group; and in response to the fact that the vector registers corresponding to the vector source operands are not in the same group, enabling two ports with the same bit width as the vector registers to read and write simultaneously.

In some embodiments, the steps further comprise: and configuring the register file in the AIPU branch into two parts, wherein the first part runs the current AIPU operation, and the second part acquires parameters required by the next operation of the AIPU.

In some embodiments, the steps further comprise: and reading corresponding data and carrying out dimension conversion on the data according to the operation requirement, and writing the converted data into a corresponding coefficient cache or an input feature cache.

In some embodiments, the steps further comprise: reading data in the first-level input feature cache and the first-level coefficient cache in response to convolution calculation, and judging whether the residual space of the first-level input feature cache and the first-level coefficient cache is larger than the size of the next group of data; and responding to the fact that the remaining space of the first-level input characteristic cache and the first-level coefficient cache is larger than the size of the next group of data, and starting a write cache.

Fig. 5 is a schematic diagram of a hardware structure of an embodiment of the computer device for performing data processing based on the RISC-V instruction set according to the present invention.

Taking the apparatus shown in fig. 5 as an example, the apparatus includes a processor 201 and a memory 202, and may further include: an input device 203 and an output device 204.

The processor 201, the memory 202, the input device 203 and the output device 204 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.

The memory 202, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for performing data processing based on RISC-V instruction set in the embodiments of the present application. The processor 201 executes various functional applications of the server and data processing, that is, the method for performing data processing based on the RISC-V instruction set, which implements the above-described method embodiments, by executing the nonvolatile software program, instructions, and modules stored in the memory 202.

The memory 202 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a method of data processing based on a RISC-V instruction set, and the like. Further, the memory 202 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 202 may optionally include memory located remotely from processor 201, which may be connected to local modules via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 203 may receive information such as a user name and a password that are input. The output device 204 may include a display device such as a display screen.

One or more program instructions/modules corresponding to the method for performing data processing based on the RISC-V instruction set are stored in the memory 202, and when executed by the processor 201, perform the method for performing data processing based on the RISC-V instruction set in any of the above-described method embodiments.

Any embodiment of a computer apparatus for performing the method for data processing based on the RISC-V instruction set may achieve the same or similar effects as any of the corresponding embodiments of the method described above.

The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, performs the method as above.

FIG. 6 is a schematic diagram of an embodiment of a computer storage medium for performing data processing based on RISC-V instruction set according to the present invention. Taking the computer storage medium as shown in fig. 6 as an example, the computer readable storage medium 3 stores a computer program 31 which, when executed by a processor, performs the method as described above.

Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and a program of the method for data processing based on the RISC-V instruction set can be stored in a computer readable storage medium, and when the program is executed, the program can include the processes of the embodiments of the methods as described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method for performing data processing based on a RISC-V instruction set, comprising the steps of:

obtaining an instruction from a RISC-V instruction space and caching the instruction into a cache, and judging the type of the instruction;

responding to the instruction as a branch jump instruction, regenerating an instruction address, and jumping to a corresponding branch according to the instruction address;

responding to the jump to the AIPU branch, storing the characteristic data and the coefficient data for the current convolution operation through a first-level input characteristic cache and a first-level coefficient cache, and storing the characteristic data and the coefficient data for the next convolution operation through a second-level input characteristic cache and a second-level coefficient cache; and

and performing convolution operation according to the corresponding characteristic data and coefficient data, and activating, normalizing and pooling the operation result.

2. The method of claim 1, further comprising:

in response to jumping to a vector architecture branch, performing a vector operation in accordance with the instruction.

3. The method of claim 1, further comprising:

and in response to the instruction being a load or store instruction, reading an address of the storage space into the destination operand according to the address in the source operand.

4. The method of claim 1, further comprising:

judging whether vector registers corresponding to vector source operands are in the same group; and

and in response to that the vector registers corresponding to the vector source operands are not in the same group, enabling two ports with the same bit width as the vector registers to read and write simultaneously.

5. The method of claim 1, further comprising:

and configuring the register file in the AIPU branch into two parts, wherein the first part runs the current AIPU operation, and the second part acquires parameters required by the next operation of the AIPU.

6. The method of claim 5, further comprising:

and reading corresponding data and carrying out dimension conversion on the data according to the operation requirement, and writing the converted data into a corresponding coefficient cache or an input feature cache.

7. The method of claim 1, further comprising:

reading data in the first-level input feature cache and the first-level coefficient cache in response to convolution calculation, and judging whether the residual space of the first-level input feature cache and the first-level coefficient cache is larger than the size of the next group of data;

and starting a write cache in response to the residual space of the first-level input characteristic cache and the first-level coefficient cache being larger than the size of the next group of data.

8. A system for performing data processing based on a RISC-V instruction set, comprising:

the acquisition module is configured to acquire an instruction from a RISC-V instruction space and cache the instruction into a cache, and judge the type of the instruction;

the jump module is configured to respond to the instruction as a branch jump instruction, regenerate an instruction address and jump to a corresponding branch according to the instruction address;

the AIPU module is configured for responding to jumping to an AIPU branch, storing feature data and coefficient data used for current convolution operation through a first-level input feature cache and a first-level coefficient cache, and storing feature data and coefficient data of next convolution operation through a second-level input feature cache and a second-level coefficient cache; and

and the execution module is configured to perform convolution operation according to the corresponding characteristic data and coefficient data, and activate, normalize and pool the operation result.

9. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.