WO2022170997A1

WO2022170997A1 - Data processing method and system based on risc-v instruction set, and device and medium

Info

Publication number: WO2022170997A1
Application number: PCT/CN2022/074414
Authority: WO
Inventors: 贾兆荣
Original assignee: 山东英信计算机技术有限公司
Priority date: 2021-02-09
Filing date: 2022-01-27
Publication date: 2022-08-18
Also published as: CN112860320A

Abstract

Disclosed in the present application are a data processing method and system based on an RISC-V instruction set, and a device and a storage medium. The method comprises: acquiring an instruction from an RISC-V instruction space, caching the instruction into a cache, and determining the type of the instruction; in response to the instruction being a branch skipping instruction, regenerating an instruction address, and skipping to a corresponding branch according to the instruction address; in response to skipping to an AIPU branch, storing, by means of a first-stage input feature cache and a first-stage coefficient cache, feature data and coefficient data which are used for the current convolution operation, and storing, by means of a second-stage input feature cache and a second-stage coefficient cache, feature data and coefficient data which are used for the next convolution operation; and performing a convolution operation according to the corresponding feature data and coefficient data, and performing activation, normalization and pooling on a result obtained by means of the operation. By means of the present application, a processor architecture with three instruction branches is designed according to an RISC-V instruction set, such that general control, vector operation, convolution and matrix acceleration calculation are realized. The present application is suitable for an AI inference chip of a terminal.

Description

Method, system, device and medium for data processing based on RISC-V instruction set

This application claims the priority of the Chinese patent application filed on February 9, 2021 with the application number 202110175746.6 and the invention titled "Method, System, Device and Medium for Data Processing Based on RISC-V Instruction Set", all of which are The contents are incorporated herein by reference.

technical field

The present application relates to the field of data processing, and more particularly, to a method, system, computer device and readable medium for data processing based on the RISC-V instruction set.

Background technique

The value of data lies in analysis and utilization, not simple storage. The amount of data is constantly growing, and it is impossible to transmit all data to the cloud through the network, and the speed of bandwidth growth is slower than the speed of data growth. For application scenarios with high real-time requirements, we need to judge the data at the edge, such as autonomous driving, unmanned driving and other fields. For scenarios with high privacy protection requirements, such as medical information or data that users are unwilling to share in the cloud, it needs to be stored locally. For example, most of the data generated by security equipment is useless or data that has no potential to be tapped. It is a waste of bandwidth to transmit all data to the cloud. If intelligent analysis is performed at the edge, only useful or potential data is transmitted to the cloud. Greatly saves network bandwidth. Therefore, the transfer of data processing from the cloud to the edge is an inevitable trend. Therefore, edge-end AI (artificial intelligence, artificial intelligence) chips are also the general trend.

Artificial intelligence processing at the edge requires AI chips, and the challenges faced by AI chips are mainly computing power and computing efficiency. The computing power of an AI chip is determined by the number of on-chip computing units. Since the amount of data involved in AI computing is very large, in theory, the larger the computing power of an AI chip, the better, but in fact, the computing power of an AI chip is restricted by various factors:

1. On-chip storage bandwidth and bus bandwidth: The main contradiction of AI chips is the contradiction between storage bandwidth and computing power. The greater the computing power, the greater the amount of input data, intermediate results, and output data, and the higher the required storage bandwidth. However, the current storage bandwidth is far from meeting the computing power requirements. If the computing units and storage units cannot be reasonably arranged, it will lead to There is a large but inefficient result.

2. There are various operators involved in AI computation, such as convolution computation, matrix computation, normalization, activation, pooling and other linear and nonlinear computations. A deep neural network model usually consists of multiple layers, and the output of the previous layer is the input of the next layer; in the same layer, the result of the multiplication and addition operation is often the input of activation, pooling, and normalization. Therefore, if multi-threading/parallel computing/computation pipeline cannot be implemented reasonably, the calculation of the previous step will hinder the calculation of the next step, causing waste of resources and reducing computing efficiency.

3. As mentioned in 2, there are various operators involved in AI, but the AI chip is fixed. How to make the unchanged hardware efficiently handle the variable operators requires that the software can be reasonably based on the hardware architecture. Allocate hardware resources and compile efficient machine code. At the same time, AI chips are also required to have efficient control capabilities.

SUMMARY OF THE INVENTION

In view of this, the purpose of the embodiments of the present application is to propose a method, system, computer equipment and computer-readable storage medium for data processing based on RISC-V instruction set, through AIPU (AI process unit, artificial intelligence processing unit) and CPU shares memory, making computing and storage adjacent, improving memory access bandwidth, facilitating data interaction between AIPU and CPU, reducing the amount of data interaction with external buses, and reducing the demand for bus bandwidth. At the same time, the AIPU and the CPU each have a small buffer (cache) used to cache input data, intermediate results, output data and CPU pre-reading instructions, allowing data to be loaded at the same time as data calculation and prolonging data read and write time. Further reducing the need for bus bandwidth.

Based on the above purpose, an aspect of the embodiments of the present application provides a method for data processing based on a RISC-V instruction set, including the following steps: acquiring an instruction from the RISC-V instruction space and caching it in the cache, and judging the instruction type; in response to the instruction being a branch jump instruction, regenerate the instruction address, and jump to the corresponding branch according to the instruction address; in response to jumping to the AIPU branch, through the first level input feature cache and the first The first-level coefficient cache stores the feature data and coefficient data for the current convolution operation, and stores the feature data and coefficient data of the next convolution operation through the second-level input feature cache and the second-level coefficient cache; and according to the corresponding feature data Perform convolution operation with coefficient data, and activate, normalize and pool the result obtained by operation.

In some embodiments, the method further includes performing a vector operation according to the instruction in response to the jump to the vector architecture branch.

In some embodiments, the method further includes: in response to the instruction being a load or store instruction, reading the address in the storage space into the destination operand according to the address in the source operand.

In some embodiments, the method further includes: judging whether the vector registers corresponding to the vector source operands are in the same group; and in response to the vector registers corresponding to the vector source operands not being in the same group, making two Ports with the same bit width of the vector registers can be read and written at the same time.

In some embodiments, the method further includes: configuring the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.

In some embodiments, the method further includes: reading data corresponding to the operation and performing dimension transformation on the data corresponding to the operation according to the requirements of the operation, and writing the converted data into the corresponding coefficients Cache or input feature cache.

In some embodiments, the method further includes: in response to performing the convolution calculation, reading data in the first-level input feature buffer and the first-level coefficient buffer, and determining the first-level input feature buffer and the first-level coefficient buffer. whether the remaining space of the first-level coefficient cache is larger than the size of the next group of data; and in response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, enable write caching.

Another aspect of the embodiments of the present application provides a system for data processing based on a RISC-V instruction set, including: an acquisition module configured to acquire instructions from the RISC-V instruction space and cache them in the cache, and determine the the type of the instruction; the jump module, configured to regenerate the instruction address in response to the instruction being a branch jump instruction, and to jump to the corresponding branch according to the instruction address; the AIPU module, configured to jump to the AIPU in response to the instruction Branch, store the feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and store the next convolution operation through the second-level input feature cache and the second-level coefficient cache. feature data and coefficient data; and an execution module configured to perform a convolution operation according to the corresponding feature data and coefficient data, and activate, normalize and pool the results obtained by the operation.

In yet another aspect of the embodiments of the present application, a computer device is also provided, including: at least one processor; and a memory, where the memory stores computer instructions that can be executed on the processor, and the instructions are executed by the processor. The processor implements the steps of the above method when executed.

In another aspect of the embodiments of the present application, a computer-readable storage medium is also provided, where the computer-readable storage medium stores a computer program that implements the above method steps when executed by a processor.

The present application has the following beneficial technical effects: the AIPU (AI process unit, artificial intelligence processing unit) shares memory with the CPU, wherein the AIPU establishes the first-level input feature cache, the first-level coefficient cache and the first-level coefficient cache in the shared memory. The second-level input feature cache and the second-level coefficient cache make calculation and storage adjacent, improve memory access bandwidth, facilitate data interaction between AIPU and CPU, reduce the amount of data interaction with external buses, and reduce the bandwidth of the bus. demand. At the same time, there is a small buffer inside the AIPU and CPU to cache input data, intermediate results, output data and CPU pre-reading instructions, allowing data to be loaded at the same time as data calculation, prolonging data reading and writing time, and further reducing the need for bus bandwidth requirements.

Description of drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other embodiments can also be obtained according to these drawings without creative efforts.

1 is a schematic diagram of an embodiment of a method for data processing based on a RISC-V instruction set provided by the present application;

2 is a schematic diagram of a CPU architecture in an embodiment of the present application;

3 is a schematic diagram of an AIPU architecture provided by the present application;

Fig. 4 is the schematic diagram of convolution operation in the embodiment of the method for data processing based on RISC-V instruction set provided by this application;

5 is a schematic diagram of the hardware structure of an embodiment of a computer device for data processing based on a RISC-V instruction set provided by the present application;

FIG. 6 is a schematic diagram of an embodiment of a computer storage medium for data processing based on a RISC-V instruction set provided by the present application.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present application clearer, the following describes the embodiments of the present application in detail with reference to the accompanying drawings and specific embodiments.

It should be noted that all expressions using "first" and "second" in the embodiments of the present application are for the purpose of distinguishing two entities with the same name but not the same or non-identical parameters. It can be seen that "first" and "second" It is only for the convenience of expression and should not be construed as a limitation on the embodiments of the present application, and subsequent embodiments will not describe them one by one.

Based on the above purpose, in the first aspect of the embodiments of the present application, an embodiment of a method for data processing based on a RISC-V instruction set is proposed. FIG. 1 shows a schematic diagram of an embodiment of a method for data processing based on a RISC-V instruction set provided by the present application. As shown in Figure 1, the embodiment of the present application includes the following steps:

S1. Acquire the instruction cache from the RISC-V instruction space into the cache, and determine the type of the instruction;

S2, in response to the instruction being a branch jump instruction, regenerate the instruction address, and jump to the corresponding branch according to the instruction address;

S3. In response to jumping to the AIPU branch, store the feature data and coefficient data for the current convolution operation through the first-level input feature buffer and the first-level coefficient buffer, and pass the second-level input feature buffer and the second-level coefficient buffer The cache stores feature data and coefficient data for the next convolution operation; and

S4. Perform a convolution operation according to the corresponding feature data and coefficient data, and activate, normalize and pool the results obtained by the operation.

In the embodiment of the present application, a storage-computing integrated structure is adopted, and the AIPU and the CPU share memory, wherein the AIPU establishes the first-level input feature cache, the first-level coefficient cache, and the second-level input in the shared memory. Feature cache and second-level coefficient cache make computing adjacent to storage, improve memory access bandwidth, facilitate data interaction between AIPU and CPU, reduce the amount of data interaction with external buses, and reduce the demand for bus bandwidth. At the same time, there is a small buffer inside the AIPU and CPU to cache input data, intermediate results, output data and CPU pre-reading instructions, allowing data to be loaded at the same time as data calculation, prolonging data reading and writing time, and further reducing the need for bus bandwidth requirements.

RISC-V instruction set includes general instruction set and vector extended instruction set, which can be divided into: integer instruction set I, multiply-add operation instruction set M, atomic operation instruction set A, single-precision instruction set F, double-precision instruction set D, compression Instruction set C and vector instruction set V. The arithmetic logic operation unit completes the IMAFDC instruction set operation, and the vector operation unit completes the vector instruction set V operation. The CPU architecture is designed according to the RISC-V instruction set. The function of the CPU is to run system code and complete system control and data operations.

FIG. 2 shows a schematic diagram of a CPU architecture in an embodiment of the present application. As shown in Figure 2, the CPU adopts a two-stage pipeline architecture. The first stage is the instruction fetch stage, which is responsible for fetching the instruction cache from the instruction storage space into the instruction cache. The second stage decodes and executes the instruction. When decoding, analyze the type of the instruction (vector instruction or ordinary instruction), and start the corresponding data operation according to the corresponding instruction type and opcode. For example, the vector add instruction will read the data from the vector data storage to the vector register, and then in The operation is completed in the vector operation unit, and the result is cached in the vector data cache.

The meaning of setting the vector data cache here is: In AI inference calculation, vector operations are usually not independent, and it is often necessary to reasonably complete the calculation with multiple vector operations in the form of pipelines. If the intermediate results are stored in the data sram (Static Random Access Memory , static random access memory), the vector data may take multiple cycles to complete the store or read, which will greatly increase the vector calculation cycle. Setting a vector cache buffer can load data into the vector cache buffer in advance before starting the vector calculation, and store the final result in the data sram after the vector calculation is completed. Both prefetching and result storage of vector data can be done during other operations, reducing vector operation cycles. The port of the vector data cache module is wide to meet the bandwidth requirements of the vector operation unit.

The instruction is obtained from the RISC-V instruction space and cached in the cache, and it is judged whether the instruction is a branch jump instruction; in response to the instruction being a branch jump instruction, the instruction address is regenerated, and the corresponding branch is jumped according to the instruction address. When encountering a branch jump instruction, if the branch jump is established (or unconditional jump), the pc (instruction address) will be regenerated, and the instruction in the instruction cache will be cleared.

The architecture has three architecture branches, namely: general architecture branch, which is used to support general-purpose instructions and realize the functions of CPU; vector architecture branch, which is used to support RISC-V vector instruction set and complete vector operations; AIPU branch, which supports General load/store instructions, support custom user instructions, used to complete special intensive calculations such as convolution and matrix multiplication. Among them, the AIPU branch can establish a connection with the AIPU architecture. The AIPU branch configures the registers of each functional module through the load (load)/store (store) instructions of the CPU. The work of each functional module in the AIPU is only controlled by the registers and does not require the participation of CPU instructions. Therefore, the calculation efficiency is high but not flexible enough. for special large-scale computing. The vector architecture branch is controlled by the vector instructions of the CPU, and each step of the operation requires instruction control. It can be seen that the vector architecture branch is more flexible than the AIPU, but the calculation efficiency is lower, and it is suitable for small batches and diversified vector calculations. Since the vector operation involves a lot of data, how to speed up the data load and store is the key.

In response to jumping to the AIPU branch, the feature data and coefficient data for the current convolution operation are stored through the first-level input feature cache and the first-level coefficient cache, and stored through the second-level input feature cache and the second-level coefficient cache Feature data and coefficient data for the next convolution operation. The input feature vector buffer and coefficient vector buffer are mainly used to buffer the data to be calculated in the current clock cycle of the multiply-add operation unit, and these data are all calculated in parallel in the form of vectors. Since it is impossible to read all these data from the input feature (or coefficient) buffer in a single cycle, it is necessary to make reasonable use of the multiplexing characteristics of the input feature data and coefficient (weight) data in the convolution to hide the process of reading data in In the process of data calculation, the whole calculation process will not be interrupted.

FIG. 3 shows a schematic diagram of the AIPU architecture provided by this application. As shown in Figure 3, the AIPU architecture includes register files, DMA, read and write interface arbitration, address generators, convolution timing controllers, vector caches, multiply-add operation matrices, intermediate result accumulators, and special vector operation units.

The core of the AIPU architecture is the multiplication and addition matrix module, which contains a large number of multiplication and addition hardware resources, which can realize parallel and high-speed multiplication and addition operations to meet the computing power requirements of intensive convolution/matrix operations; other modules are for Make the convolution operation more efficient. For example, the data multiplexing introduced is to solve the problem that the data demand is large during the calculation, but the bandwidth of the data bus and SRAM is not enough. The read data is reused as much as possible to reduce the pressure on the bandwidth; buffer The setting of (cache) is to adjust the data throughput rate of the modules before and after the buffer, reduce the occurrence of blocking, and make each functional module run at full speed without blocking; the vector operation unit can provide different algorithm support according to the requirements of the convolution algorithm, so that First, the data can be read to complete the operation used, and then stored, instead of reading the data multiple times to complete a complete convolution calculation; the address generator cooperates with read and write control, and can read and write data through different read and write data. Sequential data arrangement makes the convolution operation more efficient; the convolutional neural network used in AI computing is usually divided into many layers, and the AI inference chip requires layer-by-layer calculation, each layer has a large number of convolution or matrix operations, including After the ping-pong register is established, the parameters required for the calculation of the next layer of AIPU, such as data dimensions and other information, can be configured while calculating the current layer. In this way, after the end of this layer, the calculation of the next layer can start immediately. It reduces the computing time of the entire neural network and improves the computing efficiency.

FIG. 4 shows a schematic diagram of a convolution operation in an embodiment of a method for data processing based on a RISC-V instruction set provided by the present application. As shown in Figure 4, in one calculation of the multiplication and addition operation matrix, the f0 vector block is simultaneously multiplied and added with w0...w7 (the vector block contains multiple vector elements, and the multiplication and addition operation is to multiply the corresponding vector elements and then multiply the elements of all elements. Multiply and accumulate, and the accumulated sum is the output result). Corresponding f0 and w0...w7 to the multiplication and addition matrix, f0 is equivalent to 8 times of copying, and do multiplication and addition operations with w0...w7 vector blocks respectively. In the same way, f1...f7 all need to do multiplication and addition with w0...w7. In this process, f0...f7 multiplex w0...w7 vector blocks, and each w vector block multiplexes the same f vector block. Therefore, in these 8 matrix operations, it is only necessary to take w0...w7 once, and then read a block of f vectors for each calculation. 8 operations require 8 clock cycles, and reading w0...w7 also requires 8 cycles, so the process of reading the w vector block can be hidden in the calculation process (that is, the process of reading data completely overlaps with the calculation process, no need Interrupt the data calculation process and wait for the data to be read). This is why the input feature vector buffer and coefficient vector buffer need to be set up.

The intermediate result cache is used to cache the intermediate results of vector calculations. According to the convolution principle, a vector multiplication and addition operation cannot obtain the final result, and the results of multiple multiplications and additions need to be accumulated. Therefore, a cache is set after the multiply-accumulate result accumulator. When the intermediate results are continuously accumulated to obtain the complete final result, the complete result needs to be stored in the complete result cache buffer. This cache buffer has multiple functions:

1. Avoid data being overwritten by subsequent intermediate results;

2. The cache buffer is shared by the subsequent activation modules, pooling modules, etc., and is used to store the input data and output data of these computing modules;

3. The module has a bus read and write control to send the final calculation data to the DMA interface.

In some embodiments, the method further includes: configuring the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation. The register file can be set as a system register when adding a compiler backend, and the configuration information is loaded by the load instruction. The register file is configured into two parts and performs ping-pong operation, that is: when the first part controls the AIPU operation process, the second part accepts the parameters required for the next calculation of the AIPU, and when the first part of the operation is completed, the second part of the register is converted to the currently available register. register. This can ensure continuous and uninterrupted work of the AIPU.

The principle of register file configuration and conversion is as follows: Since two sets of registers are added to the back end of the compiler when the chip architecture is described in the back end of the compiler, the compiler will find the corresponding registers according to the description of the registers in the architecture. For example, load r0, addr loads the data at address into register 0, load r1, addr loads the data into register 1. However, when the AIPU uses registers, it needs to determine which register is available. At this time, a "calculation complete" signal is required to alternately enable register 0 and register 1. During programming, after enabling an AIPU calculation, another register needs to be configured immediately to prepare for the next AIPU startup.

In some embodiments, the method further includes: in response to the instruction being a load or store instruction, reading the address in the storage space into the destination operand according to the address in the source operand. RISC-V instructions usually have one or two source operands rs1 and rs2, and the corresponding vector source operands are vs1 and vs2. The instruction sends the source operand to the corresponding execution unit (including data load, data storage, scalar calculation, vector calculation, etc.) according to the opcode (representing the type of calculation, such as addition, subtraction, multiplication, division, etc.). For example, when the opcode represents load/store, it indicates that the instruction is an access storage instruction, and the execution unit reads the address of the data storage space into the destination operand (rd or vd) according to the address in rs1.

In some embodiments, the method further includes: judging whether the vector registers corresponding to the vector source operands are in the same group; and in response to the vector registers corresponding to the vector source operands not being in the same group, making two Ports with the same bit width of the vector registers can be read and written at the same time. In order to speed up the load and store of the vector, you can set up multiple ports with the same width as the vector register. For example, there are 32 vector registers, the hardware divides these 32 vector registers into 4 groups, each group corresponds to a port. Enable the corresponding port according to the number of vector groups set by the vsetvli instruction. For example, if the vsetvli t0, a0, e8, m4 instructions set a group of 4 vector registers, the software divides the 32 registers into 8 groups, and the corresponding relationship with the hardware is 2 Each software vector group corresponds to one hardware vector group. If the vector registers vs1 and vs2 are in the same group during calculation, only one port is enabled to read and write, and if they are in two groups, two ports are enabled to read and write at the same time.

In some embodiments, the method further includes: according to the requirements of the operation, reading data corresponding to the operation and performing dimension transformation on the data corresponding to the operation, and writing the converted data into the corresponding coefficient buffer or input feature cache. According to the architecture of the AIPU, both the coefficient cache unit and the input feature cache unit need to read the weight and feature data from the shared external SRAM, and the address generator generates the corresponding SRAM address according to the register configuration. Convolution calculation or matrix operation will have different calculation methods according to different applications. For example, convolution calculation is divided into one-dimensional/two-dimensional/three-dimensional convolution, hole convolution, depthwise convolution, separated convolution, and transposed volume. accumulate and so on. Different calculation methods read data in different ways. Convolution calculation usually also transforms the dimension of the data accordingly, which requires the address generator to read data in different ways according to the configuration of the register, and complete these conversions in disguise . It can be seen that the functions of the address generator and read and write data control are: according to different calculation requirements, complete the reading of data and make corresponding dimension conversion, and then write the corresponding coefficient (weight) buffer unit or input feature (feature) ) in the cache unit.

The convolution timing control unit is the control core of the entire AIPU, which is responsible for collecting the status of each functional module, controlling the enabling of related modules, and generating the synchronization signal of the convolution operation. The convolution sync signal is the beat of the entire convolution process. The whole convolution process includes N (N>=1) beats, one beat contains M (>=1) clock cycles, and the cycle of multiply-add operation and accumulation operation is one clock cycle. Therefore, a beat contains M of these multiply-add and accumulate operations. The size of M is determined by the number of times of data multiplexing in the convolution process. For example, if the same set of coefficients is multiplexed 8 times, the minimum value of M is 8 (if the number of calculation cycles is enough to load the next set of data, then M is the calculation cycle; otherwise, M needs extra time to load the next set of data). Since the convolution calculation and data loading are performed synchronously, the synchronization signal of the data loading is the synchronization signal of the convolution calculation after a fixed read and write data cycle delay. Similarly, the synchronization signal of the accumulator is the delayed signal of the convolution synchronization signal after a fixed multiply-add operation cycle.

In some embodiments, the method further includes: in response to performing the convolution calculation, reading data in the first-level input feature buffer and the first-level coefficient buffer, and determining the first-level input feature buffer and the first-level coefficient buffer. whether the remaining space of the first-level coefficient cache is larger than the size of the next group of data; and in response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, enable write caching. The remaining available storage space of the input feature cache and coefficient cache is determined by the two processes of reading and writing data. Data writing reduces the remaining available space, and data reading increases the remaining available space. The convolution timing controller calculates the available space size of the cache according to the number of times of reading and writing the cache. When the data in the two caches is enough to start the convolution operation (for example, the coefficient data meets the requirement of the number of multiplexing, and the input feature data meets the requirement of multiple computations). requirements, and the calculation time is greater than or equal to the data loading time required for the next calculation), enable convolution. During the convolution calculation, the input feature buffer and coefficient buffer are continuously read, so that the remaining space of the two buffers gradually increases. When the remaining space is larger than the size of the next set of data, the write cache is enabled. Therefore, if the load time of the next set of data is less than the convolution calculation time of the previous set, the convolution calculation will run uninterrupted. If the calculation is fast and the data loading is slow, there will be interruptions in the convolution calculation process.

The convolution operation is performed according to the corresponding feature data and coefficient data, and the results obtained by the operation are activated, normalized and pooled. In the convolution operation, the data needs to be activated (such as Relu), normalized, and pooled after the multiplication and addition calculation is completed. If a vector operation unit with slow computational efficiency outside the AIPU is used, a large number of intermediate results of multiplication and addition operations will be accumulated before the vector operation, waiting for activation or pooling, and the efficiency of the entire convolution operation will be lowered by the vector operation unit. Therefore, vector operations such as activations required for convolution are specialized and placed after the multiply-add matrix unit. Activation, etc. The dedicated vector calculation unit can be connected in series with the multiply-accumulate unit and the accumulation unit, or can work independently, and the intermediate result cache unit is shared by these dedicated vector calculation units.

Key points of the embodiments of the present application:

(1) According to the RISC-V instruction set, the processor architecture of three instruction branches is designed, namely: general instruction branch, vector instruction branch, and AIPU branch;

(2) The AIPU architecture is designed. The AIPU is combined with the RISC-V architecture in the form of an accelerator. It has a dedicated register file and is configured by the RISC-V load/store instruction to accelerate convolution and matrix operations;

(3) The architecture of the AIPU multiply-add operation array is designed, which is a two-dimensional parallel multiply-add operation unit. There are two vector cache buffers, which cooperate with the previous input feature cache buffer and coefficient cache buffer to form a two-level double buffer. The front-level double buffer (the purpose is to make the subsequent units have continuous data) is composed of the input feature buffer, coefficient buffer and convolution control unit. Using the method of real-time monitoring of the remaining space, the data is continuously read out of the buffer at the same time. , write the data required for the next step into the buffer. The back-end buffer is to realize the functions of increasing bandwidth and data multiplexing.

(4) The design of each cache buffer in the AIPU, and the difference in the data throughput rate between the functional modules at all levels of the buffer convolution operation is reasonable;

(5) A flexible address generator is designed: according to the configuration of the register, the address generator is matched with the buffer of the latter stage to complete the transformation of the data dimension while reading the data;

(6) The ping-pong operation register is designed to ensure the uninterrupted operation of the two different convolution operations before and after.

The application of the architecture in the embodiment of the present application is very flexible, and has both the control function of a general-purpose CPU and the computing power required by AI. It can be applied to edge-end machines of artificial intelligence and IoT. It can also achieve greater computing power through the Internet-on-Chip (NoC), and install it on a PC or server in the form of an accelerator card to realize cloud-based reasoning or training.

It should be particularly pointed out that the steps in the above-mentioned various embodiments of the method for data processing based on the RISC-V instruction set can be intersected, replaced, added, and deleted. The data processing method of the -V instruction set should also belong to the protection scope of the present application, and the protection scope of the present application should not be limited to the embodiments.

Based on the above purpose, in a second aspect of the embodiments of the present application, a system for data processing based on the RISC-V instruction set is proposed, including: an acquisition module configured to acquire an instruction cache from the RISC-V instruction space and store it in the cache , and determine the type of the instruction; the jump module is configured to regenerate the instruction address in response to the instruction being a branch jump instruction, and jump to the corresponding branch according to the instruction address; the AIPU module is configured to respond It jumps to the AIPU branch, stores the feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and stores the next level through the second-level input feature cache and the second-level coefficient cache. feature data and coefficient data of a one-step convolution operation; and an execution module configured to perform a convolution operation according to the corresponding feature data and coefficient data, and activate, normalize and pool the results obtained by the operation.

In some embodiments, the system further includes a vector module configured to perform vector operations according to the instructions in response to jumping to the vector architecture branch.

In some embodiments, the system further includes a first judgment module configured to: in response to the instruction being a load or store instruction, read the address of the storage space into the destination operand according to the address in the source operand.

In some embodiments, the system further includes a second judgment module configured to: judge whether the vector registers corresponding to the vector source operands are in the same group; and respond that the vector registers corresponding to the vector source operands are not in the same group Inside, two ports with the same bit width as the vector register can be read and written at the same time.

In some embodiments, the system further includes a configuration module configured to configure the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.

In some embodiments, the system further includes a conversion module configured to read data corresponding to the operation and perform dimension conversion on the data corresponding to the operation according to the requirements of the operation, and write the converted data into into the corresponding coefficient buffer or input feature buffer.

In some embodiments, the system further includes a computing module configured to read data in the first-level input feature buffer and the first-level coefficient buffer in response to performing the convolution calculation, and determine the first-level input feature Whether the remaining space of the cache and the first-level coefficient cache is larger than the size of the next set of data; and in response to the first-level input feature cache and the remaining space of the first-level coefficient cache being larger than the size of the next set of data, enable write caching .

Based on the above purpose, in a third aspect of the embodiments of the present application, a computer device is proposed, including: at least one processor; and a memory, where the memory stores computer instructions that can be executed on the processor, and the instructions are executed by the processor to The following steps are implemented: S1. Obtain the instruction from the RISC-V instruction space and cache it in the cache, and determine the type of the instruction; S2. In response to the instruction being a branch jump instruction, regenerate the instruction address, and jump to the corresponding instruction address according to the instruction address. branch; S3, in response to jumping to the AIPU branch, store the feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and pass the second-level input feature cache and the first-level coefficient cache. The second-level coefficient cache stores the feature data and coefficient data of the next convolution operation; and S4, performs convolution operation according to the corresponding feature data and coefficient data, and activates, normalizes and pools the result obtained by the operation.

In some embodiments, the steps further comprise: in response to jumping to the vector architecture branch, performing a vector operation according to the instruction.

In some embodiments, the step further includes: in response to the instruction being a load or store instruction, reading the address in the storage space into the destination operand according to the address in the source operand.

In some embodiments, the steps further include: judging whether the vector registers corresponding to the vector source operands are in the same group; and in response to the vector registers corresponding to the vector source operands not being in the same group, making two Ports with the same bit width of the vector registers can be read and written at the same time.

In some embodiments, the step further includes: configuring the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.

In some embodiments, the step further includes: reading data corresponding to the operation and performing dimension transformation on the data corresponding to the operation according to the requirements of the operation, and writing the converted data into the corresponding coefficients Cache or input feature cache.

In some embodiments, the step further includes: in response to performing the convolution calculation, reading the data in the first-level input feature buffer and the first-level coefficient buffer, and judging the first-level input feature buffer and the first-level coefficient buffer. whether the remaining space of the first-level coefficient cache is larger than the size of the next group of data; and in response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, enable write caching.

As shown in FIG. 5 , it is a schematic diagram of the hardware structure of an embodiment of the above-mentioned computer device for data processing based on the RISC-V instruction set provided for this application.

Taking the device shown in FIG. 5 as an example, the device includes a processor 201 and a memory 202 , and may also include an input device 203 and an output device 204 .

The processor 201 , the memory 202 , the input device 203 and the output device 204 may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 5 .

As a non-volatile computer-readable storage medium, the memory 202 can be used to store non-volatile software programs, non-volatile computer-executable programs and modules. The program instruction/module corresponding to the data processing method. The processor 201 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions and modules stored in the memory 202, that is, the data processing based on the RISC-V instruction set of the above method embodiments is implemented. Methods.

The memory 202 may include a storage program area and a storage data area, wherein the storage program area may store an operating system and an application program required by at least one function; the storage data area may store the use of the method for data processing based on the RISC-V instruction set created data, etc. Additionally, memory 202 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 202 may optionally include memory located remotely from processor 201, which may be connected to local modules via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

The input device 203 can receive input information such as user name and password. The output device 204 may include a display device such as a display screen.

One or more program instructions/modules corresponding to the method for data processing based on the RISC-V instruction set are stored in the memory 202, and when executed by the processor 201, the execution based on the RISC-V instruction set in any of the above method embodiments is performed. method of data processing.

Any embodiment of a computer device that executes the above-mentioned method for data processing based on a RISC-V instruction set can achieve the same or similar effects as any of the foregoing method embodiments corresponding to it.

The present application also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program that executes the above method when executed by a processor.

As shown in FIG. 6 , it is a schematic diagram of an embodiment of the above-mentioned computer storage medium for data processing based on the RISC-V instruction set provided for this application. Taking the computer storage medium shown in FIG. 6 as an example, the computer readable storage medium 3 stores a computer program 31 that executes the above method when executed by the processor.

Finally, it should be noted that those of ordinary skill in the art can understand that all or part of the process in the method of the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program. The program of the method for data processing based on the RISC-V instruction set can be Stored in a computer-readable storage medium, when the program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, the storage medium of the program may be a magnetic disk, an optical disk, a read only memory (ROM) or a random access memory (RAM) or the like. The above computer program embodiments can achieve the same or similar effects as any of the foregoing method embodiments corresponding thereto.

The above are exemplary embodiments disclosed in the present application, but it should be noted that various changes and modifications may be made without departing from the scope of the disclosure of the embodiments of the present application defined by the claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements disclosed in the embodiments of the present application may be described or claimed in an individual form, unless explicitly limited to the singular, they may also be construed as a plurality.

It should be understood that, as used herein, the singular form "a" is intended to include the plural form as well, unless the context clearly supports an exception. It will also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The above-mentioned embodiments of the present application disclose the serial numbers of the embodiments only for description, and do not represent the advantages and disadvantages of the embodiments.

Those of ordinary skill in the art can understand that all or part of the steps of implementing the above embodiments can be completed by hardware, or can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium. The storage medium can be a read-only memory, a magnetic disk or an optical disk, and the like.

Those of ordinary skill in the art should understand that the discussion of any of the above embodiments is only exemplary, and is not intended to imply that the scope of the disclosure of the embodiments of the present application is limited to these examples; under the idea of the embodiments of the present application, the above embodiments or Combinations of technical features in different embodiments are also possible, and there are many other variations of different aspects of the embodiments of the present application as above, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. made within the spirit and principle of the embodiments of the present application should be included within the protection scope of the embodiments of the present application.

Claims

A method for data processing based on a RISC-V instruction set, characterized in that it comprises the following steps:

Obtain the instruction cache from the RISC-V instruction space into the cache, and determine the type of the instruction;

In response to the instruction being a branch jump instruction, regenerate the instruction address, and jump to the corresponding branch according to the instruction address;

In response to jumping to the AIPU branch, the feature data and coefficient data for the current convolution operation are stored through the first-level input feature cache and the first-level coefficient cache, and stored through the second-level input feature cache and the second-level coefficient cache Feature data and coefficient data for the next convolution operation; and

The convolution operation is performed according to the corresponding feature data and coefficient data, and the results obtained by the operation are activated, normalized and pooled.
The method of claim 1, further comprising:

In response to jumping to the vector architecture branch, vector operations are performed according to the instruction.
The method of claim 1, further comprising:

In response to the instruction being a load or store instruction, the address in the storage space is read into the destination operand according to the address in the source operand.
The method of claim 1, further comprising:

Determine whether the vector registers corresponding to the vector source operands are in the same group; and

In response to the vector registers corresponding to the vector source operands not being in the same group, two ports with the same bit width as the vector registers are simultaneously read and written.
The method of claim 1, further comprising:

The register file in the AIPU branch is configured into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
The method of claim 5, further comprising:

According to the requirements of the operation, read the data corresponding to the operation and perform dimension conversion on the data corresponding to the operation, and write the converted data into the corresponding coefficient buffer or input feature buffer.
The method of claim 1, further comprising:

In response to performing the convolution calculation, read the data in the first-level input feature cache and the first-level coefficient cache, and determine whether the remaining space of the first-level input feature cache and the first-level coefficient cache is larger than the next level. the size of the group data;

In response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, the write cache is enabled.
A system for data processing based on the RISC-V instruction set, characterized in that it includes:

an acquisition module, configured to acquire instructions from the RISC-V instruction space and cache them in the cache, and determine the type of the instructions;

a jump module, configured to regenerate an instruction address in response to the instruction being a branch jump instruction, and jump to a corresponding branch according to the instruction address;

The AIPU module is configured to, in response to jumping to the AIPU branch, store feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and store feature data and coefficient data for the current convolution operation through the second-level input feature cache and the first-level coefficient cache The second-level coefficient cache stores feature data and coefficient data for the next convolution operation; and

The execution module is configured to perform a convolution operation according to the corresponding feature data and coefficient data, and activate, normalize and pool the results obtained by the operation.
A computer equipment, characterized in that, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, the instructions implementing the steps of the method of any one of claims 1-7 when executed by the processor.
A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the steps of the method of any one of claims 1-7 are implemented.