CN112860320A - Method, system, device and medium for data processing based on RISC-V instruction set - Google Patents

Method, system, device and medium for data processing based on RISC-V instruction set Download PDF

Info

Publication number
CN112860320A
CN112860320A CN202110175746.6A CN202110175746A CN112860320A CN 112860320 A CN112860320 A CN 112860320A CN 202110175746 A CN202110175746 A CN 202110175746A CN 112860320 A CN112860320 A CN 112860320A
Authority
CN
China
Prior art keywords
instruction
data
cache
coefficient
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110175746.6A
Other languages
Chinese (zh)
Inventor
贾兆荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Yingxin Computer Technology Co Ltd
Original Assignee
Shandong Yingxin Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Yingxin Computer Technology Co Ltd filed Critical Shandong Yingxin Computer Technology Co Ltd
Priority to CN202110175746.6A priority Critical patent/CN112860320A/en
Publication of CN112860320A publication Critical patent/CN112860320A/en
Priority to PCT/CN2022/074414 priority patent/WO2022170997A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3814Implementation provisions of instruction buffers, e.g. prefetch buffer; banks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a method, a system, equipment and a storage medium for processing data based on a RISC-V instruction set, wherein the method comprises the following steps: obtaining an instruction from a RISC-V instruction space and caching the instruction into a cache, and judging the type of the instruction; responding to the instruction as a branch jump instruction, regenerating an instruction address, and jumping to a corresponding branch according to the instruction address; responding to the jump to the AIPU branch, storing the characteristic data and the coefficient data for the current convolution operation through a first-level input characteristic cache and a first-level coefficient cache, and storing the characteristic data and the coefficient data for the next convolution operation through a second-level input characteristic cache and a second-level coefficient cache; and carrying out convolution operation according to the corresponding characteristic data and the coefficient data, and activating, normalizing and pooling the operation result. The invention designs a processor architecture with three instruction branches according to a RISC-V instruction set, realizes general control, vector operation, convolution and matrix accelerated calculation, and is suitable for a terminal AI inference chip.

Description

Method, system, device and medium for data processing based on RISC-V instruction set
Technical Field
The present invention relates to the field of data processing, and more particularly, to a method, system, computer device and readable medium for performing data processing based on RISC-V instruction set.
Background
The value of the data is in analytical utilization rather than simple storage. The data volume is continuously increasing, all data cannot be transmitted to the cloud end through the network, and the bandwidth increasing speed is slower than the data increasing speed. For an application scene with a high real-time requirement, data needs to be judged at the edge, such as the fields of automatic driving, unmanned driving and the like. For scenes with high privacy protection requirements, such as medical information or data that a user does not want to share in the cloud, the data needs to be stored locally. For example, most of data generated by the security device is data which is useless or has no mining potential, all data transmitted to the cloud is a waste of bandwidth, and if intelligent analysis is performed at an edge end, only useful or potential data is transmitted to the cloud, so that the network bandwidth can be greatly saved. Therefore, data processing is inevitably transferred from the cloud to the edge, and therefore, an AI (artificial intelligence) chip at the edge is also in the trend.
The AI chip is required for artificial intelligent processing at the edge end, and the challenges of the AI chip are mainly computational power and computational efficiency. The computation power of the AI chip is determined by the number of on-chip computation units. Since the amount of data involved in AI calculation is very large, theoretically the larger the calculation power of the AI chip is, the better, but actually the calculation power of the AI chip is limited by various factors:
1. on-chip memory bandwidth and bus bandwidth: the main contradiction of the AI chip is the contradiction between memory bandwidth and computational power. The larger the calculation power is, the larger the input data, the intermediate result and the output data volume are, the higher the required storage bandwidth is, but the current storage bandwidth far cannot meet the calculation power requirement, and if the calculation unit and the storage unit cannot be reasonably arranged, the result of high calculation power but low efficiency is caused.
The AI calculation involves a variety of operators, such as convolution calculation, matrix calculation, normalization, activation, pooling, and other linear and nonlinear calculations. Deep neural network models are usually composed of multiple layers, with the output of the previous layer being the input of the next layer; in the same layer, the result of the multiply-add operation is often the input for activation, pooling, and normalization. Therefore, if multithreading/parallel computing/computing pipelining cannot be reasonably realized, the computation of the previous step can block the computation of the next step, which causes resource waste and reduces computing efficiency.
3. As described in fig. 2, the AI involves various operators, but the AI chip is fixed and invariant, and how to make the invariant hardware efficiently process the variable operators, it is necessary that software can reasonably allocate hardware resources according to the hardware architecture and compile efficient machine code. Meanwhile, the AI chip is also required to have efficient control capability.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, a system, a computer device, and a computer readable storage medium for performing data processing based on a RISC-V instruction set, in which an AIPU (artificial intelligence unit) and a CPU share a memory, so that computation and storage are adjacent to each other, thereby improving memory access bandwidth, facilitating data interaction between the AIPU and the CPU, reducing data interaction amount with an external bus, and reducing a requirement for bus bandwidth. Meanwhile, the AIPU and the CPU are respectively provided with a small buffer for buffering input data, intermediate results, output data and instructions for pre-reading of the CPU, so that data loading is allowed while data calculation is carried out, the data reading and writing time is prolonged, and the requirement on bus bandwidth is further reduced.
In view of the above, an aspect of the embodiments of the present invention provides a method for data processing based on RISC-V instruction set, including the following steps: obtaining an instruction from a RISC-V instruction space and caching the instruction into a cache, and judging the type of the instruction; responding to the instruction as a branch jump instruction, regenerating an instruction address, and jumping to a corresponding branch according to the instruction address; responding to the jump to the AIPU branch, storing the characteristic data and the coefficient data for the current convolution operation through a first-level input characteristic cache and a first-level coefficient cache, and storing the characteristic data and the coefficient data for the next convolution operation through a second-level input characteristic cache and a second-level coefficient cache; and carrying out convolution operation according to the corresponding characteristic data and the coefficient data, and activating, normalizing and pooling the operation result.
In some embodiments, the method further comprises: in response to jumping to a vector architecture branch, performing a vector operation in accordance with the instruction.
In some embodiments, the method further comprises: and in response to the instruction being a load or store instruction, reading an address of the storage space into the destination operand according to the address in the source operand.
In some embodiments, the method further comprises: judging whether vector registers corresponding to vector source operands are in the same group; and in response to the fact that the vector registers corresponding to the vector source operands are not in the same group, enabling two ports with the same bit width as the vector registers to read and write simultaneously.
In some embodiments, the method further comprises: and configuring the register file in the AIPU branch into two parts, wherein the first part runs the current AIPU operation, and the second part acquires parameters required by the next operation of the AIPU.
In some embodiments, the method further comprises: and reading corresponding data and carrying out dimension conversion on the data according to the operation requirement, and writing the converted data into a corresponding coefficient cache or an input feature cache.
In some embodiments, the method further comprises: reading data in the first-level input feature cache and the first-level coefficient cache in response to convolution calculation, and judging whether the residual space of the first-level input feature cache and the first-level coefficient cache is larger than the size of the next group of data; and responding to the fact that the remaining space of the first-level input characteristic cache and the first-level coefficient cache is larger than the size of the next group of data, and starting a write cache.
In another aspect of the embodiments of the present invention, a system for performing data processing based on a RISC-V instruction set is provided, which includes: the acquisition module is configured to acquire an instruction from a RISC-V instruction space and cache the instruction into a cache, and judge the type of the instruction; the jump module is configured to respond to the instruction as a branch jump instruction, regenerate an instruction address and jump to a corresponding branch according to the instruction address; the AIPU module is configured for responding to jumping to an AIPU branch, storing feature data and coefficient data used for current convolution operation through a first-level input feature cache and a first-level coefficient cache, and storing feature data and coefficient data of next convolution operation through a second-level input feature cache and a second-level coefficient cache; and the execution module is configured to perform convolution operation according to the corresponding characteristic data and coefficient data, and activate, normalize and pool the operation result.
In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method as above.
In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.
The invention has the following beneficial technical effects: the AIPU (artificial intelligence unit) and the CPU share the memory, so that the calculation and the storage are adjacent, the memory access bandwidth is improved, the data interaction between the AIPU and the CPU is convenient, the data interaction amount with an external bus is reduced, and the requirement on the bus bandwidth is reduced. Meanwhile, the small buffers are respectively arranged in the AIPU and the CPU and used for caching input data, intermediate results, output data and instructions pre-read by the CPU, so that data loading is allowed while data calculation is carried out, the data reading and writing time is prolonged, and the requirement on bus bandwidth is further reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
FIG. 1 is a diagram illustrating an embodiment of a method for data processing based on RISC-V instruction set according to the present invention;
FIG. 2 is a schematic diagram of a CPU architecture according to an embodiment of the present invention;
FIG. 3 is a diagram of the AIPU architecture provided by the present invention;
FIG. 4 is a diagram illustrating convolution operations according to an embodiment of the method for processing data based on RISC-V instruction set;
FIG. 5 is a diagram of a hardware structure of an embodiment of a computer device for data processing based on RISC-V instruction set according to the present invention;
FIG. 6 is a diagram of an embodiment of a computer storage medium for data processing based on a RISC-V instruction set.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
In view of the above objects, a first aspect of embodiments of the present invention proposes an embodiment of a method for data processing based on a RISC-V instruction set. FIG. 1 is a diagram illustrating an embodiment of a method for data processing based on a RISC-V instruction set according to the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:
s1, obtaining an instruction from the RISC-V instruction space and caching the instruction into a cache, and judging the type of the instruction;
s2, responding to the instruction as a branch jump instruction, regenerating an instruction address, and jumping to a corresponding branch according to the instruction address;
s3, responding to the jump to the AIPU branch, storing the feature data and the coefficient data used for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and storing the feature data and the coefficient data of the next convolution operation through the second-level input feature cache and the second-level coefficient cache; and
and S4, performing convolution operation according to the corresponding characteristic data and coefficient data, and activating, normalizing and pooling the operation result.
The embodiment of the invention adopts a storage-computation integrated structure, and the AIPU and the CPU share the memory, so that computation and storage are adjacent, the memory access bandwidth is improved, the data interaction between the AIPU and the CPU is convenient, the data interaction quantity with an external bus is reduced, and the requirement on the bus bandwidth is reduced. Meanwhile, the small buffers are respectively arranged in the AIPU and the CPU and used for caching input data, intermediate results, output data and instructions pre-read by the CPU, so that data loading is allowed while data calculation is carried out, the data reading and writing time is prolonged, and the requirement on bus bandwidth is further reduced.
The RISC-V instruction set includes a general instruction set and a vector extension instruction set, and can be divided into: the system comprises an integer instruction set I, a multiply-add operation instruction set M, an atomic operation instruction set A, a single-precision instruction set F, a double-precision instruction set D, a compression instruction set C and a vector instruction set V. The arithmetic logic unit completes IMAFDC instruction set operation, and the vector operation unit completes vector instruction set V operation. The CPU architecture is designed according to RISC-V instruction set, and the function of the CPU is to run system codes to complete system control and data operation.
FIG. 2 is a schematic diagram of a CPU architecture in an embodiment of the present invention. As shown in fig. 2, the CPU adopts a two-stage pipeline architecture, and the first stage is an instruction fetching stage, which is responsible for fetching an instruction from an instruction storage space and caching the instruction into an instruction cache. The second stage decodes and executes the instruction. When decoding, the type of instruction (vector instruction or normal instruction) is analyzed, and corresponding data operation is started according to the corresponding instruction type and the operation code, for example, a vector addition instruction, data is read from a vector data storage to a vector register, then operation is completed in a vector operation unit, and a result is cached in a vector data cache.
The meaning of setting the vector data cache here is: in AI inference calculation, vector operations are usually not independent, and it is often necessary to reasonably complete calculation by pipelining a plurality of vector operations, and if an intermediate result is stored in a sram (Static Random Access Memory), the vector data may need a plurality of cycles to complete storage or reading, which greatly increases the vector calculation cycle. Setting a vector cache buffer can load data into the vector cache buffer in advance before starting vector calculation, and after the vector calculation is finished, storing a final result into the data sram. The pre-reading and result storage of the vector data can be completed during other operations, and the vector operation period is reduced. The port of the vector data caching module is wide, and the bandwidth requirement of the vector operation unit is met.
Obtaining an instruction from a RISC-V instruction space and caching the instruction into a cache, and judging whether the instruction is a branch jump instruction or not; and in response to the instruction being a branch jump instruction, regenerating the instruction address and jumping to the corresponding branch according to the instruction address. When a branch jump instruction is encountered, if the branch jump is established (or unconditional jump), the pc (instruction address) is regenerated and the instructions in the instruction cache are cleared.
The architecture has 3 architecture branches, respectively: the general architecture branch is used for supporting general instructions and realizing the function of the cpu; a vector architecture branch for supporting a risc-v vector instruction set to perform vector operations; the AIPU branch supports general load/store instructions and custom user instructions and is used for completing special intensive calculation such as convolution, matrix multiplication and the like. Wherein the AIPU branch may establish contact with the AIPU architecture. The AIPU branch configures registers of all functional modules through load/store instructions of a CPU, the work of all functional modules in the AIPU is only controlled by the registers, and the instructions of the CPU are not needed to participate, so that the calculation efficiency is high but not flexible enough, and the method is suitable for special large-scale calculation. The vector architecture branch is controlled by a vector instruction of the CPU, and the operation of each step needs instruction control, so that the vector architecture branch is more flexible than the AIPU, but has lower calculation efficiency, and is suitable for small-batch and diversified vector calculation. Since vector operations involve more data, how to speed up the load and store of data is critical.
And responding to the jump to the AIPU branch, storing the characteristic data and the coefficient data for the current convolution operation through the first-level input characteristic cache and the first-level coefficient cache, and storing the characteristic data and the coefficient data for the next convolution operation through the second-level input characteristic cache and the second-level coefficient cache. The input feature vector cache and the coefficient vector cache are mainly used for caching data to be calculated in the current clock cycle of the multiply-add operation unit, and the data are calculated in parallel in a vector mode. Since these data cannot be read out from the input feature (or coefficient) buffer in a single cycle, the multiplexing characteristics of the input feature data and coefficient (weight) data in convolution need to be reasonably utilized, and the process of reading data is hidden in the process of data calculation, so that the whole calculation process is not interrupted.
FIG. 3 is a schematic diagram of the AIPU architecture provided by the present invention. As shown in fig. 4, the AIPU architecture includes a register file, a DMA, a read/write interface arbitration, an address generator, a convolution timing controller, a vector buffer, a multiply-add operation matrix, an intermediate result accumulator, and a special vector operation unit.
The core of the AIPU framework is a multiply-add operation matrix module which comprises a large number of multiply and add hardware resources, can realize parallel and high-speed multiply-add operation and meet the calculation force requirement of intensive convolution/matrix operation; other modules are used for enabling convolution operation to be more efficient, for example, the introduced data multiplexing is used for solving the contradiction that the data requirement is large during calculation, but the bandwidth of a data bus and the SRAM is not enough, read data are multiplexed as much as possible, and the pressure of the bandwidth is reduced; the buffer (cache) is arranged to adjust the difference of the data throughput rates of the modules before and after the buffer, reduce the occurrence of blockage and enable all the functional modules to run at full speed without blockage; the vector operation unit can provide different algorithm supports according to the requirements of the convolution algorithm, so that the used operation can be completed after data is read, and then the data is stored instead of reading the data for multiple times to complete one-time complete convolution calculation; the address generator is matched with read-write control, and data arrangement can be realized through different read-write data sequences, so that convolution operation is more efficient; the convolution neural network used for AI calculation is generally divided into a plurality of layers, an AI inference chip needs to calculate layer by layer, each layer has a large amount of convolution or matrix operation, and after a ping-pong register is arranged, parameters required by the next layer of AIPU calculation, such as data dimension and other information, can be configured while the layer is calculated.
FIG. 4 is a diagram illustrating convolution operations in an embodiment of a method for data processing based on a RISC-V instruction set according to the present invention. As shown in fig. 3, in one calculation of the multiply-add operation matrix, f0 vector blocks are simultaneously subjected to multiply-add operation with w0 … w7 (the vector blocks include a plurality of vector elements, and the multiply-add operation is to multiply the corresponding vector elements and then to accumulate the products of all the elements, and the sum is to output the result). F0 and w0 … w7 are correspondingly added to the multiplication and addition matrix, f0 is copied for 8 times, and the multiplication and addition operation is respectively carried out on the f0 and the w0 … w7 vector blocks and the w0 … w7 vector blocks. Similarly, f1 … f7 needs to be multiplied and added with w0 … w 7. In this process, f0 … f7 multiplexes w0 … w7 vector chunks, and each w vector chunk multiplexes the same f vector chunk. Therefore, in the 8 matrix operations, only one time w0 … w7 needs to be taken, and then one f vector chunk is read for each calculation. 8 clock cycles are needed for 8 operations, and 8 cycles are needed for reading w0 … w7, so that the process of reading w vector blocks can be hidden in the calculation process (namely, the data reading process is completely overlapped with the calculation process, and the data calculation process does not need to be interrupted to wait for data reading). This is why an input feature vector buffer and a coefficient vector buffer need to be set.
The intermediate result buffer is used for buffering intermediate results of vector calculation, and according to the convolution principle, a final result cannot be obtained through one-time vector multiplication and addition operation, and multiple multiplication and addition results need to be accumulated. Therefore, a buffer is provided after the multiply-add result accumulator. When the intermediate result is accumulated continuously to obtain a complete final result, the complete result needs to be stored in the complete result buffer. This buffer has several roles:
1. avoiding data being overwritten by later intermediate results;
2. the cache buffer is shared by a following activation module, a pooling module and the like and is used for storing input data and output data of the calculation modules;
3. the module has bus read-write control and sends the final calculation data to the DMA interface.
In some embodiments, the method further comprises: and configuring the register file in the AIPU branch into two parts, wherein the first part runs the current AIPU operation, and the second part acquires parameters required by the next operation of the AIPU. The register file may be set as a system register when adding the compiler back end, and the configuration information is loaded by a load instruction. The register file is configured in two parts, performing ping-pong operations, namely: when the first part controls the operation process of the AIPU, the second part receives parameters required by the next calculation of the AIPU, and after the operation of the first part is finished, the registers of the second part are converted into the currently available registers. This ensures continuous uninterrupted operation of the AIPU.
The register file configuration and conversion principles are as follows: since two sets of registers are added to the compiler back-end when the compiler back-end describes the chip architecture, the compiler will look for the corresponding registers according to the register description in the architecture. For example, load r0, address loads the data at address into register 0, and load r1, address loads the data into register 1. However, when the AIPU uses registers, it needs to determine which register is available, and a "compute done" signal is needed to alternately enable register 0 and register 1. When programming, another register is required to be configured immediately after enabling one AIPU calculation in preparation for the next AIPU start.
In some embodiments, the method further comprises: and in response to the instruction being a load or store instruction, reading an address of the storage space into the destination operand according to the address in the source operand. RISC-V instructions typically have one or two source operands rs1, rs2, corresponding to vector source operands vs1, vs 2. An instruction provides source operands to corresponding execution units (including data loads, data stores, scalar calculations, vector calculations, etc.) according to an opcode (which represents a type of computation such as an add, subtract, multiply, divide, etc. operation). For example, when the opcode represents a load/store, indicating that the instruction is an access store instruction, the execution unit reads the address in the data store into the destination operand (rd or vd) based on the address in rs 1.
In some embodiments, the method further comprises: judging whether vector registers corresponding to vector source operands are in the same group; and in response to the fact that the vector registers corresponding to the vector source operands are not in the same group, enabling two ports with the same bit width as the vector registers to read and write simultaneously. To speed up the load and store of a vector, multiple ports may be provided that are as wide as the vector register bit. For example, there are 32 vector registers, and the 32 vector registers are divided into 4 groups in hardware, each group corresponding to one port. Enabling the corresponding port according to the vector group number set by the vsetvli instruction, for example, if the vsetvli t0, a0, e8 and m4 instructions set a group of 4 vector registers, software divides the 32 registers into 8 groups, and the corresponding relation with hardware is that 2 software vector groups correspond to a hardware vector group. If the vector registers vs1 and vs2 are in the same group during calculation, only one port is enabled to read and write, and if the two ports are in the two groups, two ports are enabled to read and write simultaneously.
In some embodiments, the method further comprises: in response to jumping to a vector architecture branch, performing a vector operation in accordance with the instruction.
In some embodiments, the method further comprises: and reading corresponding data and carrying out dimension conversion on the data according to the operation requirement, and writing the converted data into a corresponding coefficient cache or an input feature cache. According to the architecture of the AIPU, both the coefficient cache unit and the input feature cache unit need to read weight and feature data from a shared external SRAM, and the address generator generates a corresponding SRAM address according to the register configuration. Convolution calculation or matrix operation has different calculation modes according to different applications, for example, convolution calculation is further divided into one-dimensional/two-dimensional/three-dimensional convolution, void convolution, depthwise convolution, separation type convolution, transposition convolution and the like. Different calculation methods are different in data reading mode, and convolution calculation usually performs corresponding conversion on the dimensionality of data, so that an address generator is required to read data in different modes according to the configuration of a register, and the conversion is completed through phase change. Therefore, the address generator and the read-write data control function are as follows: according to different calculation requirements, data reading is completed, corresponding dimension conversion is carried out, and then the data are written into a corresponding coefficient (weight) cache unit or an input feature (feature) cache unit.
The convolution time sequence control unit is a control core of the whole AIPU and is responsible for collecting the states of all the functional modules, controlling the enabling of the related modules and generating a synchronous signal of convolution operation. The convolution synchronization signal is the beat of the entire convolution process. The whole convolution process comprises N (N > ═ 1) beats, one beat comprises M (> ═ 1) clock cycles, and the cycle of the multiplication and addition operation and the accumulation operation is one clock cycle. Thus, one beat contains M such multiply-add and accumulate operations. The size of M is determined by the number of times the data is multiplexed during the convolution process. For example, if the same set of coefficients is multiplexed 8 times, then the minimum value of M is 8 (if the number of compute cycles is sufficient to load the next set of data, then M is the compute cycle; otherwise, M requires additional time to load the next set of data). Because the convolution calculation and the data loading are carried out synchronously, the synchronous signal of the data loading is the synchronous signal of the convolution calculation after the fixed time delay of reading and writing the data period. Similarly, the synchronization signal of the accumulator is a signal obtained by delaying the convolution synchronization signal by a fixed multiplication and addition operation period.
In some embodiments, the method further comprises: reading data in the first-level input feature cache and the first-level coefficient cache in response to convolution calculation, and judging whether the residual space of the first-level input feature cache and the first-level coefficient cache is larger than the size of the next group of data; and responding to the fact that the remaining space of the first-level input characteristic cache and the first-level coefficient cache is larger than the size of the next group of data, and starting a write cache. The remaining available storage space of the input feature cache and the coefficient cache is determined by the two processes of reading and writing data together, the remaining available space is reduced by data writing, and the remaining available space is increased by data reading. The convolution time sequence controller calculates the size of the available space of the cache according to the times of reading and writing the cache, and when the data in the two caches is enough to start convolution operation (for example, the coefficient data meets the requirement of multiplexing number, the input characteristic data meets the requirement of multiple times of calculation, and the calculation time is more than or equal to the data loading time required by the next calculation), the convolution enabling is started. And during convolution calculation, continuously reading the input characteristic cache and the coefficient cache to gradually increase the residual space of the two caches, and starting write cache enabling when the residual space is larger than the size of the next group of data. Thus, if the load time of the next set of data is less than the convolution calculation time of the previous set, the convolution calculations will run without interruption. If the calculation is fast and the data loading is slow, the convolution calculation process will be interrupted.
And performing convolution operation according to the corresponding characteristic data and coefficient data, and activating, normalizing and pooling the operation result. During convolution operation, after the multiplication and addition calculation is completed, operations such as activation (e.g., Relu), normalization, pooling and the like need to be performed on data. If a vector operation unit with low computation efficiency outside the AIPU is used, a large number of multiply-add operation intermediate results are accumulated before vector operation, and wait for activation or pooling, so that the overall convolution operation efficiency is lowered by the vector operation unit. Therefore, the active isopector operation required for convolution is specialized and placed after the multiply-add matrix unit. The special vector computing units such as activation unit and the like can be connected in series with the multiplication and addition unit and the accumulation unit, and can also work independently, and the intermediate result buffer unit is shared by the special vector computing units.
The key points of the embodiment of the invention are as follows:
(1) according to RISC-V instruction set, a processor architecture of three instruction branches is designed, which is respectively as follows: general instruction branch, vector instruction branch, AIPU branch;
(2) an AIPU framework is designed, the AIPU is combined with a RISC-V framework in the form of an accelerator, and a special register file is arranged for accelerating convolution and matrix operation through load/store instruction configuration of RISC-V;
(3) the architecture of an AIPU multiply-add operation array is designed, and the AIPU multiply-add operation array is a two-dimensional parallel multiply-add operation unit. Two vector buffer buffers are arranged and matched with the input characteristic buffer and the coefficient buffer to form two-stage double buffers. The front-stage double buffers (aiming at ensuring that the subsequent units have continuous data) are jointly formed by the input characteristic buffer, the coefficient buffer and the convolution control unit, and the data required by the next step is written into the buffers while the buffers are continuously read out by utilizing a method of monitoring the residual space in real time. The latter buffer is to realize the functions of increasing bandwidth and multiplexing data.
(4) The design of each buffer in the AIPU reasonably buffers different data throughput rates among all levels of functional modules of convolution operation;
(5) a flexible address generator was designed: the address generator is matched with a buffer at the later stage according to the configuration of the register, and the data dimension is converted while the data is read;
(6) a ping-pong operation register is designed to ensure that two different convolution operations before and after the operation are carried out uninterruptedly.
The architecture in the embodiment of the invention is very flexible in application, has the control function of a general CPU and has the calculation power required by AI. The method can be applied to the edge end machine which can only be used for the Internet of things manually. The system can also realize higher computing power through an on-chip internet (NoC) and be installed on a PC or a server in the form of an accelerator card to realize cloud reasoning or training.
It should be particularly noted that, the steps in the embodiments of the method for processing data based on RISC-V instruction set described above can be mutually intersected, replaced, added, or deleted, so that these reasonable permutation and combination transformations for the method for processing data based on RISC-V instruction set also belong to the protection scope of the present invention, and the protection scope of the present invention should not be limited to the embodiments.
In view of the above object, according to a second aspect of the embodiments of the present invention, there is provided a system for performing data processing based on a RISC-V instruction set, comprising: the acquisition module is configured to acquire an instruction from a RISC-V instruction space and cache the instruction into a cache, and judge the type of the instruction; the jump module is configured to respond to the instruction as a branch jump instruction, regenerate an instruction address and jump to a corresponding branch according to the instruction address; the AIPU module is configured for responding to jumping to an AIPU branch, storing feature data and coefficient data used for current convolution operation through a first-level input feature cache and a first-level coefficient cache, and storing feature data and coefficient data of next convolution operation through a second-level input feature cache and a second-level coefficient cache; and the execution module is configured to perform convolution operation according to the corresponding characteristic data and coefficient data, and activate, normalize and pool the operation result.
In some embodiments, the system further comprises a vector module configured to: in response to jumping to a vector architecture branch, performing a vector operation in accordance with the instruction.
In some embodiments, the system further comprises a first determining module configured to: and in response to the instruction being a load or store instruction, reading an address of the storage space into the destination operand according to the address in the source operand.
In some embodiments, the system further comprises a second determining module configured to: judging whether vector registers corresponding to vector source operands are in the same group; and in response to the fact that the vector registers corresponding to the vector source operands are not in the same group, enabling two ports with the same bit width as the vector registers to read and write simultaneously.
In some embodiments, the system further comprises a configuration module configured to: and configuring the register file in the AIPU branch into two parts, wherein the first part runs the current AIPU operation, and the second part acquires parameters required by the next operation of the AIPU.
In some embodiments, the system further comprises a conversion module configured to: and reading corresponding data and carrying out dimension conversion on the data according to the operation requirement, and writing the converted data into a corresponding coefficient cache or an input feature cache.
In some embodiments, the system further comprises a computing module configured to: reading data in the first-level input feature cache and the first-level coefficient cache in response to convolution calculation, and judging whether the residual space of the first-level input feature cache and the first-level coefficient cache is larger than the size of the next group of data; and responding to the fact that the remaining space of the first-level input characteristic cache and the first-level coefficient cache is larger than the size of the next group of data, and starting a write cache.
In view of the above object, a third aspect of the embodiments of the present invention provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, obtaining an instruction from the RISC-V instruction space and caching the instruction into a cache, and judging the type of the instruction; s2, responding to the instruction as a branch jump instruction, regenerating an instruction address, and jumping to a corresponding branch according to the instruction address; s3, responding to the jump to the AIPU branch, storing the feature data and the coefficient data used for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and storing the feature data and the coefficient data of the next convolution operation through the second-level input feature cache and the second-level coefficient cache; and S4, performing convolution operation according to the corresponding characteristic data and coefficient data, and activating, normalizing and pooling the operation result.
In some embodiments, the steps further comprise: in response to jumping to a vector architecture branch, performing a vector operation in accordance with the instruction.
In some embodiments, the steps further comprise: and in response to the instruction being a load or store instruction, reading an address of the storage space into the destination operand according to the address in the source operand.
In some embodiments, the steps further comprise: judging whether vector registers corresponding to vector source operands are in the same group; and in response to the fact that the vector registers corresponding to the vector source operands are not in the same group, enabling two ports with the same bit width as the vector registers to read and write simultaneously.
In some embodiments, the steps further comprise: and configuring the register file in the AIPU branch into two parts, wherein the first part runs the current AIPU operation, and the second part acquires parameters required by the next operation of the AIPU.
In some embodiments, the steps further comprise: and reading corresponding data and carrying out dimension conversion on the data according to the operation requirement, and writing the converted data into a corresponding coefficient cache or an input feature cache.
In some embodiments, the steps further comprise: reading data in the first-level input feature cache and the first-level coefficient cache in response to convolution calculation, and judging whether the residual space of the first-level input feature cache and the first-level coefficient cache is larger than the size of the next group of data; and responding to the fact that the remaining space of the first-level input characteristic cache and the first-level coefficient cache is larger than the size of the next group of data, and starting a write cache.
Fig. 5 is a schematic diagram of a hardware structure of an embodiment of the computer device for performing data processing based on the RISC-V instruction set according to the present invention.
Taking the apparatus shown in fig. 5 as an example, the apparatus includes a processor 201 and a memory 202, and may further include: an input device 203 and an output device 204.
The processor 201, the memory 202, the input device 203 and the output device 204 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.
The memory 202, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for performing data processing based on RISC-V instruction set in the embodiments of the present application. The processor 201 executes various functional applications of the server and data processing, that is, the method for performing data processing based on the RISC-V instruction set, which implements the above-described method embodiments, by executing the nonvolatile software program, instructions, and modules stored in the memory 202.
The memory 202 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a method of data processing based on a RISC-V instruction set, and the like. Further, the memory 202 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 202 may optionally include memory located remotely from processor 201, which may be connected to local modules via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 203 may receive information such as a user name and a password that are input. The output device 204 may include a display device such as a display screen.
One or more program instructions/modules corresponding to the method for performing data processing based on the RISC-V instruction set are stored in the memory 202, and when executed by the processor 201, perform the method for performing data processing based on the RISC-V instruction set in any of the above-described method embodiments.
Any embodiment of a computer apparatus for performing the method for data processing based on the RISC-V instruction set may achieve the same or similar effects as any of the corresponding embodiments of the method described above.
The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, performs the method as above.
FIG. 6 is a schematic diagram of an embodiment of a computer storage medium for performing data processing based on RISC-V instruction set according to the present invention. Taking the computer storage medium as shown in fig. 6 as an example, the computer readable storage medium 3 stores a computer program 31 which, when executed by a processor, performs the method as described above.
Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and a program of the method for data processing based on the RISC-V instruction set can be stored in a computer readable storage medium, and when the program is executed, the program can include the processes of the embodiments of the methods as described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A method for performing data processing based on a RISC-V instruction set, comprising the steps of:
obtaining an instruction from a RISC-V instruction space and caching the instruction into a cache, and judging the type of the instruction;
responding to the instruction as a branch jump instruction, regenerating an instruction address, and jumping to a corresponding branch according to the instruction address;
responding to the jump to the AIPU branch, storing the characteristic data and the coefficient data for the current convolution operation through a first-level input characteristic cache and a first-level coefficient cache, and storing the characteristic data and the coefficient data for the next convolution operation through a second-level input characteristic cache and a second-level coefficient cache; and
and performing convolution operation according to the corresponding characteristic data and coefficient data, and activating, normalizing and pooling the operation result.
2. The method of claim 1, further comprising:
in response to jumping to a vector architecture branch, performing a vector operation in accordance with the instruction.
3. The method of claim 1, further comprising:
and in response to the instruction being a load or store instruction, reading an address of the storage space into the destination operand according to the address in the source operand.
4. The method of claim 1, further comprising:
judging whether vector registers corresponding to vector source operands are in the same group; and
and in response to that the vector registers corresponding to the vector source operands are not in the same group, enabling two ports with the same bit width as the vector registers to read and write simultaneously.
5. The method of claim 1, further comprising:
and configuring the register file in the AIPU branch into two parts, wherein the first part runs the current AIPU operation, and the second part acquires parameters required by the next operation of the AIPU.
6. The method of claim 5, further comprising:
and reading corresponding data and carrying out dimension conversion on the data according to the operation requirement, and writing the converted data into a corresponding coefficient cache or an input feature cache.
7. The method of claim 1, further comprising:
reading data in the first-level input feature cache and the first-level coefficient cache in response to convolution calculation, and judging whether the residual space of the first-level input feature cache and the first-level coefficient cache is larger than the size of the next group of data;
and starting a write cache in response to the residual space of the first-level input characteristic cache and the first-level coefficient cache being larger than the size of the next group of data.
8. A system for performing data processing based on a RISC-V instruction set, comprising:
the acquisition module is configured to acquire an instruction from a RISC-V instruction space and cache the instruction into a cache, and judge the type of the instruction;
the jump module is configured to respond to the instruction as a branch jump instruction, regenerate an instruction address and jump to a corresponding branch according to the instruction address;
the AIPU module is configured for responding to jumping to an AIPU branch, storing feature data and coefficient data used for current convolution operation through a first-level input feature cache and a first-level coefficient cache, and storing feature data and coefficient data of next convolution operation through a second-level input feature cache and a second-level coefficient cache; and
and the execution module is configured to perform convolution operation according to the corresponding characteristic data and coefficient data, and activate, normalize and pool the operation result.
9. A computer device, comprising:
at least one processor; and
a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202110175746.6A 2021-02-09 2021-02-09 Method, system, device and medium for data processing based on RISC-V instruction set Pending CN112860320A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110175746.6A CN112860320A (en) 2021-02-09 2021-02-09 Method, system, device and medium for data processing based on RISC-V instruction set
PCT/CN2022/074414 WO2022170997A1 (en) 2021-02-09 2022-01-27 Data processing method and system based on risc-v instruction set, and device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110175746.6A CN112860320A (en) 2021-02-09 2021-02-09 Method, system, device and medium for data processing based on RISC-V instruction set

Publications (1)

Publication Number Publication Date
CN112860320A true CN112860320A (en) 2021-05-28

Family

ID=75989351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110175746.6A Pending CN112860320A (en) 2021-02-09 2021-02-09 Method, system, device and medium for data processing based on RISC-V instruction set

Country Status (2)

Country Link
CN (1) CN112860320A (en)
WO (1) WO2022170997A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254391A (en) * 2021-06-25 2021-08-13 之江实验室 Neural network accelerator convolution calculation and data loading parallel method and device
CN114399034A (en) * 2021-12-30 2022-04-26 北京奕斯伟计算技术有限公司 Data handling method for direct memory access device
WO2022170997A1 (en) * 2021-02-09 2022-08-18 山东英信计算机技术有限公司 Data processing method and system based on risc-v instruction set, and device and medium
CN115113933A (en) * 2022-08-25 2022-09-27 旋智电子科技(上海)有限公司 Apparatus for accelerating data operations
CN115248701A (en) * 2022-09-21 2022-10-28 进迭时空(杭州)科技有限公司 Zero-copy data transmission device and method between processor register files
CN115576606A (en) * 2022-11-16 2023-01-06 苏州浪潮智能科技有限公司 Method for realizing matrix transposition multiplication, coprocessor, server and storage medium
WO2023284130A1 (en) * 2021-07-15 2023-01-19 深圳供电局有限公司 Chip and control method for convolution calculation, and electronic device
CN116149554A (en) * 2023-02-08 2023-05-23 珠海妙存科技有限公司 RISC-V and extended instruction based data storage processing system and method thereof
CN116804915A (en) * 2023-08-28 2023-09-26 腾讯科技(深圳)有限公司 Data interaction method, processor, device and medium based on memory

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115801147B (en) * 2022-11-30 2023-09-22 珠海笛思科技有限公司 Data communication processing method and system
CN118276951B (en) * 2024-06-04 2024-08-09 山东浪潮科学研究院有限公司 RISC-V based instruction expansion method and implementation device
CN118427030B (en) * 2024-07-05 2024-09-13 长沙麟卓信息科技有限公司 Harvard architecture multi-level instruction cache measuring and calculating method based on random instruction set

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1516001A (en) * 2003-01-08 2004-07-28 上海海尔集成电路有限公司 New-type RISC pieline microcontroller structure and its operation method
US20110035745A1 (en) * 2008-03-17 2011-02-10 Institute Of Computing Technology Of The Chinese Academy Of Sciences Risc processor apparatus and method for supporting x86 virtual machine
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN108647773A (en) * 2018-04-20 2018-10-12 复旦大学 A kind of hardwired interconnections framework of restructural convolutional neural networks
CN110659069A (en) * 2018-06-28 2020-01-07 赛灵思公司 Instruction scheduling method for performing neural network computation and corresponding computing system
CN111078287A (en) * 2019-11-08 2020-04-28 苏州浪潮智能科技有限公司 Vector operation co-processing method and device
CN111160545A (en) * 2019-12-31 2020-05-15 北京三快在线科技有限公司 Artificial neural network processing system and data processing method thereof
CN111191774A (en) * 2018-11-14 2020-05-22 上海富瀚微电子股份有限公司 Simplified convolutional neural network-oriented low-cost accelerator architecture and processing method thereof
CN111582465A (en) * 2020-05-08 2020-08-25 中国科学院上海高等研究院 Convolutional neural network acceleration processing system and method based on FPGA and terminal
CN112130901A (en) * 2020-09-11 2020-12-25 山东云海国创云计算装备产业创新中心有限公司 RISC-V based coprocessor, data processing method and storage medium
CN112232517A (en) * 2020-09-24 2021-01-15 苏州浪潮智能科技有限公司 Artificial intelligence accelerates engine and artificial intelligence treater

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740749A (en) * 2017-10-30 2019-05-10 北京深鉴智能科技有限公司 The hardware realization apparatus and method that the full connection of high speed calculates
CN111656367A (en) * 2017-12-04 2020-09-11 优创半导体科技有限公司 System and architecture for neural network accelerator
CN110007961B (en) * 2019-02-01 2023-07-18 中山大学 RISC-V-based edge computing hardware architecture
JP7308674B2 (en) * 2019-07-08 2023-07-14 キヤノン株式会社 Arithmetic processing device and arithmetic processing method
CN112860320A (en) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 Method, system, device and medium for data processing based on RISC-V instruction set

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1516001A (en) * 2003-01-08 2004-07-28 上海海尔集成电路有限公司 New-type RISC pieline microcontroller structure and its operation method
US20110035745A1 (en) * 2008-03-17 2011-02-10 Institute Of Computing Technology Of The Chinese Academy Of Sciences Risc processor apparatus and method for supporting x86 virtual machine
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN108647773A (en) * 2018-04-20 2018-10-12 复旦大学 A kind of hardwired interconnections framework of restructural convolutional neural networks
CN110659069A (en) * 2018-06-28 2020-01-07 赛灵思公司 Instruction scheduling method for performing neural network computation and corresponding computing system
CN111191774A (en) * 2018-11-14 2020-05-22 上海富瀚微电子股份有限公司 Simplified convolutional neural network-oriented low-cost accelerator architecture and processing method thereof
CN111078287A (en) * 2019-11-08 2020-04-28 苏州浪潮智能科技有限公司 Vector operation co-processing method and device
CN111160545A (en) * 2019-12-31 2020-05-15 北京三快在线科技有限公司 Artificial neural network processing system and data processing method thereof
CN111582465A (en) * 2020-05-08 2020-08-25 中国科学院上海高等研究院 Convolutional neural network acceleration processing system and method based on FPGA and terminal
CN112130901A (en) * 2020-09-11 2020-12-25 山东云海国创云计算装备产业创新中心有限公司 RISC-V based coprocessor, data processing method and storage medium
CN112232517A (en) * 2020-09-24 2021-01-15 苏州浪潮智能科技有限公司 Artificial intelligence accelerates engine and artificial intelligence treater

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022170997A1 (en) * 2021-02-09 2022-08-18 山东英信计算机技术有限公司 Data processing method and system based on risc-v instruction set, and device and medium
CN113254391A (en) * 2021-06-25 2021-08-13 之江实验室 Neural network accelerator convolution calculation and data loading parallel method and device
WO2023284130A1 (en) * 2021-07-15 2023-01-19 深圳供电局有限公司 Chip and control method for convolution calculation, and electronic device
CN114399034A (en) * 2021-12-30 2022-04-26 北京奕斯伟计算技术有限公司 Data handling method for direct memory access device
CN115113933A (en) * 2022-08-25 2022-09-27 旋智电子科技(上海)有限公司 Apparatus for accelerating data operations
CN115113933B (en) * 2022-08-25 2022-11-15 旋智电子科技(上海)有限公司 Apparatus for accelerating data operation
CN115248701A (en) * 2022-09-21 2022-10-28 进迭时空(杭州)科技有限公司 Zero-copy data transmission device and method between processor register files
CN115576606A (en) * 2022-11-16 2023-01-06 苏州浪潮智能科技有限公司 Method for realizing matrix transposition multiplication, coprocessor, server and storage medium
CN116149554A (en) * 2023-02-08 2023-05-23 珠海妙存科技有限公司 RISC-V and extended instruction based data storage processing system and method thereof
CN116149554B (en) * 2023-02-08 2023-11-24 珠海妙存科技有限公司 RISC-V and extended instruction based data storage processing system and method thereof
CN116804915A (en) * 2023-08-28 2023-09-26 腾讯科技(深圳)有限公司 Data interaction method, processor, device and medium based on memory
CN116804915B (en) * 2023-08-28 2023-12-15 腾讯科技(深圳)有限公司 Data interaction method, processor, device and medium based on memory

Also Published As

Publication number Publication date
WO2022170997A1 (en) 2022-08-18

Similar Documents

Publication Publication Date Title
CN112860320A (en) Method, system, device and medium for data processing based on RISC-V instruction set
CN108268278B (en) Processor, method and system with configurable spatial accelerator
US10564980B2 (en) Apparatus, methods, and systems for conditional queues in a configurable spatial accelerator
CN109213723B (en) Processor, method, apparatus, and non-transitory machine-readable medium for dataflow graph processing
US11307873B2 (en) Apparatus, methods, and systems for unstructured data flow in a configurable spatial accelerator with predicate propagation and merging
US11086816B2 (en) Processors, methods, and systems for debugging a configurable spatial accelerator
US10445451B2 (en) Processors, methods, and systems for a configurable spatial accelerator with performance, correctness, and power reduction features
US10515046B2 (en) Processors, methods, and systems with a configurable spatial accelerator
US10915471B2 (en) Apparatuses, methods, and systems for memory interface circuit allocation in a configurable spatial accelerator
US10387319B2 (en) Processors, methods, and systems for a configurable spatial accelerator with memory system performance, power reduction, and atomics support features
US10380063B2 (en) Processors, methods, and systems with a configurable spatial accelerator having a sequencer dataflow operator
US10445234B2 (en) Processors, methods, and systems for a configurable spatial accelerator with transactional and replay features
US10496574B2 (en) Processors, methods, and systems for a memory fence in a configurable spatial accelerator
US10416999B2 (en) Processors, methods, and systems with a configurable spatial accelerator
US20190303297A1 (en) Apparatus, methods, and systems for remote memory access in a configurable spatial accelerator
US11029958B1 (en) Apparatuses, methods, and systems for configurable operand size operations in an operation configurable spatial accelerator
CN111566623A (en) Apparatus, method and system for integrated performance monitoring in configurable spatial accelerators
US10678724B1 (en) Apparatuses, methods, and systems for in-network storage in a configurable spatial accelerator
US12086080B2 (en) Apparatuses, methods, and systems for a configurable accelerator having dataflow execution circuits
WO2021034587A1 (en) Multiple output fusion for operations performed in a multi-dimensional array of processing units
GB2464292A (en) SIMD processor circuit for performing iterative SIMD multiply-accumulate operations
US20210200540A1 (en) Apparatuses, methods, and systems for fused operations in a configurable spatial accelerator
Kang et al. Datapath Extension of NPUs to Support Nonconvolutional Layers Efficiently
CN118708246A (en) RISC-V based multi-precision vector operation device
CN118193054A (en) Custom instruction processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination