CN112130901A

CN112130901A - RISC-V based coprocessor, data processing method and storage medium

Info

Publication number: CN112130901A
Application number: CN202010954828.6A
Authority: CN
Inventors: 邹晓峰; 李拓; 李仁刚; 刘同强; 周玉龙; 王朝辉
Original assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Current assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2020-12-25

Abstract

The application provides a RISC-V based coprocessor, including: the RoCC interface is connected with the secondary cache of the main processor and is used for realizing data interaction with the main processor; the buffer area is connected with the second-level cache and used for caching the data to be processed distributed by the main processor; the scheduler is connected with the buffer area and is used for scheduling the data to be processed to the vector processing module or the scalar processing module; the vector processing module is used for processing vector data in the data to be processed; and the scalar processing module is used for processing scalar data in the data to be processed. The method and the device can meet the calculation requirements of matrix calculation and the like in an edge calculation application scene on high requirements on real-time performance and performance. The coprocessor can be used as an independent accelerator chip, and flexible management and configuration can be realized in a mode of configuring a control register through software. The application also provides a data processing method and a computer readable storage medium, which have the beneficial effects.

Description

RISC-V based coprocessor, data processing method and storage medium

Technical Field

The present application relates to the field of chips, and in particular, to a RISC-V based coprocessor, a data processing method, and a storage medium.

Background

With the rise of edge computing, the end-side data rapidly increases along with the application requirements, the real-time requirement of the end-side computing is high, but due to the problems of network delay and the like, a central computing task represented by artificial intelligence reasoning computing gradually extends to the edge side, the computing requirements of the edge side are increased, and the end side also provides new requirements for high-performance computing.

In the edge computing, in addition to a common processor of an architecture such as an ARM, a RISC-V architecture processor can be selected as a computing core. RISC-V is an open brand-new reduced instruction set architecture which is introduced in recent years, is based on a BSD License open source protocol, and has the characteristics of free open source, low cost, light weight, low power consumption and the like compared with other architectures such as ARM and the like. Based on the characteristics of the RISC-V open architecture, the method has wide application prospect in the field of edge computing.

However, in the related art, the RISC-V architecture-based processor has fewer instructions for vector classes, and the general processor basically does not support vector calculation because the realization of vector calculation has high overhead, large area and high power consumption, and cannot meet the requirement of application on real-time high-performance calculation.

Disclosure of Invention

The purpose of the application is to provide a RISC-V based coprocessor, a data processing method and a computer readable storage medium, which can complete the vector calculation operation which is difficult to be realized by a main processor through the coprocessor.

In order to solve the above technical problem, the present application provides a RISC-V based coprocessor, which has the following specific technical scheme:

the RoCC interface is connected with the second-level cache of the main processor and is used for carrying out data interaction with the main processor; the main processor is a RISC-V architecture processor;

the buffer area is connected with the secondary cache and used for caching the data to be processed distributed by the main processor;

the scheduler is connected with the buffer and used for scheduling the data to be processed to a vector processing module or a scalar processing module;

at least one vector processing module, configured to process vector data in the data to be processed;

and the scalar processing module is used for processing scalar data in the data to be processed.

Optionally, the vector processing module includes two vector processing units, and the vector processing unit includes a plurality of vector processing subunits;

the vector processing units are used to perform 32-bit computations individually, or together, or in parallel.

The present application also provides a data processing method, based on the above-mentioned RISC-V based coprocessor, comprising:

receiving data to be processed;

loading the data to be processed to a buffer area of the coprocessor through a second-level cache;

reading the data to be processed in the buffer area, and dividing the data to be processed into vector data and scalar data by using a scheduler;

and sending the vector data to a vector processing module for processing, and sending the scalar data to a scalar processing module for processing.

Optionally, if the coprocessor includes at least two vector processing modules, after the scheduler divides the data to be processed into vector data and scalar data, and before the vector data is sent to the vector processing module for processing, the method further includes:

dividing the vector data into vector subdata with the same number as the vector processing modules;

sending the vector data to a vector processing module for processing comprises:

and respectively sending the vector subdata to each vector processing module for processing.

Optionally, before receiving the data to be processed, the method further includes:

compiling a tool chain based on a RISC-V compiler; the toolbar supports a vector extension instruction;

loading the data to be processed to the buffer area of the coprocessor through a second-level cache comprises:

and loading the data to be processed to a buffer area of the coprocessor through a secondary cache by using the tool instruction in the tool chain.

Optionally, sending the vector data to a vector processing module for processing, and after sending the scalar data to a scalar processing module for processing, the method further includes:

receiving a first result of the vector processing module, receiving a second result of the scalar processing module;

obtaining a processing result according to the first result and the second result;

and writing the processing result back to the buffer area.

Optionally, after writing the processing result back to the buffer, the method further includes:

and writing the processing result back to a preset address in the main processor from the buffer area.

Optionally, the method further includes:

and returning the state information of the coprocessor to a main processor through the RoCC interface of the coprocessor.

Optionally, the method further includes:

and the coprocessor reads the cache information of the main processor through the RoCC interface or receives the instruction information of the main processor.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as set forth above.

The application provides a RISC-V based coprocessor, including: the RoCC interface is connected with the second-level cache of the main processor and is used for carrying out data interaction with the main processor; the main processor is a RISC-V architecture processor; the buffer area is connected with the secondary cache and used for caching the data to be processed distributed by the main processor; the scheduler is connected with the buffer and used for scheduling the data to be processed to a vector processing module or a scalar processing module; at least one vector processing module, configured to process vector data in the data to be processed; and the scalar processing module is used for processing scalar data in the data to be processed.

The coprocessor connected with the main processor is established, the coprocessor based on RISC-V vector instruction extension is added to a traditional computing system to serve as a computing accelerator, the main processor adopts a RISC-V architecture multi-core processor, and a special coprocessor module is designed through vector instruction extension, so that the computing requirements of matrix computing and the like in an edge computing application scene on high real-time performance and high performance requirements are met. The coprocessor can be used as an independent accelerator chip, is connected to the main processor through the RoCC, and realizes flexible management and configuration in a mode of configuring a control register through software.

The present application further provides a data processing method and a computer-readable storage medium, which have the above beneficial effects and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic structural diagram of a RISC-V based coprocessor according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a host processor and a coprocessor during interworking according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a data processing method according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic structural diagram of a RISC-V based coprocessor according to an embodiment of the present application, fig. 2 is a schematic structural diagram of a host processor and a coprocessor according to an embodiment of the present application during an interworking, where fig. 1 includes:

In fig. 1, the coprocessor includes two vector processing modules, but the number of the vector processing modules in the coprocessor is not limited in this embodiment, and can be set by those skilled in the art according to the actual vector data processing requirements. It is easily understood that the greater the number of vector processing modules, the more the vector data processing efficiency is improved.

The interface of the RoCC (socket Custom coprocessor) is a special coprocessor interface for RISC-V architecture, mainly executes Custom instructions, realizes data interaction between a main processor and a coprocessor, and realizes calculation acceleration of specific algorithms. The RoCC data interface comprises three functions, namely, the RoCC data interface is used as a Core control interface and mainly realizes the transmission of coprocessor state information between the RoCC and a main processor; secondly, the system is used as a register interface and mainly realizes the instruction interaction between the ROCC and the main processor; and thirdly, the access interface is used as a memory access interface to realize the access between the RoCC and the cache on the main processor.

The main processor caches instructions needing coprocessing calculation to a cache region through a RoCC interface, and the cached instructions mainly comprise vector instructions and cache necessary scalar instructions, jump instructions and the like.

The scheduler mainly realizes the cooperative scheduling of each cache instruction in the cache region, and performs the operations of instruction fetching, decoding, execution, write-back and the like of the instructions according to the instruction compiling sequence.

It should be noted that the data to be processed sent by the host processor needs to be divided into scalar data and vector data in the buffer, and the data is controlled by the scheduler to be input into the scalar processing module and the vector processing module for processing.

The scalar processing module mainly executes scalar calculation instructions. The main functions comprise instruction decoding, a scalar instruction calculation module, instruction flow control, interrupt control, scalar data storage, a scalar unit control register and the like. Since the acceleration program is not completely composed of vector expansion instructions, a part of scalar class instructions are realized in the coprocessor, and in order to reduce interaction with a main processor and improve execution efficiency, the scalar class instructions are realized in the coprocessor independently.

The vector processing module mainly executes vector calculation instructions. The main functions include instruction decoding, vector instruction calculation modules, specification, shuffling, global control registers, and the like. The Vector processing module comprises two identical Vector processing units (Vector units), and Vector processing subunits are arranged in the Vector processing units. And each vector unit sets the number N of the vector processing subunits according to actual requirements. Vector processing units typically have three modes of computation: each vector unit can independently execute 32-bit calculation, and each processing unit has an operand bit width of 32 bits; the bit width splicing can also be carried out, and the two units can simultaneously execute the upper 32 bits and the lower 32 bits of the 64-bit vector numerical operation; it is also possible to perform the computation concatenation and execute 32-bit computation in parallel, i.e. execute the same vector instruction operation, the number of vector units is twice that of the vector instruction operation executed alone, e.g. the number N is the unit number of vector operations. For example, when two vector units are independently executed, the number of units for vector calculation is N, and when two vector units are cooperatively calculated, the number of units for vector calculation is 2N. The three calculation modes can be flexibly selected and executed, and the flexibility of vector processing is greatly improved.

In fig. 2, the system further includes a main processor connected to the coprocessor, the main processor system is a main machine of the system, and is mainly used for running edge computing applications such as neural network inference programs, and when the system is actually applied, a RISC-V architecture processor is required to be used as the main processor, the processor is usually a multi-core processor, each core is connected to a first-level data Cache (i.e., L1D-Cache) and a first-level instruction Cache (i.e., L1I-Cache), the L1 Cache is connected to a second-level Cache through a Cache consistency hub, and the second-level Cache realizes inter-core sharing. The peripheral equipment mainly comprises peripheral modules required in the practical application scene of edge computing, such as various sensors, analog-to-digital conversion equipment, image and video acquisition interfaces, network interfaces and the like.

It can be seen from the above that, in the embodiment of the present application, by establishing the coprocessor connected to the main processor, compared with the conventional computing system, the coprocessor based on RISC-V vector instruction extension is added as a computing accelerator, the main processor adopts a RISC-V architecture multi-core processor, and a dedicated coprocessor module is designed by vector instruction extension, so that the computing requirements of matrix computing and the like in an edge computing application scene on high real-time performance requirements are met.

On the basis of the above embodiment, the coprocessor can be used as a separate accelerator chip and connected to the main processor through the RoCC interface, and the RoCC interface needs to be reserved for the main processor, and the enabling of the coprocessor can be realized in a software configuration register manner, so that the flexible management and configuration of the coprocessor are realized, and the power consumption is reduced.

Referring to fig. 3, fig. 3 is a flowchart of a data processing method provided in the present application, and the present application also provides a data processing method, which is based on the RISC-V based coprocessor described in the above embodiment, and includes:

s101: receiving data to be processed;

s102: loading the data to be processed to a buffer area of the coprocessor through a second-level cache;

s103: reading the data to be processed in the buffer area, and dividing the data to be processed into vector data and scalar data by using a scheduler;

s104: and sending the vector data to a vector processing module for processing, and sending the scalar data to a scalar processing module for processing.

In this embodiment, a processor structure formed by the main processor and the coprocessor shown in fig. 2 is used as an execution main body, and when data to be processed is received, a corresponding register value on the main processor is triggered to change, that is, the data to be processed is handed to the coprocessor for processing. That is, the main processor has a preliminary inspection process of the data to be processed after the data to be processed is received by default in this embodiment, and the purpose is to detect whether the data to be processed needs to be processed by the coprocessor, and if the data to be processed does need to be processed by the coprocessor, the corresponding register value is changed and sent to the buffer area of the coprocessor through the secondary cache. Since the present embodiment focuses on how to use the coprocessor to process data, the process is not described as a separate step, and the above-mentioned preliminary verification process may be included in practical applications, or other similar verification processes may be used.

After the data to be processed enters the buffer area of the coprocessor, the scheduler divides the data to be processed into vector data and scalar data and distributes the vector data and the scalar data. Because the coprocessor at least comprises one vector processing module and only one scalar processing module, the scheduler needs to distribute vector data according to the number of the actual vector processing modules of the current coprocessor, so that each vector processing module can process the vector data, and the processing efficiency of the vector data is improved.

On this basis, if the coprocessor includes at least two vector processing modules, after the scheduler is used to divide the data to be processed into vector data and scalar data, and before the vector data is sent to the vector processing modules for processing, the vector data should be divided into vector sub-data having the same number as that of the vector processing modules, then the vector sub-data should be sent to each vector processing module for processing in step S104. It should be noted that, in the present embodiment, only the number of the vector sub-data is required to be the same as the number of the vector processing modules, and whether the data amount of the vector sub-data to be processed corresponding to each vector processing module is the same is not limited, that is, the scheduler does not limit the data amount allocation manner of the vector data, and those skilled in the art can allocate the vector data in various manners.

After step S104, the first result of the vector processing module is received, the second result of the scalar processing module is received, the processing result is obtained according to the first result and the second result, the processing result is written back to the buffer area, and finally the processing result is written back to the preset address in the host processor from the buffer area. Because the data to be processed is actually sent to the main processor first, after the coprocessor is used, the data to be processed can return to the preset address of the main processor and return to the peripheral equipment. Of course, a corresponding interface may also be configured on the coprocessor to directly return the processing result to the data processing requester. Of course, the preset address is not particularly limited, and may be a memory or a cache.

On the basis of the above embodiment, in order to ensure the stability of the coprocessor, the state information of the coprocessor can be returned to the main processor through the RoCC interface of the coprocessor, and meanwhile, the coprocessor can read the cache information of the main processor through the RoCC interface or receive the instruction information of the main processor. The RoCC interface can be used for realizing the interaction of state information between the main processor and the coprocessor, and the main processor usually has a corresponding monitoring element on the host, while the coprocessor may not have the monitoring element, so that the state monitoring of the coprocessor by the main processor can be realized by using the RoCC interface, and the coprocessor can process the state information in time when the coprocessor is abnormal. Meanwhile, the buffer usually buffers only the data to be processed, and the corresponding control command can be transmitted by using the RoCC interface. For example, before executing S101 to receive data to be processed, the tool chain may be compiled based on a RISC-V compiler and the toolbar supports vector extension instructions. And the instruction source of the compiled toolchain can be transmitted through the RoCC interface. Meanwhile, if the tool chain compilation is completed, the step S102 may be executed to load the data to be processed into the buffer of the coprocessor through the second level cache by using the tool instruction in the tool chain. It should be noted that the tool chain only needs to be compiled once, and then the compiling instruction in the tool chain can be directly utilized without repeated compiling when the data to be processed is received.

The present embodiment aims to provide a data processing method, that is, discloses a process of the main processor and the coprocessor described in the above embodiments to complete data processing in cooperation with each other, and is particularly directed to processing vector data by using the coprocessor. The special coprocessor is designed by vector instruction extension, the calculation requirements of matrix calculation and the like in an edge calculation application scene on high requirements on real-time performance and performance are met, and meanwhile, an open-source compiler tool chain can be utilized, and the compiling flow is simplified.

The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed, may implement the steps provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system provided by the embodiment, the description is relatively simple because the system corresponds to the method provided by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A RISC-V based coprocessor, comprising:

2. The coprocessor of claim 1, wherein the vector processing module comprises two vector processing units, the vector processing units comprising vector processing sub-units;

3. A data processing method, based on the RISC-V based coprocessor of claim 1 or 2, comprising:

receiving data to be processed;

4. The data processing method according to claim 3, wherein if the coprocessor includes at least two vector processing modules, the method further comprises, after the scheduler divides the data to be processed into vector data and scalar data, sending the vector data to the vector processing modules for processing, and further comprising:

sending the vector data to a vector processing module for processing comprises:

5. The data processing method of claim 3, wherein before receiving the data to be processed, further comprising:

6. The data processing method of claim 3, wherein sending the vector data to a vector processing module for processing, and after sending the scalar data to a scalar processing module for processing, further comprises:

and writing the processing result back to the buffer area.

7. The data processing method of claim 6, wherein after writing the processing result back to the buffer, further comprising:

8. The data processing method of claim 3, further comprising:

9. The data processing method of claim 8, further comprising:

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 3-9.