CN111340185A

CN111340185A - Convolutional neural network acceleration method, system, terminal and storage medium

Info

Publication number: CN111340185A
Application number: CN202010094798.6A
Authority: CN
Inventors: 邹晓峰; 李拓; 刘同强; 周玉龙; 王朝辉; 李仁刚
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-02-16
Filing date: 2020-02-16
Publication date: 2020-06-26

Abstract

The invention provides a convolutional neural network acceleration method, a system, a terminal and a storage medium, wherein the method comprises the following steps: generating a soft core of the RISC-V processor by using a source code generator; the RISC-V single core is constructed by setting an extended DMA (direct memory access) of the soft core of the RISC-V processor, a memory controller and a distributed memory module; constructing a multi-core acceleration array with a preset specification by using the RISC-V single core; and accessing the many-core acceleration array into a convolutional neural network system, wherein the convolutional neural network system comprises a main processor and convolutional neural network hardware. The invention can greatly improve the memory access bandwidth in the calculation process, reduce the memory access delay, simultaneously improve the calculation performance of the convolutional neural network and realize the calculation acceleration of the convolutional neural network.

Description

Convolutional neural network acceleration method, system, terminal and storage medium

Technical Field

The invention relates to the technical field of convolutional neural networks, in particular to a convolutional neural network acceleration method, a convolutional neural network acceleration system, a convolutional neural network acceleration terminal and a storage medium.

Background

With the advent of the big data era, mass data shows exponential explosive growth along with the improvement of computer performance, and various deep learning algorithms represented by a convolutional neural network are widely applied. However, based on the hierarchical and convolution calculation structures of the neural network, the huge calculation amount and parameters become the performance bottleneck of the convolution neural network, and especially, the large amount of parameter storage and memory access delay become the calculation bottleneck.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, the present invention provides a convolutional neural network acceleration method, system, terminal and storage medium, so as to solve the above-mentioned technical problems.

In a first aspect, the present invention provides a convolutional neural network acceleration method, including:

generating a soft core of the RISC-V processor by using a source code generator;

the RISC-V single core is constructed by setting an extended DMA (direct memory access) of the soft core of the RISC-V processor, a memory controller and a distributed memory module;

constructing a multi-core acceleration array with a preset specification by using the RISC-V single core;

and accessing the many-core acceleration array into a convolutional neural network system, wherein the convolutional neural network system comprises a main processor and convolutional neural network hardware.

Further, the generating of the soft core of the RISC-V processor by the source code generator includes:

generating parameter configuration by an open source RISC-V RockChip generator through a kernel;

and generating a soft core RTL source code of the RISC-V32-bit processor according to the parameter configuration.

Further, the method for constructing the RISC-V single core by setting the extended DMA, the memory controller and the distributed memory module of the soft core of the RISC-V processor comprises the following steps:

the method comprises the following steps that an AXI bus interface based on a soft core of the RISC-V processor extends a direct memory access module, a memory controller and a distributed memory module, wherein the direct memory access module is connected with convolutional neural network hardware.

Further, the method for constructing a multi-core acceleration array with a preset specification by using a RISC-V single core includes:

setting the number of RISC-V single cores of the many-core acceleration array according to the calculation amount requirement of the convolutional neural network;

and constructing a set number of RISC-V single cores to form a many-core acceleration array.

Further, the method further comprises:

generating a 64-bit RISC-V dual-core processor by utilizing an open source RISC-V tool chain;

adding a direct memory access module and a memory device to the RISC-V dual-core processor;

configuring a dual-core RISC-V system by utilizing the open source firmware and the Linux system in the RISC-V ecology;

and setting a RoCC conversion interface in the dual-core RISC-V system.

Further, the setting of the RoCC conversion interface in the dual-core RISC-V system includes:

generating a RoCC conversion interface by utilizing an open source RISC-V tool chain;

and respectively connecting the many-core acceleration array and the convolutional neural network hardware by using the RoCC conversion interface.

In a second aspect, the present invention provides a convolutional neural network acceleration system, including:

the system comprises a main processor, convolutional neural network hardware and a many-core acceleration array, wherein the main processor is in communication connection with the convolutional neural network; the many-core acceleration array is respectively interconnected with the main processor and the convolutional neural network hardware;

the many-core acceleration array comprises a plurality of RISC-V single cores, a RISC-V single core RISC-V32-bit processor, a direct memory access module, a memory controller and a distributed memory module.

Further, the many-core acceleration array is interconnected with the main processor through a RoCC conversion interface; the many-core acceleration array is interconnected with the convolutional neural network through the direct memory access module.

In a third aspect, a terminal is provided, including:

a processor, a memory, wherein,

the memory is used for storing a computer program which,

the processor is used for calling and running the computer program from the memory so as to make the terminal execute the method of the terminal.

In a fourth aspect, a computer storage medium is provided having stored therein instructions that, when executed on a computer, cause the computer to perform the method of the above aspects.

The beneficial effect of the invention is that,

the convolutional neural network acceleration method, the system, the terminal and the storage medium provided by the invention have the advantages that the many-core acceleration sequence based on the RISC-V many-core framework is constructed, the many-core acceleration sequence is accessed into the convolutional neural network system, the concurrent access of the parameters of the convolutional calculation in the convolutional neural network is realized in a parallel access mode, and the high-speed parameter access is provided for the convolutional calculation of the convolutional neural network, so that the access bandwidth in the convolutional calculation is increased, and the access bandwidth bottleneck faced by the existing neural network is eliminated. The invention can greatly improve the memory access bandwidth in the calculation process, reduce the memory access delay, simultaneously improve the calculation performance of the convolutional neural network and realize the calculation acceleration of the convolutional neural network.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.

Fig. 2 is a schematic architecture diagram of a system of one embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following explains key terms appearing in the present invention.

The RISC-V architecture is a latest generation open Instruction Set Architecture (ISA), belongs to a reduced instruction set, uses a BSDLicense open source protocol, and has the characteristics of light weight and low power consumption. Users can rapidly design and realize processors based on the RISC-V instruction set based on open source software and hardware ecology of the RISC-V instruction set, wherein the ecology comprises ISA specifications, complete software stacks of embedded type and general computation, various RISC-V processors and system-level hardware basic architectures. The RISC-V design mode adopts a modular design, can meet different application requirements through the combination of different module instructions, and also has an extended instruction function, so that a user can customize the instruction function and realize the corresponding function according to actual requirements. Based on the characteristics, the RISC-V is particularly suitable for application scenes of lightweight and many-core, and is particularly suitable for the design and implementation of many-core accelerators.

DMA (Direct Memory Access) allows hardware devices of different speeds to communicate without relying on a large amount of interrupt load of a CPU, so that peripheral devices can directly Access a Memory through a DMA controller.

FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention. The execution subject in fig. 1 may be a convolutional neural network acceleration system.

As shown in fig. 1, the method 100 includes:

step 110, generating a soft core of the RISC-V processor by using a source code generator;

step 120, a RISC-V single core is constructed by setting the extended DMA of the soft core of the RISC-V processor, the memory controller and the distributed memory module;

step 130, constructing a multi-core acceleration array with a preset specification by using the RISC-V single core;

step 140, accessing the many-core acceleration array to a convolutional neural network system, wherein the convolutional neural network system comprises a main processor and convolutional neural network hardware.

In order to facilitate understanding of the present invention, the acceleration method of the convolutional neural network provided in the present invention is further described below with reference to the principle of the acceleration method of the convolutional neural network of the present invention and the acceleration process of the convolutional neural network in the embodiment.

Specifically, the convolutional neural network acceleration method includes:

and S1, generating a soft core of the RISC-V processor by using the source code generator.

Generating a soft core of the RISC-V processor: an open-source RISC-V RocktChip generator (a set of processor source code generator based on a RISC-V reduced instruction set developed by Berkeley division of California university) is utilized to generate soft-core RTL source codes of a RISC-V32-bit processor through kernel generation parameter configuration.

S2, the RISC-V single core is constructed by setting the extension DMA, the memory controller and the distributed memory module of the soft core of the RISC-V processor.

And for the processor source code generated in the step S1, constructing the simplest 32-bit RISC-V single-core processing system based on the generated AXI bus interface extended DMA, the memory controller and the distributed memory module.

And S3, constructing a multi-core acceleration array with a preset specification by using the RISC-V single core.

According to the specification of the many-core acceleration array set according to the calculation amount of the convolutional neural network, the embodiment sets the many-core acceleration array of 8 × 8, namely, 64 32-bit RISC-V single-core processing systems are created to form the many-core acceleration array.

And S4, accessing the many-core acceleration array into a convolutional neural network system, wherein the convolutional neural network system comprises a main processor and convolutional neural network hardware.

Constructing a RISC-V main processor system: a64-bit RISC-V dual-core processor is generated by utilizing an open source RISC-V tool chain, a DDR controller and a memory device are added, and a dual-core RISC-V system is designed by utilizing open source firmware and a Linux system in a RISC-V ecology. The main processor system can also adopt other existing architecture processors, but the RoCC conversion interface is finally designed to realize the interconnection with the acceleration array. The method for constructing the RoCC interface module comprises the following steps: the RoCC interface module is generated by an open source RISC-V tool chain and is interconnected with a neural network and an RV-32 calculation acceleration array.

Constructing a neural network processing module: the convolutional neural network can be realized by adopting a common multilayer convolutional neural network architecture, and comprises a data input layer, a convolutional calculation layer, an excitation layer, a pooling layer, a full-link layer and the like. In the invention, all storage and access interfaces of convolution calculation are required to be extracted, and are interconnected with the many-core acceleration array through a Buffer (data cache module) by a DMA interface.

Integrating the designed or generated modules, loading system firmware, starting the system, and loading an application program for testing and debugging.

As shown in fig. 2, the present embodiment provides a convolutional neural network acceleration system, including:

Specifically, the main processor includes: 64-bit RISC-V dual-core processor, DDR controller, memory device, open source firmware and Linux system. The system also comprises two identical RoCC conversion interfaces which are respectively interconnected with the convolutional neural network hardware and the many-core acceleration array.

The many-core acceleration array comprises: an 8x8 RV _32 array, i.e., 64 RV _32 computational units. The RV _32 calculation unit includes: RISC-V32-bit processor, AXI bus interface extended DMA, memory controller and distributed memory module.

The convolutional neural network hardware includes: a data input layer, a convolution calculation layer, an excitation layer, a pooling layer, a full-link layer and the like. The convolutional neural network implementation of the present invention depends on the actual neural network type that needs acceleration of the user, and the exemplary system of the present invention employs a common multi-layer convolutional neural network.

In this embodiment, a common multilayer convolutional neural network hardware is adopted, all storage and access interfaces of the convolutional neural network hardware are extracted, and the DMA interface is interconnected with the many-core acceleration array through a Buffer (data cache module).

Fig. 3 is a schematic structural diagram of a terminal system 300 according to an embodiment of the present invention, where the terminal system 300 may be used to execute the convolutional neural network acceleration method according to the embodiment of the present invention.

The terminal system 300 may include: a processor 310, a memory 320, and a communication unit 330. The components communicate via one or more buses, and those skilled in the art will appreciate that the architecture of the servers shown in the figures is not intended to be limiting, and may be a bus architecture, a star architecture, a combination of more or less components than those shown, or a different arrangement of components.

The memory 320 may be used for storing instructions executed by the processor 310, and the memory 320 may be implemented by any type of volatile or non-volatile storage terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk. The executable instructions in memory 320, when executed by processor 310, enable terminal 300 to perform some or all of the steps in the method embodiments described below.

The processor 310 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by operating or executing software programs and/or modules stored in the memory 320 and calling data stored in the memory. The processor may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions. For example, the processor 310 may include only a Central Processing Unit (CPU). In the embodiment of the present invention, the CPU may be a single operation core, or may include multiple operation cores.

A communication unit 330, configured to establish a communication channel so that the storage terminal can communicate with other terminals. And receiving user data sent by other terminals or sending the user data to other terminals.

The present invention also provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A convolutional neural network acceleration method, comprising:

2. The method of claim 1, wherein generating a RISC-V processor soft core using a source code generator comprises:

3. The method of claim 1, wherein said constructing a RISC-V single core by configuring extended DMA, memory controller and distributed memory module of said RISC-V processor soft core comprises:

4. The method of claim 1, wherein constructing a pre-specified many-core accelerator array using RISC-V single cores comprises:

5. The method of claim 1, further comprising:

and setting a RoCC conversion interface in the dual-core RISC-V system.

6. The method of claim 5, wherein the setting of the RoCC conversion interface in the dual-core RISC-V system comprises:

7. A convolutional neural network acceleration system, comprising:

8. The system of claim 7, wherein the many-core acceleration array is interconnected with a main processor through a RoCC translation interface; the many-core acceleration array is interconnected with the convolutional neural network through the direct memory access module.

9. A terminal, comprising:

a processor;

a memory for storing instructions for execution by the processor;

wherein the processor is configured to perform the method of any one of claims 1-6.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.