WO2021017378A1

WO2021017378A1 - Fpga-based convolution parameter acceleration device and data read-write method

Info

Publication number: WO2021017378A1
Application number: PCT/CN2019/126433
Authority: WO
Inventors: 马向华; 马成森; 边立剑
Original assignee: 上海安路信息科技有限公司
Priority date: 2019-08-01
Filing date: 2019-12-18
Publication date: 2021-02-04
Also published as: CN110390392A; CN110390392B

Abstract

A FPGA-based convolution parameter acceleration device (100) and a data read-write method, wherein the method comprises: determining whether a parameter is the last one of a set of input convolution parameters, if not, a write control counter is incremented, and assigning an address to each of the set of convolution parameters in a first random read-write memory (101); determining whether a parameter is the last one of a set of output convolution parameters, if not, the first random read-write memory (101) outputs one of the set of convolution parameters according to the address, and a first read control counter is incremented; and determining whether the predetermined number of outputs of the set of convolution parameters is completed, if so, clearing the first read control counter and a second read control counter.

Description

FPGA-based convolution parameter acceleration device and data reading and writing method

Technical field

The present invention relates to the technical field of integrated circuits, in particular to an FPGA-based convolution parameter acceleration device and a data reading and writing method.

Background technique

In the field of artificial intelligence, Convolutional Neural Network (CNN) is a relatively mature technical solution. It has the ability to characterize learning and can classify input information according to its hierarchical structure. With the advancement of deep learning theory and the improvement of numerical computing equipment, CNN neural networks have developed rapidly and have been widely used in computer vision, natural language processing and other directions. They are mainly used for target classification processing, in image recognition and language recognition. It has an overwhelming advantage in such applications.

At present, the implementation of CNN neural network is mainly based on the computer platform. The CNN architecture is deployed on the development side computer, and then the mass data is used for weight training, and finally the appropriate weight coefficient is generated. After the weight coefficient is solidified, the same can be deployed on the product side. CNN architecture, but cancels the training part, and uses the generated weight coefficients to directly make the CNN neural network work. If portability and practicality must be considered on the product side, high-end servers and workstations are generally not used to set up CNN neural networks. Under the requirements of reducing costs and reducing size, using embedded development becomes the first choice.

Research on the embedded development of CNN neural network has been carried out in recent years. Digital signal processing (Digital Signal Processing, DSP) or ARM (Advanced RISC Machines) methods are basically not considered due to the long calculation time. Therefore, FPGA design of CNN neural network convolution accelerators has become a popular direction, such as "Convolutional Neural Networks" FPGA Parallel Architecture Design of CNN Algorithms" (Wang Wei et al. 2019.4 Microelectronics and Computers) and "Convolutional Neural Network Accelerator Design and Application Research Based on ZYNQ Platform" (Deng Shuai 2018.5 Beijing University of Technology), the latter It only describes the theoretical process, without giving the actual design model and performance analysis. The former proposed a more specific neural network convolution accelerator model. After the analysis of the paper, the model has a greater improvement in performance. Level implementation still has the disadvantage of insufficient internal data throughput of the chip, and it is difficult to implement applications. For example, the image classification algorithm of yoloV2 requires 17.4G calculation times for each frame of image. According to this design, only under the condition of seamless data connection A processing speed of 1.15 frames/s can be achieved.

CNN Neural Network Accelerator Refer to Figure 1. The existing CNN neural network accelerator technology is not yet mature. The main problem is that the cost is relatively high, and the data throughput rate is low, which leads to too much calculation delay, which cannot satisfy real-time applications and low cost. In addition, CNN neural network is a complex system. The existing public design ideas are basically integrated design, without configurable modular design, which will lead to design changes, upgrades and low transplantation efficiency, and reduce design reusability. .

Summary of the invention

The purpose of the present invention is to provide an FPGA-based convolution parameter acceleration device and a data reading and writing method to solve the technical problems of slow data processing and insufficient data throughput in the prior art.

In order to solve the above problems, this application discloses an FPGA-based method for reading and writing convolution parameter data, including:

Judge whether it is the last one of a set of input convolution parameters, if it is not the last one, the write control counter is incremented, and an address is assigned to each of the set of convolution parameters in the first random read/write memory;

It is judged whether it is the last one of a set of output convolution parameters. If it is not the last one, the first random read/write memory outputs one of the set of convolution parameters according to the address, and the first read control counter increments automatically; judge whether it is completed If the output of the set of convolution parameters for a predetermined number of times is completed, the first read control counter and the second read control counter are cleared.

In a preferred example, if it is the last of a set of input convolution parameters, the write control counter is cleared.

In a preferred example, if it is the last one of a set of output convolution parameters, and the output of the set of convolution parameters has not been completed, the first read control counter is cleared, and the second read control counter is automatically reset. Increase by 1.

In a preferred example, while writing a set of convolution parameters to the first random access memory, the second random access memory outputs another set of convolution parameters; or, the first random access memory While outputting a set of convolution parameters, write another set of convolution parameters into the second random access memory.

In a preferred example, after the input of a group of convolution parameters is completed, it further includes: judging whether it is the last one of the input another group of convolution parameters. If it is not the last one, the write control counter is incremented automatically, and in the second random An address is allocated to each of the other set of convolution parameters in the read-write memory.

In a preferred example, after the output of a set of convolution parameters is completed, it further includes: judging whether it is the last one of another set of convolution parameters to be output, if not the last one, the second random read/write memory outputs the other set of convolution One of the parameters, the first read control counter increments automatically.

The application also discloses an FPGA-based convolution parameter acceleration device including:

At least one random read-write memory, configured to store convolution parameters;

The write address control unit is configured to determine whether it is the last one of the input set of convolution parameters, if it is not the last one, the write control counter is incremented, and in the first random read/write memory, it is each set of convolution parameters. Allocation address

The read address control unit judges whether it is the last one of a set of output convolution parameters. If it is not the last one, the first random read/write memory outputs one of the set of convolution parameters according to the address, and the first read control counter automatically Increment; It is judged whether the output of the predetermined number of times of the set of convolution parameters is completed, and if completed, the first read control counter and the second read control counter are cleared.

In a preferred example, it includes first and second random access memory, while writing a set of convolution parameters into the first random access memory, the second random access memory outputs another set of convolution parameters Or, while the first random access memory outputs a set of convolution parameters, another group of convolution parameters is written into the second random access memory.

In a preferred example, it includes first and second random read-write memories; the write address control unit is also configured to: determine whether it is the last one of another set of input convolution parameters, if not the last one, An address is allocated to each of the other set of convolution parameters in the second random read/write memory, and the write address controls the counter to increment.

In a preferred example, it includes first and second random read-write memories; the read address control unit is also configured to determine whether it is the last one of another set of output convolution parameters, if not the last one, then the second random The read-write memory outputs one of the other set of convolution parameters, and the first read control counter increases automatically.

Compared with the prior art, this application has the following beneficial effects:

The FPGA-based convolution parameter acceleration device of the present invention uses the least logic resources to form a minimized convolution parameter management. The device interface is simple and easy to use, less resource occupancy, easy to transplant, short input and output paths, and due to internal use of two A random read-write memory can read and write data at the same time, continuously output, and maintain the peak state for a long time, which can greatly improve the parallelism and achieve high data throughput.

Description of the drawings

Figure 1 shows a process diagram of the convolution technology in the CNN neural network model in the prior art;

Figure 2 shows a schematic diagram of an acceleration device in an embodiment of the present invention;

Figure 3 shows a schematic diagram of an acceleration device in another embodiment of the present invention;

Figure 4 shows a process diagram of data writing in an embodiment of the present invention;

Fig. 5 shows a process diagram of data output in an embodiment of the present invention.

Detailed ways

In the following description, many technical details are proposed in order to enable readers to better understand this application. However, those of ordinary skill in the art can understand that even without these technical details and various changes and modifications based on the following embodiments, the technical solutions required by the claims of this application can be implemented.

In order to make the objectives, technical solutions and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below in conjunction with the accompanying drawings.

Explanation of some concepts:

CNN: Convolutional Neural Networks, Convolutional Neural Network

Convolution parameters: Convolution kernel parameters in CNN

FPGA: Field Programmable Logic Gate Array

RAM, Random Access Memory, random read-write memory

As shown in FIG. 2, this application also discloses an FPGA-based convolution parameter acceleration device, the acceleration device 100 includes:

At least one random access memory (RAM), shown in FIG. 2 includes a first random access memory 101, configured to store convolution parameters;

The write address control unit 201 is configured to determine whether it is the last one of the input set of convolution parameters. If it is not the last one, the write control counter (not shown in the figure) in the write address control unit 201 increments by 1, in Allocate an address in the first random access memory 101 for each of the group of convolution parameters;

The read address control unit 202 is configured to determine whether it is the last one of a set of output convolution parameters. If it is not the last one, the first random access memory 101 outputs one of the set of convolution parameters according to the address, and the read address The first read control counter in the control unit 202 increments by 1, and determines whether the output of the set of convolution parameters is completed for a predetermined number of times. If it is completed, the second read control counter in the address control unit 202 (not shown in the figure) is read. Out) cleared. In this embodiment, the least logical resources are used to form a minimized convolution parameter management acceleration unit, and its interface is simple and easy to use, with low resource occupation, easy transplantation, and short input and output paths.

In a preferred example, referring to FIG. 3, the acceleration device of the present application includes a first random read/write memory 101 and a second random read/write memory 102, and a set of volumes is written into the first random read/write memory 101 While accumulating parameters, the second random access memory 102 outputs another set of convolution parameters; or, the first random access memory 101 outputs a group of convolution parameters while writing to the second random access memory 102 Enter another set of convolution parameters.

In a preferred example, it includes a first random read/write memory 101 and a second random read/write memory 102; the write address control unit 201 is also configured to determine whether it is the last one of another set of input convolution parameters, If it is not the last one, an address is assigned to each of the other set of convolution parameters in the second random access memory 202, and the write address controls the counter to increment by one.

In a preferred example, it includes a first random read/write memory 101 and a second random read/write memory 102; the read address control unit 202 is also configured to determine whether it is the last one of another set of output convolution parameters, if not The last one, the second random read/write memory 102 outputs one of the other set of convolution parameters, and the first read control counter in the read address control unit 202 is incremented.

Because the acceleration device uses two RAMs inside, the data can be read and written at the same time, one RAM is used for writing data, and the other RAM is used for reading data, so that parallel processing of data can be realized, continuous output, and long-term retention of the peak state.

Another embodiment of the present application also discloses an FPGA-based data reading and writing method, including:

As shown in Figure 4, the FPGA-based data writing method includes:

In step 11, it is judged whether there is data input, if there is no data input, go to step 15, and the write control counter of the write control unit is cleared;

If there is data input, go to step 12 to determine whether it is the last one of the input set of convolution parameters. If it is not the last one, go to step 13, and the write control counter is incremented by one, and then go to step S14, in the first random read and write Allocate an address in the memory for each of the group of convolution parameters;

If it is the last one of the group of convolution parameters, go to step 15, and the write control counter of the write control unit is cleared.

As shown in Figure 5, the FPGA-based data writing method includes:

First, determine whether the data is output 21. If there is no data output, go to step 26, the first read control counter is cleared, and the second read control counter is cleared.

If there is data output, go to step 22 to determine whether it is the last one of the set of convolution parameters. If it is the last one of the set of convolution parameters, go to step 27, the first read control counter is cleared, and the second read control The counter increments by 1.

If it is not the last one, go to step 23, the first random read/write memory 101 outputs one of the group of convolution parameters according to the address, and at the same time go to step 24, the first read control counter increments by 1; then, repeat step 21 .

In a preferred example, if it is the last one of a set of output convolution parameters, and the output of the set number of convolution parameters has not been completed, it is equivalent to step 27, the first read control counter is cleared, and the first read control counter is cleared. The second reading control counter increments by 1, indicating that the output of a set of convolution parameters for one point has been completed.

Then, go to step 25 to determine whether the output of the predetermined number of convolution parameters of the group of convolution parameters has been completed, and if not, enter step 21 again to determine whether to output data.

If the output of the predetermined number of times of the set of convolution parameters is completed, then step 26 is entered, and the first read control counter and the second read control counter are cleared.

In a preferred example, while writing a set of convolution parameters to the first random read/write memory 101, the second random read/write memory 102 outputs another set of convolution parameters; or, the first random read While the write memory 101 outputs a set of convolution parameters, another set of convolution parameters is written into the second random access memory 102.

In a preferred example, after the input of a set of convolution parameters is completed, a set of convolution parameters are stored in the first random read/write memory 101 at this time, and then further includes: determining whether it is the last one of the input another set of convolution parameters , If it is not the last one, the write control counter is incremented, an address is assigned to each of the other set of convolution parameters in the second random read/write memory, and another set of convolution parameters is written into the second random Read and write memory, and the first random read and write memory can output data at the same time.

In a preferred example, after the output of a set of convolution parameters is completed, at this time the output of a set of convolution parameters in the first random read/write memory 101 is completed, and then it also includes: judging whether it is the last set of output convolution parameters One, if it is not the last one, the second random read/write memory outputs one of the other set of convolution parameters, the first read control counter increments automatically, so that the second random read/write memory is used to output data, and the first Random read-write memory can write data at the same time.

Because the acceleration device uses two RAMs, data can be read and written at the same time, one RAM is used to write data, and the other RAM is used to read data, so that parallel processing of data can be realized, continuous output, and long-term retention of the peak state.

The first embodiment is a method embodiment corresponding to this embodiment. The technical details in the first embodiment can be applied to this embodiment, and the technical details in this embodiment can also be applied to the first embodiment.

It should be noted that those skilled in the art should understand that the implementation functions of each module shown in the implementation of the acceleration device can be understood with reference to the relevant description of the data reading and writing method. The function of each module shown in the embodiment of the acceleration device can be realized by a program (executable instruction) running on the processor, or can be realized by a specific logic circuit. If the acceleration device of the embodiment of the present application is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application essentially or the part that contributes to the prior art can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, Read Only Memory (ROM, Read Only Memory), magnetic disk or optical disk and other media that can store program codes. In this way, the embodiments of the present application are not limited to any specific hardware and software combination.

Correspondingly, another implementation manner of the present application is implemented by a configuration file in an FPGA-readable storage medium. FPGA-readable storage media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of storage media for FPGA configuration files include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), Readable memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical Storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices. According to the definition in this article, FPGA-readable storage media does not include transitory media, such as modulated data signals and carrier waves.

It should be noted that in the application documents of this patent, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these There is any such actual relationship or sequence between entities or operations. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or device that includes a series of elements includes not only those elements, but also includes Other elements of, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the phrase "including one" does not exclude the existence of other same elements in the process, method, article, or equipment including the element. In the application documents of this patent, if it is mentioned that an act is performed based on a certain element, it means that the act is performed at least based on that element, which includes two situations: performing the act only based on the element, and performing the act based on the element and Other elements perform the behavior. Multiple, multiple, multiple, etc. expressions include two, two, two, and two or more, two or more, and two or more expressions.

All documents mentioned in this specification are considered to be included in the disclosure of this application as a whole, so that they can be used as a basis for modification when necessary. In addition, it should be understood that the above descriptions are only preferred embodiments of this specification, and are not intended to limit the protection scope of this specification. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of one or more embodiments of this specification shall be included in the protection scope of one or more embodiments of this specification.

In some cases, the actions or steps described in the claims may be performed in a different order than in the embodiments and still achieve desired results. In addition, the processes depicted in the drawings do not necessarily require the specific order or sequential order shown to achieve the desired result. In certain embodiments, multitasking and parallel processing are also possible or may be advantageous.

Claims

An FPGA-based method for reading and writing convolution parameter data, including:

Judge whether it is the last one of a set of input convolution parameters, if it is not the last one, the write control counter is incremented, and an address is assigned to each of the set of convolution parameters in the first random read/write memory;

It is judged whether it is the last one of a set of output convolution parameters. If it is not the last one, the first random read/write memory outputs one of the set of convolution parameters according to the address, and the first read control counter increments automatically; judge whether it is completed If the output of the set of convolution parameters for a predetermined number of times is completed, the first read control counter and the second read control counter are cleared.
The method according to claim 1, wherein if it is the last one of a set of input convolution parameters, the write control counter is cleared.
The method of claim 1, wherein if it is the last one of a set of output convolution parameters, and the output of the set of convolution parameters is not completed, the first read control counter is cleared, so The second read control counter increments by 1.
The method according to claim 1, wherein while writing a set of convolution parameters to the first random access memory, the second random access memory outputs another set of convolution parameters; or, so While the first random access memory outputs a group of convolution parameters, another group of convolution parameters is written into the second random access memory.
The method of claim 1, wherein after the input of a set of convolution parameters is completed, the method further comprises: determining whether it is the last one of the input another set of convolution parameters, and if it is not the last one, the write control counter Self-increment, assign an address to each of the other set of convolution parameters in the second random access memory.
The method of claim 1, wherein after the output of a set of convolution parameters is completed, the method further comprises: determining whether it is the last one of another set of output convolution parameters, and if not the last one, the second random read/write memory One of the other set of convolution parameters is output, and the first read control counter is incremented.
An FPGA-based convolution parameter acceleration device, which is characterized in that it comprises:

At least one random read-write memory, configured to store convolution parameters;

The write address control unit is configured to determine whether it is the last one of the input set of convolution parameters, if it is not the last one, the write control counter is incremented, and in the first random read/write memory, it is each set of convolution parameters. Allocation address

The read address control unit judges whether it is the last one of a set of output convolution parameters. If it is not the last one, the first random read/write memory outputs one of the set of convolution parameters according to the address, and the first read control counter automatically Increment; It is judged whether the output of the predetermined number of times of the set of convolution parameters is completed, and if completed, the first read control counter and the second read control counter are cleared.
The device according to claim 7, characterized in that it comprises a first and a second random read-write memory, while writing a set of convolution parameters into the first random read-write memory, the second random read-write memory Output another set of convolution parameters; or, when the first random read/write memory outputs a set of convolution parameters, write another set of convolution parameters into the second random read/write memory.
7. The device according to claim 7, characterized by comprising first and second random read/write memories; the write address control unit is further configured to: determine whether it is the last one of another set of input convolution parameters, If it is not the last one, an address is allocated to each of the other set of convolution parameters in the second random read/write memory, and the write address controls the counter to increment.
The device according to claim 7, characterized in that it comprises a first and a second random read/write memory; the read address control unit is further configured to: determine whether it is the last one of another set of output convolution parameters, if not The last one, the second random read/write memory outputs one of the other set of convolution parameters, and the first read control counter increments automatically.