CN111275194B

CN111275194B - NLP reasoning acceleration system based on FPGA

Info

Publication number: CN111275194B
Application number: CN202010094699.8A
Authority: CN
Inventors: 贾兆荣
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-02-16
Filing date: 2020-02-16
Publication date: 2022-06-21
Anticipated expiration: 2040-02-16
Also published as: CN111275194A

Abstract

The invention relates to the technical field of data processing, and provides an NLP reasoning acceleration system based on an FPGA (field programmable gate array), which comprises an FPGA and a double-rate synchronous dynamic random access memory DDR (double data rate), wherein the FPGA comprises an ARM (advanced RISC machine) kernel and a plurality of model computing kernels which execute operation actions in parallel; the ARM core is internally stored with a trained BERT model and used for executing actions including scheduling of each model calculation core, configuration of cache size and configuration of data read-write addresses according to the BERT model, so that high-speed artificial intelligence reasoning tasks are realized.

Description

NLP reasoning acceleration system based on FPGA

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to an NLP reasoning acceleration system based on an FPGA.

Background

The CPU + GPU platform architecture is a general artificial intelligence implementation scheme at present, and compared with the internal structure of the GPU, the CPU and the GPU are generally composed of a controller (Control), a register (Cache, DRAM), and A Logic Unit (ALU). In the CPU the controller and registers occupy a large part of the structure, while in the GPU the size of the logic unit is much larger than the sum of the other two. The different architectures determine that the CPU can well perform instruction processing/execution and function calling, but the logic units have a smaller proportion, so that the capacity of processing data (arithmetic operation or logic operation) is much weaker than that of the GPU. Therefore, in the CPU + GPU architecture, the CPU realizes the scheduling of the artificial intelligence scheme, and the GPU realizes the parallel algorithm calculation.

However, the CPU + GPU platform architecture has the following drawbacks:

(1) the CPU + GPU architecture is commonly found in the bulky devices such as PCs or servers, and is not commonly used in the front end or mobile end, such as automatic driving. The combined architecture is commonly used for big data training of an artificial intelligence model, and a special artificial intelligence chip is commonly used for front-end reasoning, so that the high performance of the CPU + GPU cannot play an advantage in the application of the front-end reasoning;

(2) the CPU + GPU architecture has higher cost, simple application is not compensated, if only the CPU is used, the time delay of a calculation result is too large, for example, the time delay is too large to meet the application requirement when judging an automatic driving obstacle.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an NLP reasoning acceleration system based on an FPGA, aiming at solving the problems that the CPU + GPU framework structure in the prior art cannot play performance advantages in front-end reasoning application and has too large time delay.

The technical scheme provided by the invention is as follows: an FPGA-based NLP reasoning acceleration system comprises a field programmable gate array FPGA and a double-rate synchronous dynamic random access memory DDR, wherein the field programmable gate array FPGA comprises an ARM core and a plurality of model computing cores which execute operation actions in parallel, and the plurality of model computing cores comprise a corpus participle reading core, a weight reading core, a time sequence control core, a data distribution core, a parallel computing core, an addition core, an activation core, a soft maximization core, a layer normalization core and a corpus participle writing DDR core;

the ARM core is internally stored with a trained BERT model and used for executing actions including scheduling of each model calculation core, configuration of cache size and configuration of data read-write address according to the BERT model;

the corpus participle reading kernel is used for reading token data which is vectorized and completed by the ARM kernel from the double-rate synchronous dynamic random access memory DDR, transmitting the read token data to the data distribution kernel and storing the token data in an on-chip cache RAM;

the weight reading kernel is used for reading the weight data which is already vectorized by the ARM kernel from the DDR and storing the weight data into an on-chip cache RAM;

the time sequence control kernel is used for counting the number of the running periods of the model calculation kernel in the whole system and feeding back the number to the data distribution kernel;

the data distribution kernel is used for driving each model calculation kernel to read data according to the configuration of the ARM kernel, starting the calculation of the parallel calculation kernel, providing calculation data for the calculation kernel, providing addition control for the addition kernel and providing configuration whether to be started or not for the activation kernel;

the parallel computing kernel is used for calling a corresponding number of DSP kernels according to the vectorized data size of the ARM kernel to perform full parallel computing;

the addition kernel is used for controlling each addition operation in the plurality of model calculation kernels;

and the plurality of model computing kernels carry out data interaction in a fifo mode, and the plurality of model computing kernels except the soft maximization kernel and the layer normalization kernel work in a cascade pipelining mode.

As an improved scheme, the specific execution steps of the ARM core are as follows:

reading the model of which the external path is trained, and analyzing the read model to obtain model basic data;

vectorizing the weight data of the model, and storing the vectorized weight data into a double-rate synchronous dynamic random access memory (DDR);

vectorizing the participle data of the corpus of the model, and storing the vectorized participle data of the corpus into a double-rate synchronous dynamic random access memory (DDR);

and configuring the data distribution kernel according to the analyzed data, so that the data distribution kernel starts reasoning calculation.

As an improved scheme, the model basic data comprises data dimensions and calculated amount of each layer of the model, iteration times required for calculating and completing all calculations of one layer according to parallel calculation resources of the FPGA, required cache size, initial address of configuration cache, whether activation is required, whether layer normalization is required, and whether soft maximization is required.

As an improved scheme, the weight reading kernel continuously reads the weight data, and the data reading process of the data distribution kernel is blocking reading;

when the data distribution kernel does not read the data, after the weight reading kernel is fully written with the data interaction fifo, the process of reading the DDR is suspended, and the fifo data is waited to be read.

As an improved scheme, the data distribution kernel is internally provided with double caches which are respectively used for caching original data and computing output data, and provides cache data for the soft maximization kernel and the layer normalization kernel.

As an improved scheme, the parallel computing kernel is internally provided with double caches which are respectively used for caching weight data required by current computation and weight data required by next computation.

As an improved scheme, a plurality of model computing kernels which execute operation actions in parallel are packaged into an IP kernel.

In the embodiment of the invention, the FPGA-based NLP reasoning acceleration system comprises a field programmable gate array FPGA and a double-rate synchronous dynamic random access memory DDR, wherein the FPGA comprises an ARM core and a plurality of model calculation cores which execute operation actions in parallel; the ARM core is internally stored with a trained BERT model and used for executing actions including scheduling of each model calculation core, configuration of cache size and configuration of data read-write addresses according to the BERT model, so that high-speed artificial intelligence reasoning tasks are realized.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

FIG. 1 is a schematic structural diagram of an FPGA-based NLP inference acceleration system provided by the invention;

the system comprises an ARM core, a 2-corpus participle reading core, a 3-weight reading core, a 4-time sequence control core, a 5-data distribution core, a 6-parallel computing core, a 7-addition core, an 8-activation core, a 9-soft maximization core, a 10-layer normalization core and an 11-corpus participle writing DDR core.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are merely for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.

Fig. 1 shows a schematic structural diagram of an FPGA-based NLP inference acceleration system provided in the present invention, and for convenience of description, only the parts related to the embodiment of the present invention are shown in fig. 1.

The FPGA-based NLP reasoning acceleration system comprises a field programmable gate array FPGA (described as FPGA below) and a double-rate synchronous dynamic random access memory DDR (described as DDR below), wherein the field programmable gate array FPGA comprises an ARM core 1 and a plurality of model computing cores which execute operation actions in parallel, and the plurality of model computing cores comprise a corpus participle reading core 2, a weight reading core 3, a time sequence control core 4, a data distribution core 5, a parallel computing core 6, an addition core 7, an activation core 8, a soft maximization core 9, a layer normalization core 10 and a corpus participle writing DDR core 11;

the ARM core 1 is internally stored with a trained BERT model and is used for executing actions including scheduling of each model calculation core, configuration of cache size and configuration of data read-write address according to the BERT model;

the corpus participle reading kernel 2 is used for reading token data which is already vectorized by the ARM kernel 1 from the double-rate synchronous dynamic random access memory DDR, transmitting the read token data to the data distribution kernel 5 and storing the token data in an on-chip cache RAM;

the weight reading kernel 3 is used for reading the weight data which are already vectorized by the ARM kernel 1 from the double-rate synchronous dynamic random access memory DDR and storing the weight data into an on-chip cache RAM, wherein the weight data are stored into the on-chip cache, so that the data distribution kernel 5 can call the weight data, particularly, the data volume is large, the on-chip cache RAM cannot completely cache all data, the weight reading kernel 3 continuously reads the weight data, and the data reading process of the data distribution kernel 5 is blocking reading; when the data distribution kernel 5 does not read data, after the weight reading kernel 3 is fully written with the data interaction fifo, the process for reading the DDR is suspended, and the fifo data is waited to be read;

the time sequence control kernel 4 is used for counting the number of the operating cycles of the model calculation kernel in the whole system and feeding back the number to the data distribution kernel 5, wherein the time sequence control kernel 4 mainly supervises the whole calculation process, counts the number of the operating cycles of the model calculation kernel, feeds back the number to the data distribution kernel 5 for data reading address offset, and the data distribution kernel 5 judges whether the data calculation of one layer is completed or not by comparing the number of the operating cycles of the layer configured by the ARM kernel 1 and feeds back the number to the ARM kernel 1 for scheduling of the model layer calculation;

the data distribution kernel 5 is configured to drive each model computation kernel to read data, start computation of a parallel computation kernel 6, provide computation data for the computation kernel, provide summation control for the summation kernel 7, and provide configuration for whether the activation kernel 8 is started or not, where the data distribution kernel 5 is internally configured with dual caches, which are respectively used for caching original data and computation output data, and providing cached data for the soft maximization kernel 9 and the layer normalization kernel 10;

the parallel computing kernel 6 is used for calling DSP kernels with corresponding quantity according to the vectorized data quantity of the ARM kernel 1 to perform full parallel computing, and double caches are also developed inside the parallel computing kernel for caching the weight data required by the current computing and the weight data required by the next computing. The caching of the weight data required by the next calculation is performed in the calculation process, so that the calculation process cannot be interrupted by calling the weights twice before and after;

the addition kernel 7 is configured to control each addition operation in the plurality of model calculation kernels, and specifically complete an addition operation in matrix multiply-add operation, an addition operation of layer normalization, an addition operation of offset, and the like;

in the activation kernel 8, the activation calculation is a nonlinear calculation, which is an algorithm in neural network calculation;

the soft maximization kernel 9 can run a Softmax algorithm, which is an algorithm in neural network calculation, and realizes the function of centralizing dispersed data, so that certain data is prevented from exceeding a storable range through calculation;

in the layer normalization kernel 10, layerorm is an algorithm in neural network computation, namely softmax; the function is similar.

The linguistic data is written into the DDR kernel 11 in a word-dividing mode, and the calculation intermediate result is output to the DDR, so that the whole calculation process can be conveniently debugged by using a PC.

The data interaction is carried out on a plurality of model computing kernels in a fifo mode, and the model computing kernels except the soft maximization kernel 9 and the layer normalization kernel 10 work in a cascade pipelining mode.

In the invention, FPGA (field Programmable Gate array) is a product which is further developed on the basis of Programmable devices such as PAL, GAL and the like, and the basic structure of FPGA comprises a Programmable input-output unit, a configurable logic block, a digital clock management module, an embedded block RAM, a wiring resource, an embedded special hard core and a bottom layer embedded functional unit. The FPGA has the characteristics of abundant wiring resources, high repeatable programming and integration level, high parallelism, low power consumption and low investment, and is widely applied to the field of digital circuit design;

in this embodiment, the FPGA (e.g. XILINX Zynq UltraScale + MPSoC) is configured with not only rich logic resources but also high-performance ARM hardmac, large-capacity RAM, rich high-speed peripheral interfaces, high-performance DSP cores, rich IP and other resources. The ARM hard core can be embedded with a linux system, develop application software, and schedule resources, calculation, tasks and the like of the whole system. The high-performance DSP core and rich logic resources can realize parallel rapid calculation of the model. The high-speed peripheral interfaces can realize rapid data transmission between boards. The rich IP cores speed up programming. Therefore, almost all elements of digital system design are integrated in the FPGA, a CPU + GPU architecture can be replaced to realize high-speed artificial intelligence reasoning tasks, the size is small, the power consumption is low, a large number of computing tasks can be parallelly operated by the parallel architecture, and low time delay and high performance are realized.

In this embodiment, artificial intelligence is the intelligence imparted to the robot that causes the robot to perform some work in place of a human. The basic method to implement artificial intelligence is machine learning, which uses algorithms to parse data, learn from it, and then make decisions and predictions about events in the real world. Unlike traditional hard-coded software programs that solve specific tasks, machine learning is "trained" with large amounts of data, learning from the data through various algorithms to complete a task method, and thereby solve or process a certain class of tasks. Machine learning was derived directly from the early artificial intelligence field. Conventional algorithms include decision tree learning, inferential logic planning, clustering, reinforcement learning, and bayesian networks, among others.

Deep learning is a technology for realizing machine learning, and by establishing a deep artificial neural network and training and learning a large amount of data, the neural network can accurately analyze the characteristics of input data, so that a machine can make accurate judgment.

Natural Language Processing (NLP) is a sub-domain of Artificial Intelligence (AI). With the rise of speech interfaces and chat robots, NLP has become one of the most important technologies in the information era and is an important component of artificial intelligence. Through a deep learning method, an effective neural network model is established, so that a machine can fully understand and express the meaning of a language.

A natural language processing model (BERT) is a natural language processing model issued by Google, and is widely used in various natural language processing tasks such as emotion analysis, spam filtering, named entity recognition, document clustering, and the like due to its excellent performance. BERT is trained on a large corpus of english text containing 33 billion words, and in some cases is even more adept at understanding the language than the average person. The advantage is that training can be performed based on unlabeled datasets, and without major modifications, it can be generalized to various applications.

In the embodiment of the invention, the NLP reasoning and accelerating system based on the FPGA is realized by adopting an ARM master control and a plurality of parallel model computing kernels. A plurality of parallel model kernels are packaged into an IP (Internet protocol) kernel, and the whole system is formed by an AXI (advanced extensible interface) bus and kernels such as an ARM (advanced RISC machine) kernel 1, a DDR (double data rate) controller, a RAM (random access memory) read-write controller and the like. The ARM checks the scheduling of the whole BERT model, stores a trained model built by a cache or a pytorech in the ARM, and finishes the scheduling of a model calculation kernel, the configuration of the size of a cache, the configuration of a data read-write address and the like according to the BERT model. And data interaction is carried out among model calculation kernels through small fifo. Except that the layer normalization core 10 and the soft maximization core 9 operate in independent modes, the other cores operate in a cascaded pipelined manner.

The FPGA is internally provided with double caches of weight and word segmentation and used for quickly caching the intermediate result of the model.

In the embodiment of the present invention, the specific execution steps of the ARM core 1 are as follows:

(1) reading a model of which the external path is trained, analyzing the read model, and obtaining model basic data, wherein the model basic data comprises data dimensions and calculated amount of each layer of the model, iteration times required by calculating and completing all calculations of one layer according to parallel calculation resources of an FPGA (field programmable gate array), required cache size, initial addresses of configuration caches, whether activation is required, whether layer normalization is required and whether soft maximization is required;

(2) vectorizing the weight data of the model, and storing the vectorized weight data into a double-rate synchronous dynamic random access memory (DDR);

the vectorization aims to realize quick RAM reading, realize data parallel computation and accelerate the whole computation flow. For example, the multiplication of two matrices, the matrix may be vectorized in rows or columns, and each time the calculation is to multiply one row of data by one column of data, so as to obtain one point of the output matrix. The multiplication of two vectors of the row of data or a column of data is realized by calling a plurality of DSP cores for parallel operation, and the calculation of row-column multiplication addition is completed in one clock cycle;

the essence of the process of vectorizing the model weight data is to dump discontinuously stored data required in calculation into continuously stored data, so that the data reading times are reduced, and the data reading speed is increased. Firstly, partitioning original data according to a vectorization rule, and then continuously storing vectorized discontinuous data according to a data reading sequence.

(3) Vectorizing the participle data of the corpus of the model, and storing the vectorized participle data of the corpus into a double-rate synchronous dynamic random access memory (DDR), wherein the specific implementation is shown in step (2), and the details are not repeated herein;

(4) and configuring the data distribution kernel 5 according to the analyzed data, so that the data distribution kernel 5 starts reasoning calculation.

In the embodiment of the invention, the structure of the FPGA can flexibly configure characteristics, abundant computing resources, ram resources and hard core resources, various algorithm cores can be flexibly configured to carry out full-parallel computing, computing delay is greatly reduced, and certain functional cores can be started or closed according to needs through configuration information, so that power consumption can be further reduced.

The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. An FPGA-based NLP reasoning acceleration system is characterized by comprising a field programmable gate array FPGA and a double-rate synchronous dynamic random access memory DDR, wherein the field programmable gate array FPGA comprises an ARM core and a plurality of model computation cores for executing operation actions in parallel, and the plurality of model computation cores comprise a corpus participle reading core, a weight reading core, a time sequence control core, a data distribution core, a parallel computation core, an addition core, an activation core, a soft maximization core, a layer normalization core and a corpus participle writing DDR core;

the time sequence control kernel is used for counting the number of the running cycles of the model calculation kernel in the whole system and feeding back the number to the data distribution kernel;

the data distribution kernel is used for driving each model calculation kernel to read data according to the configuration of the ARM kernel, starting the calculation of the parallel calculation kernel, providing calculation data for the calculation kernel, providing addition control for the addition kernel and providing configuration whether the activation kernel is started or not for the activation kernel;

the parallel computing kernel is used for calling DSP kernels with corresponding quantity according to the vectorized data size of the ARM kernel to perform full parallel computing;

and the plurality of model calculation kernels except the soft maximization kernel and the layer normalization kernel work in a cascading flow mode.

2. The FPGA-based NLP reasoning acceleration system of claim 1, wherein the ARM kernel comprises the following specific execution steps:

vectorizing the participle data of the corpus of the model, and storing the vectorized participle data of the corpus into a Double Data Rate (DDR) synchronous dynamic random access memory;

3. The FPGA-based NLP reasoning acceleration system of claim 2, wherein the model base data comprises data dimensions of each layer of the model, calculation amount, iteration number required for calculating all calculations of one layer according to parallel calculation resources of the FPGA, required cache size, initial address of configuration cache, whether activation is required, whether layer normalization is required and whether soft maximization is required.

4. The FPGA-based NLP reasoning acceleration system of claim 2, wherein the weight reading kernel continuously reads the weight data, and the data distribution kernel reads the data in a blocking reading process;

5. The FPGA-based NLP reasoning acceleration system of claim 1, wherein the data distribution kernel is internally configured with double caches for caching original data and calculation output data, and providing cache data for the soft maximization kernel and the layer normalization kernel.

6. The FPGA-based NLP reasoning acceleration system of claim 1, wherein the parallel computing kernel is internally configured with double caches for caching weight data required by current computation and weight data required by next computation respectively.

7. The FPGA-based NLP reasoning acceleration system of claim 1, wherein the plurality of model computation cores executing the operation action in parallel are packaged as IP cores.