CN117764123A

CN117764123A - Neural network acceleration system, testing device and electronic equipment thereof

Info

Publication number: CN117764123A
Application number: CN202211111855.2A
Authority: CN
Inventors: 李艳杰; 李莹; 周崟灏
Original assignee: Institute of Microelectronics of CAS
Current assignee: Institute of Microelectronics of CAS
Priority date: 2022-09-13
Filing date: 2022-09-13
Publication date: 2024-03-26

Abstract

The application discloses a neural network acceleration system, a testing device and electronic equipment thereof, and relates to the fields of machine learning and artificial intelligence, wherein the system comprises an acceleration top layer interface unit and a neural network unit; the neural network unit comprises a convolution layer module, a pooling layer module and a full-connection layer module which are sequentially connected; the convolution layer module, the pooling layer module and the full-connection layer module operate in a data stream-level full-parallel pipeline operation mode; the acceleration top layer interface unit is used for completing interlayer connection between the convolution layer module, the pooling layer module and the full-connection layer module, and can fully mine inherent parallelism of the convolution neural network, so that low power consumption and low cost of the terminal reasoning artificial neural network device of the Internet of things can be met, and industrial application requirements of reasoning instantaneity can be met.

Description

Neural network acceleration system, testing device and electronic equipment thereof

Technical Field

The application relates to the field of machine learning and artificial intelligence, in particular to a neural network acceleration system, a testing device and electronic equipment thereof.

Background

With the introduction of Artificial Intelligence (AI) by more and more industries of consumer electronics, automotive electronics, industrial control, etc., artificial intelligence is facing unprecedented rapid development, deep learning and neural networks have grown on the move. The larger the neural network, the larger the required computation, and the traditional software implementation scheme can complete artificial intelligence operation, but has the problems of high power consumption and high delay. The convolutional neural network (Convolutional Neutral Network, CNN) is a deep machine learning algorithm derived from an artificial neural network, has high adaptability to the deformation of the image in the forms of translation, scaling, tilting and the like, is a sensor for extracting the sensitivity of the image features, has a weight sharing network structure more similar to a biological neural network structure, reduces the complexity of a network model and reduces the number of weights.

Convolutional neural networks are currently mainly based on general-purpose processors and are implemented in software, and in fact, convolutional neural networks are used as a feed-forward network structure with high degree of independence between layers. Wherein, each layer of network calculation is independent, no data feedback exists between layers, so the convolutional neural network is a highly parallel network structure.

However, the characteristics of the general-purpose processor optimized for performing logic processing and transaction processing are not suitable for mining parallelism of the convolutional neural network, and the convolutional neural network based on software cannot meet the application requirements in terms of real-time performance and power consumption. The common neural network acceleration chip adopts a computing unit (PE) array design, has complex control logic, high power consumption and complex application scene, and needs a large amount of memory data carrying operation, so the design is not suitable for low-power-consumption Internet of things terminal equipment, and therefore, a convolutional neural network acceleration integration method is needed to realize the application in the low-power-consumption Internet of things terminal equipment.

Disclosure of Invention

The purpose of the application is to provide a neural network acceleration system, a testing device and electronic equipment thereof, so as to solve the problems that the neural network acceleration chip adopts PE array design, the control logic is complex, the power consumption is high, the application scene is complex, a large amount of memory data carrying operation is required, and the design is not suitable for low-power-consumption Internet of things terminal equipment

In a first aspect, the present application provides a neural network acceleration system, the system comprising an acceleration top layer interface unit and a neural network unit;

The neural network unit comprises a convolution layer module, a pooling layer module and a full-connection layer module which are sequentially connected; the convolution layer module, the pooling layer module and the full-connection layer module operate in a data stream-level full-parallel pipeline operation mode;

the acceleration top layer interface unit is used for completing interlayer connection among the convolution layer module, the pooling layer module and the full-connection layer module.

Under the condition of adopting the technical scheme, the neural network acceleration system provided by the application comprises an acceleration top interface unit and a neural network unit; the neural network unit comprises a convolution layer module, a pooling layer module and a full-connection layer module which are sequentially connected; the convolution layer module, the pooling layer module and the full-connection layer module operate in a data stream-level full-parallel pipeline operation mode; the acceleration top layer interface unit is used for completing interlayer connection between the convolution layer module, the pooling layer module and the full-connection layer module, and can fully mine inherent parallelism of the convolution neural network, so that low power consumption and low cost of the terminal reasoning artificial neural network device of the Internet of things can be met, and industrial application requirements of reasoning instantaneity can be met.

In one possible implementation manner, the convolution layer module comprises a convolution layer interface design sub-module, a convolution layer on-layer storage sub-module, an input loading state machine design sub-module, an input buffer construction sub-module and a multiply-add buffer construction sub-module which are connected in sequence;

the convolution layer interface design submodule is used for controlling parameter input corresponding to convolution layer input data through parameterization so as to increase reusability of the convolution layer module;

the on-layer convolution layer storage submodule is used for storing weight bit width in the input data of the convolution layer according to preset bit width and preset number of groups;

the input loading state machine design submodule is used for determining the state of the first-in first-out queue buffer area according to the preset state corresponding relation;

the input buffer area construction submodule is used for acquiring input data corresponding to the state of the first-in first-out buffer area and pressing the input data into the corresponding convolution kernel position according to a preset pressing-in convolution kernel position;

the multiplication and addition buffer area construction submodule is used for starting the first-column multiplication operation after the input data is pressed into the first column of the buffer area, performing addition operation after a corresponding convolution kernel is pressed fully to obtain an addition result, and sending the addition result after convolution to the pooling layer module;

The convolution layer input data comprises the number of input and output channels, weight bit width, feature map size, convolution step length and output data width.

In one possible implementation manner, the pooling layer module comprises a pooling layer interface design sub-module, a pooling layer on-layer storage sub-module, a comparator design sub-module, an upper layer feature map buffer area construction sub-module and a data bit truncation sub-module which are connected in sequence;

the pooling layer interface design submodule is used for adding the first two rows of the addition results input by the convolution layer module into the upper layer feature map buffer zone construction submodule, carrying out data processing on the addition results, processing the addition structure after the data processing through the data stage submodule to obtain an output result, determining the maximum value of the corresponding output results in every four selected blocks by utilizing the comparator design submodule, determining the pooling layer output result, and inputting the pooling layer output result into the full-connection layer module.

In one possible implementation manner, the full-connection layer module is configured to load data on the pooling layer output result input by the pooling layer module, obtain full-connection layer input data, perform full-connection matrix multiplication on the full-connection layer input data, and determine an output probability value in the last layer after three layers of full-connection processing.

In one possible implementation, the inter-layer connection includes an input data interface control, an output data interface control, and an inter-layer connection control.

In a second aspect, the present application further provides a neural network acceleration system testing device, which is applied to the neural network acceleration system including any one of the first aspect, and further includes a main control module connected to the neural network acceleration system;

the main control module is configured to acquire image data to be processed;

the main control module is further configured to determine convolutional layer input data based on the image data to be processed;

the neural network acceleration system is configured to pipeline the convolutional layer input data in parallel to determine an output probability value.

In one possible implementation manner, the main control module comprises a kernel control sub-module, a device bus, a storage unit and a peripheral internet protocol address sub-module which are connected in sequence; the kernel control sub-module is used for controlling data interaction; the neural network acceleration system is mounted on the equipment bus and used for completing parallel pipeline processing on the input data of the convolution layer through control of the kernel control sub-module and determining the output probability value.

In one possible implementation manner, the peripheral internet protocol address sub-module comprises an image acquisition module, an image display module and an acceleration recognition result display module;

the image acquisition module is configured to acquire the image data to be processed corresponding to the image to be processed in the storage unit, and convert the image data to be processed into binary image data; the main control module is configured to perform median filtering processing on the binary image data, perform horizontal and vertical projection on video stream data to be processed to obtain digital positioning data, and determine the digital positioning data as the input data of the convolution layer;

the image display module is configured to display an image corresponding to the image data to be processed;

the acceleration recognition result display module is configured to display recognition results corresponding to probability values output by the neural network acceleration system.

The beneficial effects of the neural network acceleration system provided in the second aspect are the same as those of the neural network acceleration system described in the first aspect or any possible implementation manner of the first aspect, and are not described herein.

In a third aspect, the present application further provides an electronic device, including: one or more processors; and one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the neural network acceleration system described in any of the possible implementations of the first aspect.

The beneficial effects of the electronic device provided in the third aspect are the same as those of the neural network acceleration system described in the first aspect or any possible implementation manner of the first aspect, and are not described herein.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 shows a schematic structural diagram of a neural network acceleration system according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a push sequence of buffer weights according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a pooling layer FIFO queue and input data buffer configuration according to an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of a neural network unit operation pipeline provided in an embodiment of the present application;

fig. 5 is a schematic diagram of a display interface of an upper computer according to an embodiment of the present application;

fig. 6 is a schematic hardware structure of an electronic device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

In order to clearly describe the technical solutions of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect. For example, the first threshold and the second threshold are merely for distinguishing between different thresholds, and are not limited in order. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

In this application, the terms "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, a and b, a and c, b and c, or a, b and c, wherein a, b, c can be single or multiple.

Fig. 1 shows a schematic structural diagram of a neural network acceleration system provided in an embodiment of the present application, and as shown in fig. 1, a neural network acceleration system 10 includes an acceleration top layer interface unit 101 and a neural network unit 102;

the neural network unit 102 comprises a convolution layer module 1021, a pooling layer module 1022 and a full connection layer module 1023 which are sequentially connected; wherein the convolutional layer module 1021, the pooling layer module 1022, and the full-connection layer module 1023 operate in a full-parallel pipeline operation mode;

the acceleration top layer interface unit 101 is configured to complete interlayer connection among the convolutional layer module 1021, the pooling layer module 1022, and the full-connection layer module 1023.

In this application, the interlayer connection includes an input data interface control, an output data interface control, and an interlayer connection control.

The acceleration top layer interface unit can comprise an input interface module and an output interface module, the input interface module can complete input data interface control, the output interface module can output data interface control after finishing, the input interface module comprises an image data port, an input data counting address port, a clock and a synchronous reset port, and the input data counting address port can receive data processed by the camera module and enter the neural network unit to carry out convolution operation. The output interface module may adapt to an output interface of a bus protocol (Advanced eXtensible Interface, AXI) bus, the port may include an output valid signal (dout_vld) and an AXI read valid signal, and may enable the neural network unit to be hooked on the AXI bus, and when the valid signal is in a valid state, the AXI bus may start reading data.

The embodiment of the application provides a neural network acceleration system, which comprises an acceleration top layer interface unit and a neural network unit; the neural network unit comprises a convolution layer module, a pooling layer module and a full-connection layer module which are sequentially connected; the convolution layer module, the pooling layer module and the full-connection layer module operate in a data stream-level full-parallel pipeline operation mode; the acceleration top layer interface unit is used for completing interlayer connection between the convolution layer module, the pooling layer module and the full-connection layer module, and can fully mine inherent parallelism of the convolution neural network, so that low power consumption and low cost of the terminal reasoning artificial neural network device of the Internet of things can be met, and industrial application requirements of reasoning instantaneity can be met.

In the application, the convolution layer module comprises a convolution layer interface design sub-module, a convolution layer on-chip storage sub-module, an input loading state machine design sub-module, an input buffer zone construction sub-module and a multiplication and addition buffer zone construction sub-module which are connected in sequence.

The convolution layer interface design submodule is used for controlling parameter input corresponding to convolution layer input data through parameterization so as to increase reusability of the convolution layer module.

The on-layer convolution layer storage submodule is used for storing weight bit width in the input data of the convolution layer according to the preset bit width and the preset number of the groups.

In the present application, the storage sub-module on the convolution layer may be a Block Random Access Memory (BRAM) for storing weight bit width (weight) data, where the stored format is 10 bits wide according to quantized weight bit width data, and 9 are a set of convolution kernel data, that is, the preset bit width in the present application may be 10, and the preset number may be 9. Convolution (bias) data may be stored in a register, where the bit width is also 10, and each layer of data is stored sequentially, and table 1 shows a schematic storage manner of input data of a convolution layer provided in the embodiment of the present application, as shown in table 1: layer indicates the number of layers, wherein the weight deposit order and the loading order are the same.

TABLE 1

The input loading state machine design submodule is used for determining the state of the first-in first-out queue buffer area according to the preset state corresponding relation.

In this application, the preset state correspondence may include five states, and the state machine may determine a first-in-first-out (fifo) buffer state based on the five states, state0 indicates that input data is being pushed into a first column of the convolution kernel, state1 indicates that input data is being pushed into a second column of the convolution kernel, state2 indicates that input data is being pushed into a third column of the convolution kernel, and state3 indicates that the convolution kernel buffer is full, otherwise full. When the conv_buf first column is pushed in, the buffer starts to load, and when the first column is full, the multiplication operation can be performed. When the state is at state3, the buffer is full.

The input buffer area construction submodule is used for acquiring input data corresponding to the first-in first-out buffer area state and pressing the input data into the corresponding convolution kernel position according to a preset pressing-in convolution kernel position.

In this application, FIFO buffer conv_buf [ o ] [ a ] [ i ] [ j ] is a four-dimensional array for storing input data to be convolved, our convolution kernel for convolution operation adopts 3*3, and input data needs to be read into the convolution buffer before convolution operation is performed, so we need to construct 3*3 FIFO buffer for storing input data, fig. 2 shows a schematic diagram of a push sequence of buffer weights provided in this embodiment of the present application, as shown in fig. 2, FIFO 0 is used for pushing data at convolution kernel positions 1,2,3, FIFO1 is used for pushing data at convolution kernel positions 4,5,6, and FIFO2 is used for pushing data at convolution kernel positions 7,8, 9. Data push requires three clock cycles, the first 1,4,7 clock cycle being pushed into the buffer, the second 2,5,8 clock cycle being pushed in, the third 3,6,9 clock cycle being pushed in, the next clock rising edge first column multiplication operation beginning after the first column is pushed in.

The multiplication and addition buffer area construction submodule is used for starting the first-column multiplication operation after the input data is pressed into the first column of the buffer area, performing addition operation after a corresponding convolution kernel is pressed fully, obtaining an addition result, and sending the addition result after convolution to the pooling layer module.

In the present application, the multiply-add buffer construction submodule starts the first-column multiplication operation after the input data is pressed into the first column of the buffer, this part adopts a pipeline design, and the next clock rising edge after the add (add_conv) signal is valid starts the multiplication operation for all the data in the conv_buf buffer, and when one convolution kernel is pressed full, the addition operation is performed. The multiplication result is stored in a multiplication result (mul_res) register every clock cycle, the addition result is stored in an addition result (add_res) register, and the convolved result is sent to the pooling layer module.

In the application, the first layer convolution operation is multiplication and addition operation, and binarization processing is performed on the input image at the algorithm end, so that the multiplication operation of the input number and the weight programs multiplication of single-bit 01 data and multi-bit weight, and the multiplication operation can be omitted.

In the application, the pooling layer module comprises a pooling layer interface design submodule, a pooling layer on-layer storage submodule, a comparator design submodule, an upper layer characteristic diagram buffer zone construction submodule and a data bit truncation submodule which are connected in sequence;

The pooling layer interface design submodule can receive the information of the characteristic diagram of the previous layer, then transmit the information of the next layer, and store all corresponding data into the next characteristic diagram. Fig. 3 shows a schematic diagram of a pooling layer FIFO queue and an input data buffer structure provided in the embodiment of the present application, as shown in fig. 3, the pooling core size is 2×2, pooling adopts maximum pooling, and the maximum value in every four selected blocks is compared by using a comparator, because the pooling core is 2×2, the first two lines of input data need to be added into the buffer, and the buffer is also implemented by two FIFO queues (FIFO 0 and FIFO 1). And adding the value in the bias convolution result (bias_res) to the data in the pooled pooling layer result (pool_res) output register, and then performing data bit truncation operation. The result is then output to the next layer. The truncation operation is to reduce the data bit width, and the truncation does not reduce the recognition accuracy, but rather avoids over-fitting.

In the application, the full-connection layer module is used for loading data of the pooling layer output result input by the pooling layer module to obtain full-connection layer input data, carrying out full-connection matrix multiplication on the full-connection layer input data, and determining an output probability value by the last layer after three layers of full-connection processing.

The full connection design is similar to the convolution layer design in that a buffer receiving area is equivalent to loading input data by using a FIFO, then performing multiplication and addition operation, and the interface and the time sequence are similar to the convolution layer design. Only the multiplexing of parameter values needs to be changed. And performing full-connection matrix multiplication operation on the upper layer input data, and outputting the statistical probability of the last layer after three layers of full-connection operation.

Fig. 4 shows a schematic diagram of a neural network unit operation pipeline provided in an embodiment of the present application, as shown in fig. 4, CONV3 represents three convolutional layers, that is, a convolutional layer module in the present application. POOL3 represents a three-layered pooling layer, i.e. a pooling layer module in the present application. Dense1 represents the full connection layer, i.e. the full connection layer module in this application. It can be seen that in this application, the convolutional layer module, the pooling layer module, and the fully-connected layer module operate in parallel, where FIFO represents first-in first-out barrier, MUL represents execution, ADD represents addition, and CAP represents data allocation.

The embodiment of the application provides a neural network acceleration system testing device, which comprises the neural network acceleration system shown in fig. 1 and a main control module connected with the neural network acceleration system;

the main control module is configured to acquire image data to be processed;

The neural network acceleration system testing device comprises a neural network acceleration system and a main control module connected with the neural network acceleration system; the main control module is configured to acquire image data to be processed; the main control module is further configured to determine convolutional layer input data based on the image data to be processed; the neural network acceleration system is configured to perform parallel pipeline processing on the input data of the convolution layer, determine an output probability value, fully mine inherent parallelism of the convolution neural network, enable low power consumption and low cost of the terminal reasoning artificial neural network device of the Internet of things to be met, and meet industrial application requirements of reasoning instantaneity, and has the advantages of being modularized in design, programmable, reconfigurable, high in expandability, easy to achieve software and hardware and low in implementation cost.

In the application, a main control module is realized based on an open source bird E203 RISCV kernel and a system-on-a-chip (SOC), and comprises a kernel control sub-module, an equipment bus (ICB/AXI bus interface), a storage unit and an external Internet protocol address (IP) sub-module which are sequentially connected; wherein the kernel control (E203 RISCV) submodule is used for controlling data interaction; the neural network acceleration system (NR_CNN, number Recognition) is mounted on the equipment bus and is used for performing parallel pipeline processing on the input data of the convolution layer through control of the kernel control submodule to determine the output probability value.

Optionally, the peripheral internet protocol address sub-module comprises an image acquisition module, an image display module and an acceleration recognition result display module;

It should be noted that the function mainly realized by the image acquisition module is to effectively read out 640 RGB pixels by using I2C read-write time sequence and configuration related registers through the camera, convert the RGB pixels into gray data (ycbcr format), then convert the RGB pixels into black and white picture data (binary data), then input the 640 x 480 data into a Block random access memory (Block RAM), cache the Block RAM, output the Block RAM to a video graphics array (PMOD-VGA) interface through a certain time sequence control, and effectively output the Block RAM to the display screen through a certain control signal. Also, after 640 x 480 binary data are generated, the picture data are median filtered and the video stream data are horizontally and vertically projected to obtain digital locations, which are captured in red frames and resized (Resize) to convert the valid 28 x 28 data, which 28 x 28 data are then output to the input of the bus-hung CNN-LeNet-5 module for recognition.

In this application, the main purpose of the device bus (ICB/AXI bus interface) is to reasonably and effectively mount the convolutional neural network element (CNN) on the SOC and generate a reasonable interface to interact with the data flow of the CNN. By assigning the added ICBtoAXI bus interface with the appropriate address, assign the output valid signal (cld) of CNN to read valid signals axijrvalid and axijareready (slave is ready to accept address and control information) when accessing the read address, otherwise assign the address control signal of the channel to axijrvalid and axijrready valid (indicating that the slave can receive read data) to axijareready.

Optionally, the acceleration recognition result display module includes a first display mode and a second display mode;

the first display mode is that the identification result is displayed in real time through an upper computer corresponding to the accelerator identification result display module; and the second display mode is to determine and display the identification result through serial port configuration.

In the application, the acceleration recognition result display module provides two display modes, one is to display the recognition result in real time in an upper computer developed by the user, and the other is to transmit a picture to be recognized stored in a memory (ROM) to a CNN through a bus, send the recognition result to an AXI bus after the CNN receives data processing, read the AXI bus through a software writing program and display the read data on an LED lamp. Fig. 5 shows a schematic diagram of an upper computer display interface provided in an embodiment of the present application, as shown in fig. 5, the upper computer handwriting digital recognition result display software is developed by using a program block diagram (QT) at a software end, and a real-time recognition result output by a serial port is displayed at the software end.

In the application, a fully parallel neural network acceleration system is designed and realized on a Field programmable gate array (Field-Programmable Gate Array, FPGA), and is realized as a complete embedded system which can be applied in an actual scene by integrating a main control module, namely an integrated humblebird E203 RISC-V processor core and utilizing related peripheral equipment, wherein a convolutional neural network test model is LeNet-5, the inherent parallelism of the convolutional neural network can be fully mined, so that the low power consumption and low cost of the terminal reasoning artificial neural network equipment of the Internet of things can be met, and the industrial application requirement of reasoning instantaneity can be met.

The test device for the neural network acceleration system provided by the application is applied to the neural network acceleration system shown in any one of fig. 1 to 4, and is not repeated here.

The electronic device in the embodiment of the application may be an apparatus, or may be a component, an integrated circuit, or a chip in a terminal. The device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a cell phone, tablet computer, notebook computer, palm computer, vehicle-mounted electronic device, wearable device, ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook or personal digital assistant (personal digital assistant, PDA), etc., and the non-mobile electronic device may be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and the embodiments of the present application are not limited in particular.

The electronic device in the embodiment of the application may be a device having an operating system. The operating system may be an Android operating system, an IOS operating system, or other possible operating systems, which is not specifically limited in the embodiments of the present application.

Fig. 6 shows a schematic hardware structure of an electronic device according to an embodiment of the present application. As shown in fig. 6, the electronic device 200 includes a processor 210.

As shown in FIG. 6, the processor 210 may be a general purpose central processing unit (central processing unit, CPU), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the programs of the present application.

As shown in fig. 6, the electronic device 200 may further include a communication line 240. Communication line 240 may include a pathway to transfer information between the aforementioned components.

Optionally, as shown in fig. 6, the electronic device may further include a communication interface 220. The communication interface 220 may be one or more. The communication interface 220 may use any transceiver-like device for communicating with other devices or communication networks.

Optionally, as shown in fig. 6, the electronic device may also include a memory 230. The memory 230 is used to store computer-executable instructions for performing aspects of the present application and is controlled by the processor for execution. The processor is configured to execute computer-executable instructions stored in the memory, thereby implementing the method provided in the embodiments of the present application.

As shown in fig. 6, the memory 230 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc (compact disc read-only memory) or other optical disc storage, an optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Memory 230 may be a stand-alone device coupled to processor 210 via communication line 240. Memory 230 may also be integrated with processor 210.

Alternatively, the computer-executable instructions in the embodiments of the present application may be referred to as application program codes, which are not specifically limited in the embodiments of the present application.

In a particular implementation, as one embodiment, as shown in FIG. 6, processor 210 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 6.

In a specific implementation, as an embodiment, as shown in fig. 6, the terminal device may include a plurality of processors, such as a first processor 2101 and a second processor 2102 in fig. 6. Each of these processors may be a single-core processor or a multi-core processor.

Fig. 7 is a schematic structural diagram of a chip according to an embodiment of the present application. As shown in fig. 7, the chip 300 includes one or more (including two) processors 210.

Optionally, as shown in fig. 7, the chip further includes a communication interface 220 and a memory 230, and the memory 230 may include a read-only memory and a random access memory, and provides operation instructions and data to the processor. A portion of the memory may also include non-volatile random access memory (non-volatile random access memory, NVRAM).

In some implementations, as shown in FIG. 7, memory 230 stores elements, execution modules or data structures, or a subset thereof, or an extended set thereof.

In the embodiment of the present application, as shown in fig. 7, by calling the operation instruction stored in the memory (the operation instruction may be stored in the operating system), the corresponding operation is performed.

As shown in fig. 7, the processor 210 controls the processing operation of any one of the terminal devices, and the processor 210 may also be referred to as a central processing unit (central processing unit, CPU).

As shown in fig. 7, memory 230 may include a read only memory and a random access memory and provide instructions and data to the processor. A portion of memory 230 may also include NVRAM. Such as a memory, a communication interface, and a memory coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 310 in fig. 7.

As shown in fig. 7, the method disclosed in the embodiment of the present application may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general purpose processor, a digital signal processor (digital signal processing, DSP), an ASIC, an off-the-shelf programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

In one aspect, a computer readable storage medium is provided, in which instructions are stored, which when executed, implement the functions performed by the terminal device in the above embodiments.

In one aspect, a chip is provided for use in a terminal device, the chip including at least one processor and a communication interface, the communication interface being coupled to the at least one processor, the processor being configured to execute instructions to implement the functions performed by the neural network acceleration system in the above-described embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a terminal, a user equipment, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, e.g., floppy disk, hard disk, tape; optical media, such as digital video discs (digital video disc, DVD); but also semiconductor media such as solid state disks (solid state drive, SSD).

Although the present application has been described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the figures, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Although the present application has been described in connection with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely exemplary illustrations of the present application as defined in the appended claims and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present application. It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to include such modifications and variations as well.

Claims

1. A neural network acceleration system, wherein the system comprises an acceleration top layer interface unit and a neural network unit;

2. The neural network acceleration system of claim 1, wherein the convolutional layer module comprises a convolutional layer interface design sub-module, a convolutional layer on-chip storage sub-module, an input loading state machine design sub-module, an input buffer construction sub-module, and a multiply-add buffer construction sub-module, which are connected in sequence;

the input buffer area constructing sub-module is used for acquiring input data corresponding to the buffer area state of the first-in first-out queue, and pressing the input data into the corresponding convolution kernel position according to a preset pressing-in convolution kernel position;

3. The neural network acceleration system of claim 1, wherein the pooling layer module comprises a pooling layer interface design sub-module, a pooling layer on-layer storage sub-module, a comparator design sub-module, an upper layer feature map buffer construction sub-module, and a data bit truncation sub-module that are connected in sequence;

the pooling layer interface design submodule is used for adding the first two rows of the addition results input by the convolution layer module into the upper layer feature map buffer zone construction submodule, carrying out data processing on the addition results, processing the addition results after the data processing through the data bit interception submodule to obtain output results, determining the maximum value of the corresponding output results in every four selected blocks by utilizing the comparator design submodule, determining the pooling layer output results, and inputting the pooling layer output results to the full-connection layer module.

4. The neural network acceleration system of claim 1, wherein the full-connection layer module is configured to load data on a pooling layer output result input by the pooling layer module, obtain full-connection layer input data, perform full-connection matrix multiplication on the full-connection layer input data, and determine an output probability value in a last layer after three-layer full-connection processing.

5. The neural network acceleration system of claim 1, wherein the inter-layer connections comprise an input data interface control, an output data interface control, and an inter-layer connection control.

6. A neural network acceleration system testing device, which is characterized by comprising the neural network acceleration system according to any one of claims 1-5, and further comprising a main control module connected with the neural network acceleration system;

the main control module is configured to acquire image data to be processed;

7. The neural network acceleration system testing apparatus of claim 6, wherein the main control module comprises a kernel control sub-module, a device bus, a memory unit and a peripheral internet protocol address sub-module connected in sequence; the kernel control sub-module is used for controlling data interaction; the neural network acceleration system is mounted on the equipment bus and used for completing parallel pipeline processing on the input data of the convolution layer through control of the kernel control sub-module and determining the output probability value.

8. The neural network acceleration system testing apparatus of claim 7, wherein the peripheral internet protocol address sub-module comprises an image acquisition module, an image display module, and an acceleration recognition result display module;

9. The neural network acceleration system testing apparatus of claim 8, wherein the acceleration recognition result display module includes a first display mode and a second display mode;

the first display mode is that the recognition result is displayed in real time through an upper computer corresponding to the acceleration recognition result display module;

And the second display mode is to determine and display the identification result through serial port configuration.

10. An electronic device, comprising: one or more processors; and one or more machine readable media having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform the neural network acceleration system of any of claims 1-5.