CN113961505A - High-performance hardware acceleration and algorithm verification system and method - Google Patents

High-performance hardware acceleration and algorithm verification system and method Download PDF

Info

Publication number
CN113961505A
CN113961505A CN202111196413.8A CN202111196413A CN113961505A CN 113961505 A CN113961505 A CN 113961505A CN 202111196413 A CN202111196413 A CN 202111196413A CN 113961505 A CN113961505 A CN 113961505A
Authority
CN
China
Prior art keywords
module
main processor
interface
coprocessor
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111196413.8A
Other languages
Chinese (zh)
Inventor
魏伟
贾庆生
孙旭
张楷龙
王锡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Panda Electronics Co Ltd
Nanjing Panda Electronics Manufacturing Co Ltd
Original Assignee
Nanjing Panda Electronics Co Ltd
Nanjing Panda Electronics Manufacturing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Panda Electronics Co Ltd, Nanjing Panda Electronics Manufacturing Co Ltd filed Critical Nanjing Panda Electronics Co Ltd
Priority to CN202111196413.8A priority Critical patent/CN113961505A/en
Publication of CN113961505A publication Critical patent/CN113961505A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4063Device-to-bus coupling
    • G06F13/4068Electrical coupling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4282Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0016Inter-integrated circuit (I2C)
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0042Universal serial bus [USB]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Advance Control (AREA)

Abstract

The invention discloses a high-performance hardware acceleration and algorithm verification system, which comprises a processor module, a storage module, a high-speed communication interface module, an external interface module, a clock module, a power management module and a debugging and downloading module, wherein the processor module is used for processing a high-speed communication signal; the storage module, the high-speed communication interface module, the peripheral interface module, the clock module, the system power supply module and the debugging and downloading module are all connected with the processor module; the processor module comprises a main processor and a coprocessor, the main processor realizes a hardware acceleration function and verifies an algorithm, and the coprocessor assists the main processor to work; the main processor is an FPGA, the coprocessor is a chip with an ARM hard core, and the main processor is connected with the coprocessor through LVDS differential signal lines and a single-ended MIO. The invention also discloses a high-performance hardware acceleration and algorithm verification method. The invention effectively improves the operational performance of the algorithm verification system by about 45 percent, the quality of signal transmission and reduces the power consumption of the system by about 20 percent.

Description

High-performance hardware acceleration and algorithm verification system and method
Technical Field
The present invention relates to an algorithm verification system and method, and more particularly, to a high-performance hardware acceleration and algorithm verification system and method.
Background
With the development of science and technology and the progress of society, the technical development in the field of artificial intelligence and the like is rapidly advanced, and the deep learning algorithm is taken as a key technology in the field of artificial intelligence and is widely applied in the fields of computer vision, natural language processing and the like.
In the early stage of the development of the technology, in the process of executing a complex algorithm similar to deep learning, a CPU is mainly adopted to realize algorithm processing, but the CPU cannot efficiently realize the complex algorithm containing a large number of numerical operations. The traditional scheme is to use a GPU comprising a large number of computation cores for processing a deep learning algorithm model with high parallelism. However, the GPU has high energy consumption, and cannot realize large-scale deployment and application. And the parallelism of the GPU is realized by copying the same general-purpose computing core for multiple times. The instruction set limitations of the GPU result in only one primitive instruction being implemented in one clock cycle. Meanwhile, the architecture of the GPU is not changeable and cannot support some special instructions. Therefore, the traditional algorithm verification system with the GPU architecture has the problems of low operation efficiency, low instruction compatibility and low flexibility.
Other single processor algorithm verification systems need steps of instruction reading, instruction decoding and instruction execution in the algorithm execution process, which also causes the reduction of the system operation performance and has the problem of high power consumption.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a high-performance hardware acceleration and algorithm verification system and method, and solves the problems of low algorithm verification efficiency and high system power consumption in an algorithm verification system.
The technical scheme is as follows: the high-performance hardware acceleration and algorithm verification system comprises a processor module, a storage module, a high-speed communication interface module, an external interface module, a clock module, a power management module and a debugging and downloading module; the storage module, the high-speed communication interface module, the peripheral interface module, the clock module, the system power supply module and the debugging and downloading module are all connected with the processor module;
the processor module comprises a main processor and a coprocessor, the main processor realizes a hardware acceleration function and verifies an algorithm, and the coprocessor coordinates the main processor to work; the memory module realizes the caching of system data and the storage of an operating program; the high-speed communication interface module realizes the transmission and processing of data between the processor module and external high-speed equipment; the peripheral interface module realizes the transmission and processing of signals between the processor module and the external equipment; the clock module provides a system clock for the processor module; the power supply management module provides working voltage for the processor module; the debugging and downloading module realizes the program downloading and function debugging of the system.
The main processor is an FPGA, the coprocessor is a chip with an ARM hard core, and the main processor is connected with the coprocessor through LVDS differential signal lines and a single-ended MIO.
The memory comprises a first memory and a second memory, the first memory is connected with the main processor, and the second memory is connected with the coprocessor; the first memory comprises a first DDR memory module and a first nonvolatile memory module; the second memory comprises a second DDR memory module, a second nonvolatile memory module and an embedded memory module;
the first DDR memory is used for caching data of the FPGA main processor, and the first nonvolatile memory module is used for storing a system program of the main processor and is used for running a system after being electrified;
the second DDR storage module is used for caching ARM coprocessor data, the second nonvolatile storage module is used for storing a system program of the coprocessor for system operation after power-on, and the embedded storage module is used for storing an embedded operating system to realize scheduling and control of a main processor and a coprocessor task process;
the first DDR memory module is connected with the main processor through an address bus and a data bus; the second DDR memory module is connected with the coprocessor through an address bus and a data bus; the first nonvolatile storage module is connected with the main processor through an SPI bus interface and a BPI bus interface; the second nonvolatile storage module is connected with the coprocessor through an SPI bus interface; and the embedded storage module is connected with the coprocessor through an MMC bus interface.
The high-speed communication interface module comprises a PCIE interface module, an SATA interface module and a high-speed optical fiber interface module; the PCIE interface module realizes communication and data interaction between the system and an external PC, the SATA interface module realizes data read-write operation of the system to SATA interface equipment, and the high-speed optical fiber module realizes communication and data interaction between the system and external optical fiber equipment.
The peripheral interface module comprises a network communication module, a USB module and an SD card module which are communicated with the processor; the network communication module supports the transmission of gigabit Ethernet data, and the USB module is used for information interaction between the system and a PC host end or a USB peripheral. The signal of the peripheral interface module also comprises an FMC interface expanding module which is communicated with the processor and is used for expanding the peripheral interface.
The clock module comprises a main processor clock module and a coprocessor clock module; the main processor clock module provides a high-precision LVDS differential system clock for the main processor; the coprocessor clock module provides a single-ended active system clock for the coprocessor.
The debugging and downloading module comprises a JTAG interface module, a UART interface module and an I2C interface module; the JTAG interface module realizes the program downloading and debugging of the main processor and the coprocessor; the UART interface module realizes the online debugging of programs and communicates with an external RS232 interface; the I2C interface module realizes the module operation with I2C communication interface in the system, and realizes the interaction and transmission of programs.
The first storage module further comprises a parallel FLAHS storage module of the BPI interface, when the LUT usage in the main processor exceeds a threshold value, the main processor loads a program from the FLAHS of the BPI interface to realize the loading and starting of the main processor program, and when the LUT usage in the main processor is less than the threshold value, the main processor loads the program from the FLAHS of the SPI interface to realize the loading and starting of the main processor program.
The invention relates to a high-performance hardware acceleration and algorithm verification method, which comprises the following steps:
(1) the coprocessor reads N pictures from a peripheral connected with the peripheral interface module or the high-speed communication interface module, wherein N is more than or equal to 2, and a standard value of each picture is set;
when the usage of the LUT in the main processor exceeds a threshold value, the main processor loads a program from the FLAHS of the BPI interface to realize the loading and starting of the program of the main processor, and when the usage of the LUT in the main processor is less than the threshold value, the main processor loads the program from the FLAHS of the SPI interface to realize the loading and starting of the program of the main processor;
(4) the method comprises the steps that a main processor converts picture data into parallel bitmap data through LVDS, the bitmap data are cached in a DDR, and algorithm verification is conducted on N input pictures at the same time;
(5) judging whether the algorithm verification result is equal to the corresponding target picture initial value or not, if so, executing the step (1), and if not, executing the step (6);
(6) and (5) correcting the algorithm, re-verifying the algorithm and executing the step (5).
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages:
(1) the system platform with the FPGA main processor and the ARM coprocessor matched with each other has the full-programmable capability of software and hardware and the high-efficiency parallel algorithm realization capability, and the dynamically reconfigurable characteristic of the system platform is matched with rich peripheral interfaces and modules, so that the operation performance of the system is effectively improved, and the power consumption of the system is reduced.
(2) The FPGA main processor and the ARM coprocessor realize cooperative processing between the FPGA main processor and the ARM coprocessor by using differential data signals, meet the requirements of high-speed data exchange and transmission, are suitable for high-bandwidth data interaction, and effectively improve the quality of signal transmission.
(3) The main processor adopts a plurality of high-speed expanded HPC interfaces to realize the input of multi-channel high-speed data, and realizes the interaction and synchronous processing with the coprocessor, thereby realizing the efficient operation effect and improving the performance of the system.
(4) According to the proportion of the LUT used in the system program, the serial and parallel interface communication modes are dynamically switched and selected, so that the processing speed of the system is improved, the power consumption of the system is reduced, and the starting requirement of system equipment is met.
Drawings
FIG. 1 is a schematic structural view of the present invention;
FIG. 2 is a schematic diagram of one embodiment of the present invention.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1, the high-performance hardware acceleration and algorithm verification system according to the present invention includes a processor module, a storage module, a high-speed communication interface module, a peripheral interface module, a clock module, a power management module, and a debugging and downloading module. Each module is connected with a processor module.
The processor module comprises an FPGA main processor module and a ZYNQ coprocessor module with an ARM hardmac. The main processor realizes the hardware acceleration function and the algorithm, and the coprocessor coordinates the work of the main processor by sending a control instruction. The FPGA main processor module is connected with the coprocessor module through a plurality of pairs of LVDS differential signal lines and a single-ended MIO, so that the interaction and control of instructions and data are realized.
The storage module comprises a first storage module at the main processor end and a second storage module at the coprocessor end. The first storage module of the main processor comprises a first DDR storage module and a first nonvolatile storage module. The second storage module of the coprocessor comprises a second DDR storage module, a second nonvolatile storage module and an embedded storage module. The first DDR storage module of the main processor end is used for caching data in the data processing process of the FPGA main processor, and the first nonvolatile storage module is used for storing the operation of the main processor end after the program is electrified. And the second DDR memory module of the coprocessor is responsible for caching and processing data of the coprocessor section, and the second nonvolatile memory module is used for storing an operating program and codes of the coprocessor and performing system initialization and program execution after power-on. The embedded storage module is used for storing an embedded operating system and can be used for realizing scheduling and control of processes of tasks of the main processor and the coprocessor. The DDR memory module is connected with the processor module through an address bus and a data bus. The nonvolatile high-speed communication interface module and the processor module are connected through the SPI interface and the BPI bus interface. The embedded memory module realizes the connection with the coprocessor through an MMC interface bus protocol.
The high-speed communication interface module is connected with a high-speed communication interface of the FPGA main processor module, so that the transmission and processing of data with high bandwidth, high throughput and low time delay requirements are realized. The high-speed communication interface module comprises a PCIE interface module, an SATA interface module and a high-speed optical fiber interface module. And the PCIE interface module is used for realizing communication and data interaction between the system and an external PC. The SATA interface module realizes the read-write operation of the system to the data of the SATA interface device. The high-speed optical fiber module is used for receiving external high-speed optical signals and transmitting the external high-speed optical signals to the main processor.
The peripheral interface module comprises an FMC interface expansion module, various high-speed and low-speed peripheral interfaces can be expanded through the FMC interface, input and processing of various signals are achieved, and transmission and processing of high-speed HDMI signals, DP video signals and high-speed AD/DA signals can be achieved through the expansion interface module.
The peripheral interface module also comprises a network communication module, a USB module and an SD card module which are communicated and connected with the coprocessor. The network communication module supports the transmission of gigabit Ethernet data and information. The USB module is used for information interaction between the system and the PC host end and between the system and the USB peripheral. When the USB module works in the master mode, the peripheral equipment realizes information interaction with the master equipment through the USB. When the USB module works in the slave mode, the host PC realizes the control and signal transmission of the system through the USB interface.
The clock module comprises a clock module of the main processor and a clock module of the coprocessor. The clock module of the main processor mainly provides a system clock for the main processor, and the clock is a high-precision LVDS differential clock and provides a clock reference for the normal work of a system. The clock module of the coprocessor mainly provides a system clock for the embedded hard core, and the clock is a single-ended active clock.
The power management module provides stable power for the main processor and the coprocessor, so that the normal and stable operation of the system is ensured, and meanwhile, a stable operating level standard is provided for other peripheral modules communicated with the main processor and the coprocessor. And the power supply module ensures the stability and the normality of the work of the system by controlling the power-on time sequence of the system. For the main processor and the coprocessor, the power-on control sequence of the system is a power scheme of the system with the power on of the kernel → the power supply of the auxiliary power supply → the power supply of the port → the power supply of the peripheral interface.
The debugging and downloading module realizes the functions of program downloading and function debugging of the system and comprises a JTAG interface module, a UART interface module and an I2C interface module. The JTAG interface module is used for program downloading and debugging functions of a plurality of processors. The UART interface module can realize the function of an online debugging interface with a program, and can also realize the functions of communication and data transmission with an external RS232 interface by adopting a level conversion chip. The I2C interface module realizes read-write operation to the module with I2C communication interface in the system, and realizes the functions of program interaction and transmission.
As shown in fig. 2, in this embodiment, the FPGA main processor selects a series of modules of sailing V7. The module comprises 36 high-speed GTX transceiver modules, the single-channel speed reaches 13.6Gbps, and the module is suitable for receiving and transmitting data signals with ultrahigh bandwidth. Meanwhile, the serial-parallel conversion module is integrated, and decoding and processing can be performed on different input signals. The module internally comprises a plurality of groups of signals of LVDS level interfaces, and can realize the communication and transmission of signals with other processors.
The coprocessor module is ZYNQ with an ARM hard core. The ARM module is embedded in the processor module, and the operating frequency can reach 866 MHz. The coprocessor module comprises a logic driving part and an embedded processing part, wherein the logic processing part is responsible for communicating and controlling with an FPGA main processor. The communication mode comprises a differential signal interface and a single-ended communication port. The single-ended communication port realizes the control and communication of the ARM coprocessor to the FPGA main processor. The differential signal interface realizes the signal interaction between the FPGA main processor and the ARM coprocessor and specifically comprises a high-speed differential transmission interface and a low-voltage signal transmission interface, so that the transmission of signals and data with different rates and formats is realized.
In this embodiment, the DDR memory module of the main processor is a DDR3 or DDR4 module. Taking the DDR3 module as an example, because the module adopts a double data rate working mode, when the operating clock rate of the DDR3 is 533MHz, the data rate of the DDR3 is 1066MB/S, the data bit width is 64bit, and the address bus bit width is 15 bit, thereby meeting the cache requirement of the FPGA main processor data processing. The non-volatile storage module is mainly a FLASH storage module with an SPI communication interface, and the module realizes the writing of programs after the main processor is electrified through an external JTAG interface, so that the FPGA main processor realizes the reading of the programs through an internal JTAG interface after the FPGA main processor is electrified, and the normal work of a main processor system is ensured.
The storage module of the main processor also comprises a parallel FLASH storage module of a BPI interface. The parallel FLASH memory module and the serial FLASH module can be used as a selection scheme for starting the main processor. When the usage of the LUT unit in the main processor exceeds 65%, the program load is large at this time, and meanwhile, in order to meet the requirement of the set-up time for PCIE equipment identification as 100ms, the main processor outputs an equipment start selection signal through an equipment selection module at this time, selects a communication mode of a BPI parallel port, loads a program from an FLAHS of a BPI interface to realize the loading and starting of the program of the main processor, can quickly realize the starting of the program, meet the load start requirement of a high-speed peripheral, improve the load efficiency at the same time, and can reduce the power consumption of a system at the same time. Otherwise, when the usage amount of the LUT in the main processor is less than 65%, the main processor outputs a device start selection signal through the device selection module, selects an SPI serial port communication mode, and loads a program from an FLAHS of the SPI interface to realize power-on start and operation. By adopting the scheme, the program loading can be dynamically and adaptively selected according to the proportion of the internal LUT occupied by the program, so that the processing efficiency of the system can be effectively improved, and the power consumption of the system can be reduced.
The DDR storage module of the coprocessor adopts a DDR3 storage module which is the same as the main processor and is used for caching data signals which are received by the ZYNQ coprocessor peripheral and processed by the coprocessor. The specification of the nonvolatile storage module of the coprocessor is the same as that of the SPI specification storage module of the FPGA main processor, and the nonvolatile storage module and the SPI specification storage module are used for realizing program operation after the coprocessor is powered on and ensuring normal work of the coprocessor. The embedded storage module is an EMMC module which can run an embedded operating system, so that management and control of the running process of the coprocessor module can be realized, control of the main processor can be indirectly realized through the MIO of coprocessing, the running efficiency of the acceleration system is further improved, task scheduling is realized, and the running power consumption of the system is controlled.
In this embodiment, the PCIE of the high-speed communication interface module adopts an X8 operating mode, which can implement transmission supporting data throughput of 7.8GB/S high bandwidth, implement high-speed communication and data exchange between the processor module and the PC terminal, implement acceleration by the acceleration system, and implement final processing of signals by the terminal PC. The SATA interface module is used as an external hard disk interface module and is used for an expansion scheme of external storage to enlarge the storage space of the acceleration and verification system, the SATA interface module supports the standard transmission speed of a SATA2.0 interface to reach 3Gbps, the transmission requirement of high bandwidth rate can be met, and the requirements of quick reading and writing and response of very low delay are realized. The high-speed optical fiber interface module comprises a 10G SFP + module and a 40G QSFP + module. After receiving the external optical fiber signal, the high-speed optical fiber interface module realizes the conversion and processing of the signal through the internal photoelectric conversion module, thereby ensuring that the system can efficiently process the externally received signal and ensure the reliability of the system operation.
In this embodiment, the peripheral interface expansion module includes an HPC connector that meets the FMC interface standard, and the connector can input and output various types of interface signals. The method meets the input and output of signals which can support 8K HDMI and DP at the highest level, and meets the requirement of ultra-high definition video signal processing. Meanwhile, in order to meet the requirements of acquisition and processing of high-speed analog-digital signals, the HPC module can meet the requirements of AD/DA acquisition with an ultrahigh sampling rate, and the highest sampling rate reaches GSPS, so that the peripheral functions of the system are further expanded, the input requirements of various types of signals are met, and the compatibility of the system is expanded.
In this embodiment, the network communication module uses an RJ45 interface scheme to support the transmission of gigabit ethernet data and information. The USB module adopts a2.0 scheme and is used for information interaction between the system and the PC host end and the USB peripheral. The clock module of the main processor selects a 200M differential active crystal oscillator, the input of a system clock is realized through the clock input port of the HR partition of the FPGA main processor, and the clock module of the coprocessor mainly inputs 33.3333MHz clock signals, so that the normal and stable work of each module of the coprocessor system can be ensured. The debugging and downloading module adopts a JTAG interface module, and because the processor module of the system and the debugging interfaces of other peripheral equipment all adopt the scheme mode of the JTAG interfaces, a driving mode of a plurality of JTAG interfaces exists. The PCB space occupied by a plurality of JTAG ports can be saved by adopting a daisy chain mode; and secondly, the FPGA program is convenient to be upgraded on line or remotely.
The working process of the invention is as follows:
in one embodiment, the deep learning algorithm verification of target identification is realized through the system provided by the invention, and the deep learning algorithm is mainly used for identifying whether components on the circuit board card after being pasted have missing components or not.
In the embodiment, a large number of element pictures of the component are used as a sample set, 80% of pictures of the sample set are used as a training set, a deep learning algorithm model is built, and then the built deep learning algorithm model is brought in through a 20% test set to carry out model algorithm verification. And setting the test set as 100 pictures, wherein the standard value of each picture is a fixed value of 0-100.
And storing the bmp pictures of the test set into an SD card of the peripheral interface module. After the system is powered on, the coprocessor reads a test set picture in the SD card through the SDIO bus, converts a bitmap picture of the test set into a serial signal through parallel and serial connection, and then transmits the serial signal to the main processor through the LVDS signal line. Because the deep learning model algorithm is complex in implementation process, an internal logic unit is large in use, and the use amount of LUTs in a processor is about 70%, the main processor selects to read programs in FLASH in a BPI mode for power-on initialization, converts picture signals of a test set into parallel bitmap data signals through LVDS, caches the bitmap data signals in DDR, runs the deep learning model algorithm, and operates a bitmap to obtain a result after the algorithm is executed.
The coprocessor sends a control instruction to the main processor through the MIO, the main processor simultaneously carries out algorithm operation on 100 pictures in a parallel mode, the operation result of each picture is compared with the corresponding standard value of the test set, when the operation result is equal to the standard value, the algorithm model is judged to be accurate, otherwise, the algorithm model is inaccurate, and the algorithm model needs to be revised again.
The invention adopts the FPGA main processor to perform data interaction with the bitmap data cached in the DDR3, can simultaneously perform concurrent transmission on the bitmap image data according to the concurrent length and times, effectively improves the system operation speed, and further effectively improves the algorithm verification efficiency.
The invention adopts the coprocessor to send a single control signal instruction to the main processor in a clock period, and the main processor receives the instruction and starts to execute, thus finishing the processing and verification of the whole bitmap data. And other single processors can only complete the verification of the data with specific bit width of the image through reading of the instruction, decoding of the instruction and execution of the instruction in one clock cycle. The larger the number of pictures for the test set, the longer the required clock period. Therefore, compared with the traditional single processor, the efficiency of the algorithm verification system in the verification process of the complex algorithm can be improved by about 45%, and the power consumption of the system can be reduced by about 20%.

Claims (10)

1. A high performance hardware acceleration and algorithm verification system, characterized by: the device comprises a processor module, a storage module, a high-speed communication interface module, a peripheral interface module, a clock module, a power management module and a debugging and downloading module; the storage module, the high-speed communication interface module, the peripheral interface module, the clock module, the system power supply module and the debugging and downloading module are all connected with the processor module;
the processor module comprises a main processor and a coprocessor, the main processor realizes a hardware acceleration function and verifies an algorithm, and the coprocessor assists the main processor to work;
the memory module realizes the caching of system data and the storage of an operating program;
the high-speed communication interface module realizes the transmission and processing of data between the processor module and external high-speed equipment;
the peripheral interface module realizes the transmission and processing of data between the processor module and the external equipment;
the clock module provides a system clock for the processor module;
the power supply management module provides working voltage for the processor module;
the debugging and downloading module realizes the program downloading and function debugging of the system.
2. The high performance hardware acceleration and algorithm validation system of claim 1, wherein: the main processor is an FPGA, the coprocessor is a chip with an ARM hard core, and the main processor is connected with the coprocessor through LVDS differential signal lines and a single-ended MIO.
3. The high performance hardware acceleration and algorithm validation system of claim 2, wherein: the memory comprises a first memory and a second memory, the first memory is connected with the main processor, and the second memory is connected with the coprocessor; the first memory comprises a first DDR memory module and a first nonvolatile memory module; the second memory comprises a second DDR memory module, a second nonvolatile memory module and an embedded memory module;
the first DDR memory is used for caching data of the main processor, and the first nonvolatile memory module is used for storing a system program of the main processor and is used for running a system after being electrified;
the second DDR storage module is used for caching data of the coprocessor, the second nonvolatile storage module is used for storing a system program of the coprocessor for system operation after power-on, and the embedded storage module is used for storing an embedded operating system to realize scheduling and control of a main processor and a task process of the coprocessor;
the first DDR memory module is connected with the main processor through an address bus and a data bus;
the second DDR memory module is connected with the coprocessor through an address bus and a data bus;
the first nonvolatile storage module is connected with the main processor through an SPI bus interface and a BPI bus interface;
the second nonvolatile storage module is connected with the coprocessor through an SPI bus interface;
and the embedded storage module is connected with the coprocessor through an MMC bus interface.
4. The high performance hardware acceleration and algorithm validation system of claim 2, wherein: the high-speed communication interface module comprises a PCIE interface module, an SATA interface module and a high-speed optical fiber interface module;
the PCIE interface module realizes communication and data interaction between the system and an external PC, the SATA interface module realizes data read-write operation of the system to SATA interface equipment, and the high-speed optical fiber module realizes communication and data interaction between the system and external optical fiber equipment.
5. The high performance hardware acceleration and algorithm validation system of claim 2, wherein:
the peripheral interface module comprises a network communication module, a USB module and an SD card module which are communicated with the processor; the network communication module supports the transmission of gigabit Ethernet data, and the USB module is used for information interaction between the system and a PC host or a USB peripheral.
6. The high performance hardware acceleration and algorithm validation system of claim 5, wherein: the signal of the peripheral interface module also comprises an FMC interface expanding module which is communicated with the processor and is used for expanding the peripheral interface.
7. The high performance hardware acceleration and algorithm validation system of claim 2, wherein:
the clock module comprises a main processor clock module and a coprocessor clock module; the main processor clock module provides an LVDS differential system clock for the main processor; the coprocessor clock module provides a single-ended active system clock for the coprocessor.
8. The high performance hardware acceleration and algorithm validation system of claim 2, wherein:
the debugging and downloading module comprises a JTAG interface module, a UART interface module and an I2C interface module;
the JTAG interface module realizes the program downloading and debugging of the main processor and the coprocessor;
the UART interface module realizes the online debugging of programs and communicates with an external RS232 interface;
the I2C interface module operates the module with I2C communication interface in the system to realize the interaction and transmission of programs.
9. The high performance hardware acceleration and algorithm validation system of claim 3, wherein: the first storage module further comprises a parallel FLAHS storage module of the BPI interface, when the LUT usage in the main processor exceeds a threshold value, the main processor loads a program from the FLAHS of the BPI interface to realize the loading and starting of the main processor program, and when the LUT usage in the main processor is less than the threshold value, the main processor loads the program from the FLAHS of the SPI interface to realize the loading and starting of the main processor program.
10. A high-performance hardware acceleration and algorithm verification method is characterized in that: the method comprises the following steps:
(1) the coprocessor reads N pictures from a peripheral connected with the peripheral interface module or the high-speed communication interface module, wherein N is more than or equal to 2, and a standard value of each picture is set;
(2) converting the picture into serial data and transmitting the serial data to a main processor through an LVDS signal line;
(3) when the usage of the LUT in the main processor exceeds a threshold value, the main processor loads a program from the FLAHS of the BPI interface to realize the loading and starting of the program of the main processor, and when the usage of the LUT in the main processor is less than the threshold value, the main processor loads the program from the FLAHS of the SPI interface to realize the loading and starting of the program of the main processor;
(4) the method comprises the steps that a main processor converts picture data into parallel bitmap data through LVDS, the bitmap data are cached in a DDR storage module, and algorithm verification is conducted on N input pictures at the same time;
(5) judging whether the algorithm verification result is equal to the corresponding target picture initial value or not, if so, executing the step (1), and if not, executing the step (6);
(6) and (5) correcting the algorithm, re-verifying the algorithm and executing the step (5).
CN202111196413.8A 2021-10-14 2021-10-14 High-performance hardware acceleration and algorithm verification system and method Pending CN113961505A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111196413.8A CN113961505A (en) 2021-10-14 2021-10-14 High-performance hardware acceleration and algorithm verification system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111196413.8A CN113961505A (en) 2021-10-14 2021-10-14 High-performance hardware acceleration and algorithm verification system and method

Publications (1)

Publication Number Publication Date
CN113961505A true CN113961505A (en) 2022-01-21

Family

ID=79463840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111196413.8A Pending CN113961505A (en) 2021-10-14 2021-10-14 High-performance hardware acceleration and algorithm verification system and method

Country Status (1)

Country Link
CN (1) CN113961505A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780449A (en) * 2022-04-01 2022-07-22 扬州宇安电子科技有限公司 Data storage and transmission system based on ZYNQ chip
CN115357534A (en) * 2022-07-29 2022-11-18 中国科学院合肥物质科学研究院 High-speed multi-channel LVDS acquisition system and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780449A (en) * 2022-04-01 2022-07-22 扬州宇安电子科技有限公司 Data storage and transmission system based on ZYNQ chip
CN114780449B (en) * 2022-04-01 2022-11-25 扬州宇安电子科技有限公司 Data storage and transmission system based on ZYNQ chip
CN115357534A (en) * 2022-07-29 2022-11-18 中国科学院合肥物质科学研究院 High-speed multi-channel LVDS acquisition system and storage medium
CN115357534B (en) * 2022-07-29 2024-04-09 中国科学院合肥物质科学研究院 High-speed multipath LVDS acquisition system and storage medium

Similar Documents

Publication Publication Date Title
EP2472409B1 (en) Input-output module, and method for extending a memory interface for input-output operations
CN113961505A (en) High-performance hardware acceleration and algorithm verification system and method
US7600142B2 (en) Integrated circuit conserving power during transitions between normal and power-saving modes
CN108804376B (en) Small heterogeneous processing system based on GPU and FPGA
CN102012885A (en) System and method for realizing communication by adopting dynamic I2C bus
GB2494257A (en) Memory interface with a clock channel, command bus and address bus.
CN103870429A (en) High-speed-signal processing board based on embedded GPU
CN104834620A (en) SPI (serial peripheral interface) bus circuit, realization method and electronic equipment
CN111190855A (en) FPGA multiple remote configuration system and method
CN101872308A (en) Memory bar control system and control method thereof
CN106980587B (en) General input/output time sequence processor and time sequence input/output control method
CN114297962A (en) Self-adaptive interface FPGA software and hardware collaborative simulation acceleration system
CN215117312U (en) Real-time signal processing platform based on MPSOC
CN104951268A (en) Method for implementing extended high-performance graphics card based on CPCI
CN101950276B (en) Memory access unit and program performing method thereof
CN201812284U (en) Memory interface
CN111124991A (en) Reconfigurable microprocessor system and method based on interconnection of processing units
CN204706031U (en) Serial peripheral equipment interface SPI bus circuit and electronic equipment
CN206975631U (en) A kind of universal input output timing processor
CN114328342B (en) Novel program control configuration method for PCIe heterogeneous accelerator card
CN215450217U (en) Image processing module
CN115934631B (en) Intelligent storage platform based on MPSoC
CN112740193A (en) Method for accelerating system execution operation of big data operation
US20220374382A1 (en) Method and apparatus for extending i3c capability across multiple platforms and devices over usb-c connection
CN219122693U (en) COMe mainboard based on Feiteng processor and bridge piece

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination