CN113962842B

CN113962842B - Dynamic non-polar despinning system and method based on high-level synthesis of large-scale integrated circuit

Info

Publication number: CN113962842B
Application number: CN202111223132.7A
Authority: CN
Inventors: 张弘; 宋剑波; 杨一帆; 邢万里; 袁丁; 李旭亮
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-12-09
Anticipated expiration: 2041-10-20
Also published as: CN113962842A

Abstract

The invention relates to a dynamic stepless despin system and a method based on high-level synthesis of a large-scale integrated circuit, which comprises a video acquisition module, a video decoding module, a video storage module, a data communication module, a video coding module, a dynamic stepless despin module and a pixel combination module (namely a four-in-one module) which is innovatively designed for reducing algorithm delay and improving bus bandwidth utilization rate. The invention adopts a high-level comprehensive technology to realize a dynamic stepless despinning function, can perform real-time despinning treatment on the acquired video image in a photoelectric platform, fully utilizes the characteristics of parallel acceleration and pipeline optimization of an FPGA (field programmable gate array), and has the excellent characteristics of high video resolution, large despinning range, high despinning precision, clear and non-sawtooth processed image, low output delay, strong system stability, easy processing, low power consumption, small volume and the like.

Description

A dynamic stepless derotation system based on high-level synthesis of large-scale integrated circuits and its method

技术领域technical field

本发明涉及智能化嵌入式视频处理领域，具体涉及的是一种基于大规模集成电路高层次综合的动态无极消旋系统及方法。The invention relates to the field of intelligent embedded video processing, in particular to a dynamic stepless derotation system and method based on high-level synthesis of large-scale integrated circuits.

背景技术Background technique

在机载吊舱电视摄录与瞄准过程当中，电视的外框架结构无法避免的会发生横滚运动，这会造成光学系统相对于载机发生相对运动，从而造成图像旋转；或者是在战斗机飞行过程中，机身时常进行大角度翻滚(甚至可达360°)，从而造成电视画面发生大角度旋转，严重影响操作人员观感。因此在众多光学瞄准器件或光电吊舱系统当中，为了消除飞行器姿态变化而引发的图像旋转问题，需要对电视系统获取到的原始视频图像进行反旋转处理，即消旋变换，以此保证图像的正常平稳，便于操作人员观察及后期的目标检测识别与跟踪工作。目前在实际工程应用中有三种常见的消旋方式，即电子消旋、光学消旋和物理消旋，光学消旋是目前使用最多的手段，其通过旋转成像光路中的消旋棱镜来校正图像，虽然这种方式延迟低、响应速度快，但是其加工工艺复杂、消旋角精度低且系统体积和功耗很大。目前随着大规模集成电路和数字信号处理技术的飞速发展，通过实时的视频图像处理算法实现的电子消旋技术成为了目前的主流研究方向，这种方式克服了光学消旋系统的上述不足，得到了越来越广泛的应用。During the video recording and aiming process of the airborne pod, the outer frame structure of the TV will inevitably undergo rolling motion, which will cause relative movement of the optical system relative to the carrier aircraft, resulting in image rotation; During the process, the fuselage often rolls at a large angle (even up to 360°), which causes the TV screen to rotate at a large angle, which seriously affects the perception of the operator. Therefore, in many optical sighting devices or photoelectric pod systems, in order to eliminate the image rotation problem caused by the attitude change of the aircraft, it is necessary to perform anti-rotation processing on the original video image acquired by the TV system, that is, derotation transformation, so as to ensure the accuracy of the image. It is normal and stable, which is convenient for operators to observe and later target detection, identification and tracking. At present, there are three common derotation methods in practical engineering applications, namely electronic derotation, optical derotation and physical derotation. Optical derotation is currently the most used method, which corrects the image by rotating the derotation prism in the imaging optical path , although this method has low delay and fast response speed, but its processing technology is complex, the precision of race-rotation angle is low, and the system volume and power consumption are large. At present, with the rapid development of large-scale integrated circuits and digital signal processing technology, the electronic derotation technology realized by real-time video image processing algorithm has become the current mainstream research direction. This method overcomes the above-mentioned shortcomings of the optical derotation system. has been more and more widely used.

随着计算机视觉领域的不断发展和各类处理芯片性能的不断提升，基于视频图像处理的电子消旋技术成为了当前各类消旋技术的主流研究方向，通过电子消旋消除因飞行器姿态变化而引发的图像旋转问题成为目前工程应用的首选。With the continuous development of the field of computer vision and the continuous improvement of the performance of various processing chips, the electronic derotation technology based on video image processing has become the mainstream research direction of various derotation technologies. The resulting image rotation problem has become the first choice for engineering applications.

发明内容Contents of the invention

本发明技术解决问题：克服现有技术的不足，提供一种基于大规模集成电路高层次综合的动态无极消旋系统及方法，基于高层次综合技术，利用FPGA并行加速及流水线优化的特点可以实现高精度、大范围、高实时性、高输出图像质量的无极消旋处理。精度可达0.001°，即可以对极小的角度敏感的进行消旋处理；消旋范围为0-360°，即可以对任意角度做消旋处理；一帧图像处理时间小于12ms，既可以实现实时消旋处理；采用双线性插值法进行消旋处理，因此图像光滑无锯齿，图像输出质量较高。限于目前技术存在实时性与精度、范围、图像质量之间的矛盾，因而现有技术仅可以单独实现以上指标中的一个或几个，未能同时实现上述全部技术指标，因此本发明具有很高的工程应用价值。The technical problem of the present invention is to overcome the deficiencies of the prior art and provide a dynamic stepless derotation system and method based on high-level synthesis of large-scale integrated circuits. Based on high-level synthesis technology, it can be realized by utilizing the characteristics of FPGA parallel acceleration and pipeline optimization. Stepless derotation processing with high precision, wide range, high real-time performance, and high output image quality. The precision can reach 0.001°, that is, the derotation processing can be performed sensitively to a very small angle; the derotation range is 0-360°, that is, it can be derotation processing for any angle; Real-time derotation processing; bilinear interpolation method is used for derotation processing, so the image is smooth and jagged, and the image output quality is high. Due to the contradiction between real-time performance, accuracy, range and image quality in the current technology, the prior art can only realize one or several of the above indicators alone, but fails to realize all the above technical indicators at the same time, so the present invention has high engineering application value.

本发明的技术解决方案：一种基于大规模集成电路高层次综合的动态无极消旋系统，基于大规模集成电路高层次综合方法设计的，作为本发明的核心其总体上具有如下创新点：1)利用高层次综合技术，即使用C++等高级语言进行FPGA算法设计优化和资源调度；2)算法流水线加速优化，提高了数据吞吐量，大幅降低延时，提高图像消旋的实时性；3)多AXI总线高带宽实时并行优化，提高数据读写效率，提高算法实时性；4)设计用于四像素合并的四合一模块，即将用于双线性插值的四个8位像素点合并成一个32位数据，可在后期实现一次读取四个像素点的功能，大幅减少因数据多次读取而带来的高延时。Technical solution of the present invention: a dynamic stepless derotation system based on high-level synthesis of large-scale integrated circuits, designed based on high-level synthesis methods of large-scale integrated circuits, as the core of the present invention, it generally has the following innovations: 1 ) Utilize high-level synthesis technology, that is, use C++ and other high-level languages to optimize FPGA algorithm design and resource scheduling; 2) Accelerate optimization of algorithm pipeline, improve data throughput, greatly reduce delay, and improve real-time performance of image derotation; 3) Multi-AXI bus high-bandwidth real-time parallel optimization, improving data read and write efficiency, and improving algorithm real-time performance; 4) A four-in-one module designed for four-pixel merging, which combines four 8-bit pixels for bilinear interpolation into A 32-bit data can realize the function of reading four pixels at a time in the later stage, greatly reducing the high delay caused by multiple reading of data.

本发明所述系统包括视频采集模块、视频解码模块、核心处理模块和视频编码模块；所述核心处理模块采用FPGA+ARM架构的异构片上系统，为Zynq UltraScale+MPSoC15EG芯片；所述FPGA包括动态无极消旋模块、视频转AXI总线视频流模块、AXI视频流DDR读写模块以及本发明中用于降低算法延迟、提高总线带宽利用率而创新设计的像素合并模块即四合一模块；ARM包括视频存储模块DDR和RS422串口通信模块，FPGA与ARM之间的数据通信采用AXI控制总线进行；The system of the present invention includes a video acquisition module, a video decoding module, a core processing module and a video encoding module; the core processing module adopts a heterogeneous system-on-chip of FPGA+ARM architecture, which is a Zynq UltraScale+MPSoC15EG chip; the FPGA includes a dynamic Infinite derotation module, video to AXI bus video stream module, AXI video stream DDR read and write module, and the innovatively designed pixel combination module for reducing algorithm delay and improving bus bandwidth utilization in the present invention is a four-in-one module; ARM includes Video storage module DDR and RS422 serial port communication module, data communication between FPGA and ARM is carried out by AXI control bus;

视频采集模块，使用相机进行原始视频图像的采集，该视频图像即为待消旋处理的数据；完成采集后原始视频图像进入视频解码模块中；The video acquisition module uses the camera to collect the original video image, which is the data to be derotated; after the acquisition is completed, the original video image enters the video decoding module;

视频解码模块，将相机采集的串行视频转换成并行视频数据，并得到一系列显性的视频同步信号，解码得到的并行视频数据和同步信号送入至FPGA；The video decoding module converts the serial video collected by the camera into parallel video data, and obtains a series of explicit video synchronization signals, and sends the decoded parallel video data and synchronization signals to the FPGA;

FPGA中，首先经过视频转AXI总线视频流模块将视频数据转化为延迟更低更利于实现数据同步与流水线加速优化的AXI总线视频流数据。接着AXI总线视频流格式的数据流入本发明创新设计的四合一模块中，由于随后要进行双线性插值的消旋处理，每处理一个像素要从DDR中读取其紧邻的四个像素，像素读取所带来的延时是十分可观的，而多次的像素读取势必造成更高的延时，因此本发明设计四合一模块，即数据流每流入两行就将其缓存在片内高速缓存中，并将每个像素周围的四个8位像素点合并成一个32位数据，后续需要读取某一像素紧邻的四个像素时，仅需读取一次合并后的32位像素，并将其分割成四个独立的8位数据，即可实现一次读取四个像素点的功能，这一处理可充分利用AXI总线带宽将延时降至原来的四分之一。接着将所述合并后的32位视频流数据通过AXI视频流DDR读写模块缓存进ARM的DDR中；In the FPGA, firstly, the video data is converted into AXI bus video stream data with lower delay and optimized for data synchronization and pipeline acceleration through the video-to-AXI bus video stream module. Then the data of the AXI bus video stream format flows into the four-in-one module of the innovative design of the present invention, because the derotation processing of bilinear interpolation will be carried out subsequently, each processing a pixel will read its adjacent four pixels from the DDR, The delay caused by pixel reading is very considerable, and multiple pixel reads will inevitably cause higher delays. Therefore, the present invention designs a four-in-one module, that is, every time the data stream flows into two lines, it is cached in the In the on-chip cache, the four 8-bit pixels around each pixel are combined into a 32-bit data. When you need to read the four pixels next to a certain pixel, you only need to read the combined 32-bit data once. Pixel, and divide it into four independent 8-bit data, can realize the function of reading four pixels at a time, this processing can make full use of the AXI bus bandwidth to reduce the delay to a quarter of the original. Then the 32-bit video stream data after the merging is cached in the DDR of ARM by the AXI video stream DDR read-write module;

动态无极消旋模块，根据上位机通过RS422串口通信模块发送的消旋指令及消旋角度对缓存在DDR中的视频数据流中的视频数据作动态无极消旋，消旋处理时配合前述四合一模块，将从DDR中读取的32位数据分割为四个8位数据进行双线性插值，处理后的视频图像仍保存在DDR中；再次利用AXI视频流DDR读写模块从DDR中将缓存的消旋后的视频图像读出到AXI视频流中，并利用AXI总线视频流转视频模块将AXI视频流转化为带有显性同步信号的并行视频数据，并将并行视频数据送入视频编码模块进行编码输出至显示器或采集卡进行实时显示。The dynamic stepless derotation module performs dynamic stepless derotation on the video data in the video data stream buffered in the DDR according to the derotation command and the derotation angle sent by the host computer through the RS422 serial communication module. One module divides the 32-bit data read from the DDR into four 8-bit data for bilinear interpolation, and the processed video image is still stored in the DDR; again, the AXI video stream DDR read-write module is used to read and write the data from the DDR The buffered derotated video image is read out to the AXI video stream, and the AXI bus video stream to video module is used to convert the AXI video stream into parallel video data with explicit synchronization signals, and the parallel video data is sent to video encoding The module encodes and outputs to the monitor or acquisition card for real-time display.

所用基于双线性插值的图像电子消旋算法具体如下：The image electronic derotation algorithm based on bilinear interpolation is as follows:

(1)根据上位机发来的消旋角度求消旋处理后图像的每个像素点(x′,y′)对应消旋处理前图像像素点的坐标(x,y)。公式如下：(1) Calculate the coordinates (x, y) of each pixel point (x', y') of the image after derotation processing corresponding to the pixel point of the image before derotation processing according to the derotation angle sent by the host computer. The formula is as follows:

其中，θ为旋转角度，

为旋转矩阵。Among them, θ is the rotation angle,

is the rotation matrix.

一般设定以图像中心(x₀,y₀)为旋转中心进行旋转，上述公式应改写为：It is generally set to rotate with the image center (x ₀ , y ₀ ) as the rotation center, the above formula should be rewritten as:

将上述公式写为标量形式为：The above formula can be written in scalar form as:

(2)采用双线性插值法进行像素映射。由于步骤(1)中计算得到的映射到原图像的像素点坐标(x,y)往往不是整数，因此无法直接按照一对一的关系进行像素映射。一般采用重采样的方式来解决映射过程中出现的非整数像素坐标问题。(2) Using bilinear interpolation method for pixel mapping. Since the pixel coordinates (x, y) mapped to the original image calculated in step (1) are often not integers, it is impossible to directly perform pixel mapping according to a one-to-one relationship. Generally, resampling is used to solve the problem of non-integer pixel coordinates in the mapping process.

根据图像重建理论，一般采用三种常见的插值方式进行图像映射：最近邻插值法、双线性插值法和三次内插法。最近邻插值法的插值效果较差，消旋后的图像有明显的锯齿效应和毛刺现象；双线性插值法和三次内插法效果较好，灰度连续无锯齿。由于三次内插法算法复杂，计算时间过长，导致其在实际工程应用中很难达到实时性要求。因此出于对消旋精度和系统实时性的折衷考虑，本发明最终选择使用基于双线性插值法的图像消旋算法。According to image reconstruction theory, three common interpolation methods are generally used for image mapping: nearest neighbor interpolation, bilinear interpolation and cubic interpolation. The interpolation effect of the nearest neighbor interpolation method is poor, and the image after derotation has obvious jagged effects and glitches; the bilinear interpolation method and the cubic interpolation method have better effects, and the gray scale is continuous without jagged. Due to the complexity of the cubic interpolation algorithm and the long calculation time, it is difficult to meet the real-time requirements in practical engineering applications. Therefore, considering the trade-off between the precision of derotation and the real-time performance of the system, the present invention finally chooses to use the image derotation algorithm based on the bilinear interpolation method.

基于双线性插值法的电子消旋算法原理示意图如图2所示。该方法根据非整数采样点整数坐标点周围4个点的灰度值在x和y两个方向上进行线性插值。在附图2中，(x,y)为双线性插值得到的像素坐标，f(x,y)为坐标(x,y)处的像素灰度值，f(0,0),f(1,0),f(0,1),f(1,1)为(x,y)周围4点的像素灰度值，由此可求得双线性插值法的计算公式如下：The schematic diagram of the electronic derotation algorithm based on the bilinear interpolation method is shown in Fig. 2 . This method performs linear interpolation in the x and y directions according to the gray values of the four points around the integer coordinate point of the non-integer sampling point. In Figure 2, (x, y) is the pixel coordinate obtained by bilinear interpolation, f(x, y) is the gray value of the pixel at the coordinate (x, y), f(0,0), f( 1,0), f(0,1), f(1,1) are the pixel gray values of 4 points around (x, y), from which the calculation formula of the bilinear interpolation method can be obtained as follows:

f(x,y)＝[f(1,0)-f(0,0)]x+[f(0,1)-f(0,0)]y+[f(1,1)-f(1,0)-f(0,1)-f(0,0)]xy+f(0,0)f(x,y)=[f(1,0)-f(0,0)]x+[f(0,1)-f(0,0)]y+[f(1,1)-f(1 ,0)-f(0,1)-f(0,0)]xy+f(0,0)

(3)确定消旋后图像边界。旋转后的图像大小相比于旋转前一般都会有改变，因此需要重新确定图像边界。图像上、下、左、右四个边界位置的确定按照如下公式进行计算：(3) Determine the image boundary after derotation. The size of the image after rotation will generally change compared with that before rotation, so the image boundary needs to be re-determined. The determination of the four boundary positions of the upper, lower, left and right of the image is calculated according to the following formula:

left＝max(x₁,x₂,x₃,x₄)left＝max(x ₁ ,x ₂ ,x ₃ ,x ₄ )

right＝min(x₁,x₂,x₃,x₄)right=min(x ₁ ,x ₂ ,x ₃ ,x ₄ )

top＝max(y₁,y₂,y₃,y₄)top=max(y ₁ ,y ₂ ,y ₃ ,y ₄ )

bottom＝min(y₁,y₂,y₃,y₄)bottom=min(y ₁ ,y ₂ ,y ₃ ,y ₄ )

(4)固定图像分辨率。在实际工程应用中，输出图像分辨率往往是固定不变的，而在针对原始视频图像进行不同消旋角的消旋操作后，图像分辨率必定会发生改变且分辨率大小无法固定，因此本发明针对消旋后的图像以图像中心为中心进行剪裁，固定输出图像分辨率，即保持相同大小的输出图像。(4) Fixed image resolution. In practical engineering applications, the resolution of the output image is often fixed, but after performing derotation operations with different derotation angles on the original video image, the image resolution will definitely change and the resolution cannot be fixed. The invention aims at clipping the derotated image with the center of the image as the center, and fixes the resolution of the output image, that is, maintains the same size of the output image.

本发明的重点在基于大规模集成电路高层综合技术来实现动态无极消旋，这是高分辨率系统实时性的重要保障，也是本发明最重要的创新点。The focus of the present invention is to realize the dynamic stepless derotation based on the high-level integrated technology of large-scale integrated circuits, which is an important guarantee for the real-time performance of the high-resolution system, and is also the most important innovation point of the present invention.

本发明与现有技术相比的优点在于：The advantage of the present invention compared with prior art is:

(1)本发明创新的设计了四合一模块，即充分利用高带宽数据流水的优势，数据流每流入两行就将其缓存在片内高速缓存中，并将每个像素周围的四个8位像素点合并成一个32位数据，并通过数据流水的方式缓存进DDR中，之后对某像素点进行双线性插值的消旋时，可将这32位数据取出并分割成四个8位的像素点即为其双线性插值所需用到的四个像素点，即可实现一次读取四个像素点的功能，由此可将算法延时降至原来的四分之一，处理延时与最近邻插值消旋处理相同，但处理效果却比最近邻插值消旋好得多。(1) The present invention has innovatively designed a four-in-one module, which takes full advantage of the advantages of high-bandwidth data streams. Every time a data stream flows into two lines, it will be cached in the on-chip cache, and four pixels around each pixel will be cached. The 8-bit pixels are merged into a 32-bit data, which is buffered into the DDR through data pipeline, and then the 32-bit data can be taken out and divided into four 8-bit data when performing bilinear interpolation derotation on a pixel One-bit pixels are the four pixels needed for its bilinear interpolation, which can realize the function of reading four pixels at a time, thus reducing the algorithm delay to a quarter of the original, The processing delay is the same as that of the nearest neighbor interpolation derotation, but the processing effect is much better than that of the nearest neighbor interpolation derotation.

(2)算法流水线加速优化。大规模集成电路FPGA相比于一般嵌入式系统的一大优势是可以以数据流水的方式对算法进行优化，因此本发明采用流水线的方式编写算法，在Vivado HLS开发工具中进行算法开发时，通过使用预编译指令pipeline(流水线优化指令)，并保证编写的程序符合数据一次输入、一次使用和一次输出，即一个数据只能输入一次，且只能使用一次，最终必须输出且只能输出一次的流水线编程原则以防止数据流堵塞，即可以以牺牲硬件逻辑资源的方式对算法进行流水线化处理。(2) Algorithm pipeline acceleration optimization. Compared with the general embedded system, the large-scale integrated circuit FPGA has a great advantage that the algorithm can be optimized in the form of data pipeline, so the present invention uses the pipeline method to write the algorithm, and when developing the algorithm in the Vivado HLS development tool, through Use the precompiled instruction pipeline (pipeline optimization instruction), and ensure that the written program conforms to the data input, use and output once, that is, a data can only be input once, and can only be used once, and finally must be output and can only be output once The principle of pipeline programming is to prevent data flow from being blocked, that is, algorithms can be pipelined at the expense of hardware logic resources.

具体而言流水线化允许并行执行操作，每个执行步骤无需等待完成所有操作后再开始下一项操作。流水线化适用于函数和循环，以循环流水线优化为例，每轮循环中的变量涉及读、计算和写三个操作，未进行流水线优化前，这三个操作按照串行顺序执行，每隔3个时钟周期读取一次输入，并在2个时钟周期后输出值；进行流水线优化后，每个时钟内都会执行一次读操作，多组数据按照并行方式执行。进行流水线优化前后的延时情况如附图3所示，未进行流水线优化前，两个读操作间需要3个时钟周期，经过8个时钟周期才会执行到最后一次写操作；进行流水线优化后，两个读操作间需要1个时钟周期，经过4个时钟周期就会执行到最后一次写操作，可见算法的流水线优化提高了数据吞吐量，大幅降低延时，提高图像消旋的实时性。Specifically, pipelining allows operations to be executed in parallel, without each step of execution having to wait for all operations to complete before starting the next one. Pipelining is applicable to functions and loops. Taking loop pipeline optimization as an example, the variables in each round of loop involve three operations of reading, computing and writing. Before pipeline optimization, these three operations are executed in serial order, every 3 The input is read once per clock cycle, and the value is output after 2 clock cycles; after pipeline optimization, a read operation is performed within each clock, and multiple sets of data are executed in parallel. The delay before and after pipeline optimization is shown in Figure 3. Before pipeline optimization, three clock cycles are required between two read operations, and the last write operation will be executed after 8 clock cycles; after pipeline optimization , It takes 1 clock cycle between two read operations, and the last write operation will be executed after 4 clock cycles. It can be seen that the pipeline optimization of the algorithm improves the data throughput, greatly reduces the delay, and improves the real-time performance of image derotation.

(3)多AXI高带宽总线实时并行优化。由于本发明要解决的问题是对高分辨图像实现实时消旋处理，而FPGA芯片的片内缓存(BRAM)空间有限，不足以缓存整帧高分辨率图像，因此本发明在ARM嵌入式端外挂64位128MB的DDR芯片，用于图像缓存。不同于直接缓存在BRAM中，由于DDR外挂在ARM端，因此FPGA芯片需要通过AXI总线从FPGA端向ARM端的DDR进行数据读写。通过分析和对延时的实际测量可得，由于(1)中已经对算法进行了流水线优化，消旋算法本身的延时已经被降至较低水平，因此延时主要来源于通过AXI总线从DDR读写数据。本发明所使用的FPGA+ARM处理架构芯片为Zynq UltraScale+MPSOC 15EG，其具有十分丰富的AXI总线资源(7条128位AXI总线)，因此本发明使用多AXI高带宽总线并行处理的方式同时读写并处理多个像素点，大幅降低延时，增加数据吞吐量，提高算法实时性。最终本发明使用2条128位总线、1条64位总线进行多总线并行处理，针对1080p灰度图像，在360°范围内执行双线性插值消旋算法整体延时为12ms，无论是针对30fps的视频图像还是60fps的视频图像，均可在一帧时间内完成消旋操作，即实现了高分辨率图像的实时消旋处理。同时可以看到，本发明只占用了36％的总线资源即实现了1080p图像的消旋，因此继续增加总线的使用可以进一步提升图像实时消旋的分辨率。(3) Real-time parallel optimization of multiple AXI high-bandwidth buses. Since the problem to be solved by the present invention is to realize real-time derotation processing of high-resolution images, and the on-chip buffer (BRAM) space of FPGA chip is limited, it is not enough to buffer the whole frame of high-resolution images, so the present invention uses ARM embedded terminal plug-in 64-bit 128MB DDR chip for image cache. Unlike directly caching in BRAM, since the DDR is plugged into the ARM side, the FPGA chip needs to read and write data from the FPGA side to the DDR on the ARM side through the AXI bus. Through the analysis and the actual measurement of the delay, it can be obtained that the delay of the derotation algorithm itself has been reduced to a low level due to the pipeline optimization of the algorithm in (1), so the delay mainly comes from the AXI bus. DDR reads and writes data. The FPGA+ARM processing architecture chip used in the present invention is Zynq UltraScale+MPSOC 15EG, which has very rich AXI bus resources (7 128-bit AXI buses), so the present invention uses multiple AXI high-bandwidth bus parallel processing methods to simultaneously read Write and process multiple pixels, greatly reducing latency, increasing data throughput, and improving the real-time performance of algorithms. Finally, the present invention uses two 128-bit buses and one 64-bit bus for multi-bus parallel processing. For 1080p grayscale images, the overall delay of bilinear interpolation and derotation algorithm is 12ms in the range of 360°, whether it is for 30fps The video image is still a 60fps video image, and the derotation operation can be completed within one frame time, that is, the real-time derotation processing of the high-resolution image is realized. At the same time, it can be seen that the present invention realizes the derotation of 1080p images by only occupying 36% of the bus resources, so increasing the usage of the bus can further improve the real-time derotation resolution of the images.

(4)使用高层次综合技术实现算法设计优化和资源调度。本发明所使用的ZynqUltraScale+MPSOC 15EG处理芯片为Xilinx公司开发的异构嵌入式芯片，使用Vivado开发套件进行开发，其中包含高层次开发工具Vivado HLS，在HLS开发框架下，可以使用高层语言(C/C++/System C)按照特定的规范进行算法开发与优化设计，并最终由HLS工具将高层语言程序转化为硬件描述语言(Verilog HDL/VHDL)程序。使用高层综合工具进行开发可以方便的进行算法设计优化与逻辑资源的动态调度，大幅提升开发效率，充分发挥了FPGA+ARM架构的多AXI总线并行计算优势和多流水线加速特性，显著提高消旋算法性能。本发明从逻辑资源占用、延迟、吞吐量等方面进行设计权衡，由于本发明所使用芯片硬件逻辑资源较为丰富，因此决定牺牲逻辑资源占用来实现更低的算法延迟和更高的数据吞吐量。本发明充分利用了HLS的前述优点，从数据类型优化和数据吞吐量优化两个方面来提高消旋算法性能。具体而言，数据类型优化方面，本发明中多次使用20bit位宽数据，然而标准C的数据类型位宽都是8bit的整数倍，而若直接使用32bit位宽的整型数据则会造成逻辑资源的浪费，无法发挥出FPGA高性能和强大并行能力的优势，因此本发明利用HLS工具提供的任意位宽数据定义的方式定义了一个20bit位宽数据，极大节约了逻辑资源的使用。数据吞吐量优化方法，本发明按照“以面积换速度”的思路，对循环进行流水线优化和循环展开优化，以牺牲逻辑资源占用为代价提升算法吞吐量，提高算法性能。(4) Use high-level synthesis technology to realize algorithm design optimization and resource scheduling. The ZynqUltraScale+MPSOC 15EG processing chip used in the present invention is a heterogeneous embedded chip developed by Xilinx Company, which is developed using the Vivado development kit, which includes the high-level development tool Vivado HLS. Under the HLS development framework, high-level language (C /C++/System C) algorithm development and optimization design according to specific specifications, and finally the high-level language program is converted into a hardware description language (Verilog HDL/VHDL) program by the HLS tool. Using high-level synthesis tools for development can facilitate algorithm design optimization and dynamic scheduling of logic resources, greatly improving development efficiency, giving full play to the advantages of multi-AXI bus parallel computing and multi-pipeline acceleration characteristics of the FPGA+ARM architecture, and significantly improving the derotation algorithm. performance. The present invention makes design trade-offs in terms of logic resource occupation, delay, and throughput. Since the chip hardware logic resources used in the present invention are relatively rich, it is decided to sacrifice logic resource occupation to achieve lower algorithm delay and higher data throughput. The invention makes full use of the aforementioned advantages of HLS, and improves the performance of the derotation algorithm from two aspects of data type optimization and data throughput optimization. Specifically, in terms of data type optimization, the present invention uses 20bit bit width data multiple times, but the data type bit width of standard C is an integer multiple of 8bit, and if the integer data of 32bit bit width is used directly, it will cause logic The waste of resources cannot give full play to the advantages of FPGA high performance and powerful parallel capability. Therefore, the present invention defines a 20-bit bit-width data by using the arbitrary bit-width data definition provided by the HLS tool, which greatly saves the use of logic resources. Data throughput optimization method, according to the idea of "trading area for speed", the present invention performs pipeline optimization and loop unrolling optimization on loops, improves algorithm throughput at the cost of sacrificing logic resource occupation, and improves algorithm performance.

(5)经实际测试，针对1920×1080可见光图像可以实现实时消旋处理，消旋范围为0-360°，延时小于12ms，消旋角精度可达0.001°，最大像素误差小于1个像素，系统整体具有视频分辨率高、消旋范围大、消旋精度高、处理后图像清晰无锯齿、输出延迟低、系统稳定性强、加工容易、功耗低、体积小等优良特性。(5) After actual testing, real-time derotation processing can be realized for 1920×1080 visible light images, the derotation range is 0-360°, the delay is less than 12ms, the precision of derotation angle can reach 0.001°, and the maximum pixel error is less than 1 pixel , The system as a whole has excellent characteristics such as high video resolution, large derotation range, high derotation precision, clear and jagged images after processing, low output delay, strong system stability, easy processing, low power consumption, and small size.

附图说明Description of drawings

图1为基于大规模集成电路高层次综合的动态无极消旋系统原理框架图；Figure 1 is a schematic diagram of a dynamic stepless derotation system based on high-level synthesis of large-scale integrated circuits;

图2为基于双线性插值法的图像消旋算法原理示意图；Fig. 2 is the schematic diagram of the principle of the image derotation algorithm based on the bilinear interpolation method;

图3为流水线优化延时效果图；Figure 3 is an effect diagram of pipeline optimization delay;

图4为动态无极消旋处理模块流程图；Fig. 4 is a flow chart of the dynamic stepless derotation processing module;

图5为动态无极消旋系统效果演示，(a)为消旋处理前，(b)为消旋处理后。Figure 5 is a demonstration of the effect of the dynamic stepless derotation system, (a) before derotation treatment, (b) after derotation treatment.

具体实施方式detailed description

下面结合附图对本发明的具体实施方式做进一步说明。The specific embodiments of the present invention will be further described below in conjunction with the accompanying drawings.

如图1所示，本发明的消旋系统包括视频采集模块、视频解码模块、核心处理模块和视频编码模块；核心处理模块采用FPGA+ARM架构的异构片上系统；FPGA包括动态无极消旋模块、视频转AXI总线视频流模块、AXI视频流DDR读写模块以及本发明中用于降低算法延迟、提高总线带宽利用率而创新设计的像素合并模块即四合一模块；ARM包括视频存储模块DDR和RS422串口通信模块，FPGA与ARM之间的数据通信采用AXI控制总线进行。As shown in Figure 1, the derotation system of the present invention includes a video acquisition module, a video decoding module, a core processing module and a video encoding module; the core processing module adopts a heterogeneous system-on-chip of FPGA+ARM architecture; FPGA includes a dynamic stepless derotation module , video to AXI bus video stream module, AXI video stream DDR read and write module, and the innovatively designed pixel merging module for reducing algorithm delay and improving bus bandwidth utilization in the present invention is a four-in-one module; ARM includes a video storage module DDR And RS422 serial communication module, the data communication between FPGA and ARM is carried out by AXI control bus.

视频采集模块为工业相机，分辨率为1920×1080，帧频为30Hz或60Hz，视频输出格式不限。视频解码模块使用视频解码芯片，其作用是将输入的串行视频信号转化为并行格式视频以及数据有效信号DE、行同步信号HSYNC、场同步信号VSYNC，数据信号及有效信号、同步信号传递至FPGA进行后续处理。视频存储模块采用4片16位128MB的DDR4组合成一片64位128MB的DDR，由于消旋处理需要整帧图像缓存，而FPGA内部的片内缓存空间较小，不足以存储整帧图像，因此需要外挂存储器，本发明最终选择将DDR外挂在Zynq芯片的ARM端，这样更有利于后续操作。数据通信模块主要包括两部分，其一是本发明所设计的电子消旋系统与上位机主控之间的通信，这一通信基于RS422进行设计，这种稳定的低速传输协议可以满足本发明系统中消旋角度的传递；其二是Zynq芯片内部FPGA端和ARM端之间的通信，这二者之间的通信采用Xilinx所提供的AXI总线通信协议，通过AXI总线进行指令信息和图像信息的传递。视频编码模块为视频编码芯片，其作用是将并行视频数据和数据有效信号DE、行同步信号HSYNC、场同步信号VSYNC转化为串行视频信号输出，最后将其输出至显示器或采集卡进行实时显示。该系统的核心处理模的型号为Zynq UltraScale+MPSOC 15EG的ARM+FPGA架构的异构片上系统，Zynq架构芯片可以充分发挥FPGA端的并行加速功能以及ARM端的主控调度功能，是目前异构片上系统的主流芯片之一。本发明的核心为四合一模块和动态无极消旋模块，动态无极消旋模块的算法部署在FPGA端，内存调度和与上位机的通信在ARM端进行。The video acquisition module is an industrial camera with a resolution of 1920×1080, a frame frequency of 30Hz or 60Hz, and the video output format is not limited. The video decoding module uses a video decoding chip. Its function is to convert the input serial video signal into a parallel format video and data effective signal DE, line synchronization signal HSYNC, field synchronization signal VSYNC, data signal, effective signal, and synchronization signal to FPGA Follow up. The video storage module uses 4 pieces of 16-bit 128MB DDR4 to combine into a piece of 64-bit 128MB DDR. Since the derotation process requires the entire frame image cache, and the on-chip cache space inside the FPGA is small, it is not enough to store the entire frame image, so it needs As for the external memory, the present invention finally chooses to externally install the DDR on the ARM side of the Zynq chip, which is more conducive to subsequent operations. The data communication module mainly includes two parts, one of which is the communication between the electronic derotation system designed by the present invention and the main control of the upper computer. This communication is designed based on RS422. This stable low-speed transmission protocol can meet the requirements of the system of the present invention. The second is the communication between the FPGA end and the ARM end inside the Zynq chip. The communication between the two adopts the AXI bus communication protocol provided by Xilinx, and the instruction information and image information are exchanged through the AXI bus. transfer. The video encoding module is a video encoding chip, its function is to convert the parallel video data and effective data signal DE, line synchronization signal HSYNC, and field synchronization signal VSYNC into serial video signal output, and finally output it to the monitor or acquisition card for real-time display . The model of the core processing module of the system is Zynq UltraScale+MPSOC 15EG ARM+FPGA architecture heterogeneous system-on-chip. The Zynq architecture chip can give full play to the parallel acceleration function of the FPGA side and the master control scheduling function of the ARM side. It is the current heterogeneous system-on-chip One of the mainstream chips. The core of the invention is the four-in-one module and the dynamic stepless derotation module, the algorithm of the dynamic stepless derotation module is deployed on the FPGA side, and the memory scheduling and communication with the upper computer are carried out on the ARM side.

本发明具体包含以下步骤：The present invention specifically comprises the following steps:

步骤一：视频采集及解码Step 1: Video capture and decoding

本发明采用工业相机采集视频图像，并通过解码芯片进行视频解码得到并行视频以及数据有效信号DE、行同步信号HSYNC、场同步信号VSYNC。本发明是基于FPGA AXI数据流进行设计，因此需要将解码得到的相关信号送入视频转AXI总线视频流模块，将并行视频数据转化为AXI总线视频流数据，便于后期高效的实现流水线加速优化。The invention adopts an industrial camera to collect video images, and performs video decoding through a decoding chip to obtain parallel video, effective data signal DE, line synchronization signal HSYNC, and field synchronization signal VSYNC. The present invention is designed based on the FPGA AXI data stream, so it is necessary to send the decoded related signals into the video-to-AXI bus video stream module, and convert the parallel video data into AXI bus video stream data, so as to facilitate the efficient realization of pipeline acceleration and optimization in the later stage.

步骤二：紧邻像素合并Step 2: Binning adjacent pixels

本发明创新的设计了四合一模块，数据流每流入两行就将其缓存在片内高速缓存中，并将每个像素周围紧邻的四个8位像素点合并成一个32位数据，之后对某像素点进行双线性插值的消旋时，可将这32位数据取出并分割成四个8位的像素点即为其双线性插值所需用到的四个像素点，即可实现一次读取四个像素点的功能，由此可将算法延时降至原来的四分之一。The present invention innovatively designs a four-in-one module. Every time the data stream flows into two lines, it is cached in the on-chip cache, and four 8-bit pixel points adjacent to each pixel are combined into a 32-bit data, and then When performing bilinear interpolation derotation on a certain pixel point, the 32-bit data can be taken out and divided into four 8-bit pixel points, which are the four pixel points required for its bilinear interpolation, that is, Realize the function of reading four pixels at a time, thereby reducing the algorithm delay to a quarter of the original.

步骤三：视频数据存储Step 3: Video data storage

将步骤二合并后的32位视频流数据通过AXI视频流DDR读写模块缓存进ARM的DDR中；The 32-bit video stream data merged in step 2 is cached in the DDR of ARM through the AXI video stream DDR read-write module;

步骤四：视频数据实时动态无极消旋处理Step 4: Real-time dynamic stepless derotation processing of video data

动态无极消旋处理模块流程图如图4所示。本发明使用Vivado高层综合技术设计动态无极消旋算法，并将其封装成IP核，本IP核定义了两个m_axi(AXI主机)端口，分别用于读、写DDR4的操作，读m_axi端口用于经AXI总线从DDR4的帧缓存区读取原始像素信息，经过消旋算法进行动态无极消旋处理后，利用写m_axi端口输出到DDR的另一个帧缓存区中，由此完成图像消旋的全部流程。The flowchart of the dynamic stepless derotation processing module is shown in Figure 4. The present invention uses the Vivado high-level synthesis technology to design the dynamic stepless derotation algorithm, and encapsulates it into an IP core. This IP core defines two m_axi (AXI host) ports, which are respectively used for reading and writing DDR4 operations, and for reading the m_axi port. The original pixel information is read from the frame buffer area of DDR4 via the AXI bus, and after the dynamic non-polar derotation processing is performed by the derotation algorithm, it is output to another frame buffer area of DDR by writing the m_axi port, thereby completing the image derotation whole process.

步骤五：视频编码及输出显示Step 5: Video encoding and output display

经过步骤四的消旋处理后，消旋后的图像已缓存在DDR的一片缓存区域中，再次利用AXI视频流DDR读写模块从DDR中将缓存的消旋后的视频图像读出到AXI视频流中，并利用AXI总线视频流转视频模块将AXI视频流转化为带有显性同步信号的并行视频数据，并将其送入视频编码芯片中进行编码输出至监视器或采集卡进行消旋后结果的实时显示。After the derotation processing in step 4, the derotated image has been cached in a buffer area of the DDR, and the AXI video stream DDR read-write module is used again to read the cached derotated video image from the DDR to the AXI video stream, and use the AXI bus video stream to video module to convert the AXI video stream into parallel video data with explicit synchronization signals, and send it to the video encoding chip for encoding and output to the monitor or capture card for derotation Real-time display of results.

根据上述步骤，上位机给定任意消旋角度，本发明系统即可实时输出消旋结果。例如上位机下发消旋角度为顺时针旋转0.625°，经消旋系统处理前后的图像如图5所示。图5中的(a)为消旋处理前的原始图像，可见该图像在水平方向上存在倾斜，即光轴未准确配平，存在逆时针方向的旋转角度，经上位机测定该旋转角度为0.625°，因此上位机向本消旋系统下发0.625°的消旋角度，经由本消旋系统进行视频图像消旋处理后的图像如图5中的(b)所示，可见消旋处理后的图像水平方向已配平，且消旋后的图像清晰无锯齿效应，消旋角精度达到了0.001°，该帧视频图像的处理时间小于12ms，具有高的实时性。According to the above steps, the host computer can set any derotation angle, and the system of the present invention can output the derotation result in real time. For example, the derotation angle issued by the host computer is 0.625° clockwise, and the images before and after processing by the derotation system are shown in Figure 5. (a) in Figure 5 is the original image before derotation processing. It can be seen that the image is tilted in the horizontal direction, that is, the optical axis is not accurately balanced, and there is a counterclockwise rotation angle. The rotation angle measured by the host computer is 0.625 °, so the host computer issues a derotation angle of 0.625° to the derotation system, and the image after the derotation processing of the video image is shown in (b) in Figure 5 through the derotation system. It can be seen that the derotation processing The horizontal direction of the image has been trimmed, and the image after derotation is clear without jagged effect, and the precision of derotation angle reaches 0.001°. The processing time of this frame of video image is less than 12ms, which has high real-time performance.

本发明说明书中未做详细描述的内容属于本领域专业技术人员公知的现有技术。The contents not described in detail in the description of the present invention belong to the prior art known to those skilled in the art.

提供以上实施例仅仅是为了描述本发明的目的，而并非要限制本发明的范围。本发明的范围由所附权利要求限定。不脱离本发明的精神和原理而做出的各种等同替换和修改，均应涵盖在本发明的范围之内。The above embodiments are provided only for the purpose of describing the present invention, not to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent replacements and modifications made without departing from the spirit and principle of the present invention shall fall within the scope of the present invention.

Claims

1. A dynamic stepless despinning system based on large-scale integrated circuit high-level synthesis is characterized in that: the system comprises a video acquisition module, a video decoding module, a core processing module and a video coding module; the core processing module adopts a heterogeneous system on chip with an FPGA + ARM architecture; the FPGA comprises a dynamic non-polar despun module, a video-to-AXI bus video stream module, an AXI video stream DDR read-write module and a pixel merging module which is an all-in-one module and is used for reducing algorithm delay and improving the bus bandwidth utilization rate and is innovatively designed; the ARM comprises a video storage module DDR and an RS422 serial port communication module, and data communication between the FPGA and the ARM is carried out by adopting an AXI control bus;

the video acquisition module is used for acquiring an original video image by using a camera, wherein the video image is data to be despuned; the original video image after the acquisition enters a video decoding module;

the video decoding module is used for converting serial videos acquired by the camera into parallel video data and obtaining a series of dominant video synchronous signals, and the parallel video data and the synchronous signals obtained by decoding are sent to the FPGA;

in the FPGA, firstly, a video-to-AXI bus video stream module converts video data into AXI bus video stream data with lower delay and more beneficial to realizing data synchronization and pipeline acceleration optimization, then data in an AXI bus video stream format flows into a four-in-one module, the four-in-one module realizes that the data stream is cached in an on-chip cache every two lines of flowing in, four 8-bit pixel points around each pixel are merged into one 32-bit data, when four pixels adjacent to one pixel are required to be read subsequently, the merged 32-bit pixel is only required to be read once and is divided into four independent 8-bit data, namely, the function of reading the four pixel points at one time is realized, and the processing utilizes the AXI bus bandwidth to reduce the delay to one fourth of the original delay; caching the merged 32-bit video stream data into DDR of an ARM through an AXI video stream DDR read-write module;

the dynamic non-polar despinning module is used for dynamically performing non-polar despinning on video data in a video data stream cached in the DDR according to a despinning instruction and a despinning angle sent by the upper computer through the RS422 serial port communication module, and is matched with the four-in-one module during despinning processing to divide 32-bit data read from the DDR into four 8-bit data for bilinear interpolation, and a processed video image is still stored in the DDR; and reading the cached deswirled video image from the DDR into the AXI video stream again by using the AXI video stream DDR read-write module, converting the AXI video stream into parallel video data with dominant synchronous signals by using the AXI bus video stream video module, and sending the parallel video data into the video coding module for coding and outputting to a display or an acquisition card for real-time display.

2. The LSI high-level synthesis-based dynamic non-polar racemization system according to claim 1, wherein: the four-in-one module and the dynamic non-polar despun module are developed by using a high-level comprehensive tool Vivado HLS, and are subjected to pipeline optimization by using a precompiled instruction pipeline, namely a pipeline optimization instruction, so that under the condition that the requirements of one-time input, one-time use and one-time output of data are met, namely that one data can be input only once and can be used only once, and finally, the data needing 8 clock cycles for processing can be processed only by using 4 clock cycles.

3. The LSI high-level synthesis-based dynamic non-polar racemization system according to claim 1, wherein: the system also improves the performance of the racemization algorithm in the aspects of data type optimization, namely self-defined bit width data type and data throughput optimization; and performing real-time parallel optimization on the plurality of AXI high-bandwidth buses, and simultaneously reading and writing and processing a plurality of pixel points in a parallel computing mode.

4. The LSI high-level synthesis-based dynamic non-polar racemization system according to claim 1, wherein: in the dynamic non-polar despinning module, an image electronic despinning algorithm based on bilinear interpolation is adopted for real-time despinning, and the method specifically comprises the following steps:

(1) According to the despinning angle sent by the upper computer, the coordinate (x, y) of each pixel point (x ', y') of the video image after the despinning processing corresponding to the pixel point of the video image before the despinning processing is solved

Wherein θ represents the racemic angle, x ₀ ，y ₀ Respectively representing the horizontal and vertical coordinates of the center of the image;

(2) Pixel mapping using bilinear interpolation

f(x，y)＝[f(1，0)-f(0，0)]x+[f(0，1)-f(0，0)]y+[f(1，1)-f(1，0)-f(0，1)-f(0，0)]xy+f(0，0)

Wherein x and y are respectively integer coordinates obtained by rounding off the pixel coordinate points after racemization obtained in the step (1), f (0,0), f (1,0), f (0,1), f (1,1) are pixel gray values of 4 points around the (x, y), and f (x, y) is a pixel gray value obtained by bilinear interpolation at the coordinates of the (x, y);

(3) Determining the boundary of the despun image, wherein the size of the rotated image is generally changed compared with that before the rotation, so that the boundary of the video image needs to be determined again, and the determination of the four boundary positions of the video image, namely the upper boundary position, the lower boundary position, the left boundary position and the right boundary position, is calculated according to the following formula:

left＝max(x ₁ ，x ₂ ，x ₃ ，x ₄ )

right＝min(x ₁ ，x ₂ ，x ₃ ，x ₄ )

top＝max(y ₁ ，y ₂ ，y ₃ ，y ₄ )

bottom＝min(y ₁ ，y ₂ ，y ₃ ，y ₄ )

(4) And fixing the image resolution, cutting the despin video image by taking the center of the video image as the center, and fixing the output image resolution, namely keeping the same size of the output image.

5. The LSI high-level synthesis-based dynamic despinning system of claim 1, wherein: the heterogeneous system on chip with the FPGA and ARM architecture adopted by the core processing module is a Zynq UltraScale + MPSoC15EG chip.

6. A dynamic non-polar despinning method based on high-level synthesis of a large-scale integrated circuit is characterized by comprising the following implementation steps of:

(1) Converting serial video collected by a camera into parallel video data, obtaining a series of dominant video synchronous signals, and sending the parallel video data and the synchronous signals obtained by decoding to an FPGA;

(2) In the FPGA, video data is converted into AXI bus video stream data with lower delay and better benefit for realizing data synchronization and pipeline acceleration optimization through a video-to-AXI bus video stream module;

(3) Then the data in the AXI bus video stream format flows into a four-in-one module, as the despun processing of bilinear interpolation is carried out subsequently, each pixel is processed, the four pixels adjacent to each pixel are read from the DDR, the four-in-one module realizes that the data stream is cached in an on-chip cache every two lines of flowing in, four 8-bit pixel points around each pixel are merged into one 32-bit data, when the four pixels adjacent to a certain pixel are required to be read subsequently, only the merged 32-bit pixel needs to be read once and is divided into four independent 8-bit data, namely the function of reading the four pixel points once is realized, and the processing fully utilizes the AXI bus bandwidth to reduce the delay to one fourth of the original delay;

(4) Caching the merged 32-bit video stream data into DDR of an ARM through an AXI video stream DDR read-write module;

(5) Then the dynamic non-polar despinning module performs dynamic non-polar despinning on video data in the video data stream cached in the DDR according to a despinning instruction and a despinning angle sent by an upper computer through the RS422 serial port communication module, the four-in-one module is matched during despinning processing, 32-bit data read from the DDR is divided into four 8-bit data for bilinear interpolation, and a processed video image is still stored in the DDR;

(6) Reading the cached deswirled video image from the DDR into the AXI video stream again by using the AXI video stream DDR read-write module, converting the AXI video stream into parallel video data with dominant synchronous signals by using the AXI bus video stream video module, and sending the parallel video data into the video coding module for coding and outputting to a display or an acquisition card for real-time display;

in the steps (3) and (5), the four-in-one module and the dynamic non-polar despin module are developed by using a high-level comprehensive tool Vivado HLS, and a precompiled instruction pipeline is used for carrying out pipeline optimization on the algorithm, so that the programmed program meets the conditions that data is input, used and output once, namely, one data can be input once and used once, and finally, the data needs to be output and output once is subjected to pipeline processing, and the data which needs to be processed in 8 clock cycles originally is processed in 4 clock cycles; in addition, the performance of the despun algorithm is improved from the aspects of data type optimization and data throughput optimization; meanwhile, a plurality of AXI high-bandwidth buses are transferred to perform real-time parallel optimization, and a plurality of pixel points are read and written and processed simultaneously in a parallel computing mode.