WO2023184754A1 - 可配置实时视差点云计算装置及方法 - Google Patents

可配置实时视差点云计算装置及方法 Download PDF

Info

Publication number
WO2023184754A1
WO2023184754A1 PCT/CN2022/101751 CN2022101751W WO2023184754A1 WO 2023184754 A1 WO2023184754 A1 WO 2023184754A1 CN 2022101751 W CN2022101751 W CN 2022101751W WO 2023184754 A1 WO2023184754 A1 WO 2023184754A1
Authority
WO
WIPO (PCT)
Prior art keywords
array
module
data
image
computing resources
Prior art date
Application number
PCT/CN2022/101751
Other languages
English (en)
French (fr)
Inventor
孟照腾
蒿杰
胡文庆
孙亚强
舒琳
历宁
范秋香
Original Assignee
中国科学院自动化研究所
广东人工智能与先进计算研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院自动化研究所, 广东人工智能与先进计算研究院 filed Critical 中国科学院自动化研究所
Publication of WO2023184754A1 publication Critical patent/WO2023184754A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of microelectronics technology, and in particular to a configurable real-time parallax point cloud computing device and method.
  • CPU and GPU have good programmability, can adapt to different matching parameters to the greatest extent, and can meet stereo matching tasks in different scenarios.
  • their real-time performance is poor and cannot meet the application requirements of high real-time performance;
  • ASIC has high Energy efficiency and real-time performance, but its flexibility is poor and cannot adapt to different matching parameters.
  • FPGA can effectively accelerate computing-intensive tasks, but existing technology can only adapt to different matching parameters through reconstruction, and the time cost of reconstructing the design is large.
  • this application provides a configurable real-time disparity point cloud computing device and method.
  • this application provides a configurable real-time parallax point cloud computing device, including:
  • Image cache unit cache controller, processing unit PE array, result shaping module, minimum value search module and configuration analysis module;
  • the image cache unit is connected to the cache controller and is used to reshape the cached binocular image data and then output the image window data according to the specified window size and sliding window sequence under the control of the cache controller. to the cache controller;
  • the cache controller is respectively connected to the configuration analysis module and the PE array, and is used to control the image cache unit to output image window data according to the control signal transmitted by the configuration analysis module.
  • the image window data is passed through the Distribute the cache controller to PEs in the PE array;
  • the PE array is connected to the configuration parsing module and the result shaping module respectively, and is used to generate a number of PUs of a specified structure according to the control signal transmitted by the configuration parsing module, and to input a pair of PUs based on the number of PUs of the specified structure.
  • the image window data is processed, and the SAD matching cost calculation result is obtained and output to the result shaping module;
  • the result shaping module is respectively connected to the configuration parsing module and the minimum value search module, and is used to add fields to the input SAD matching cost calculation results according to the control signal transmitted by the configuration parsing module and then output it to the Minimum value search module;
  • the minimum value search module is connected to the configuration analysis module, and is used to search for the minimum value of the input SAD matching cost calculation results step by step according to the control signal and the minimum value search algorithm transmitted by the configuration analysis module, and output the minimum match.
  • the configuration parsing module is used to parse the received configuration information, generate corresponding control signals and input them to the cache controller, the PE array, the result shaping module and the minimum value search module respectively.
  • the PEs in the PE array are interconnected up, down, left and right, transferring intermediate results in the vertical direction, and transferring operands and matching costs in the horizontal direction.
  • the PE array includes one or more of the following types of PEs:
  • Ultra PE is used to calculate the absolute value of the difference between two operands during the SAD matching cost calculation process, as well as the accumulation operation of partial sums;
  • Standard PE is used to calculate the absolute value of the difference between two operands during the SAD matching cost calculation process, as well as the accumulation operation of partial sums;
  • Lite PE is used to perform the operation of calculating the absolute value of the difference between two operands during the SAD matching cost calculation process
  • the computing resources corresponding to the Ultra PE are larger than those of the Standard PE.
  • each column in the PE array can be configured as one or more PUs, and the PUs are used to perform SAD matching cost calculation operations with a specified window size.
  • the PE in the first row is the Ultra PE or the Standard PE.
  • this application also provides a configurable real-time disparity point cloud computing method, including:
  • the configuration parsing module parses the received configuration information, generates corresponding control signals, and inputs them to the cache controller, the PE array, the result shaping module and the minimum value search module respectively;
  • the cache controller controls the image cache unit to output image window data corresponding to one or more channels of binocular image data according to the specified window size and sliding window sequence according to the control signal transmitted by the configuration analysis module;
  • the result shaping module adds a field to the result of the SAD matching cost calculation output by the PE array according to the control signal transmitted by the configuration analysis module;
  • the minimum value search module searches for the minimum value step by step on the SAD matching cost calculation results after adding fields according to the control signal and the minimum value search algorithm transmitted by the configuration analysis module, and outputs the disparity value corresponding to the minimum matching cost.
  • the configuration information includes:
  • Image resolution matching window size, parallax search depth, number of binocular image data channels, and PE working mode.
  • the configuration information is determined by:
  • Allocate computing resources according to the number of unit computing resources that can be allocated in the PE array, the disparity search depth corresponding to each data stream, and the video frame rate corresponding to each data stream;
  • the number of unit calculation resources that can be allocated in the PE array, the parallax search depth corresponding to each data stream, and the video frame corresponding to each data stream are Rate allocate computing resources, including:
  • the remaining allocable computing resources can provide at least one unit computing resource for each data stream, and the first condition is met, continue to allocate one unit computing resource for each data stream;
  • each data stream is individually Allocate unit computing resources
  • the first value is determined based on the disparity search depth and video frame rate corresponding to each data stream
  • the second value is determined based on the number of unit computing resources currently allocated to each data stream
  • the first condition Determined based on the first value, the second value and the preset threshold.
  • the method also includes:
  • the remaining allocable computing resources can only provide unit computing resources for the target data flow, then all the remaining allocatable computing resources are allocated to the target data flow.
  • the configurable real-time disparity point cloud computing device and method provided by this application realizes adaptation to different matching parameters by configuring the analysis module without the need to reconstruct the FPGA; the calculation of the SAD matching cost is completed through the PE array parallel pipeline structure, meeting high real-time performance requirements, ensuring both adaptation of different matching information and high real-time performance.
  • Figure 1 is an overall block diagram of the configurable real-time parallax point cloud generation system based on FPGA provided by this application;
  • Figure 2 is a schematic structural diagram of a configurable real-time parallax point cloud computing device provided by this application;
  • Figure 3 is a schematic diagram of the resolution compatibility method of the image cache unit provided by this application.
  • FIG. 4 is a schematic diagram of the definitions of PSAD and PSUM provided by this application.
  • FIG. 5 is a schematic diagram of the SAD matching cost calculation pipeline design provided by this application.
  • Figure 6 is a schematic diagram of the fully parallel pipelined SAD computing array provided by this application.
  • FIG. 7 is the internal structure diagram of Ultra PE provided by this application.
  • Figure 8 is a schematic flow chart of the configurable real-time disparity point cloud computing method provided by this application.
  • Figure 9 is a schematic diagram of the resource-aware configuration generation process provided by this application.
  • stereo matching is a key link in binocular stereo vision.
  • the stereo matching algorithm searches for corresponding points in the left and right images based on the similarity of pixel information to determine the disparity. By searching for corresponding points in the pixels of the entire image, a disparity point cloud of the entire image can be generated, which can then be used for tasks such as ranging or three-dimensional reconstruction.
  • stereo matching algorithms can be divided into local matching algorithms, global matching algorithms and semi-global matching algorithms. Due to its unique real-time characteristics, the local matching algorithm is widely used in high real-time applications.
  • Stereo matching algorithms can be deployed on different platforms such as CPU, GPU, FGPA, ASIC, etc.
  • CPU and GPU have good programmability and can adapt to different matching parameters (such as matching window size, parallax search depth, image resolution, etc.) to the greatest extent, and can meet stereo matching tasks in different scenes, but their real-time performance is poor , unable to meet high real-time application requirements.
  • ASIC has high energy efficiency and real-time performance, but its flexibility is poor and cannot adapt to different matching parameters.
  • FPGA can effectively accelerate computing-intensive tasks, and can adapt to different matching parameters through reconstruction. It can achieve a compromise between real-time performance and flexibility, and has become a mainstream solution for stereo matching acceleration.
  • the core idea of this application is to adapt different matching parameters by configuring the analysis module without the need to reconstruct the FPGA; to complete the SAD matching cost calculation through the PE array pipeline structure to ensure high real-time performance.
  • FIG 1 is an overall block diagram of the configurable real-time parallax point cloud generation system based on FPGA provided by this application.
  • the image data collected by the left and right camera lenses and acquisition chips are processed through high-speed interfaces, such as mobile industry Mobile Industry Processor Interface (MIPI), Universal Serial Bus (USB), High Definition Multimedia Interface (HDMI), Display Port (DP), etc. are transmitted into the FPGA chip; through the basic The image processing module then enters the image distortion correction module to perform imaging distortion correction; the output corrected pixel data is cached in the external memory in frame units in a ping-pong cache manner under the scheduling of the cache controller; the image cached in the external memory The data is sent to the configurable parallax point cloud computing module for the stereo matching calculation process. Stereo matching uses a local matching algorithm and uses SAD as an indicator to measure the matching cost.
  • the obtained disparity data is output through a high-speed interface (such as a high-speed serial computer expansion bus (Peripheral Component Interconnect Express, PCIE)).
  • the main control unit (such as the built-in CPU on the FPGA) generates configuration information and sends the configuration information to each configurable module to realize the configuration of matching parameters.
  • the camera and acquisition chip can adjust different resolutions and frame rates through register configuration.
  • the basic image processing module includes demosaic, grayscale correction, image color format conversion, resolution cropping, etc. Cropping from high resolution to low resolution can be achieved by configuring registers.
  • the image distortion correction module can be configured by registers. Achieve compatibility with different resolutions.
  • the configurable real-time parallax point cloud generation system based on FPGA provided by this application can realize the configuration of different resolutions, matching window widths, and parallax search depths by configuring registers, and can support real-time processing of multiple sets of binocular data.
  • FIG. 2 is a schematic structural diagram of a configurable real-time parallax point cloud computing device provided by this application. As can be seen from Figure 2, the device can be applied to a configurable real-time parallax point cloud generation system based on FPGA.
  • the device includes an image cache unit. 200.
  • Cache controller 210 processing unit (Processing Element, PE) array 220, result shaping module 230, minimum value search module 240, and configuration analysis module 250.
  • PE processing unit
  • result shaping module 230 result shaping module
  • minimum value search module 240 minimum value search module
  • configuration analysis module 250 configuration analysis module
  • the image cache unit 200 is connected to the cache controller 210 and is used to reshape the cached binocular image data according to the specified window size and sliding window sequence under the control of the cache controller 210 and then output the image window data to the cache control.
  • the cache controller 210 is connected to the configuration parsing module 250 and the PE array 220 respectively, and is used to control the image cache unit 200 to output image window data according to the control signal transmitted by the configuration parsing module 250.
  • the image window data is distributed to the PE array through the cache controller 210.
  • the PE array 220 is connected to the configuration parsing module 250 and the result shaping module 230 respectively, and is used to generate a number of algorithm processing units (Processing Units, PUs) of a specified structure according to the control signal transmitted by the configuration parsing module 250, and based on a number of PUs of the specified structure. Process the input image window data to obtain the SAD matching cost calculation result and output it to the result shaping module 230;
  • algorithm processing units Processing Units, PUs
  • the PE array adopts a mesh topology structure, and each PE is interconnected with its four PEs above, below, left and right.
  • the PU of the specified structure refers to an algorithm processing unit containing a column of several rows of PEs, which is used to complete the SAD matching cost calculation of the specified window and obtain the calculation
  • the number of rows can be determined according to the image window size in the configuration information. For example, if the image window size is 3 ⁇ 3, then a PU contains 1 column, 3 rows, and a total of 3 PEs;
  • SAD matching cost calculation refers to the cumulative sum of the absolute values of the differences between the corresponding pixel values of two windows (such as the left and right eye image windows).
  • the result shaping module 230 is connected to the configuration analysis module 250 and the minimum value search module 240 respectively, and is used to add data fields to the input SAD matching cost calculation results according to the control signal transmitted by the configuration analysis module 250 and then output it to the minimum value search module 240 ;
  • the minimum search module 240 is connected to the configuration parsing module 250, and is used to search for the minimum value of the input SAD matching cost calculation results step by step based on the minimum search tree, and output the disparity value corresponding to the minimum matching cost;
  • the configuration parsing module 250 is used to parse the received configuration information, generate corresponding control signals and input them to the cache controller, PE array, result shaping module and minimum search module respectively.
  • the storage size of the image cache unit 200 is 48.64KB, including 38 Banks. Each Bank is composed of a dual-port block random access memory (Block Random Access Memory, BRAM) with a size of 1280 Byte. Store 1280Byte pixel data. For example, you can store one row of pixel data for the left and right images with 480p resolution (that is, the resolution is 640 ⁇ 480). You can store 1/2 each of the left and right images with the 720p resolution (that is, the resolution is 1280 ⁇ 720). Line pixel data can also store 1/3 line pixel data of each left and right image with 1080p resolution (that is, the resolution is 1920 ⁇ 1080). However, those skilled in the art should understand that the storage size, number of banks, and bank size of the image cache unit 200 are not restrictive, and they can be flexibly adjusted as needed.
  • BRAM Block Random Access Memory
  • FIG. 3 is a schematic diagram of the resolution-compatible method of the image cache unit provided by this application. It can be seen from the figure that the image cache unit 200 can perform shaping operations for images of different resolutions, that is, the pixels are transformed according to the fixed window size and sliding window order. The data is taken out or the pixels are rearranged.
  • the matching window size after shaping is 2n+1(n ⁇ N ⁇ *,n ⁇ 9) (Note: At 1920 ⁇ 1080 resolution, the matching window size after shaping is 2n+1(n ⁇ N ⁇ *,n ⁇ 6)).
  • each line of the reshaping buffer can cache 640 pixels, which can complete the reshaping operation of an image with a row number of pixels of 640*n (n is a positive integer).
  • images are cached line by line starting from Bank0 in order of pixel coordinates. If the input image has a 480p resolution, each row of pixels occupies one Bank.
  • data can be read from each bank at the same time to complete the shaping; if the input image has a 720p resolution, each row of pixels occupies two Banks.
  • the shaping can be completed by reading data from the 2n-1 (n is a positive integer) bank at the same time; the input image is 1080p resolution, so each row of pixels occupies 3 banks.
  • the data is read from The shaping can be completed by reading data from the 3n-1 (n is a positive integer) Bank at the same time. When all banks are full, the input data will overwrite the historical data starting from Bank0 and continue to be cached line by line.
  • the cache controller 210 can generate the read and write control signals of the image cache unit according to the control signals transmitted by the configuration analysis module 250, and the reading and writing of data in the BRAM (for example, reading from the external memory, writing to the BRAM, reading from the BRAM, and distributing to PE array, etc.).
  • the PE array 220 completes SAD-based matching cost calculation.
  • Each PE in the PE array can be configured in different working modes according to the control signal transmitted by the configuration parsing module 250 .
  • Each PE has different computing tasks under different working modes (different window widths, disparity search depths, resolutions, and number of video streams), and requires different FPGA resources.
  • Each PE constructs a configuration space, and three types of PE structures are designed based on the configuration space: Ultra PE, Standard PE and Lite PE. These three types of PE are distributed in different rows in the array according to different computing tasks.
  • More complex calculations are completed by Ultra PE, followed by Standard PE, and simple calculations are completed by Lite PE, which can save FPGA resource consumption to the maximum extent; at the same time, the PE array
  • Each PE in the PE is interconnected up, down, left and right, which can transfer operands and matching costs in the horizontal direction, and transfer intermediate results in the vertical direction, thereby realizing a multi-stage pipeline design for SAD matching cost calculation and accelerating SAD matching cost calculation;
  • Each column in the PE array can be configured to include one or more PUs to implement matching cost calculations at different search depths.
  • the result shaping module 230 adds fields to the SAD calculation results generated by Ultra PE and Standard PE rows.
  • the SAD matching cost calculation results generated by Ultra PE and Standard PE lines are transmitted to the result shaping module through the interconnection data channel in the horizontal direction.
  • the result shaping module combines the control signal transmitted by the configuration analysis module to form the corresponding candidate disparity (left and right window center point horizontal
  • the position field is added to the SAD result of the coordinate difference) position, that is, an 8-bit binary code is added before each result to represent the matching cost of the SAD value under the candidate disparity. And send the matching cost after adding the field to the minimum value search module.
  • the minimum value search module 240 uses the minimum value search tree to search for the minimum value in the matching cost step by step, and outputs the disparity value corresponding to the minimum matching cost.
  • the configuration analysis module 250 receives the configuration information sent by the CPU, including the number of channels, resolution, search depth, matching window size and other information of the video to be processed.
  • the configuration parsing module 250 parses the received configuration information and generates corresponding flag signals for the cache controller 210 to generate read and write addresses and read and write enablers; it transmits the control signals to the PE array 220 and the result shaping module 230 for minimum value search. Module 240.
  • the configuration analysis module 250 realizes the adaptation of different matching parameters without the need to reconstruct the FPGA; the calculation of the SAD matching cost is completed through the PE array pipeline structure to ensure real-time performance, thereby overcoming the inability of the existing technology to take into account real-time performance and simultaneously adapt to different parameters. Defects in matching information.
  • the PEs in the PE array are interconnected up, down, left and right, transferring intermediate results in the vertical direction, and transferring operands and final matching costs in the horizontal direction.
  • each PE in the PE array is interconnected up, down, left, and right, so that the operands and matching costs can be transferred in the horizontal direction, and the intermediate results can be transferred in the vertical direction.
  • the data multiplexing characteristics in the stereo matching pipeline mode are fully utilized to transfer the operands and the final SAD matching cost within the PE array to avoid long-distance data access.
  • the transfer and accumulation of partial sums in the calculation process of left and right window matching costs are completed.
  • FIG 4 is a schematic diagram of the definitions of PSAD and PSUM provided by this application.
  • PSAD means to find the absolute difference between the pixel values at the same corresponding position in the same column of the left image and the right image, and add the absolute differences in this column.
  • FIG. 5 is a schematic diagram of the SAD matching cost calculation pipeline design provided by this application.
  • SAD obtaining process for two input images (left image and right image), scan each pixel point (called anchor point) of the left image in turn.
  • anchor point When scanning each pixel point of the left image, perform the following operations : Construct a fixed-size matching window (such as 3 ⁇ 3, 5 ⁇ 5...) with each anchor point as the center, and select all pixels in the window coverage area; also use the window to cover the corresponding position in the right picture, and select All pixels in the coverage area; find the absolute value of the difference between the grayscale values of the corresponding pixels in the left image coverage area and the right image coverage area, and add the absolute values; with 1 as the step size, move the right image to the left Cover the area, take out all the pixels in the coverage area, and calculate the SAD value; repeat the previous step until the center of the coverage area in the right picture exceeds the parallax search range; find the window corresponding to the minimum SAD value in this range, and its center point
  • Part (a) in Figure 5 shows the disparity search calculation process of an anchor point with a pixel value of 98 (the center point of the 3 ⁇ 3 window) in the left image, where the window size is 3 ⁇ 3 and the search range is 4.
  • the search process calculate the SAD values of the window data on the left and windows 1, 2, 3, and 4 on the right in sequence, and find the minimum SAD value to obtain the disparity.
  • Part (b) of Figure 5 shows the implementation process of the SAD calculation pipeline in the computing architecture proposed by this application.
  • the SAD parallel process is as follows.
  • the SAD calculation process for two 3 ⁇ 3 windows can be divided into 3 Sub-process, name the sub-process as PSAD process, and the result of PSAD process is defined as PSUM, then the SAD calculation result can be obtained by adding several PSUMs.
  • the calculation process shown in part (a) of Figure 5 takes a complete SAD calculation as the basic granularity. This calculation method involves repeated access to data.
  • the SAD calculation process is now adjusted to the form shown in part (b) of Figure 5.
  • This form uses one PSAD calculation as the granularity and realizes the SAD calculation process through the accumulation of PSUM. Specifically, at time t1, the sub-window shown in 1 in the left picture and the 4 sub-windows shown in 1 in the right picture execute 4 PSAD processes in parallel, generating 4 PSUMs, which are PSUM1_1, PSUM1_2, and PSUM1_3 from right to left.
  • PSUM1_4 (PSUMn_m: PSAD result of the mth sub-window from right to left in the set of sub-windows shown in the left picture n and the right picture n); at time t2, the sub-window shown in 2 in the left picture and the right picture
  • the four sub-windows shown in 2 execute four PSAD processes in parallel, generating four PSUMs, which are PSUM2_1, PSUM2_2, PSUM2_3, PSUM2_4 from right to left...and so on.
  • the SAD process of window 1 shown in part (a) of Figure 5 can be obtained by adding PSUM1_1, PSUM2_1, and PSUM3_1, which is recorded as SUM.
  • This pipeline method can fully exploit the data reusability in the picture on the right.
  • SUM plus PSUM4_1 (part (b) in Figure 5) is used.
  • the sub-window shown in 4 in the left picture and the one in the right picture are 4
  • the PSAD result of the first sub-window from right to left in the set of sub-windows shown) is subtracted from PSUM1_1 to avoid redundant calculations.
  • This calculation method requires local caching of historical PSUM.
  • Figure 6 is a schematic diagram of a fully parallel pipelined SAD computing array provided by this application.
  • the array is composed of three rows and four columns of PEs and can be configured into four PUs (PU1, PU2, PU3 and PU4).
  • the completion window size is 3 ⁇ 3, full disparity parallel computing process with a search range of four.
  • each sub-window on the left needs to perform PSAD calculations with the four sub-windows on the right.
  • Each PSAD calculation is completed by a column of PEs (i.e. 1 PU).
  • Data can be transferred between PEs in the up, down, left and right directions.
  • the entire SAD calculation phase is divided into two sub-phases, which are operand filling and pipeline calculation.
  • clk1 ⁇ clk4 As shown in part (b) of Figure 5, the cache controller takes out the three numbers in the sub-window 1 on the left and multicasts them to PE00-PE03, PE10-PE13, PE20- in Figure 6 respectively.
  • PE23 (multicast 43 to PE00-PE03, multicast 87 to PE10-PE13, and multicast 34 to PE20-PE23); the cache controller sequentially takes out the four sub-window data in 1 on the right through the horizontal data transfer path of the array , four sub-window data are sent to 12 PEs (88 is sent to PE03, 59 is sent to PE13, 88 is sent to PE23, 1 is sent to PE02, 45 is sent to PE12, 6 is sent to PE22, 42 is sent to PE01, 58 is sent to PE11, 14 to PE21, 69 to PE00, 72 to PE10, 0 to PE20).
  • Each PE calculates the difference between the two operands in the register and takes the absolute value. Taking PE00 as an example, calculate the AD value (26) of the two operands (43 and 69) and store it.
  • clk6 Update the operands in each PE register, calculate the AD values of the two updated operands, and add the AD values generated at clk5 time by column. Specifically, the PE03 operand is updated to 1, the PE13 operand is updated to 45, the PE23 operand is updated to 6, the PE02 operand is updated to 42, the PE12 operand is updated to 58, the PE22 operand is updated to 14, and the PE01 operand is updated. is 69, the PE11 operand is updated to 72, the PE21 operand is updated to 0, the PE00 operand is updated to 55, the PE10 operand is updated to 80, and the PE20 operand is updated to 87.
  • the four sub-windows in the right picture 2 and the four sub-windows in the right picture 1 have reusable values, so only the values of the rightmost sub-window are taken from the cache buffer (PE internal structure), that is, 55, 80, 87, and Just send it to PE00, PE10, PE20.
  • the operands in the remaining three sub-windows can be passed horizontally from the left PE to the right PE. Taking PE00 as an example, calculate the AD value (43) of the two operands (98 and 55) in the register and store it.
  • the addition of AD values is completed in two steps: taking the first column as an example, the first step completes the addition of the AD values in PE00 and PE10 and generates the intermediate result p, and the second step completes the addition of the AD values in PE20 and the intermediate result p Add up.
  • clk6 always completes the first step of the above process.
  • the AD value (15) of PE10 is passed to PE00, and added to the AD value (26) of PE00, and the intermediate result (41) is temporarily stored.
  • the remaining column procedures are the same as the first column.
  • clk7 Update the operands in each PE register, calculate the AD value of the updated two operands, and complete the second step of adding the AD values generated at clk5 time, and complete the third step of adding the AD values generated at clk6 time. step.
  • the update of the PE operand and the calculation of the AD value of the operand are similar to the above process. Specifically, taking the first column as an example, PE20 passes the AD value (34) of the operands (34 and 0) generated at clk5 to PE00, and adds it to the intermediate result (41) at clk6. Adding, we get PSUM1_1(75).
  • PE10 passes the AD value (18) of the operands (98 and 80) generated at clk6 to PE00, and adds it to the AD value (43) of the operands (98 and 55) generated by PE00 at clk6 to generate an intermediate result. (61).
  • the remaining columns are the same as the first column.
  • clk8 Update the operands in each PE register, calculate the AD values of the updated two operands, and complete the second step of adding the AD values generated at clk6, and at the same time complete the third step of adding the AD values generated at clk7. step.
  • the update of the PE operand and the calculation of the AD value of the operand are similar to the above process. Specifically, taking the first column as an example, PE20 passes the AD value (48) of the operands (39 and 87) generated at clk6 to PE00, and adds it to the intermediate result (61) at clk7. , get PSUM2_1(109).
  • PE10 passes the AD value (89) of the operands (90 and 1) generated at clk7 to PE00, and adds it to the AD value (12) of the operands (44 and 56) generated by PE00 at clk7, producing an intermediate result ( 101).
  • the remaining columns are the same as the first column.
  • clk9 Complete the second step of adding the AD values generated at clk7, and add PSUM1_1 ⁇ 4 and PSUM2_1 ⁇ 4 respectively.
  • PE20 passes the AD value (0) of the operands (45 and 45) at clk7 to PE00, and adds it to the intermediate result (101) at clk8 to obtain PSUM3_1 (101 ).
  • PSUM1_1(75) and PSUM2_1(109) are added to obtain the intermediate result q(184).
  • the remaining columns are the same as the first column.
  • clk10 Complete the addition of the intermediate result q and PSUM3_1 to generate the final SUM value. Specifically, taking the first column as an example, in PE00, the addition of PSUM3_1 (101) and the intermediate result q (184) is completed to obtain the final SUM value (285). The remaining columns are the same as the first column.
  • the PSUM value is transferred vertically upward between PEs, and the data on the right and the SAD calculation results generated by Ultra PE and Standard PE rows are transferred horizontally between PEs, thereby speeding up the calculation of SAD matching costs and reducing the cost of repeated reading of data. Time wasted.
  • the PE array includes one or more of the following types of PE:
  • Ultra PE is used to calculate the absolute value of the difference between two operands during the SAD matching cost calculation process, as well as the accumulation operation of partial sums;
  • Standard PE is used to calculate the absolute value of the difference between two operands during the SAD matching cost calculation process, as well as the accumulation operation of partial sums;
  • Lite PE is used to perform the operation of calculating the absolute value of the difference between two operands during the SAD matching cost calculation process
  • the computing resources corresponding to Ultra PE are larger than those of Standard PE.
  • the PE array when arranging the PE array, includes one or more of the three types of PE.
  • the PE array can be generated entirely from Ultra PE; the PE array can also be generated from Ultra PE and Standard PE. ; You can also generate PE arrays using Ultra PE, Standard PE and Lite PE.
  • FIG. 7 is the internal structure diagram of Ultra PE provided by this application.
  • Ultra PE includes the following components internally: AD value accumulation units 701 to 705, which are used to complete the accumulation of intermediate results (AD values) transmitted from PE in the vertical direction, and generate PSUM, which is processed through data interconnection Passed to other PEs; the AD calculation unit 706 is used to complete the calculation process of the difference between the two operands and obtain the absolute value (hereinafter referred to as the AD value); the PSUM cache unit 707 is used to complete the historical PSUM during time division multiplexing Local temporary cache; PSUM accumulation unit 708; matching cost cache unit 709; operand cache unit 710, used for temporary storage of operands during time division multiplexing.
  • AD value accumulation units 701 to 705 which are used to complete the accumulation of intermediate results (AD values) transmitted from PE in the vertical direction, and generate PSUM, which is processed through data interconnection Passed to other PEs
  • the AD calculation unit 706 is used to complete the calculation process of the difference between the two operands and
  • Standard PE includes AD value accumulation units 701 to 703 in Figure 7; AD calculation unit 706; PSUM cache unit 707; PSUM accumulation unit 708; matching cost cache unit 709; operand cache unit 710.
  • Lite PE only includes the AD calculation unit 706 and the operand cache unit 710 in Figure 6.
  • the AD value accumulation units 701 to 705 include an adder, two registers (A and B), a MUX and a DEMUX.
  • the operand 1 of the adder comes from the vertical data bus and is transmitted from other PEs; the operand 2 is the historical accumulation result or 0; when the accumulation of a PSAD is completed, the operand 2 is set to zero to proceed.
  • each column of the PE array can be configured as one or more PUs of a certain size, for example There are 10 PEs in each column of the PE array, which can be configured as three PUs that calculate the 3 ⁇ 3 window matching cost and complete three SAD matching cost calculations in parallel, or as two PUs that calculate the 5 ⁇ 5 window matching cost and complete it in parallel. Two SAD matching cost calculations, or a PU configured to calculate a 9 ⁇ 9 window matching cost to complete one SAD matching cost calculation.
  • each column of the PE array has 19 PEs.
  • These PEs can be configured as a combination of 22 computing units as shown in Table 1. Compatibility with different window sizes can be achieved through the following configuration combinations. It should be noted that the numbers in the table represent the number of calculation units that can calculate the matching cost of the left and right windows of the corresponding window size. Taking combination 1 as an example, it means that each column of PE can be configured with 5 units for calculating the size of 3 ⁇ 3 The left and right window matching cost calculation unit.
  • each column of the PE array has 19 PEs, with a total of 25 columns.
  • Each column of PEs can be configured as 5 PUs for calculating the matching cost of the left and right windows with a size of 3 ⁇ 3. Then 25 columns of PEs can provide 125 3 ⁇ 3 windows of parallel processing power.
  • the parallax search depth is less than or equal to 125, the array can perform full parallax parallel calculation.
  • time-division multiplexing is used for calculation.
  • the PE in the first row of this PU is Ultra PE or Standard PE.
  • Ultra PE or Standard PE on the top of the PU can achieve the advantages of low resource consumption and low device power consumption.
  • Step 804 The minimum value search module searches for the minimum value step by step based on the control signal and the minimum value search algorithm passed by the configuration analysis module, and outputs the disparity value corresponding to the minimum matching cost based on the SAD matching cost calculation results after adding the field.
  • computing resources are allocated based on the number of allocable unit computing resources in the PE array, the parallax search depth corresponding to each data stream, and the frame rate requirements corresponding to each data stream. For example, allocate from the configuration resource space After resources are allocated to the data flow to be allocated, the resources allocated to each data flow are dynamically adjusted based on the difference between the proportion of computing resources allocated to the data flow to be allocated and the proportion of computing resources required by the data flow to be allocated. Computational resources are characterized by disparity search depth and video frame rate.
  • the above method can allocate computing resources to each data flow based on the availability of array computing resources and the index requirements of each data flow, which can achieve the effect of making full use of computing resources and balancing the performance indicators of different data flows.
  • the resources allocated to each data flow can be dynamically adjusted, thereby achieving better resource allocation and high utilization.
  • the remaining allocable computing resources can only provide unit computing resources for the target data flow, then all the remaining allocatable computing resources will be allocated to the target data flow.

Abstract

本申请提供一种可配置实时视差点云计算装置和方法,包括图像缓存单元、缓存控制器、PE阵列、结果整形模块、最小值搜索模块以及配置解析模块;图像缓存单元用于输出指定窗口大小和滑窗顺序的图像窗口数据;缓存控制器用于控制图像缓存单元输出图像窗口数据,并分发至PE阵列中的PE;PE阵列用于生成指定结构的若干PU,并得到SAD匹配代价计算结果;结果整形模块用于对匹配代价进行数据字段添加;最小值搜索模块用于对匹配代价逐级搜索最小值得到视差值;配置解析模块用于解析接收到的配置信息,生成相应的控制信号分别输入其他模块,可以实现视差点云计算实时进行,匹配参数可配置且无需重构。

Description

可配置实时视差点云计算装置及方法
相关申请的交叉引用
本申请要求于2022年04月01日提交的申请号为202210348784.1,发明名称为“可配置实时视差点云计算装置及方法”的中国专利申请的优先权,其通过引用方式全部并入本文。
技术领域
本申请涉及微电子技术领域,尤其涉及一种可配置实时视差点云计算装置及方法。
背景技术
立体匹配是双目立体视觉中的关键环节,立体匹配算法根据像素信息相似性来搜索左右图的对应点,从而确定视差。通过对全图的像素点进行对应点搜索,可以生成整张图的视差点云,进而用于测距或三维重建等任务。立体匹配算法可以部署在中央处理器(Central Processing Unit,CPU)、图形处理器(Graphic Processing Unit,GPU)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)等不同的平台上。
CPU和GPU具有较好的可编程性,可以最大程度地适配不同的匹配参数,能够满足不同场景的立体匹配任务,但其实时性差,无法满足高实时性的应用需求;ASIC具有较高的能效与实时性,但其灵活性较差,无法适应不同的匹配参数。FPGA可以有效地加速计算密集型任务,但现有技术只能通过重构来适配不同的匹配参数,重构设计的时间成本较大。
发明内容
针对现有技术存在的问题,本申请提供一种可配置实时视差点云计算装置和方法。
第一方面,本申请提供一种可配置实时视差点云计算装置,包括:
图像缓存单元、缓存控制器、处理单元PE阵列、结果整形模块、最小 值搜索模块以及配置解析模块;
其中,所述图像缓存单元与所述缓存控制器连接,用于在所述缓存控制器的控制下,按照指定窗口大小和滑窗顺序,对缓存的双目图像数据进行整形后输出图像窗口数据至所述缓存控制器;
所述缓存控制器分别与所述配置解析模块和所述PE阵列连接,用于根据所述配置解析模块传递的控制信号,控制所述图像缓存单元输出图像窗口数据,所述图像窗口数据经所述缓存控制器分发至所述PE阵列中的PE;
所述PE阵列分别与所述配置解析模块和所述结果整形模块连接,用于根据所述配置解析模块传递的控制信号,生成指定结构的若干PU,并基于所述指定结构的若干PU对输入的图像窗口数据进行处理,得到SAD匹配代价计算结果输出至所述结果整形模块;
所述结果整形模块分别与所述配置解析模块和所述最小值搜索模块连接,用于根据所述配置解析模块传递的控制信号,对输入的SAD匹配代价计算结果进行字段添加后输出至所述最小值搜索模块;
所述最小值搜索模块与所述配置解析模块连接,用于根据所述配置解析模块传递的控制信号和最小值搜索算法,对输入的SAD匹配代价计算结果逐级搜索最小值,并输出最小匹配代价对应的视差值;
所述配置解析模块用于对接收到的配置信息进行解析,生成相应的控制信号分别输入至所述缓存控制器、所述PE阵列、所述结果整形模块和所述最小值搜索模块。
可选地,所述PE阵列中的PE采用上下左右互联的方式,在垂直方向上进行中间结果的传递,在水平方向上进行操作数及匹配代价的传递。
可选地,所述PE阵列中包括以下类型PE中的一种或多种:
Ultra PE,用于执行SAD匹配代价计算过程中,对两个操作数作差求绝对值,以及部分和的累加操作;
Standard PE,用于执行SAD匹配代价计算过程中,对两个操作数作差求绝对值,以及部分和的累加操作;
Lite PE,用于执行SAD匹配代价计算过程中,对两个操作数作差求绝对值的操作;
其中,所述Ultra PE对应的计算资源大于所述Standard PE。
可选地,所述PE阵列中的每一列均可配置成一个或者多个PU,所述PU用于执行指定窗口大小的SAD匹配代价计算操作。
可选地,所述PU中,第一行的PE为所述Ultra PE或所述Standard PE。
第二方面,本申请还提供一种可配置实时视差点云计算方法,包括:
所述配置解析模块对接收的配置信息进行解析,生成相应的控制信号,分别输入至所述缓存控制器、所述PE阵列、所述结果整形模块和所述最小值搜索模块;
所述缓存控制器根据所述配置解析模块传递的控制信号,按照指定窗口大小和滑窗顺序,控制所述图像缓存单元输出一路或多路双目图像数据对应的图像窗口数据;
所述PE阵列根据所述配置解析模块传递的控制信号,生成指定结构的若干PU,并基于所述指定结构的若干PU对输入的图像窗口数据进行处理,得到所述图像窗口数据对应的SAD匹配代价计算结果;
所述结果整形模块根据所述配置解析模块传递的控制信号,对所述PE阵列输出的所述SAD匹配代价计算的结果添加字段;
所述最小值搜索模块根据所述配置解析模块传递的控制信号和最小值搜索算法,对添加字段后的SAD匹配代价计算结果逐级搜索最小值,并输出最小匹配代价对应的视差值。
可选地,所述配置信息包括:
图像分辨率、匹配窗口大小、视差搜索深度、双目图像数据的路数以及PE工作模式。
可选地,所述配置信息的确定方式包括:
确定满足单数据流或者多数据流性能指标;
根据所述PE阵列中可分配的单位计算资源个数、每个数据流对应的视差搜索深度以及每个数据流对应的视频帧率分配计算资源;
根据分配的计算资源生成配置信息。
可选地,在数据流数量为两个的情况下,所述根据所述PE阵列中可分配的单位计算资源个数、每个数据流对应的视差搜索深度以及每个数据 流对应的视频帧率分配计算资源,包括:
确定在为每个数据流都分配一个单位计算资源后所述PE阵列中剩余可分配的计算资源;
若所述剩余可分配的计算资源可以为每个数据流提供至少一个单位计算资源,且满足第一条件,则继续为每个数据流都分配一个单位计算资源;
若所述剩余可分配的计算资源可以为每个数据流提供至少一个单位计算资源,但不满足第一条件,则根据第一数值和第二数值之间的大小关系,单独为每个数据流分配单位计算资源;
其中,所述第一数值根据每个数据流对应的视差搜索深度和视频帧率确定,所述第二数值根据当前已分配给每个数据流的单位计算资源个数确定,所述第一条件根据所述第一数值、所述第二数值以及预设阈值确定。
可选地,所述方法还包括:
若所述剩余可分配的计算资源仅可以为目标数据流提供单位计算资源,则将所述剩余可分配的计算资源全部分配给目标数据流。
本申请提供的可配置实时视差点云计算装置及方法,通过配置解析模块实现适配不同的匹配参数,同时无需重构FPGA;通过PE阵列并行流水结构完成SAD匹配代价的计算,满足高实时性要求,保证了适配不同的匹配信息与高实时性的兼顾。
附图说明
为了更清楚地说明本申请或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请提供的基于FPGA的可配置实时视差点云生成系统的整体框图;
图2是本申请提供的可配置实时视差点云计算装置的结构示意图;
图3是本申请提供的图像缓存单元对分辨率的兼容方法示意图;
图4是本申请提供的PSAD和PSUM定义示意图;
图5是本申请提供的SAD匹配代价计算流水设计示意图;
图6是本申请提供的全并行流水SAD计算阵列示意图;
图7是本申请提供的Ultra PE内部结构图;
图8是本申请提供的可配置实时视差点云计算方法的流程示意图;
图9是本申请提供的资源感知型配置生成流程示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请中的附图,对本申请中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
双目立体视觉作为一种获取场景深度信息的常用手段,具有良好的可靠性和鲁棒性,广泛应用于移动机器人、自动驾驶、工业自动化及自动监控等领域。立体匹配是双目立体视觉中的关键环节。立体匹配算法根据像素信息相似性来搜索左右图的对应点,从而确定视差。通过对全图的像素点进行对应点搜索,可以生成整张图的视差点云,进而用于测距或三维重建等任务。目前立体匹配算法可以分为局部匹配算法、全局匹配算法和半全局匹配算法。局部匹配算法由于其独特的实时性特性,被广泛应用于高实时性的应用中。
立体匹配算法可以部署在如CPU、GPU、FGPA、ASIC等不同的平台上。CPU和GPU具有较好的可编程性,可以最大程度地适配不同的匹配参数(如匹配窗口大小、视差搜索深度、图像分辨率等),能够满足不同场景的立体匹配任务,但其实时性差,无法满足高实时性的应用需求。ASIC具有较高的能效与实时性,但其灵活性较差,无法适应不同的匹配参数。FPGA可以有效地加速计算密集型任务,并且能够通过重构来适配不同的匹配参数,能够做到实时性和灵活性的折中,已经成为立体匹配加速的主流方案。
目前已有较多的基于FPGA的立体匹配加速平台,实现了诸如差的绝对值之和算法(Sum of Absolute Differences,SAD)、差的平方和算法(Sum of Squared Differences,SSD)、Census Transform等局部匹配算法的部署。 但这些平台只能通过重构设计来适配不同的匹配参数,且重构设计的时间成本较大。因此,本申请提供一种基于FPGA的立体匹配算法解决方案,通过该解决方案可以实现立体匹配任务同时满足实时性、灵活适配不同匹配参数、无需重构的要求。
本申请的核心思想是:通过配置解析模块实现适配不同的匹配参数,同时无需重构FPGA;通过PE阵列流水结构完成SAD匹配代价计算,保证高实时性。
图1为本申请提供的基于FPGA的可配置实时视差点云生成系统的整体框图,从图1中可以看出,左右相机镜头及采集芯片采集到的图像数据,经过高速接口,例如移动产业处理器接口(Mobile Industry Processor Interface,MIPI)、通用串行总线(Universal Serial Bus,USB)、高清多媒体接口(High Definition Multimedia Interface,HDMI)、显示接口(DisplayPort,DP)等传输进FPGA芯片;经过基础图像处理模块之后进入图像畸变校正模块进行成像畸变校正;输出的校正后的像素数据在缓存控制器的调度下,以乒乓缓存的方式以帧为单位缓存至外部存储器中;外部存储器中缓存的图像数被发送至可配置视差点云计算模块进行立体匹配计算过程。立体匹配采用局部匹配算法,以SAD为指标来度量匹配代价。得到的视差数据通过高速接口(例如高速串行计算机扩展总线(Peripheral Component Interconnect Express,PCIE))进行输出。
主控单元(例如FPGA上内置CPU)生成配置信息,并将配置信息发送至各可配置模块,实现匹配参数的配置,其中相机及采集芯片可以通过寄存器配置实现不同分辨率和帧率的调整,基础图像处理模块包括了demosaic、灰度校正、图像色彩格式转换、分辨率裁剪等,可以通过配置寄存器的方式实现由高分辨率向低分辨率的裁剪,图像畸变校正模块可以通过配置寄存器的方式实现不同分辨率的兼容。
本申请提供的基于FPGA的可配置实时视差点云生成系统可以通过配置寄存器的方式实现不同分辨率、匹配窗口宽度、视差搜索深度的配置,且能够支持多组双目数据实时处理。
图2为本申请提供的可配置实时视差点云计算装置的结构示意图,从图2中可以看出,该装置可以应用于基于FPGA的可配置实时视差点云生 成系统,该装置包括图像缓存单元200、缓存控制器210、处理单元(Processing Element,PE)阵列220、结果整形模块230、最小值搜索模块240,配置解析模块250。
其中,图像缓存单元200与缓存控制器210连接,用于在缓存控制器210的控制下,按照指定窗口大小和滑窗顺序,对缓存的双目图像数据进行整形后输出图像窗口数据至缓存控制器210;
缓存控制器210分别与配置解析模块250和PE阵列220连接,用于根据配置解析模块250传递的控制信号,控制图像缓存单元200输出图像窗口数据,图像窗口数据经缓存控制器210分发至PE阵列220中的PE;
PE阵列220分别与配置解析模块250和结果整形模块230连接,用于根据配置解析模块250传递的控制信号,生成指定结构的若干算法处理单元(Processing Unit,PU),并基于指定结构的若干PU对输入的图像窗口数据进行处理,得到SAD匹配代价计算结果输出至结果整形模块230;
其中,PE阵列采用mesh拓扑结构,每个PE与其上下左右四个PE互联,指定结构的PU指的是包含一列若干行PE的算法处理单元,用于完成指定窗口的SAD匹配代价计算,获得计算结果,其行数可以根据配置信息中的图像窗口尺寸确定,例如图像窗口尺寸为3×3大小,则一个PU包含1列3行共3个PE;
SAD匹配代价计算指的是两个窗口(例如左右目图像窗口)的对应像素值的差的绝对值的累加和。
结果整形模块230分别与配置解析模块250和最小值搜索模块240连接,用于根据配置解析模块250传递的控制信号,对输入的SAD匹配代价计算结果进行数据字段添加后输出至最小值搜索模块240;
最小值搜索模块240与配置解析模块250连接,用于基于最小值搜索树,对输入的SAD匹配代价计算结果逐级搜索最小值,并输出最小匹配代价对应的视差值;
配置解析模块250用于对接收到的配置信息进行解析,生成相应的控制信号分别输入至缓存控制器、PE阵列和结果整形模块,最小搜索模块。
从图2中可以看出,图像缓存单元200存储大小为48.64KB,包括38个Bank,每个Bank由一个大小为1280Byte的双端口块随机存取存储器 (Block Random Access Memory,BRAM)组成,可以存储1280Byte的像素数据,例如可以存储480p分辨率(即分辨率为640×480)的左右图像各自一行像素数据,可以存储720p分辨率(即分辨率为1280×720)的左右图像各1/2行像素数据,也可以存储1080p分辨率(即分辨率为1920×1080)的左右图像各1/3行像素数据。但是本领域技术人员应当理解,图像缓存单元200的存储大小、Bank个数及Bank大小,这些并非是限制性的,他们可以根据需要灵活调整。
图3为本申请提供的图像缓存单元对分辨率的兼容方法示意图,从图中可以看出,图像缓存单元200可以为不同分辨率图像进行整形操作,即按照固定窗口大小以及滑窗顺序将像素数据取出或者将像素重新排列,整形后匹配窗口大小为2n+1(n∈N^*,n≤9)(注:1920×1080分辨率下,整形后匹配窗口大小为2n+1(n∈N^*,n≤6))。对于单个图像来说,整形buffer的每一行可以缓存640个像素,其可以完成行像素数量为640*n(n属于正整数)的图片的整形操作。例如,图像按照像素坐标顺序,从Bank0开始逐行缓存。输入图像为480p分辨率,则每行像素占用一个Bank,进行数据整形时,从每个bank中同时读取数据即可完成整形;输入图像为720p分辨率,则每行像素占用两个Bank,进行数据整形时,从第2n-1(n属于正整数)个Bank中同时读取数据即可完成整形;输入图像为1080p分辨率,则每行像素占用3个bank,进行数据整形时,从第3n-1(n属于正整数)个Bank中同时读取数据即可完成整形。当所有Bank写满后,输入数据将从Bank0开始覆盖历史数据,继续进行逐行缓存。
缓存控制器210可以根据配置解析模块250传递的控制信号,生成图像缓存单元的读写控制信号,BRAM中数据的读写(例如从外部存储器读出,写入BRAM,从BRAM读出,分发给PE阵列等)。
PE阵列220完成基于SAD的匹配代价计算。PE阵列中的每一个PE都可以根据配置解析模块250传递的控制信号,被配置为不同工作模式。每一个PE在不同工作模式下(不同窗口宽度、视差搜索深度、分辨率及视频流数量)的计算任务不同,需要的FPGA资源也不一样,根据该架构所能支持的所有计算参数,为每个PE构建了一个配置空间,并且根据配置空间设计了三类PE结构:Ultra PE、Standard PE及Lite PE。这三种PE根 据计算任务的不同,分布于阵列中不同的行,较为复杂的计算由Ultra PE完成,其次是Standard PE,简单计算由Lite PE完成,可以最大限度节约FPGA资源消耗;同时PE阵列中的每一个PE采用上下左右互联的方式,可以在水平方向上传递操作数及匹配代价,在垂直方向上传递中间结果,从而实现SAD匹配代价计算的多级流水设计,加速SAD匹配代价计算;PE阵列中的每一列均可配置为包括一个或者多个PU,实现不同搜索深度的匹配代价计算。
结果整形模块230对Ultra PE和Standard PE行生成的SAD计算结果进行字段添加。Ultra PE和Standard PE行生成的SAD匹配代价计算结果,经过水平方向上的互联数据通道传输至结果整形模块,结果整形模块结合配置解析模块传递的控制信号,为对应候选视差(左右窗口中心点横坐标的差)位置的SAD结果添加位置字段,即在每个结果前添加8bit二进制码,用于表示该SAD值在该候选视差下的匹配代价。并将添加字段后的匹配代价发送至最小值搜索模块。
最小值搜索模块240采用最小值搜索树,逐级搜索匹配代价中的最小值,并输出最小匹配代价对应的视差值。
配置解析模块250接收CPU发送的配置信息,包括待处理视频的路数、分辨率,搜索深度,匹配窗口大小等信息。配置解析模块250对收到的配置信息进行解析,生成相应的标志信号,供缓存控制器210生成读写地址与读写使能;传递控制信号至PE阵列220、结果整形模块230,最小值搜索模块240。
配置解析模块250实现适配不同的匹配参数,同时无需重构FPGA;通过PE阵列流水结构完成SAD匹配代价的计算,保证实时性,从而克服了现有技术无法兼顾实时性与同时适配不同的匹配信息的缺陷。
可选地,PE阵列中的PE采用上下左右互联的方式,在垂直方向上进行中间结果的传递,在水平方向上进行操作数及最终匹配代价的传递。
具体地,在SAD匹配代价的计算过程中,将PE阵列中的每一个PE采用上下左右互联的方式,可以在水平方向上传递操作数及匹配代价,在垂直方向上传递中间结果。水平方向上,充分利用立体匹配流水模式中的数据复用特点,在PE阵列内传递操作数及最终SAD匹配代价,避免远距 离数据访存。在垂直方向上,完成左右窗匹配代价计算过程中部分和的传递以及累加。
图4为是本申请提供的PSAD和PSUM定义示意图。从图4中可以看出,PSAD表示对左图和右图的同一列中处于同一对应位置的像素值求绝对差值,并将这一列的绝对差值相加。
图5为本申请提供的SAD匹配代价计算流水设计示意图。首先说明SAD求取过程:对于两幅输入图像(左图和右图),依次扫描左图的每一个像素点(称为锚点),在扫描左图的每一个像素点时,进行如下操作:以每一个锚点为中心构造一个固定尺寸的匹配窗口(如3×3、5×5……),选择出窗口覆盖区域的所有像素点;同样用窗口覆盖右图对应位置,并且选择出覆盖区域的所有像素点;求取左图覆盖区域与右图覆盖区域对应像素点的灰度值之差的绝对值,并将绝对值相加;以1为步长,向左移动右图的覆盖区域,并取出覆盖区域的所有像素点,并求SAD值;重复上一步骤,直到右图覆盖区域中心位置超出视差搜索范围;找到此范围内最小SAD值对应的窗口,其中心点即为左图锚点的对应点,左图锚点和其在右图的对应点的横坐标的差值即为该锚点的视差。
图5中的(a)部分展示了左图像素值为98(3×3窗口的中心点)的一个锚点的视差搜索计算过程,其中窗口尺寸为3×3,搜索范围为4,搜索过程中,依次计算左图窗口数据与右图窗口1、2、3、4的SAD值,并求取最小SAD值,进而得到视差。如图5中的(b)部分展示了本申请提出的计算架构中SAD计算流水的实现过程,示例性的,SAD并行过程如下,两个3×3大小的窗口进行SAD计算过程可以划分为3个子过程,将子过程命名为PSAD过程,PSAD过程的结果定义为PSUM,则SAD计算结果可以通过若干PSUM相加得到。
图5中的(a)部分所示的计算过程以一个完整的SAD计算为基本粒度,此计算方式存在数据的重复访存。为了实现高效的流水计算,现将SAD计算过程调整为图5中的(b)部分所示的形式,该形式以一个PSAD计算为粒度,通过PSUM的累加实现SAD的计算过程。具体来说,t1时刻,左图中①所示的子窗口与右图中①所示的4个子窗口并行执行4个PSAD过程,产生4个PSUM,从右至左分别为PSUM1_1、PSUM1_2、PSUM1_3、 PSUM1_4(PSUMn_m:左图子窗口n与右图n所示的子窗口集合中从右向左的第m个子窗口的PSAD结果);t2时刻,左图中②所示的子窗口与右图中②所示的4个子窗口并行执行4个PSAD过程,产生4个PSUM,从右至左分别为PSUM2_1、PSUM2_2、PSUM2_3、PSUM2_4······以此类推。这种计算方式下,图5中的(a)部分所示的窗口1的SAD过程可以由PSUM1_1、PSUM2_1、PSUM3_1相加得到,记为SUM。这种流水方式可以充分挖掘右图的数据复用性。当计算以左图中90为中心点,视差为0的两个窗口的匹配代价时,采用SUM加上PSUM4_1(图5中的(b)部分左图中④所示的子窗口与右图中④所示的子窗口集合中从右向左第一个子窗口的PSAD结果)再减去PSUM1_1的方式,避免冗余计算。此种计算方式需要对历史PSUM进行本地缓存。
示例性地,图6为本申请提供的全并行流水SAD计算阵列示意图,该阵列由三行四列PE组成,可配置为四个PU(PU1、PU2、PU3及PU4),完成窗口尺寸为3×3,搜索范围为四的全视差并行计算过程。图5中的(b)部分左图的每一个子窗口需要跟右图的四个子窗口进行PSAD计算,每个PSAD计算由一列PE(即1个PU)完成,每一列PE中PE的个数由窗口尺寸决定,因此完成该全视差并行计算需要四列这样的PE,PE之间可以在上下左右方向进行数据传递。整个SAD计算阶段分为两个子阶段,分别进行操作数填充和流水线计算。
(1)操作数填充阶段:
clk1~clk4:如图5中的(b)部分所示,缓存控制器将左图①子窗口中的三个数取出,分别多播给图6中的PE00-PE03、PE10-PE13、PE20-PE23(将43多播给PE00-PE03,87多播给PE10-PE13,34多播给PE20-PE23);缓存控制器依次取出右图①中的四个子窗口数据,通过阵列的横向数据传递路径,将四个子窗口数据发送至12个PE(将88发送至PE03,59发送至PE13,88发送至PE23,1发送至PE02,45发送至PE12,6发送至PE22,42发送至PE01,58发送至PE11,14发送至PE21,69发送至PE00,72发送至PE10,0发送至PE20)。
流水线计算阶段:
clk5:各PE计算寄存器中两个操作数的差并取绝对值。以PE00为例, 计算两个操作数(43和69)的AD值(26)并寄存。
clk6:更新各PE寄存器中的操作数,计算更新后的两个操作数的AD值,并且将clk5时刻产生的AD值按照列进行相加。具体来说,PE03操作数更新为1,PE13操作数更新为45,PE23操作数更新为6,PE02操作数更新为42,PE12操作数更新为58,PE22操作数更新为14,PE01操作数更新为69,PE11操作数更新为72,PE21操作数更新为0,PE00操作数更新为55,PE10操作数更新为80,PE20操作数更新为87。右图②的四个子窗口与右图①的四个子窗口存在可复用的值,因此仅从缓存buffer(PE内部结构)中取出最右侧子窗口的值,即55、80、87,并将其发送至PE00、PE10、PE20即可。其余三个子窗口中的操作数,由左侧PE横向传递给右侧PE即可。以PE00为例,计算寄存器中两个操作数(98和55)的AD值(43)并寄存。AD值的相加分为两步完成:以第一列为例,第一步完成PE00与PE10中AD值的相加并产生中间结果p,第二步完成PE20中AD值与中间结果p的相加。clk6时刻完成上述过程的第一步。PE10的AD值(15)传递给PE00,并与PE00的AD值(26)相加,并将中间结果(41)暂存。其余列过程与第一列相同。
clk7:更新各PE寄存器中的操作数,计算更新后的两个操作数的AD值,并完成clk5时刻产生的AD值相加的第二步,同时完成clk6时刻产生的AD值相加的第一步。PE操作数的更新和操作数的AD值计算与上述过程类似。AD值相加具体来说,以第一列为例,PE20将clk5时刻产生的操作数(34和0)的AD值(34),传递给PE00,并与clk6时刻的中间结果(41)相加,得到PSUM1_1(75)。PE10将clk6时刻产生的操作数(98和80)的AD值(18)传递给PE00,并与PE00在clk6时刻产生的操作数(98和55)的AD值(43)相加,产生中间结果(61)。其余列与第一列相同。
clk8:更新各PE寄存器中的操作数,计算更新后的两个操作数的AD值,并完成clk6时刻产生的AD值相加的第二步,同时完成clk7时刻产生的AD值相加的第一步。PE操作数的更新和操作数的AD值计算与上述过程类似。AD值相加具体来说,以第一列为例,PE20将clk6时刻产生操作数(39和87)的AD值(48),传递给PE00,并与clk7时刻的中间结果(61)相加,得到PSUM2_1(109)。PE10将clk7时刻产生操作数(90和 1)的AD值(89)传递给PE00,并与PE00在clk7时刻产生的操作数(44和56)的AD值(12)相加,产生中间结果(101)。其余列与第一列相同。
clk9:完成clk7时刻产生的AD值相加的第二步,同时将PSUM1_1~4与PSUM2_1~4分别相加。具体来说,以第一列为例,PE20将clk7时刻操作数(45和45)的AD值(0),传递给PE00,并与clk8时刻的中间结果(101)相加,得到PSUM3_1(101)。在PE00中,将PSUM1_1(75)与PSUM2_1(109)相加,得到中间结果q(184)。其余列与第一列相同。
clk10:完成中间结果q与PSUM3_1的相加,产生最终SUM值。具体来说,以第一列为例,在PE00中,完成PSUM3_1(101)与中间结果q(184)的相加,得到最终SUM值(285)。其余列与第一列相同。
其中PSUM值在PE之间垂直向上传递,右图的数据和Ultra PE和Standard PE行生成的SAD计算结果在PE之间横向传递,从而加快SAD匹配代价的计算,减少重复读取数据带来的时间浪费。通过上述PE连接方式,更加有利于SAD匹配代价的多级流水设计实现,达到数据传输与计算的实时性要求。
可选地,PE阵列中包括以下类型PE中的一种或多种:
Ultra PE,用于执行SAD匹配代价计算过程中,对两个操作数作差求绝对值,以及部分和的累加操作;
Standard PE,用于执行SAD匹配代价计算过程中,对两个操作数作差求绝对值,以及部分和的累加操作;
Lite PE,用于执行SAD匹配代价计算过程中,对两个操作数作差求绝对值的操作;
其中,Ultra PE对应的计算资源大于Standard PE。
具体地,在PE阵列的排布时,PE阵列中包括三种类型PE中的一种或多种,例如,可以全部由Ultra PE生成PE阵列;也可以由Ultra PE与Standard PE来生成PE阵列;还可以Ultra PE、Standard PE及Lite PE三种共同生成PE阵列。
图7为本申请提供的Ultra PE内部结构图。从图7可以看出,Ultra PE内部包括如下部件:AD值累加单元701~705,用于完成垂直方向上的PE 传递而来的中间结果(AD值)的累加,并产生PSUM,经过数据互联传递给其他PE;AD计算单元706,用于完成两个操作数的差并取绝对值(以下称为AD值)的计算过程;PSUM缓存单元707,用于完成分时复用时历史PSUM的本地暂时缓存;PSUM累加单元708;匹配代价缓存单元709;操作数缓存单元710,用于分时复用时的操作数暂存。
需要说明的是Ultra PE内计算资源最多,Standard PE次之,Lite PE最少。Standard PE包含图7中的AD值累加单元701~703;AD计算单元706;PSUM缓存单元707;PSUM累加单元708;匹配代价缓存单元709;操作数缓存单元710。Lite PE仅包含图6中的AD计算单元706和操作数缓存单元710。
具体地,AD值累加单元701~705,包括一个加法器、两个寄存器(A和B)、一个MUX和一个DEMUX。具体地,加法器的操作数1来源于纵向的数据总线,由其他PE传递而来;操作数2为历史累加结果或者0;当完成一个PSAD的累加后,将操作数2置零,来进行下一个PSAD计算过程;累加的中间值寄存在A和B中,A和B组成一个深度为2的FIFO;MUX可以选择从A或者B中获取累加值(图中上方的寄存器为B,下方的寄存器为A);从B中获取累加的中间值可以实现寄存一拍(延时一个预设的时钟周期)的效果;5个AD值累加单元的输出连接到MUX,MUX选通有效的PSUM,并且将其分别送入PSUM缓存单元707和PSUM累加单元708。
AD计算单元706,输入为左右图像像素数据,输出经过一个由三个寄存器组成的FIFO,MUX可以选择从三个寄存器中获取AD值,从而达到打一拍(延时一个预设的时钟周期)或者打两拍的效果。
PSUM累加单元708,有三个操作数,分别为当前PSUM值,历史PSUM值与历史匹配代价,其中PSUM值由AD累加单元701~705产生,经MUX选通后传递而来,历史PSUM值缓存在PSUM值缓存707中,历史匹配代价缓存在匹配代价缓存单元中709中。
本领域技术人员应当理解,图中AD值累加单元、AD计算单元等的个数及寄存器个数并非是限制性的,他们可以根据需要调整。
根据SAD多级流水计算对计算资源的不同需求,设计不同的PE结构, 有效提升了阵列利用率,节约了FPGA资源,降低了FPGA功耗。
可选地,PE阵列中的每一列均可配置为包括一个或者多个PU,所述PU用于执行指定窗口大小的SAD匹配代价计算操作。
具体地,为了空间上进行尽可能多的视差并行计算,同时也为了兼容不同的匹配窗口尺寸、不同的视差搜索深度,PE阵列的每一列都可以配置成为一个或者多个一定尺寸的PU,例如PE阵列的每一列有10个PE,可以配置成3个计算3×3窗口匹配代价的PU,并行完成三次SAD匹配代价计算,或配置成计算2个5×5窗口匹配代价的PU,并行完成两次SAD匹配代价计算,或配置成计算1个9×9窗口匹配代价的PU,完成一次SAD匹配代价计算。
可选地,PE阵列每一列有19个PE,这些PE可以被配置为如表1所示的22种计算单元的组合,通过以下配置组合实现不同窗口尺寸的兼容。需要说明的是表格中的数字表示能够计算相应窗口大小的左右窗口匹配代价的计算单元的个数,以组合1为例,意为每一列PE可以配置为5个用于计算尺寸为3×3的左右窗口匹配代价的计算单元。
表1:PE阵列中单列PE的配置组合
Figure PCTCN2022101751-appb-000001
Figure PCTCN2022101751-appb-000002
在此基础上,结合分时复用的方法,可以兼容更大的视差搜索深度。例如,PE阵列的每一列有19个PE,一共25列,每列PE可以配置成5个用于计算尺寸为3×3的左右窗口匹配代价的PU,则25列PE可以提供125个3×3窗口的并行处理算力。当视差搜索深度小于等于125时,阵列可以做到全视差并行计算,当视差搜索深度大于125时,采用分时复用的方式进行计算。
本申请的实施例,共提供了3种分时复用配置,即分时复用2次、4次和8次,表2展示了此阵列在不同的分时配置下,可以达到的最大搜索深度。
表2:不同窗口宽度及复用次数下的理论最大搜索深度
Figure PCTCN2022101751-appb-000003
按照上述操作,装置可以兼容不同的窗口尺寸、不同的搜索深度,实现窗口尺寸、搜索深度的可配置,同时最大限度的并行设计,也使装置满足高实时性要求。
可选地,该PU中,第一行的PE为Ultra PE或Standard PE。
具体地,在计算匹配代价的过程中,垂直方向上完成中间结果向上传 递,因此每个PU在第一行进行最终结果的累加,因此第一行可以采用Ultra PE或者Standard PE。
更进一步地,在整个PE阵列中求取SAD时,综合考虑资源消耗和不同窗口尺寸兼容性,凡是出现加法器的行,都应该是Ultra PE或者Standard PE,当加法器需求较多(例如5个),则只能使用Ultra PE;仅进行AD值计算的行,都可以用Lite PE;
基于这个设计思路,示例性地,在一个19行25列的PE阵列中,其中每一列PE可以配置成一个或者多个完成SAD计算的PU。其中Ultra PE在第0行和第9行,Standard PE在第2、4、6、8、11、13、15行,Lite PE在第1、3、5、7、10、12、14、16、17、18行,这个排布设计可以达到资源配置最优,不会造成设计冗余。
在PU顶端布局Ultra PE或Standard PE,可以实现资源消耗少,装置功耗小的优点。
本申请各实施例提供的方法和装置是基于同一申请构思的,由于方法和装置解决问题的原理相似,因此方法和装置的实施可以相互参见,重复之处不再赘述。
图8为本申请提供的可配置实时视差点云计算方法的流程示意图,如图8所示,该方法包括如下步骤:
步骤800、配置解析模块对接收的配置信息进行解析,生成相应的控制信号,分别输入至缓存控制器、PE阵列、结果整形模块和最小值搜索模块。
步骤801、缓存控制器根据配置解析模块传递的控制信号,按照指定窗口大小和滑窗顺序,控制图像缓存单元输出一路或多路双目图像数据对应的图像窗口数据。
步骤802、PE阵列根据配置解析模块传递的控制信号,生成指定结构的若干PU,并基于指定结构的若干PU对输入的图像窗口数据进行处理,得到图像窗口数据对应的SAD匹配代价计算结果。
步骤803、结果整形模块根据配置解析模块传递的控制信号,对PE阵列输出的SAD匹配代价计算的结果添加字段。
步骤804、最小值搜索模块根据配置解析模块传递的控制信号和最小 值搜索算法,对添加字段后的SAD匹配代价计算结果逐级搜索最小值,并输出最小匹配代价对应的视差值。
通过配置解析模块实现适配不同的匹配参数;通过PE阵列流水结构完成SAD匹配代价的计算,保证高实时性,从而克服了现有技术无法兼顾高实时性与同时适配不同的匹配信息的缺陷。
可选地,可配置实时视差点云计算方法配置信息包括图像分辨率、匹配窗口大小、视差搜索深度、双目图像数据的路数以及PE工作模式。
具体地,图像分辨率支持480p、720p、1080p等;匹配窗口大小包括3×3、5×5、13×13等;视差搜索深度视应用场景而定,由使用人员指定;双目图像数据包含了左右两幅图各自的数据流,简称为左右数据流对,双目图像数据的路数指的是左右数据流对的数量;PE工作模式包括Ultra PE,Standard PE,Lite PE。在SAD匹配代价并行计算过程中,PE阵列里需要使用加法器的行均使用Ultra PE和Standard PE(加法器较多时使用Ultra PE),仅做AD值计算的行使用Lite PE。
这些配置信息可以通过配置解析模块传递和解析,从而实现了灵活适配不同的匹配参数的功能。
可选地,配置信息的确定方式包括:
确定满足单数据流或者多数据流性能指标;
根据PE阵列中可分配的单位计算资源个数、每个数据流对应的视差搜索深度以及每个数据流对应的视频帧率分配计算资源;
根据分配的计算资源生成配置信息。
具体地,数据流是指一个左右目视频数据流对,多数据流即多个左右目视频数据流对。
单位计算资源是指由若干个PE组成的可以用于处理一定大小窗口匹配代价的单元,例如对于3×3大小的窗口,其计算单元为一个3行1列的PE矩阵,包含3个PE。
图2的各模块都可配置为多数据流共享模式。其中图像缓存单元200可以支持多路不同分辨率数据流的整形,PE阵列通过资源感知型的配置信息生成方法,根据数据流参数和帧率要求在空间上划分给不同的数据流,从而提高计算资源的利用率。
在为数据流分配计算单元时,首先检查配置生成要求,即是否为多数据流,在多数据流情况下,根据本设计的多数据流配置空间,检查架构算力是否满足要求多数据流性能指标(即在指定的窗口宽度及视差搜索深度下,满足或者接近帧率要求),若是多数据流以及满足性能指标,则进入配置生成流程,否则进入单数据流配置流程或者报错。
确定资源配置的过程中,根据PE阵列中可分配的单位计算资源个数、每个数据流对应的视差搜索深度以及每个数据流对应的帧率要求分配计算资源,例如,从配置资源空间分配资源给待分配数据流后,根据待分配数据流已经分到的计算资源的比例与待分配数据流所需要的计算资源的比例之间的差异动态的调整各数据流分配资源,数据流需要的计算资源由视差搜索深度和视频帧率来表征。
上述方法可以根据阵列计算资源可用情况及每个数据流的指标要求为每个数据流分配计算资源,可以达到充分利用计算资源,平衡不同数据流性能指标的效果。
可选地,在数据流数量为两个的情况下,根据所述PE阵列中可分配的单位计算资源个数、每个数据流对应的视差搜索深度以及每个数据流对应的视频帧率分配计算资源,包括:
确定在为每个数据流都分配一个单位计算资源后PE阵列中剩余可分配的计算资源;
若剩余可分配的计算资源可以为每个数据流提供至少一个单位计算资源,且满足第一条件,则继续为每个数据流都分配一个单位计算资源;
若剩余可分配的计算资源可以为每个数据流提供至少一个单位计算资源,但不满足第一条件,则根据第一数值和第二数值之间的大小关系,单独为每个数据流分配单位计算资源;
其中,第一数值根据每个数据流对应的视差搜索深度和视频帧率确定,第二数值根据当前已分配给每个数据流的单位计算资源个数确定,第一条件根据所述第一数值、第二数值以及预设阈值确定。
具体地,图9为本申请提供的资源感知型配置生成流程示意图。
从图中可以看出,以为AB两个数据流分配资源为例,其中
Figure PCTCN2022101751-appb-000004
Figure PCTCN2022101751-appb-000005
为第一条件,δ为人为设定的经验值,是一个小量,例如0.2或者0.5,
Figure PCTCN2022101751-appb-000006
为第一数值,
Figure PCTCN2022101751-appb-000007
为第二数值,I表示已分配的单位计算资源个数,D表示该数据流要求的搜索深度,F表示该数据流要求的帧率。配置生成流程如下:首先为每个数据流分配一个单位计算资源,然后检查阵列中AB数据流是否都存在可分配的计算资源,满足则检查
Figure PCTCN2022101751-appb-000008
是否成立,若上式成立,则继续为AB都分配单位计算资源。若不成立,则检查
Figure PCTCN2022101751-appb-000009
Figure PCTCN2022101751-appb-000010
的大小关系,若
Figure PCTCN2022101751-appb-000011
表示A少了,则单独为A分配单位计算资源,否则单独为B分配单位计算资源,继续检查
Figure PCTCN2022101751-appb-000012
直到配置生成结束,输出配置信息。
利用上述方法,可以动态的调整各数据流分配资源,从而使资源分配更好,利用率高。
可选地,该资源分配方法还包括:
若剩余可分配的计算资源仅可以为目标数据流提供单位计算资源,则将剩余可分配的计算资源全部分配给目标数据流。
具体地,在为两个数据流分配资源的情况下,首先为每个数据流分配一个单位计算资源,然后检查阵列中AB数据流是否都存在可分配的计算资源,若不满足上述条件,检查阵列中是否剩余可分配给数据流A的单位计算资源,此时数据流A为目标数据流,若有剩余,则将剩余计算资源全部分配给数据流A;如阵列中已经没有单位计算资源可以分配给数据流A,则检查阵列中是否剩余可分配给数据流B的单位计算资源,此时数据流B为目标数据流,若有剩余,则将剩余计算资源全部分配给数据流B,结束配置,并将配置信息输出;若没有剩余计算资源可分配给任意数据流,结束配置,并将配置信息输出。
通过该分配方法,可以在计算资源不能满足同时分配两个数据流,但能满足分配其中一个数据流的情况下,为单个数据流分配资源,实现资源利用最大化。
本领域内的技术人员应明白,本申请的实施例可提供为方法、装置、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不 限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(装置)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机可执行指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机可执行指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些处理器可执行指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的处理器可读存储器中,使得存储在该处理器可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些处理器可执行指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (10)

  1. 一种可配置实时视差点云计算装置,包括:
    图像缓存单元、缓存控制器、处理单元PE阵列、结果整形模块、最小值搜索模块以及配置解析模块;
    其中,所述图像缓存单元与所述缓存控制器连接,用于在所述缓存控制器的控制下,按照指定窗口大小和滑窗顺序,对缓存的双目图像数据进行整形后输出图像窗口数据至所述缓存控制器;
    所述缓存控制器分别与所述配置解析模块和所述PE阵列连接,用于根据所述配置解析模块传递的控制信号,控制所述图像缓存单元输出图像窗口数据,所述图像窗口数据经所述缓存控制器分发至所述PE阵列中的PE;
    所述PE阵列分别与所述配置解析模块和所述结果整形模块连接,用于根据所述配置解析模块传递的控制信号,生成指定结构的若干算法处理单元PU,并基于所述指定结构的若干PU对输入的图像窗口数据进行处理,得到SAD匹配代价计算结果输出至所述结果整形模块;
    所述结果整形模块分别与所述配置解析模块和所述最小值搜索模块连接,用于根据所述配置解析模块传递的控制信号,对输入的SAD匹配代价计算结果进行字段添加后输出至所述最小值搜索模块;
    所述最小值搜索模块与所述配置解析模块连接,用于根据所述配置解析模块传递的控制信号和最小值搜索算法,对输入的SAD匹配代价计算结果逐级搜索最小值,并输出最小匹配代价对应的视差值;
    所述配置解析模块用于对接收到的配置信息进行解析,生成相应的控制信号分别输入至所述缓存控制器、所述PE阵列、所述结果整形模块和所述最小值搜索模块。
  2. 根据权利要求1所述的可配置实时视差点云计算装置,其中,所述PE阵列中的PE采用上下左右互联的方式,在垂直方向上进行中间结果的传递,在水平方向上进行操作数及最终匹配代价的传递。
  3. 根据权利要求2所述的可配置实时视差点云计算装置,其中,所述PE阵列中包括以下类型PE中的一种或多种:
    Ultra PE,用于执行SAD匹配代价计算过程中,对两个操作数作差求 绝对值,以及部分和的累加操作;
    Standard PE,用于执行SAD匹配代价计算过程中,对两个操作数作差求绝对值,以及部分和的累加操作;
    Lite PE,用于执行SAD匹配代价计算过程中,对两个操作数作差求绝对值的操作;
    其中,所述Ultra PE对应的计算资源大于所述Standard PE。
  4. 根据权利要求3所述的可配置实时视差点云计算装置,其中,所述PE阵列中的每一列均可配置为包括一个或者多个PU,所述PU用于执行指定窗口大小的SAD匹配代价计算操作。
  5. 根据权利要求4所述的可配置实时视差点云计算装置,其中,所述PU中,第一行的PE为所述Ultra PE或所述Standard PE。
  6. 一种基于如权利要求1至5任一所述的可配置实时视差点云计算装置执行的可配置实时视差点云计算方法,其中,所述方法包括:
    所述配置解析模块对接收的配置信息进行解析,生成相应的控制信号,分别输入至所述缓存控制器、所述PE阵列、所述结果整形模块和所述最小值搜索模块;
    所述缓存控制器根据所述配置解析模块传递的控制信号,按照指定窗口大小和滑窗顺序,控制所述图像缓存单元输出一路或多路双目图像数据对应的图像窗口数据;
    所述PE阵列根据所述配置解析模块传递的控制信号,生成指定结构的若干PU,并基于所述指定结构的若干PU对输入的图像窗口数据进行处理,得到所述图像窗口数据对应的SAD匹配代价计算结果;
    所述结果整形模块根据所述配置解析模块传递的控制信号,对所述PE阵列输出的所述SAD匹配代价计算的结果添加字段;
    所述最小值搜索模块根据所述配置解析模块传递的控制信号和最小值搜索算法,对添加字段后的SAD匹配代价计算结果逐级搜索最小值,并输出最小匹配代价对应的视差值。
  7. 根据权利要求6所述的可配置实时视差点云计算方法,其中,所述配置信息包括:
    图像分辨率、匹配窗口大小、视差搜索深度、双目图像数据的路数以 及PE工作模式。
  8. 根据权利要求7所述的可配置实时视差点云计算方法,其中,所述配置信息的确定方式包括:
    确定满足单数据流或者多数据流性能指标;
    根据所述PE阵列中可分配的单位计算资源个数、每个数据流对应的视差搜索深度以及每个数据流对应的视频帧率分配计算资源;
    根据分配的计算资源生成配置信息。
  9. 根据权利要求8所述的可配置实时视差点云计算方法,其中,在数据流数量为两个的情况下,所述根据所述PE阵列中可分配的单位计算资源个数、每个数据流对应的视差搜索深度以及每个数据流对应的视频帧率分配计算资源,包括:
    确定在为每个数据流都分配一个单位计算资源后所述PE阵列中剩余可分配的计算资源;
    若所述剩余可分配的计算资源可以为每个数据流提供至少一个单位计算资源,且满足第一条件,则继续为每个数据流都分配一个单位计算资源;
    若所述剩余可分配的计算资源可以为每个数据流提供至少一个单位计算资源,但不满足第一条件,则根据第一数值和第二数值之间的大小关系,单独为每个数据流分配单位计算资源;
    其中,所述第一数值根据每个数据流对应的视差搜索深度和视频帧率确定,所述第二数值根据当前已分配给每个数据流的单位计算资源个数确定,所述第一条件根据所述第一数值、所述第二数值以及预设阈值确定。
  10. 根据权利要求9所述的可配置实时视差点云计算方法,其中,所述方法还包括:
    若所述剩余可分配的计算资源仅可以为目标数据流提供单位计算资源,则将所述剩余可分配的计算资源全部分配给所述目标数据流。
PCT/CN2022/101751 2022-04-01 2022-06-28 可配置实时视差点云计算装置及方法 WO2023184754A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210348784.1A CN114897665A (zh) 2022-04-01 2022-04-01 可配置实时视差点云计算装置及方法
CN202210348784.1 2022-04-01

Publications (1)

Publication Number Publication Date
WO2023184754A1 true WO2023184754A1 (zh) 2023-10-05

Family

ID=82714560

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/101751 WO2023184754A1 (zh) 2022-04-01 2022-06-28 可配置实时视差点云计算装置及方法

Country Status (2)

Country Link
CN (1) CN114897665A (zh)
WO (1) WO2023184754A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117808861A (zh) * 2022-09-26 2024-04-02 神顶科技(南京)有限公司 一种双目视觉系统的运行方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130201283A1 (en) * 2009-12-31 2013-08-08 Cable Television Laboratories, Inc. Method and system for generation of captions over stereoscopic 3d images
CN106780590A (zh) * 2017-01-03 2017-05-31 成都通甲优博科技有限责任公司 一种深度图的获取方法及系统
CN110602474A (zh) * 2018-05-24 2019-12-20 杭州海康威视数字技术股份有限公司 一种图像视差的确定方法、装置及设备

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130201283A1 (en) * 2009-12-31 2013-08-08 Cable Television Laboratories, Inc. Method and system for generation of captions over stereoscopic 3d images
CN106780590A (zh) * 2017-01-03 2017-05-31 成都通甲优博科技有限责任公司 一种深度图的获取方法及系统
CN110602474A (zh) * 2018-05-24 2019-12-20 杭州海康威视数字技术股份有限公司 一种图像视差的确定方法、装置及设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LYU, NIQI; SONG, GUANGHUA; YANG, BOWEI: "Semi-global Stereo Matching Algorithm Based on Feature Fusion and Its CUDA Implementation", JOURNAL OF IMAGE AND GRAPHICS, ZHONGGUO TUXIANG TUXING XUEHUI, CN, vol. 23, no. 6, 30 June 2018 (2018-06-30), CN , pages 874 - 886, XP009549818, ISSN: 1006-8961, DOI: 10.11834/jig.170157 *

Also Published As

Publication number Publication date
CN114897665A (zh) 2022-08-12

Similar Documents

Publication Publication Date Title
CN108681984B (zh) 一种3*3卷积算法的加速电路
EP3496007B1 (en) Device and method for executing neural network operation
CN111931918B (zh) 神经网络加速器
US20210019594A1 (en) Convolutional neural network accelerating device and method
CN106846235B (zh) 一种利用NVIDIA Kepler GPU汇编指令加速的卷积优化方法及系统
JP6335335B2 (ja) タイルベースのレンダリングgpuアーキテクチャのための任意のタイル形状を有する適応可能なパーティションメカニズム
US9378533B2 (en) Central processing unit, GPU simulation method thereof, and computing system including the same
KR20220047284A (ko) 포비티드 렌더링을 위한 시스템들 및 방법들
CN110246081B (zh) 一种图像拼接方法、装置及可读存储介质
CN109658337A (zh) 一种图像实时电子消旋的fpga实现方法
WO2023184754A1 (zh) 可配置实时视差点云计算装置及方法
US9460489B2 (en) Image processing apparatus and image processing method for performing pixel alignment
CN109472734B (zh) 一种基于fpga的目标检测网络及其实现方法
CN108540689B (zh) 图像信号处理器、应用处理器及移动装置
CN114219699B (zh) 匹配代价处理方法及电路和代价聚合处理方法
CN106952215B (zh) 一种图像金字塔特征提取电路、装置及方法
CN116166185A (zh) 缓存方法、图像传输方法、电子设备及存储介质
CN115346099A (zh) 基于加速器芯片的图像卷积方法、芯片、设备及介质
CN101452572B (zh) 基于三次平移算法的图像旋转vlsi结构
RU168781U1 (ru) Устройство обработки стереоизображений
CN110602426B (zh) 一种视频图像边缘提取系统
WO2021179286A1 (zh) 卷积神经网络的数据处理方法、预测方法、计算装置和存储介质
US20230252600A1 (en) Image size adjustment structure, adjustment method, and image scaling method and device based on streaming architecture
CN112017112B (zh) 图像处理方法、装置和系统以及计算机可读存储介质
US20220292344A1 (en) Processing data in pixel-to-pixel neural networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22934577

Country of ref document: EP

Kind code of ref document: A1