Disclosure of Invention
The invention aims to provide a full-field dense point fast matching system aiming at the problems in the prior art, and the FPGA is used as a core processor of a hardware platform, so that the matching speed can be improved, the calculated amount can be reduced, and the real-time performance can be ensured.
In order to achieve the purpose, the invention adopts the technical scheme that: the image processing device comprises a reference image memory module and a target image memory module, wherein the reference image memory module inputs reference subarea data in a stored reference image into a reference subarea register group module and a reference subarea full template sum-of-squares module under the control action of a memory control module;
the reference subarea register group module and the search subarea register group module are connected with the multiplier array module, and output results of the multiplier array module are respectively input into the reference subarea local template square sum module, the full template product sum module and the local template product sum module under the control of the state machine control module; the output results of the reference subarea local template square sum module, the search subarea local template square sum module and the local template product sum module are input into a local template correlation coefficient calculation module to obtain a local template correlation coefficient, and the local template correlation coefficient is input into a threshold comparison module to obtain the correlation between the current matching window and the local template; the threshold comparison module is connected with the state machine control module;
the reference subarea full-template square sum module, the full-template product sum module and the searching subarea full-template square sum module are connected with the full-template correlation coefficient calculation module, and the full-template correlation coefficient calculation module outputs the optimal matching point;
the local template correlation coefficient calculation module can record the correlation coefficient value after completing a pair of matching points, and the threshold value in the threshold value comparison module is replaced according to the correlation coefficient value recorded by the completed matching points.
The reference image memory module and the target image memory module are processor external modules, and the rest are processor internal modules; the processor external module also comprises a parameter configuration module which can provide initialization parameters for the processor internal module, and after the local template correlation coefficient calculation module completes the task of the first pair of matching points, a completion flag bit signal can be obtained, and the configuration of the parameter configuration module for the size of the search area is changed according to the displacement of the best matching point.
The reference image memory module and the target image memory module are realized by independent memories.
The search area cache module can convert input serial search area data into parallel data and output the parallel data.
The multiplier array module is constructed by adopting multipliers with the same number as the local template pixel points.
The searching subarea register group module adopts a data cache region to cache data input into the register group, and after the calculation of the local template matching correlation coefficient of a certain searching window is completed, the searching window data enters another register group through the cache region to carry out the calculation of the full template correlation coefficient. The search subarea full template square sum module divides the search subarea full template area into a plurality of parts for parallel calculation, the partial result of a certain part is used as the input of the search subarea local template square sum module, and all the results are summed into the search subarea full template square sum. The full template product sum module divides the residual pixel points of the full template after the partial template is removed into a plurality of parts, and the time division multiplexing multiplier array module carries out product term calculation on the pixel points of each part and sums to obtain the full template product sum.
Compared with the prior art, the invention has the following beneficial effects: the captured image data is cached through a reference image memory module and a target image memory module, the system is simultaneously provided with a search subarea full template square sum module and a search subarea local template square sum module, the output results of the reference subarea local template square sum module, the search subarea local template square sum module and the local template product sum module are input into a local template correlation coefficient calculation module to obtain a local template correlation coefficient, the local template correlation coefficient is input into a threshold value comparison module to obtain the correlation between the current matching window and the local template, a large number of non-matching windows are eliminated in local template matching, a candidate matching window with strong correlation with a matching point is screened out, then the candidate matching window is subjected to fine matching by using a full template, the times of the full template participating in search matching calculation can be reduced under the condition of ensuring the matching precision, the matching speed is improved. The local template correlation coefficient calculating module can record the correlation coefficient value after completing a pair of matching points, and the threshold value in the threshold value comparing module is replaced according to the correlation coefficient value recorded by the completed matching points, so that the local template correlation coefficient calculating module has self-adaptability. In addition, each multiplier in the multiplier array module is responsible for multiplication operation of pixel points of a pair of reference subareas and search subareas, the multiplier array module ensures that cross-correlation product calculation of all pixel points under a local template is completed in one clock period, after the cross-correlation product is completed, all product terms are summed through a pipeline adder, and finally the cross-correlation product under the local template is obtained, if full template calculation is needed after threshold comparison, the multiplier array is multiplexed by adopting a time-sharing multiplexing method, the full template calculation of a window is ensured to be completed in the clock period, and the cost for realizing algorithm hardware is greatly reduced. The method greatly reduces the using amount of multiplier arrays, can eliminate a large number of non-matching points by local template matching, multiplexes the multiplier arrays in a time-sharing manner, performs full template matching calculation on a small number of candidate matching points, increases the stability and flexibility of the algorithm by a self-adaptive threshold selection technology, reduces the consumption of processor hardware resources by more than two times under the condition that the speed loss is less than 15%, and can be used for precisely measuring the three-dimensional deformation and the morphology of objects in the fields of aerospace, quality control, material science and the like.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
Referring to fig. 1, the reference image memory module M001 and the target image memory module M002 buffer the captured image data as image data buffer modules. The reference image memory module M001 and the target image memory module M002 adopt two memories to store respectively, and a memory control module M004 in the processor is connected with the memories to carry out read-write control on the two image memory modules, so that the memory control module is mainly divided into the following three parts:
1) a phase locked loop. The memory chip itself requires a driving clock plus a memory control clock, for a total of two clock signals. In order to ensure that the read/write operation of the memory is in a stable period of the data output by the controller, a fixed phase difference needs to be maintained between the driving clock and the control clock, so as to ensure stable read/write data to the maximum extent. Therefore, a phase-locked loop is added in the design, the phase-locked loop ensures that the phase jitter is small, and an external clock outputs two clocks with the same frequency and different phases through the phase-locked loop.
2) The control timing of the memory. The time sequence mainly comprises an internal command interface module and a control interface module. All timing of the memory is responded to and processed by commands, and the module mainly comprises control of commands such as precharging, refreshing, burst read/write operation and the like. The control interface module directly realizes the operation control of the memory through related commands, firstly initializes the memory at the initial stage of power-on, starts normal read/write operation on the memory after initialization is completed, and carries out pre-refresh operation on rows and columns of all banks at intervals, thereby ensuring that data is not lost.
3) A dual port read/write interface. The dual port is two read/write ports, and since the read/write of the memory shares a group of data I/O ports, the memory can only perform read operation or write operation at the same time. The read/write at the data port of the memory realizes the data interaction across clock domains and guarantees the continuity of the data through the asynchronous FIFO. If the write or read speed is too fast and the memory has no time to respond to the write or read operation, data loss or errors may result. In addition, the memory needs to occupy a certain time in the refresh operation due to the self-refresh operation inherent to the memory. Therefore, if the data input speed is too fast and the memory is not ready to be written, the data written into the FIFO will overflow.
The reference sub-area register set module and the search area cache module mainly read data of the reference sub-area and the search area, and the search area cache module M006 performs serial conversion and parallel processing on the data of the search area. Fig. 2 is a timing sequence of outputting Data of the reference sub-area, firstly, a command is sent to the memory to start reading Data of the reference sub-area in a burst mode, Ref _ Data _ En is a row enable signal of the reference sub-area, each row enable signal contains a plurality of valid Data, and then, a row of Data of the reference sub-area is read after a certain time interval. Because the data of the reference sub-area is obviously less than the data of the search area, the reading time of the data of the reference sub-area is shorter than the reading time of the data of the search area, and the speed of the whole system is not affected even if the data of the reference sub-area is output in a serial mode, the data output of the reference sub-area is output in a serial mode, compared with the output time of the reference sub-area, the reading time of a large amount of data of the search area is more, and the data of the search area needs to be subjected to serial conversion and processing.
The size of the search area is initially set to 256 × 256, after the first matching point is searched, the system adaptively adjusts the position and size of the search area according to the displacement obtained by the first matching point, and the adjusted size of the search area is limited to 128 × 128. The search area cache module outputs data in parallel, firstly determines the initial address of the search area, sends a command and an address to the target image memory module to read and cache the first line of data in the search area, and the data output time sequence is shown in fig. 3. And when the data reading of the first line is finished, the data of the next line in the searching area is continuously sent to the target image memory module to be read, so that the data of the searching area is output from the target image memory module in series along with the enabling signal. The data input into the cache module of the search area is subjected to serial conversion and processing, the cache unit adopts a dual-port memory with 31 groups of data depth being 256 and data bit width being 8 bits, the 31 groups of dual-port memories are connected end to end in a serial mode, and the read/write addresses of all the dual-port memories adopt the same address line.
Most digital image processing processes are realized based on sliding windows, and in order to ensure better matching precision, the invention selects a square area with the size of 31 multiplied by 31 as a matching sliding window. In the matching point search calculation, the reference subarea as template data is kept unchanged in the whole search process, and the data amount thereof is small. Therefore, once the position of the reference subarea is determined, the data of the reference subarea is immediately read out from the reference image memory module and is registered in the reference subarea register group module, the reference subarea register group is composed of 31 multiplied by 31 registers, each register can register 8-bit gray-scale image data, and the registers are connected in a head-to-tail serial connection mode, so that the 31 multiplied by 31 data of the reference subarea can be serially shifted into the 31 multiplied by 31 register group for template registration in a plurality of clock cycles.
The search subarea register group module M009 is formed by 31-way parallel shift register groups. Each shift register group includes 31 registers, and each register can register 8-bit grayscale image data. The parallel input signal of the parallel shift register group comes from 31-path 8-bit parallel output signals of the search area cache module, the data parallel output effective enabling signal of the search area cache module is used as a shift enabling signal of the register, and when the data enabling signal is high, the effective data is shifted into the register group; when the data enable is low, no shift operation is performed, and the original value in the register is kept unchanged. The 31 x 31 parallel shift register group is followed by a data FIFO buffer and another 31 x 31 parallel shift register group. As the result of the calculation of the correlation coefficient by the local matching template needs the pipeline delay of dozens of clock cycles, after the dozens of clock cycles, the data of the front 31X 31 parallel shift register group can be continuously updated with the subsequent data. Once it is determined that the data in the window needs to be subjected to full template matching, the data in the current window is updated early and full template calculation cannot be performed. Therefore, the unit adopts the data FIFO to buffer the data of the parallel shift register group, and is convenient to calculate the correlation coefficient of the full template.
The normalized cross-correlation coefficient calculating unit is used for solving the correlation coefficient of the normalized cross-correlation formula by designing some mathematical operation logic circuits, and can respectively calculate the correlation coefficients: a reference sub-area full template square sum module M011, a search sub-area full template square sum module M010, a reference sub-area local template square sum module M013, a search sub-area local template square sum module M015, a full template product sum module M017 and a local template product sum module M018. In order to maintain consistent data throughput, the present invention requires that the correlation coefficient calculation for one window be performed every clock cycle. The calculation of the square sum of the full template and the local template of the search subarea is easy to realize and relatively less resources are consumed by referring to the square sum of the full template and the local template of the search subarea; however, for the calculation of the full template product sum and the partial template product sum, a large amount of multiplier resources are required if one calculation is to be completed in one clock cycle. The invention improves the search matching algorithm, the proposed fast matching algorithm can be better suitable for hardware realization, the algorithm is realized by firstly adopting a corresponding number of multipliers to complete the calculation of the correlation coefficient of the local matching template, if the threshold value needs to be calculated by the full template after being compared, the multiplier array is multiplexed by adopting a time-sharing multiplexing method, the full template calculation of the window is ensured to be completed in three clock cycles, and the cost for realizing the algorithm hardware is greatly reduced.
The hardware implementation of the computing module will be described in detail below.
1) And a reference sub-region full-template square sum module M011. The data of the reference subarea of a certain matching point is fixed and unchangeable in the whole searching process, so that the sum of squares of all pixel points in the reference subarea is also fixed and unchangeable in the whole searching process, and each matching point only needs to calculate the sum of squares of the reference subareas once. The circuit structure shown in fig. 4 is a reference subarea full-template square sum calculation unit, which calculates the reference subarea square sum in an accumulation manner, and is composed of a multiplier, an adder and a register. The calculation unit performs accumulation calculation on the data of the search subarea one by one when the data of the reference subarea enters the serial shift register group, and the square sum obtained by calculation is the square sum of the reference subareas under the full template.
2) And a reference subregion local template square sum module M013. Since the amount of data in the reference sub-area is small and constant, the data in the search sub-area is still ready when it is shifted into the register set in full serial, and the multiplier array is idle at this time. Therefore, the calculation of the square sum of the local templates of the reference subarea can be completed by utilizing the multiplier array in the idle time.
3) A search sub-area full template sum of squares module M010 and a search sub-area local template sum of squares module M015. The matching full template of the invention adopts 31 × 31 regions, the corresponding local matching template regions are divided as shown in fig. 5, the local matching template is composed of R0, R1, R2, R3 and R4, wherein R0, R1, R2, R3 and R4 are all 7 × 7 regions. As the search area data is entered into the search sub-area register set module M009 in parallel, one beat for one beat, it is equivalent to moving the sliding window from left to right, from top to bottom, point by point within the search area. The data in the search subregion register set is dynamically changing, so for such a large data throughput, the sum of the squares of the data in the search subregion register set cannot be calculated in a serial accumulation manner. The method adopts a parallel pipeline structure to calculate the square sum of the pixel points in the search subarea, and divides the full template into four parts for parallel calculation, namely the square sum of the region R1R2, the square sum of the region R0, the square sum of the region R3R4 and the square sum of the remaining 10 rows of data. The implementation circuit in which the sum of squares of R1R2 is calculated is shown in fig. 6, which implements the sum of squares calculation of the first 7 lines of data of the template and the sum of squares calculation of the data of region R1R2 in a single clock cycle. The method comprises the steps of firstly, completing square calculation of each data by seven paths of parallel data through an 8-bit multiplier, then completing column square sum calculation of the seven paths of parallel data through a pipeline adder, and serially shifting a calculation result into 31 20-bit shift registers for registering.
The square sum of the region R0 and the square sum of the 7 lines of data occupied thereby, the square sum of the region R3R4 and the square sum of the 7 lines of data occupied thereby, and the square sum of the remaining 10 lines of data are obtained by a similar circuit configuration, respectively.
Then, carrying out summation operation on the calculated square Sum Sum _ R1R2, Sum _ R0 and Sum _ R3R4 to obtain the square Sum under a local template of a certain search subarea; and performing summation operation on the calculated Sum of squares Sum1, Sum2, Sum3 and Sum4 to obtain the Sum of squares under the full template of the search subarea.
4) Local template product and module M018. A large amount of DSP Slice resources need to be consumed to ensure that the throughput of a large amount of data within the window is completed within a single clock cycle. The invention greatly reduces the DSP Slice resource consumption of the multiplier array by optimizing the algorithm and adopting the strategy of local template matching. In the hardware implementation, a multi-period time-sharing calculation method is adopted, so that the resource consumption of the DSP Slice is obviously reduced, and the hardware implementation cost is greatly saved.
As shown in fig. 7, the hardware structure of the present invention uses multipliers with the same number of local template pixels to construct a multiplier array, that is, each pixel in the local template corresponds to one multiplier, so that the multiplier array needs 245 dspscice to construct. The reference subarea and the search subarea register groups are used as data input of a multiplier array, each multiplier in the multiplier array is responsible for multiplication operation of pixel points of a pair of reference subareas and a search subarea, the multiplier array can ensure that cross-correlation product calculation of all pixel points under a local template is completed in one clock period, and after the cross-correlation product is completed, all product terms are summed through a pipeline adder, and finally the cross-correlation product under the local template can be obtained.
5) Full template product and module M017. For the calculation of the correlation coefficient under the full template, the calculation of the full template sum of squares of the reference subarea, the full template sum of squares of the search subarea, the sum of squares of the local templates of the reference subarea and the local template sum of squares of the search subarea under the full template is already completed at present, and the correlation coefficient under the full template can be obtained only by calculating the cross-correlation product sum under the full template. However, with the existing multiplier array, the calculation of all cross-correlation product terms under the full template cannot be completed within one clock cycle. Therefore, the cross-correlation product sum under the full template is calculated by adopting a multi-clock period strategy, the remaining pixel points of the full template are divided into three parts, as shown in fig. 9, each part comprises equal pixel points, the product term calculation is carried out on the three parts of the pixel points by utilizing a multiplier array in a time-sharing manner, and the sum of the results is the product sum of the full template.
6) And a full template correlation coefficient calculation module M016. Through the calculation, the full-template square sum of the reference subarea, the full-template square sum of the search subarea, the local-template square sum of the reference subarea, the local-template square sum of the search subarea, the full-template product sum and the local-template product sum can be obtained. When the correlation coefficient of the full template needs to be calculated, the system can generate a full template calculation flag bit, the signal can stop the output data of the cache module of the search area, so that the data of the register group module of the search subarea is kept unchanged, the position of the sliding window is kept unchanged at the moment, the time of three clock cycles is waited, the state machine control module controls the multiplier array to process the product operation of the data of the full template, and the square sum of the full template of the reference subarea, the square sum of the full template of the search subarea and the product sum of the full template are input to calculate the correlation coefficient under the full template. And the state machine control module controls different data to be calculated in the multiplier array module in different states, and after the calculation is finished, the parallel shift register group continues to shift and update the data of the next window.
7) The local template correlation coefficient calculation module M014. Through the calculation, the full-template square sum of the reference subarea, the full-template square sum of the search subarea, the local-template square sum of the reference subarea, the local-template square sum of the search subarea, the full-template product sum and the local-template product sum can be obtained. By a simple multiplication and division operation unit, as shown in fig. 8, the local template square sum of the reference sub-area, the local template square sum of the search sub-area, and the local template product sum are input to calculate the correlation coefficient under the local template.
Comparing the obtained correlation coefficient value with a threshold value by using a comparator, if the correlation coefficient value is smaller than the threshold value, indicating that the correlation between the search subarea and the reference subarea is poor, and excluding the search subarea to calculate the next search subarea; if the correlation coefficient value is greater than or equal to the threshold value, it indicates that the correlation between the search sub-area and the reference sub-area is better, and the correlation coefficient value under the full template needs to be further calculated to determine whether the window is the best matching window.
8) The state machine control module M005. The state machine is divided into 5 states in total, the S0 state is that the product of a search subarea and a reference subarea under the local template is calculated, and if the input of the reference subarea template is finished, the state jumps to the S1 state; if the full template calculation flag bit is 0, continuing the current state; if the value is 1, jumping to the S2 state; the state of S1 is to calculate the square under the local template of the reference subarea, if the square calculation under the local template is completed, the state of S0 is jumped to, and the state only needs to be executed once at each matching point; the state of S2 is that the product of the reference subarea and the first part of the search subarea under the full template is calculated, and the first part jumps to the state of S3 after the calculation is finished, wherein the state is a one-way sequential execution state; the state of S3 is that the product of the reference subarea and the second part of the search subarea under the full template is calculated, and the second part jumps to the state of S4 after the calculation is finished, wherein the state is a unidirectional sequential execution state; the state of S4 is that the product of the reference subarea and the third part of the search subarea under the full template is calculated, and the third part jumps to the state of S0 after the calculation is finished, and the state is a one-way sequential execution state. The correlation coefficient of the local template correlation coefficient calculation module M014 is output to the threshold comparison module M012, some non-matching points are excluded by comparison with the threshold, the selection of the threshold determines the number of excluded non-matching points, if the threshold is selected too large, the best matching point may be excluded, resulting in matching failure; if the threshold value is selected too small, the number of candidate matching points participating in the full-template calculation is increased, and the matching efficiency is reduced. The invention adopts self-adaptive selection of threshold values.
The method comprises the steps of adopting a histogram statistical module to carry out statistics on correlation coefficients under a local template, using a memory to carry out statistics on the number of the correlation coefficients in different intervals in a local template correlation coefficient calculation module, wherein the data depth is 100, the data bit width is 15 bits, and 1-100 address values of the memory respectively represent correlation coefficient values of 0.01-1. Multiplying the calculated correlation coefficient under the local template by a fixed value 100 through a floating-point multiplier, taking the result as the read address of the memory, taking out the numerical value corresponding to the address in the memory, adding 1, and then continuously storing the address back, thereby completing the statistics of the correlation coefficient. When the local template completes the calculation of the whole search area, the unit completes the statistics of the correlation coefficient. And after the statistics is finished, the values of the maximum addresses to the minimum addresses are taken out one by one for summation, the statistics is stopped when the summation result reaches 1% -10% of the total number, and the correlation coefficient corresponding to the address value at the moment is used as the threshold value of the next matching point.