WO2013128343A2

WO2013128343A2 - Method for estimating a model on multi-core and many-core mimd architectures

Info

Publication number: WO2013128343A2
Application number: PCT/IB2013/051382
Authority: WO
Inventors: Francesco DIOTALEVI; Amir Fijany; Giulio Sandini
Original assignee: Fondazione Istituto Italiano Di Tecnologia
Priority date: 2012-03-02
Filing date: 2013-02-20
Publication date: 2013-09-06
Also published as: ITGE20120025A1; WO2013128343A3

Abstract

The invention concerns A method for estimating a model on a multi-core and many-core MIMD (Multipe Instruction, Multiple Data) architectures including processors and a global memory using RANSAC (RANdom Sample Consensus) algorithm, particularly in connection with image processing applications for homography model estimation. In accordance with the invention to accelerate computation the following steps are provided for this method: a) incorporating backtracking into RANSAC, and b) in a parallel environment implementing the algorithm obtained by backtracking as a cooperative search algorithm for estimating the model.

Description

Method for estimating a model on multi-core and many-core MIMD architectures

The invention concerns a method for estimating a model on multi-core and many-core MIMD (Multipe Instruction, Multiple Data) architectures including processors and a global memory using RANSAC (RANdom Sample Consensus) algorithm, particularly in connection with image processing applications for homography model estimation.

Background of the invention

Many computer vision applications require a robust estimation algorithm to determine model parameters from a set of data which might contain a significant portion of outliers. The RANSAC (RANdom Sample Consensus) algorithm, originally developed by Fishier and Bolles [1], has become the most widely used robust estimator in the field of computer vision [2]. For example, it has been used in applications such as stereo matching [3], motion segmentation [4], and mosaicing [5].

RANSAC is an iterative method to estimate parameters of a certain mathematical model from a set of data which may contain a large number of outliers. It represents a hypothesize-and- verify framework [6]. Each iteration of RANSAC consists of two steps: first, generation of a hypothetical model based on a sample subset of data and then evaluation of the hypothetical model by using the whole set of data. This iterative procedure is repeated till the probability of finding a better model drops below a certain threshold and the iterations terminate.

For many applications, a real-time implementation of RANSAC is desirable. However, its computational complexity represents a major obstacle for achieving such a real-time performance. The computational complexity of RANSAC is a function of the number of required iterations, i.e., the number of generated hypothetical models, and the size of data set. In fact, RANSAC can often find the correct model even for high levels of outliers [6].

However, the number of hypothetical models required to achieve such an exact model increases exponentially, leading to substantial computational cost [6]. Consequently, there has been significant effort to improve the performance of RANSAC by either reducing the number of models [7,8,9] or by reducing the size of data set for model evaluation [10].

An efficient alternative to improve the performance of RANSAC is to speed up the computation by exploiting parallelism. However, such a parallel implementation has not been extensively and rigorously considered in the prior art. In fact, it seems that the only reported work on parallel implementation of RANSAC is reported in [11] wherein a very limited parallelism has been exploited. In [11], the implementation of pRANSAM algorithm, a limited parallelization of the RANSAM algorithm [12], which is an enhancement of the original RANSAC, on an Intel multi-coreprocessor chip has been considered. The pRANSAM is implemented on a system equipped with an Intel Core 2 Quad processor, representing a very limited parallel implementation. The results presented in [11] show that the achievable speedup depends on both the number of processing nodes and the operating system.

However, the emergence of massively parallel architectures provides a unique opportunity to exploit a large degree of parallelism in the computation of RANSAC to speed-up its application. In fact, some of the emerging highly parallel architectures such as the Tilera [13], a many-core MIMD architecture and the ClearSpeed CSX, a SIMD architecture [16], in addition to providing a significant computing power have a very low power consumption, making them particlularly suitable for embedded application. In addition, there are radiation-hardened versions of both Tilera and CSX [14,15] which make them extremely suitable for space application. In fact, the emergence of these high performance, low-power, and radiation hardened parallel architectures provide a unique opportunity for achieving a fast computation of RANSAC for many embedded applications and even for space applications. However, the main challenge in efficient application of these novel architectures is in the development of the appropriate parallel algorithms to fully exploit their features.

One can consider a rather straightforward parallel implementation of RANSAC by exploiting parallelism in the computation of each iteration. Note that, at each iteration, one and the same model is evaluated for all the elements of the data set. This represents a data parallel computation since the evaluation for all the elements of the data set can be performed in parallel. However, this approach represents a rather fine-grain parallelism which imposes certain limitations in terms of achievable speedup. These limitations can perhaps explain the lack of more extensive works on parallelization of RANSAC. The inventors of the present invention have found out that a more promising approach, used in the present application, is based on a full parallelization of the whole computation of RANSAC. Such a parallel implementation approach for computation of RANSAC is disclosed by the inventors for the Cell Processor [17], a MIMD-SIMD architecture [17] and the CSX SIMD architecture [16]. In both approaches a significant performance in the computation is achieved. However, both these two approaches assume, and indeed require, a high degree of the regularity in the

computation. That is, the evaluation of a model is performed for the whole data set. RANSAC Algorithm

As mentioned before, RANSAC has become a fundamental tool in computer vision and image processing applications, and variations to the original algorithm have been presented to improve the speed and accuracy of the algorithm. Despite various modifications, the core of RANSAC algorithm consists of the following two steps. First, a minimal sample set (MSS) is randomly selected from the dataset. Cardinality of MSS is smallest sufficient number of data to determine model parameters. Then, parameters of the model are computed, using only MSS elements. Then, RANSAC determines the set of data in entire dataset which are consistent with the model and parameters estimated from MSS in the first step. This set of data is called the consensus set (CS). These steps are performed iteratively until the probability of finding a better CS drops below a certain threshold and RANSAC terminates.

To describe RANSAC more formally, assume that the dataset, consisting of N elements, is indicated as

D={d₂, d₂,..., d_N} and V denotes the parameter vector. Let S denotes a selected MSS, and Err(V,dj) be an appropriate function which indicates the error of fitting datum d, in the model with parameter vector V. First, 5 is randomly selected from dataset D and then the model parameters are computed based on the elements in 5, which represents the model generation step. In the next step, model verification, RANSAC checks the elements in D which fit in the model. A datum d, is considered to fit the model, if its corresponding fitting error, Err(V,d_i), is less than a threshold 6. If this is the case, then d, is added to the consensus set, CS. After that, the CS is compared with the best consensus set CS* obtained so far. If CS is ranked better than CS*, best consensus set and best model parameters are updated. The sets CS could be ranked with various measures. In the original RANSAC [1], the consensus sets are ranked according to their cardinality. Other measures could also be considered [18]. In this paper, application the best consensus set is considered to be based on its cardinality, i.e., the maximum number of data which fits the model.

Finally, the algorithm checks if more iteration is needed. As mentioned before, usually a large number of iterations Assume p is the probability of selecting an inlier from dataset D. Thus, the probability of selecting an MSS, 5, that produces an accurate estimation of model parameters will be p_s, where s is the cardinality of 5. So, the probability of selecting an MSS which contains at least one outlier is

(l-Ps)- If the algorithm iterates h times, the probability that all selected MSSs contain outliers is (1- p_s)^h. Consequently, h should be picked large enough so that (1- p_s)^h becomes equal or smaller than an acceptable failure threshold, denoted by ε. Therefore, the required iteration number, T_itr, is obtained as:

T_itr = log £ /(l - p_s) (1)

However, p is not a priori known. A lower bound on the p could be estimated as N/N, where Λ/, is the cardinality of CS. Estimation of p is then updated as the algorithm progresses [6].

However, for any practical problem, a large number of iterations are needed before terminating the process. In fact, for many practical real-time problems with a required fixed computation time, a fixed number of iterations are chosen a priori [10]. This would indicate that a large number of models can be generated and validated independently and in parallel. Our propose approach to parallel implementation of RANSAC, similar to our previous works in [16,17], can be considered as a multi-corestage process wherein, at each stage, a large number of models are generated and evaluated in parallel. The checking is then performed at the end of each stage to determine whether more stages are needed. Note that, if the number of hypothetical models are fixed a priori or for real-time applications, wherein a fixed

computation time is given, then our parallel implementation can be performed in one single stage wherein all the hypothetical models are generated and evaluated in parallel. This approach leads to a massive parallelism in the computation which might be limited only by the resources of the target parallel computing architecture.

Disclosure of the invention

One object underlying the invention is to provide a method for estimating a model on multi- core and many-core MIMD (Multipe Instruction, Multiple Data) architectures including processors and a global memory using RANSAC (RANdom Sample Consensus) algorithm which can be easily implemented into an integrated circuit and which allows faster computation as hitherto known methods based on RANSAC.

This object is attained by a method for estimating a model on multi-core and many-core MIMD (Multipe Instruction, Multiple Data) architectures including processors and a global memory using RANSAC (RANdom Sample Consensus) algorithm, particularly in connection with image processing applications for homography model estimation, including the steps a) incorporating backtracking into RANSAC, and

b) in a parallel environment implementing the algorithm obtained by backtracking as a cooperative search algorithm for estimating the model. In accordance with the invention the efficiency in the computation of ANSAC which represents one of the most computation intensive image processing tasks since it requires evaluation of a large number of models from a given data set is extensively increased by exploiting a massive degree of parallelism is the key enabling factor for many of its applications. The method of the invention comprises a novel and fast algorithm for highly parallel implementation of the RANSAC on multi-core and many-core MIM D architectures by incorporating the concept of backtracking in the computation. This variant of RANSAC in accordance with the invention is used as a cooperative search algorithm with excellent features for highly parallel implementation. This parallel implementation results in an asynchronous algorithm with a very limited communication requirement. In multi-core and many-core M IMD architectures any processor performs a global broadcasting if and when it finds a partial solution better than previous one. As discussed below for the case of the Tilera architecture by using 57 cores thereof practical results clearly demonstrate that excellent speedup in the computation can be achieved by using 57 cores of the Tilera. In fact, for certain cases, the cooperative search algorithms used in the method of the invention even achieved super-linear speedup, i.e., a speedup greater than 57.

Generally, the proposed approaches to increase the efficiency of the RANSAC can be classified into two groups (see [6] for a detailed discussion). In the first group, which aims at optimizing model generation, attempt is made to reduce the number of models generated by a more careful sampling of data for model generation. In the second group, which aims at optimizing model verification, attempt is made to reduce the number of data considered for verification of a given model. That is, by early termination of the verification process.

The approach used in the invention belongs to the second group and attempts at reducing the number of data during the verification stage by introducing the concept of backtracking. To see this, consider evaluation of a given model, denoted M_j, during the model verification with N data. Let CS* denote the best consensus set, achieved so far in the computation, and CS,, the obtained consensus set for model j after evaluating a number of /^' data. At this point, a simple check can determine whether the verification of model j needs to be continued. In fact, if

CS_ji + N - i < CS* (2) then the verification can be stopped since, even assuming that the rest of the data are all inliers, the achievable consensus set would not be better than the already achieved CS*. That is, Eq. (2) by providing an upper bound in the computation determines whether the backtracking, or pruning of the rest of the computation, can take place. It should be emphasized that for a backtracking strategy to be efficient, the test should be less computationally intensive as much as possible. To this end, Eq. (2) is very simple and fast for implementation and can be indeed performed as an integer operation.

It should be mentioned that approach used in the invention is fundamentally different from other approaches such as the T_djd Test [2] and the Bail-Out Test [19]. In these approaches the test is performed by considering a subset of data and hence their success depends on the distribution of inliers (see [6] for further discussion). However, in the approach used in the invention neither any assumption on the distribution of data is made nor any a priori information is needed.

The approach used in the invention can be efficiently implemented on a MIMD parallel architecture wherein each processor performs the computation of the ANSAC by using the backtracking strategy. In this case, the whole data set is moved to the memory of all processors but, in order to assure that duplicate computations are avoided, the models are stored in a global memory. Each processor then loads the next available model for verification from the global memory. This would lead to a fully parallel implementation since there is no need for communication among processors. Note that, this approach is not suitable for SIMD architectures due to the irregularity of the computation since each processor would have to perform different computations. However, it is very efficient for MIMD architectures as disclosed below. In the present description and in the claims, it is referred to this strategy as the parallel algorithm.

Parallel Cooperative Search Algorithm for Computation of RANSAC

In a parallel implementation environment, the efficiency of RANSAC by using backtracking strategy algorithm in accordance with the invention is further improved by introducing the concept of cooperative search. This can be achieved by making the best consensus set, CS*, a global variable accessible to all processors. In this case, if a processor during its computation finds a consensus set which is better than the previously calculated one by all processors, then it updates the best consensus set which is communicated to all processors. As a result, the whole computation is performed as a cooperative search in which all processors use the best global result to further improve the efficiency by enabling an even earlier backtracking. This would lead to a fully asynchronous parallel computation in which a processor communicates with other processors if it finds a consensus set better than the globally existing one. As will be shown below, this cooperative search strategy leads to a much better, and indeed excellent, results than the parallel strategy described above, even enabling achieving super-linear speedup for certain cases. In the present application and in the claims, it is referred to this strategy as the cooperative algorithm.

Preferred embodiments of the invention

In the following an embodiment of the invention will be described in more detail along the drawing; in the drawing

Fig. la schematically shows the Tilera architecture; Fig. lb schematically shows the tile architecture;

Fig. 2 schematically shows error computation caused by the generated model in the Tilera architecture;

Fig. 3 shows generic core activity during parallel RANSAC algorithm;

Fig. 4 shows Generic Core activity during cooperative RANSAC algorithm;

Fig. 5 shows Performance of Parallel and cooperative search algorithms with 1024 ds elements and MSS_set of: (a) 16384, (b) 8192, (c) 4096, (d) 2048 and (e) 1024;

Fig. 6 shows Performance of Parallel and cooperative search algorithms with 2048 ds elements and MSS_set of: (a) 16384, (b) 8192, (c) 4096, (d) 2048 and (e) 1024;

Fig. 7 shows comparison between traditional RANSAC algorithm 4096 Models and Dataset with 1024 data; and

Fig. 8 shows comparison between traditional RANSAC algorithm 8192 Models and Dataset with 1024 data.

In the following the target parallel architecture, the TILEPro64 architecture, with the emphasis on the key features, i.e. memory organization and on-chip interconnection network, employed in our implementation is briefly reviewed along Fig. la and Fig. lb.

Tilera architechture

TILEPro64 is a many-core chip consisting of 64 processing cores, called tiles, organized in a two-dimensional mesh architecture. Each tile consists of three main parts, i.e. Processor Engine, Cache Engine, and Switch Engine [13]. The processor engine is a 32-bit 3-way Very Long Instruction Word (VLIW) integer processor unit with two/three instructions per bundle. Each Processor Engine has its own Program Counter (PC) and can run programs independent of the other tiles. The cache engine provides LI instruction cache; LI data cache and combined L2 cache for each tile. The cache engine is also equipped with a DMA controller for fast memory data transfer between tiles; and between tiles and external memory. And finally the last part, switch engine, is responsible for tile interconnections. The tiles are connected using six different on-chip interconnection networks including four system accessible networks and two user accessible networks. The on-chip interconnection networks are also used for data transfer between the tiles and the on-chip external memory and 10 interfaces. In the following two subsections more details on TILEPro64 memory organization and communication architecture are provided.

Memory Organization

T\LEPro64 architecture provides a 36-bit physical memory addressing space which is globally shared between all 64 tiles for data communication. The physical addressing space has been distributed in four different DD 2 RAMs to balance memory bandwidth. This feature can be employed for parallel read/write operations from different memory modules. Each tile has its own LI instruction cache (16KB), LI data cache (8KB) and L2 combined cache (64KB). This provides 5.5MB on-chip cache for TILEPro64. TILEPro64 supports hardware cache coherency management to guarantee a coherent view of data memory to all tiles. If the requested data is not found in local L1/L2 cache; at first the adjacent distributed coherent cache is searched and in case of data miss, the request is passed to the external memory. Data access time for LI, L2, adjacent distributed coherent cache and external memory is equal to 2, 8, 35 and 69 cycles, respectively. As a result, in high-performance applications, data distribution should be performed in such a way to maximizes local memory access. Cache engine has a DMA controller for memory data transfer. The processor engine initializes the DMA controller for data transfer and then continues the program execution. This feature can be used efficiently for background data transfer in high-performance applications. Tilera API library, iLib, provides a set of functions to allocate memory, synchronize shared memory and perform DMA transfer.

Communication Architecture

TILEPro64 architecture provides an on-chip interconnection network, iMesh, which is responsible for all data communications between tiles; and between tiles and 10 devices. The iMesh consists of two different classes of independent networks, i.e. Static Network and Dynamic Network. Static network uses circuit switching mechanism to establish a path between source and destination to send data. This network is user accessible and is suitable for scalar data streaming between tiles. Dynamic network includes User Dynamic Network (UDN), Memory Dynamic Network (MDN), Tile Dynamic Network (TDN), Coherence Dynamic Network (CDN) and 10 Dynamic Network (IDN). UDN is used for data communication between tiles. MDN and TDN are used for both memory data transfer between tiles, and between tiles and external memories. CDN is used for cache coherency data transfer between tiles' caches. And finally, IDN is responsible for data transfer between 10 devices and memory. Only UDN network is user accessible and the others are dedicated for system level functions. Tilera API library, iLib, provides a set of functions to use underlying interconnect network for data transfer between the tiles. From software development point of view, there are two types of channels, Raw Channels and Buffered Channel [20]. Raw channels use existing hardware buffers in the output port of the switch engine, so they have limited buffer size and provide high bandwidth communication (3.93 bytes/cycle). On the other hand, buffered channels use memory to virtually provide unlimited buffer size but the bandwidth is very low in comparison with the raw channels (1.25 bytes/cycle).

Implementation of parallel and cooperative search algorithms

In this section, various aspects of implementation of both parallel and cooperative algorithms for Homography model estimation are discussed. Homography is a linear transformation in projective space which relates two images of a planner scene taken from different views by a pin-hole camera. One of the images is denoted as as source and the other as destination. For model generation, the approach in [17] is used. Briefly, the conventional approaches for model generation for homography model estimation need Singular Value Decomposition (SVD) computation which requires double precision floating-point operations. In contrast, our proposed method in [17] is not only faster than SVD but it also requires only single precession floating-point operations.

In describing the search algorithms, the following notations are used. The data set, denoted as ds, consists of a set of source points and corresponding destination points. The ith element of the ds is denoted as:

(χ₅>Υ$, Χ Υ )¹ with 0 < i < A? (3) where N indicates the size of ds.

Starting from the ds, a set of different MSS (the MSS_set) are generated where each element is constituted by 4 randomly chosen elements of ds, i.e., an element of the MSS_set is given as:

where 0 < j, k, l, m < N and i - k 1 in.

The single core algorithm

For each MSS e MSS_set, the following steps are performed:

- Model estimation

The Model estimation consists of computing the Homography model, denoted as H, by using the MSS selected from the MSS_set. As stated before, the method disclosed in [17] is used for this step. Note that this method requires single precision floating point computation.

- Model verification

The model verification consists of the execution of two different steps performed for each element of the data set ds: First, test for backtracking as described in Sec. 2 and given by Eq. (2), and second the error computation performed for updating the number of inliers for the generated model.

For each point in the data set ds, it is needed to compute the error caused by the generated model (Homography H). The symmetric transfer error is defined as follows:

ΕΓΓΟΓ = (¾) - (¾)) (¾) - ^ 2))² ^) i.e., the sum of squared difference of the computed destination point with the expected destination point added to the sum of squared difference of the computed back source point and the true starting point, shown in Fig. 2.

This operation is done once for each model verification loop. It has been observed that this approach enables to perform the whole loop in half the time compared to the case in which H and H ¹ with floating value are used. The error induced by the integer computation is negligible- indeed in test cases it was found that this strategy would lead to at the most 2 erroneous estimated points on a total of 1024 points, i.e., less than 0.2%. If the computed error is less than a fixed threshold, β, the value of the inliers found for the estimated model was increased. At the end of the model verification phase, the consensus set CS was set with the value of the found inliers (noi). Of course, if this value CS is better than any other previously computed CS, this means that the best Model, CS* already has bheen found. At the completion of the model verification phase, i.e., when MSS_set becomes void, in the variable bestModel the best model was found based on MSS_set. A pseudo code of the single core computation is given below:

Algorithm 1: Pseudo Code for Single Core Computation

1. While MSS_set≠ 0 {

2. chose MSS e MSS_set

3. Estimation of Model using grabbed MSS

// model verification

4. Set noi=0

5. For (i=0;i<#ds; i++) {

6. If (noi+ (#ds-i))<bestCS goto 2

7. If (ErrorComputation()<threshold) noi++

8. }

9. cs=noi;

10. if (cs>bestCS)

11. bestCS=cs and

12. bestModel=Model

13. }

The Parallel Algorithm Implementation

For the implementation of the parallel algorithm, each core reads a different MSS from the MSS_Set. In practice, a shared Index variable (IndexMSS) common to each core that is used to point to the Shared MSS_set and to read the correspondent MSS has been used. Each core then continues the computation as explained in the previous Section.

The difference with the previous single core implementation is essentially at the end of the computation when the MSS_set doesn't have any other MSS to read. In this case, each core compares its BetsCS with a shared value (bestShCS) and if the BestCs is better, it updates it and also updates the bestShModel with the BestModel that it has computed. To preserve contemporaneous accesses to the same resource mutex has been used for writing to the shared memory. At the end of the execution of all cores, in the common memory the best consensus set and the best model found is located as shown in Fehler! Verweisquelle konnte nicht gefunden werden..

It is to be noted that, although the computation performed by cores are performed independently, but due to the overhead involve in the communication with the shared memory, a perfect speedup cannot be achieved. The pseudo code for parallel implementation is given below:

Algorithm 2 Pseudo Code for Parallel Implementation

1. While MSS_set≠ 0 {

2. Read the index and use it for choosing MSS

3. chose MSS e MSS_set

4. Estimation of Model using grabbed MSS

// model verification

5. Set noi=0

6. For (i=0;i<#ds; i++) {

7. If (noi+ (#ds-i))<bestCS goto 2

8. If (ErrorComputation()<threshold) noi++

9. }

10. cs=noi;

11. if (cs>bestCS)

12. bestCS=cs and

13. bestModel=Model

14. }

15. if (cs>betsShCS)

16. bestShCS=bestCS and

17. bestShModel=bestModel

The Cooperative Search Algorithm Implementation

As stated before, in the cooperative search algorithm, each core has knowledge of the best CS obtained globally, i.e., by all other cores. In this scheme, each core broadcasts its bestCS value to all other cores if its bestCS is better than the one previously obtained globally. The pseudo code for implementation of cooperative search is given below:

Algorithm 3 Pseudo Code for Cooperative implementation

1. While MSS_set≠ 0 {

2. Read the index and use it for choosing MSS

3. chose MSS e MSS_set

4. Estimation of Model using grabbed MSS

// model verification

5. Set noi=0

6. For (i=0;i<#ds; i++) {

7. If (noi+ (#ds-i))<bestShCS goto 2

8. If (ErrorComputation()<threshold) noi++

9. }

10. cs=noi;

11. if (cs>betShcs)

12. bestShCS=cs and

13. bestShModel=Model

14. } Fig. 4 shows the activity of the generic /^' Core during the computation of cooperative search algorithm. The check between the local computed Model and the Shared Model is done at the end of any model verification loop. The check for backtracking, i.e., pruning, is done at each cycle of the model verification. Also, in this implementation mutex was used to avoid concurrent updating of the bestShCS and bestShModel.

Results

The maximum number of cores available in Tilera to perform the computation is 57. Both the parallel and cooperative algorithms have been used by considering different number of models as 16384, 8192, 4096, 2048, and 1024. All the computations are performed for ds = 1024 and ds = 2048. For achieving a more accurate result, for each implementation 50 different runs have been performed by randomly permutating the MSS_set. Therefore, any result presented in the following is the average result of 50 different runs. Also, for measuring the speedup, each run is implemented on a single core and is then compared with the parallel and cooperative search implementations.

The performance of parallel and cooperative search algorithms by using 57 cores are presented and compared in Tables 1-5 as well as in Figs. 5 and 6. As can be seen, the cooperative search algorithm always achieves a better result than the parallel algorithm. This better performance is clear indication of the advantage of cooperation among all cores. Also, both algorithms achieve better results for larger problems, i.e., with a larger number of models. Furthermore, the achieved speedups for all cases indicate that, despite the irregularity of the computation and a fully asynchronous implementation, a very good load balancing is achieved. In fact, some cores might evaluate a larger number of models and/or perform more validation for a given model but all the cores remain busy computing as far as the MSS_set is not empty.

However, an interesting, but perhaps not surprising, result is the fact that, for certain cases, the cooperative search algorithm achieves super-linear speedup, i.e., a speedup greater than the number of cores, i.e., 57. The cases for which the cooperative search algorithm achieves super-linear speedup are shown with bold numbers in Tables 1-3. However, this super-linear speedup could have been somehow expected. Note that, the speedup, SP, in the parallel computation is defined as

SP = T_P/T_S (6) where T_p and T_s denote the time of parallel and serial computation, respectively. Neglecting the overhead in the parallel computation, usually the total amount of computation performed by a parallel algorithm is at least equal to (but usually greater than) the amount of

computation performed in the serial implementation. Therefore, a speedup of greater than the number of processors cannot be achieved. However, here one is faced with a special case.

In fact, in the cooperative search algorithm of the invention, due to the introduction of

cooperation concept, for certain cases the total amount of computation performed by all cores can be less than the computation performed in a strictly serial implementation on a single core, resulting in a super-linear speedup. In fact, in this case, a given core might stop some computation for validation of a given model based on the result received from other cores, the computation which would have otherwise been performed by that core.

This can be better described by considering the serial implementation. It is assumed that a single core is validating 57 models in an arbitrary given order. Now assume that model 41 gives a very good result, much better than the previous 40 models. Here, obviously, the single core cannot backtrack from the already performed computation of previous 40 models. However, in the cooperative search algorithm with 57 cores evaluating the same 57 models, a core will find this good model and communicate its result to other cores, thus causing them to backtrack from their current computation.

It has been mentioned that the manifest of the super-linear speedup would have been

somehow expected. In fact, the situation is very similar to other non-deterministic algorithms for solving NP-Complete problems, such as Branch-and-Bound for solving the Integer

Programming problem. As it is well known, most often reordering the data and hence the computation might lead to different computation time and sometimes a much faster

computation. However, in a serial computation, only one ordering can be considered. In a parallel implementation, depending on the resource of the parallel architecture, a number of orderings can be considered simultaneously and by using the concept of cooperative search then a much better efficiency than can be achieved. In fact, it is strongly believed that one of the main application areas for emerging massively parallel MIMD architectures is for a faster computation of non-deterministic algorithms.

Fig. 7 and 8 show comparison graphs of traditional and backtracking algorithm considering 4096 and 8192 Models with Dataset composed of 1024 samples run on a single core on a intel i7 based pc. As can be taken from these figures, backtracking strategy helps a lot in speeding the algorithm. In case of many outliers (90% and more...). The two algorithms have the same performances because of the backtracking strategy doesn't perform pruning during the verification phase.

This means:

• Backtracking algorithm is faster with respect to traditional ANSAC algorithm on any processor.

• Backtracking algorithm has superlinear speedup when performed in cooperative search Ransac in parallel architectures.

Advantageous effects attained by the method of the invention

The emergence of massively parallel, low-power architectures such as the Tilera and the CSX

700, provides a unique opportunity to exploit a large degree of parallelism in the computation and achieve much better performance in embedded applications severely constrained by the power consumption. However, the main challenge in efficient application of these novel architectures is in the development of the appropriate parallel algorithms to fully exploit their features.

As discussed above RANSAC is widely used in image processing applications for Homography model estimation and indeed represents one of the most computation intensive image

processing tasks since it requires evaluation of a large number of models from a given data set. Therefore, increasing the efficiency in its computation by exploiting a massive degree of

parallelism is the key enabling factor for many of its applications. The method of the invention provides for a novel and fast algorithm for highly parallel implementation of the RANSAC on multi-core and many-core MIMD architectures. The invention further is based on a novel variant of the RANSAC by incorporating the concept of backtracking in the computation

additionally making use of a cooperative search algorithm with excellent features for highly parallel implementation. This parallel implementation results in an asynchronous algorithm with a very limited communication requirement. Any processor performs a global broadcasting if and when it finds a partial solution better than previous one. By implementations using 57 cores of the Tilera architecture for an extensive set of models and data with varying degree of outliers it turned out as practical results that excellent speedup in the computation can be achieved by using 57 cores of the Tilera. For certain cases, the cooperative search algorithms even achieved super-linear speedup, i.e., a speedup greater than 57.

In addition to its low power consumption and excellent GOPs per Watt performance, radiation-hard version of Tilera may be used for implementing the method of the invention which makes it one of the best candidates for future aerospace applications.

List of cited documents

[I] M. A. Fischler and R. C. Bolles, "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography," Comm. ACM, vol. 24, no. 6, pp. 381- 395, 1981.

[2] O. Chum and J. Matas, "Randomized RANSAC with T_(d,d) test," Proc. British Machine Vision Conference (BMVC'02), pp. 448-457, Sep. 2002.

[3] P. Pritchett and A. Zisserman, "Wide baseline stereo matching," Proc. Int. Conf. on Computer Vision (ICCV'98), Jan 1998, pp.754— 760.

[4] P. H. S. Torr, "Outlier detection and motion segmentation," Ph.D. dissertation, Dept. of

Engineering Science, University of Oxford, 1995.

[5] P. McLauchlan and A. Jaenicke, "Image mosaicing using sequential bundle adjustment," Proc.

British Machine Vision Conference (BMVC'OO), Sep 2000, pp. 751-759.

[6] R. Raguram, J.-M. Frahm, and M. Pollefeys, "A comparative analysis of RANSAC techniques leading to adaptive real-time random sample consensus," Proc. 10th European Conf. on Computer Vision (ECCV '08), pp. 500-513, 2008.

[7] O. Chum and J. Matas, "Matching with PROSAC - progressive sample consensus," Proc. Int.

Conf. on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1, pp. 220-226, June 2005.

[8] B. J. Tordoff and D. W. Murray, "Guided-MLESAC: Faster image transform estimation by using matching priors," IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 1523-1535, 2005.

[9] D. R. Myatt, P. H. S. Torr, S. J. Nasuto, J. M. Bishop, and R. Craddock, "NAPSAC: High noise, high dimensional robust estimation," Proc. British Machine Vision Conference (BMVC'02), pp. 458-467, Sep. 2002.

[10] D. Nister, "Preemptive RANSAC for live structure and motion estimation," Mach. Vision Appl., vol. 16, no. 5, pp. 321-329, 2005.

[II] R. Iser, D. Kubus, and F. M. Wahl, "An efficient parallel approach to random sample matching (pRANSAM)," Proc. International Conf. Robotics and Automation(ICRA'09), pp. 1199-1206, May 2009.

[12] S. Winkelbach, S. Molkenstruck, and F. M. Wahl, "Low-cost laser range scanner and fast surface registration approach," Proc. 28th Annual Symp. of the German Association for Pattern Recognition (DAGM'06), pp. 718-728, Sep 2006. [13] Tilera, http://www.tilera.com, 2011.

[14] J. P. Walters, R. Kost, K. Sing, J. Suh, and S. Crago, "Software-Based Fault Tolerance for the Maestro Many-Core Processor," Proc. 2011 IEEE Aerospace Conf., March 2011.

[15] J.R. Marshall, D. Stanley, and J.E. Robertson, "Matching Processor Performance to Mission

Application Needs," Proc. InfoTec 2011.

[16] A. Fijany and F. Hosseini," Image Processing Applications on a Low-Power Highly Parallel SIMD Architecture," Proc. 2011 IEEE Aerospace Conf., Big Sky, MO, March 2011.

[17] A. Khalili, A. Fijany, F. Hosseini, S. Safari, and J-G. Fontaine," Fast Parallel Model Estimation on the Cell Broadband Engine," Proc. Int. Symp. on Visual Computing, Las Vegas, Nevada, USA, Nov. 2010.

[18] P.H.S. Torr and A. Zisserman/'MLESAC: a new robust estimator with application to estimating image geometry," Computer Vision and Image Understanding, vol.78, no.l, pp.138-156, 2000.

[19] D. Capel,"An effective bail-out test for RANSAC consensus scoring," Proc. British Machine Vision Conf. pp. 629-638, 2005.

[20] D. Wentzlaff, et al., "On-Chip Interconnection Architecture of the Tile Processor," IEEE Micro, vol. 27 (5), pp. 15-31, 2007.

Tables

Table 1 - Performance of Parallel and Cooperative Search Algorithms for Computing 16384 Models.

Table 2 - Performance of Parallel and Cooperative Search Algorithms for Computing 8192 Models.

Table 3 - Performance of Parallel and Cooperative Search Algorithms for Computing 4096 Models. ds = 2048 pairs of pixel ds = 1024 pairs of pixel

Outliers Parallel Cooperative Parallel Cooperative percentage Avg. Avg. Avg. Avg. Avg. Avg. Avg. Avg.

SpeedUp Model/s SpeedUp Model/s SpeedUp Model/s SpeedUp Model/s

5 37.76 57385 52.08 79162 42.50 72037 54.48 92345

10 38.47 53869 53.33 74684 42.59 69041 53.69 87032

20 38.73 48534 55.48 69531 42.65 64569 55.12 83443

30 38.58 44060 54.48 62218 42.31 60334 55.69 79411

40 38.06 39706 54.71 57071 41.59 56124 55.52 74915

50 38.43 36561 53.88 51251 42.00 53205 56.69 71818

60 39.16 33598 53.21 45648 42.35 50258 55.83 66248

70 42.05 32418 54.07 41689 44.40 49193 55.41 61385

80 47.40 32823 53.80 37254 47.99 48959 55.28 56401

90 49.44 32508 53.75 35347 50.16 49571 54.89 54250

Table 4 - Performance of Parallel and Cooperative Search Algorithms for Computing 2048 Models.

Table 5 - Performance of Parallel and Cooperative Search Algorithms for Computing 1024 Models.

Claims

1. A method for estimating a model on multi-core and many-core MIMD (Multipe Instruction, Multiple Data) architectures including processors and a global memory using ANSAC (RANdom Sample Consensus) algorithm, particularly in connection with image processing applications for homography model estimation, characterized by the steps of

a) incorporating backtracking into RANSAC, and

b) in a parallel environment implementing the algorithm obtained by backtracking as a cooperative search algorithm for estimating the model.

2. The method of claim 1, wherein for defining the backtracking of step a) in evaluating of a given model, denoted Mj, during a model verification with N data, CS* denoting the best consensus set, achieved so far in the computation, and CSji denoting the obtained consensus set for model j after evaluating a number of /^' data, it is determined whether the verification of model j needs to be continued.

3. The method of claim 2, wherein the verification is stopped if CSji + N - i < CS*.

4. The method of claim 2 or 3, wherein RANSAC is implemented on the MIMD parallel architecture by each processor performing the computation of the RANSAC by using the backtracking strategy, wherein the whole data set is moved to the memory of all processors.

5. The method of claim 4, wherein to assure that duplicate computations are avoided, the models are stored in the global memory, each processor then loading the next available model for verification from the global memory.

6. The method of one of claims 1 to 5, wherein for defining the cooperative search algorithm of step b) the best consensus set, CS*, is made a global variable accessible to all processors.

7. The method of claim 6, wherein, if a processor during its computation finds a consensus set which is better than the previously calculated one by all processors, then it updates the best consensus set which is communicated to all processors.

8. The method of one of claims 1 to 7, wherein the M IMD architecture is a Tilera MIM D architecture.

9. The method of claim 1, wherein the Tilera M IMD architecture is a T\ LEPro64 architecture.

10. The method of claim 8 or 9, wherein steps a) and b) are implemented for homography model estimation.

11. The method of claims 8, 9 or 10, wherein the pseudo code for single core computation is defined as:

1. While MSS_set≠0 {

2. chose MSSe MSS_set

3. Estimation of Model using grabbed MSS

// model verification

4. Set noi=0

5. For (i=0;i<#ds; i++) {

6. If (noi+ (#ds-i))<bestCS goto 2

7. If (ErrorComputation()<threshold) noi++

8. }

9. cs=noi;

10. if (cs>bestCS)

11. bestCS=cs and

12. bestModel=Model

13. } where an element of the MSS_set is given as:

where 0 < j, k, l, in < N and j≠ k≠ i≠ m.

12. The method of one of claims 8 to 11, wherein the pseudo code for parallel implementation is defined as:

1. While MSS_set≠0 {

2. Read the index and use it for choosing MSS

3. chose MSSe MSS_set

4. Estimation of Model using grabbed MSS

// model verification

5. Set noi=0

6. For (i=0;i<#ds; i++) {

7. If (noi+ (#ds-i))<bestCS goto 2

8. If (ErrorComputation()<threshold) noi++

9· }

10. cs=noi;

l l. if (cs>bestCS)

12. bestCS=cs and

13. bestModel=Model

14. }

15. if (cs>betsShCS)

16. bestShCS=bestCS and

17. bestShModel=bestModel where an element of the MSS_set is given as:

where 0 < j, k, L in < N and

13. The method of one of claims 8 to 12, wherein the pseudo code for cooperative implementation is defined as:

1. While MSS_set≠0 {

2. Read the index and use it for choosing MSS

3. chose MSSe MSS _set

4. Estimation of Model using grabbed MSS

// model verification

5. Set noi=0

6. For (i=0;i<#ds; i++) {

7. If (noi+ (#ds-i))<bestShCS goto 2

8. If (ErrorComputation()<threshold) noi++

9· }

10. cs=noi;

l l. if (cs>betShcs)

12. bestShCS=cs and

13. bestShModel=Model

14. } where an element of the MSS_set is given as:

where 0 < j, k, I, m < iV and j≠ k≠ 1≠ m.

14.The method of one of claims 8 to 13, wherein the cooperative search algorithm of step b) is adapted to provide for super-linear speedup SP which in parallel computation is defined as SP = Tp/Ts, where Tp and Ts denote the time of parallel and serial computation, respectively.

15. A multi-core or many-core MIMD (Multipe Instruction, Multiple Data) architecture including processors and a global memory using a method for estimating a model using RANSAC (RANdom Sample Consensus) algorithm, particularly in connection with image processing applications for homography model estimation, according to claims 1 tol4.