IMAGE FUSION METHODS AND APPARATUS
FIELD OF THE INVENTION
The invention relates to fusing images and, more particularly, to using one or more transformed source images to create a fused image.
BACKGROUND OF THE INVENTION Image fusion approaches based on wavelet transforms have been proposed and studied as demonstrated in a number of articles, for example: A CATEGORIZATION AND STUDY OF MULTISCALE-DECOMPOSITION-BASED IMAGE FUSION SCHEMES by Z Zhang and RS Blum in Proceedings of the IEEE 87(8): 1315-1328, 1999; WAVELET BASED SENSOR FUSION by T Huntsberger and B Jawerth in Proc SPIE 2059:488-498, 1993; WAVELETS AND IMAGE FUSION by LJ Chipman, TM Orr, and LN Graham in Proc SPIE O Ronen, et al.69:208-219, 1995; IMAGE FUSION USING STEERABLE DYADIC WAVELET TRANSFORM by I Koren, A Laine, and F Taylor in Proc IEEE Intl Conf Image Processing, 1995, pp 232-235; PERCEPTUAL BASED HYPERSPECTRAL IMAGE FUSION USING MULTIRESOLUTION ANALYSIS by TA Wilson, SK Rogers, and LR Myers in Optical Engineering 34(11) : 3154-3164, 1995; MULTISPECTRAL IMAGE FUSION USING WAVELET TRANSFORM by X Jiang, L Zhou, and Z Gao in Proc SPIE 2898:35-42, 1996; CONCEALED WEAPON DETECTION: AN IMAGE FUSION APPROACH by MK Uner, LC Ramac, and PK Varshney in Proc SPIE 2942: 123-132, 1997; MULTISENSOR IMAGE FUSION USING THE WAVELET TRANSFORM by H Li, BS Manjunath, and SK Mitrain Graphical Models and Image Processing 57:235-245, 1995; and A REGION BASED IMAGE FUSION SCHEME FOR CONCEALED WEAPON DETECTION by Z Zhang and RS Blum in Proc 31st Annual Conference on Information Sciences and Systems, 1997, pp 168-173.
Although the wavelet transform is sometimes interpreted as a "decorrelator" which attempts to make each wavelet coefficient statistically independent of all others, the research from a number of researchers has demonstrated that there are still some important dependencies between wavelet coefficients, for example: EMBEDDED IMAGE CODING USING ZEROTREES OF WAVELET COEFFICIENTS by J Shapiro in IEEE Trans Signal Processing 41 :3445-3462, 1993; WAVELET-BASED STATISTICAL SIGNAL PROCESSING USING HIDDEN MARKOV MODELS by M Crouse, RD Nowak, and RG Baraniuk in IEEE Trans Signal Processing 46(2) :886-902, 1998; AN INVESTIGATION OF WAVELET-BASED IMAGE CODING USING AN ENTROPY-CONSTRAINED QUANTIZATION FRAMEWORK by K Ramchandran and MT Orchard in IEEE Trans signal processing 46(2) : 342-353, 1998; CHARACTERIZATION OF SIGNALS FROM MULTISCALE EDGES by S Mallat and S Zhong in IEEE Trans Pattern Anal Machine Intell 14: 710-732, 1992; and SINGULARITY DETECTION AND PROCESSING WITH WAVELETS by S Mallat and W Hwang
in IEEE Trans Information Theory 38 :617-643, 1992. The dependencies may be described using the statistical properties called clustering and persistence. Clustering is a property that states that, if a particular wavelet coefficient is large (small), then adjacent coefficients are very likely also to be large (small) . Persistence is a property that states that large (small) values of wavelet coefficients tend to propagate across scales. In J Shapiro and M Crouse, et al., these properties were studied and applied in image coding, signal detection and estimation. A hidden Markov model (HMM) approach was suggested by L Rabiner in A TUTORIAL ON HIDDEN MARKOV MODELS AND SELECTED APPLICATIONS IN SPEECH RECOGNITION (Proceedings of the IEEE 77: 257- 285, 1989) and P Smyth, D Heckerman, and M Jordan in PROBABILISTIC INDEPENDENCE NETWORKS FOR HIDDEN MARKOV PROBABILITY MODELS (Neural Computation 9(2): 227-269, 1997).
SUMMARY OF THE INVENTION An exemplary embodiment of the invention is a method for fusing a plurality of images of a scene to form a fused image. A representation of each image of the scene is created. The representation includes a number of transform coefficients. A set of image fusion parameters are estimated based on dependencies between the transform coefficients. The image fusion parameters are updated using an estimation algorithm and the fused image is formed based on the image fusion parameters. Another exemplary embodiment of the invention is an image fusion apparatus for producing a fused image of a scene. The image fusion apparatus includes: a multiscale transform (MST) processor; a tree processor electrically coupled to the MST processor; a hidden Markov tree (HMT) processor electrically coupled to the tree processor; a Markov chain processor electrically coupled to the HMT processor; an expectation maximization (EM) processor electrically coupled to the Markov chain processor; and an inverse multiscale transform (IMST) processor electrically coupled to the EM processor. The MST processor receives images of a scene and creates an MST representation of each image of the scene. The MST representation includes a set of transform coefficients. The tree processor arranges each MST representation created by the MST processor into tree structures. Each tree structure includes a plurality of nodes with one of the set of transform coefficients in each node. The HMT processor associates each node of the tree structures with a state node, thereby obtaining an HMT. The Markov chain processor computes conditional probabilities from image fusion parameters using an upward-downward algorithm. This calculation is based on the HMT's obtained by the HMT processor. The EM processor updates the image fusion parameters using an EM algorithm based on the image fusion parameters and the conditional probabilities computed by the Markov chain processor. The IMST processor forms the fused image by
taking the IMST of a transformed image resulting from the image fusion parameters updated by the EM processor.
An addition exemplary embodiment of the invention is a method for fusing a number of images of a scene to form a fused image. Each image is segmented into at least one region and the images are fused to form the fused image based on the segmentation of the images.
A further exemplary embodiment of the invention is an image fusion apparatus for producing a fused image of a scene. The image fusion apparatus including : a transform processor; a dependency processor electrically coupled to the transform processor; an estimation processor electrically coupled to the dependency processor; an image fusion processor electrically coupled to the estimation processor. The transform processor receives images of a scene and creates a representation of each image of the scene. The representation includes a set of transform coefficients. The dependency processor estimates image fusion parameters based on dependencies between the transform coefficients. The estimation processor updates the image fusion parameters using an estimation algorithm based on the image fusion parameters estimated by the dependency processor. The image fusion processor forms the fused image based on the image fusion parameters updated by the estimation processor.
Yet a further exemplary embodiment of the invention is an image fusion apparatus for producing a fused image of a scene. The image fusion apparatus includes a segmentation processor and an image fusion processor. The segmentation processor receives images of a scene and segments each image into at least one region. The image fusion processor forms the fused image by fusing the images of the scene based on the segmentation of the images. BRIEF DESCRIPTION OF THE DRAWINGS
The invention is best understood from the following detailed description when read in connection with the accompanying drawings. It is emphasized that, according to common practice, the various features of the drawings are not to scale. On the contrary, the dimensions of the various features are arbitrarily expanded or reduced for clarity. Included in the drawings are the following figures:
Figure 1 is a block diagram illustrating an image fusion method according to an exemplary embodiment of the invention.
Figure 2 is a schematic diagram illustrating a forest of quadtrees of wavelet coefficients according to an exemplary embodiment of the invention. Figure 3 is a schematic diagram illustrating an exemplary hidden Markov tree (HMT) for an image quadtree according to an exemplary embodiment of the invention.
Figure 4 is a block diagram illustrating an iterative fusion procedure (or method) according to an exemplary embodiment of the invention.
Figure 5A is a digital image illustrating an exemplary visual image of a scene that an exemplary embodiment of the invention may use in concealed weapon detection (CWD).
Figure 5B is a digital image illustrating an exemplary millimeter wave (MMW) image of the same scene as illustrated in Figure 5A that an exemplary embodiment of the invention may use in CWD.
Figure SC is a digital image illustrating an exemplary fused image formed from the images of Figures 5A and 5B using an HMT-based expectation maximization (EM) fusion method according to an exemplary embodiment of the invention illustrated in Figures 4 and 8.
Figure 5D is a digital image illustrating an exemplary fused image formed from the images of Figures 5A and 5B using a prior art EM fusion method. Figure 5E is a digital image illustrating an exemplary fused image formed from the images of Figures 5A and 5B using a prior art wavelet fusion method.
Figure 5F is a digital image illustrating an exemplary fused image formed from the images of Figures 5A and 5B using a prior art selecting maximum method.
Figure 6A is a digital image illustrating an exemplary visual image of a scene that an exemplary embodiment of the invention may use in CWD.
Figure 6B is a digital image illustrating an exemplary infrared (IR) image of the same scene as illustrated in Figure 6A that an exemplary embodiment of the invention may use in CWD.
Figure 6C is a digital image illustrating an exemplary fused image formed from the images of Figures 6A and 6B using an HMT-based EM fusion method according to an exemplary embodiment of the invention illustrated in Figures 4 and 8.
Figure 6D is a digital image illustrating an exemplary fused image formed from the images of Figures 6A and 6B using a prior art EM fusion method.
Figure 6E is a digital image illustrating an exemplary fused image formed from the images of Figures 6A and 6B using a prior art wavelet fusion method.
Figure 6F is a digital image illustrating an exemplary fused image formed from the images of Figures 6A and 6B using a prior art selecting maximum method.
Figure 7A is a digital image illustrating an exemplary IR image of a scene that an exemplary embodiment of the invention may use for night vision applications. Figure 7B is a digital image illustrating an exemplary night visual image of the same scene as illustrated in Figure 7A that an exemplary embodiment of the invention may use for night vision applications.
Figure 7C is a digital image illustrating an exemplary fused image formed from the images of Figures 7A and 7B using an HMT-based EM fusion method an exemplary embodiment of the invention illustrated in Figures 4 and 8.
Figure 7D is a digital image illustrating an exemplary fused image formed from the images of Figures 7A and 7B using a prior art EM fusion method.
Figure 7E is a digital image illustrating an exemplary fused image formed from the images of Figures 7A and 7B using a prior art wavelet fusion method.
Figure 7F is a digital image illustrating an exemplary fused image formed from the images of Figures 7A and 7B using a prior art selecting maximum method. Figure 8 is a flow chart illustrating an iterative image fusion procedure according to an exemplary embodiment of the invention.
Figure 9A is a digital image illustrating an exemplary IR video frame of a scene that an exemplary embodiment of the invention may for night vision applications.
Figure 9B is a digital image illustrating an exemplary intensified video frame of the same scene as illustrated in Figure 9A that an exemplary embodiment of the invention may use for night vision applications.
Figure 9C is a digital image illustrating an exemplary fused image formed from the images of Figures 9A and 9B using a single frame EM fusion method according to an exemplary embodiment of the invention. Figure 9D is a digital image illustrating an exemplary fused image formed from the images of Figures 9A and 9B using a two frame EM fusion method according to an exemplary embodiment of the invention.
Figure 9E is a digital image illustrating an exemplary fused image formed from the images of Figures 9A and 9B using a three frame EM fusion method according to an exemplary embodiment of the invention.
Figure 9F is a digital image illustrating an exemplary fused image formed from the images of Figures 9A and 9B using a four frame EM fusion method according to an exemplary embodiment of the invention.
Figure 1OA is a digital image illustrating an exemplary forward looking infrared (FLIR) video frame of a scene that an exemplary embodiment of the invention may use for night vision applications.
Figure 1OB is a digital image illustrating an exemplary intensified video frame of the same scene as illustrated in Figure 1OA that an exemplary embodiment of the invention may use for night vision applications. Figure 1OC is a digital image illustrating an exemplary fused image formed from the images of Figures 1OA and 1OB using a single frame EM fusion method according to an exemplary embodiment of the invention.
Figure 1OD is a digital image illustrating an exemplary fused image formed from the images of Figures 1OA and 1OB using a two frame EM fusion method according to an exemplary embodiment of the invention.
Figure 1OE is a digital image illustrating an exemplary fused image formed 5 from the images of Figures 1OA and 1OB using a three frame EM fusion method according to an exemplary embodiment of the invention.
Figure 1OF is a digital image illustrating an exemplary fused image formed from the images of Figures 1OA and 1OB using a four frame EM fusion method according to an exemplary embodiment of the invention. o Figure 11 is a block diagram illustrating an image fusion method according to an exemplary embodiment of the invention.
Figure 12 is a block diagram illustrating a portion of the method of Figure 11 according to an exemplary embodiment of the invention.
Figures 13A and 13B are schematic diagrams illustrating exemplary pixel s connectivity maps that may be used in the method of Figures 11 and 12 according to an exemplary embodiment of the invention.
Figure 14A is a digital image illustrating an exemplary visual image of a scene that an exemplary embodiment of the invention may use in CWD.
Figure 14B is a digital image illustrating an exemplary MMW image of the 0 same scene as illustrated in Figure 14A that an exemplary embodiment of the invention may use in CWD.
Figures 14C and 14D are digital images illustrating exemplary edge images formed from the images of Figures 14A and 14B, respectively, during the exemplary methods of Figures 11 and 12. 5 Figures 14E and 14F are digital images illustrating exemplary object region images formed from the images of Figures 14C and 14D, respectively, during the exemplary methods of Figures 11 and 12.
Figure 14G is a digital image illustrating an exemplary binary decision map formed from the images of Figures 14E and 14F during the exemplary methods of 0 Figures 11 and 12.
Figure 14H is a digital image illustrating an exemplary fused image formed from the images of Figures 14A and 14B using the exemplary methods of Figures 11 and 12.
Figure 15A is a digital image illustrating an exemplary visual image of a s scene that an exemplary embodiment of the invention may use.
Figure 15B is a digital image illustrating another exemplary visual image of the same scene as illustrated in Figure 15A that an exemplary embodiment of the invention may use.
Figures 15C and 15D are digital images illustrating exemplary edge images 5 formed from the images of Figures 15A and 15B, respectively, during the exemplary methods of Figures 11 and 12.
Figures 15E and 15F are digital images illustrating exemplary object region images formed from the images of Figures 15C and 15D, respectively, during the exemplary methods of Figures 11 and 12. o Figure 15G is a digital image illustrating an exemplary binary decision map formed from the images of Figures 15E and 15F during the exemplary methods of Figures 11 and 12.
Figure 15H is a digital image illustrating an exemplary fused image formed from the images of Figures 15A and 15B using the exemplary methods of Figures 11 and s 12.
Figure 16A is a digital image illustrating an exemplary visual image of a scene that an exemplary embodiment of the invention may use.
Figure 16B is a digital image illustrating another exemplary visual image of the same scene as illustrated in Figure 16A that an exemplary embodiment of the 0 invention may use.
Figures 16C and 16D are digital images illustrating exemplary edge images formed from the images of Figures 16A and 16B, respectively, during the exemplary methods of Figures 11 and 12.
Figures 16E and 16F are digital images illustrating exemplary object region 5 images formed from the images of Figures 16C and 16D, respectively, during the exemplary methods of Figures 11 and 12.
Figure 16G is a digital image illustrating an exemplary binary decision map formed from the images of Figures 16E and 16F during the exemplary methods of Figures 11 and 12. 0 Figure 16H is a digital image illustrating an exemplary fused image formed from the images of Figures 16A and 16B using the exemplary methods of Figures 11 and 12.
DETAILED DESCRIPTION OF THE INVENTION
Image fusion methods based on multiscale transforms (MST) are a popular s choice in recent research. Figure 1 illustrates the block diagram of an image fusion scheme based on multiscale analysis. The basic idea is to perform a multiscale transform (MST) on each source image 100 to produce a corresponding multiscale
representation 102. Fusion process 104 is then applied to the multiscale representations 102 to construct composite multiscale representation 106 from the transformed source images. Fused image 108 is obtained by taking an inverse multiscale transform (IMST) of composite multiscale representation 106.
5 In exemplary embodiments of the invention, an image formation model using a hidden Markov model (HMM) is employed to capture the correlations between wavelet coefficients across wavelet decomposition scales. Based on this exemplary image formation model, an expectation-maximization (EM) algorithm is used to estimate the model parameters and produce the fused image. The fused image may be formed io directly from digital images produced by a number of sensors. For example: digital visible cameras; digital infrared cameras; digital millimeter wavelength detectors; digital x-ray detectors; image intensified digital cameras; digital low light cameras; digital terahertz frequency detectors; visible video cameras; infrared video cameras; image intensified video cameras; and/or low light video cameras may be used in exemplary i5 embodiments of the invention. Alternatively, analog images that have been converted to digital form may be fused using exemplary methods and apparatus of the invention.
EM algorithms have been described for use in some different applications and formation models by a number of authors, including: AP Dempster, NM Laird, and DB Rubin in MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA THE EM ALGORITHM (J Q of the Royal Statistical Soc B 39(1): 1-38, 1977); RA Redner and HF Walker in MIXTURE DENSITIES, MAXIMUM LIKELIHOOD AND THE EM ALGORITHM (SIAM Review 26(2) : 195- 239, 1984); JA Fessler and AO Hero in SPACE-ALTERNATING GENERALIZED EXPECTATION-MAXIMIZATION ALGORITHM (IEEE Trans Signal Processing 42(10):2664- 2677, 1994); RS Blum, RJ Kozick, and BM Sadler in AN ADAPTIVE SPATIAL DIVERSITY 5 RECEIVER FOR NON-GAUSSIAN INTERFERENCE AND NOISE (IEEE Trans Signal
Processing 47(8):2100-2111, 1999); and RS Blum, RJ Kozick, and BM Sadler in EM- BASED APPROACHES TO ADAPTIVE SIGNAL DETECTION ON FADING CHANNELS WITH IMPULSIVE NOISE (31st Annual Conference on Information Sciences and Systems, Baltimore, MD, 1997, pp 112-117). These approaches do not include the HMM in 0 conjunction with the EM algorithm, however.
Exemplary embodiments of the invention include a statistical signal processing approach to multi-sensor image fusion. This approach is based on an image formation model in which the sensor images are described as a true scene corrupted by additive non-Gaussian distortion. The HMM is fitted on wavelet transforms of sensor 5 images from multiple sensors to describe the correlations between the coefficients across wavelet decomposition scales. These multiscale transformations may include pyramid transformations. The EM algorithm is then used to estimate the model parameters and
produce the fused images. In an exemplary embodiment of the invention, an iterative fusion procedure based on these steps is used to produce the maximum likelihood estimates of the parameters and the fused images. An exemplary embodiment of an iterative fusion procedure is illustrated by the block diagram in Figure 4 and the flow chart of Figure 8. The operations of the exemplary methods illustrated in these figures may be performed by a processing unit 110 as illustrated in Figure 1. Embodiments of the invention encompass the processing unit 110 being a single processor, a plurality of processors operating in series and/or parallel, or a combined processing unit. These processors may include special purpose circuitry or ASIC's designed to perform the processing steps, or a general purpose computer programmed to perform the processing steps.
Exemplary embodiments of the invention may be applied on visual and radiometric images for applications such as concealed weapon detection (CWD) applications and night vision image fusion. Exemplary embodiments of the invention may also be applied in image fusion for CWD. Test results for exemplary CWD applications using exemplary methods of the invention along with the results of other methods for comparison are shown in Figures 5A-F and 6A-F. Exemplary embodiments of the invention may also be applied to night vision applications, which may be of particularly applicability in military and automotive uses. Test results for exemplary night vision applications using exemplary methods of the invention along with the results of other methods for comparison are shown in Figures 7A-F. Additionally, it is noted that exemplary embodiments of the invention may be used in medical image fusion to improve diagnostic and training procedures. Although exemplary embodiments of the invention are described herein with regard to certain applications, exemplary embodiments of the invention are more generally applicable and are not limited to the exemplary applications described herein. 1 The Image Formation Model Based on HMM
In exemplary embodiments of the invention, the image fusion process takes into account the dependency among or between coefficients. In contrast, the methods disclosed in A STATISTICAL SIGNAL PROCESSING APPROACH TO IMAGE FUSION FOR CONCEALED WEAPON DETECTION by J Yang and RS Blum (IEEE International Conference on Image Processing, Rochester, NY, 2002, pp I-513-I-516) assumed that each pyramid coefficient was statistically independent to all others. . An image fusion method according to an exemplary embodiment of the invention is illustrated by the flow chart in Figure 8. The exemplary method of Figure 8 begins with creating a multiscale transformation (MST) representation of each image to be fused, step 800. The MST transformation includes transform coefficients, which may
be wavelet and/or pyramid coefficients, on multiple scales. This new image formation model has been created based on allowing correlations to exist between the pyramid coefficients. In this exemplary model, sensor images may be described as true scenes corrupted by additive non-Gaussian distortion. A hidden Markov model (HMM) describes 5 the correlations between the wavelet coefficients in one sensor image. 1.1 Tree Structure of the Wavelet Coefficients
To describe an exemplary HMM for the transform coefficients, it is desirable to use graphs and trees as described, for example, in GRAPHS: THEORY AND ALGORITHMS by K Thulasiraman and MNS Swamy (New York: Wiley, 1992, pp 1-54). o An undirected graph consists of set of nodes {v{ ,v2 ,...,vN } and a set of edges linking the nodes. A path is a set of edges connecting two nodes. A rooted tree is an undirected acyclic graph. All nodes in a tree that lie on a path from v, to the root are called ancestors of v( . All nodes in a tree that lie on paths from v, going away from the root node are called descendants of V1 . A node is called the parent of v( if it is the s immediate ancestor of V1 . The parent of v, may be denoted by vp(l) . A node is called child of v, if V1 is its parent. The children of V1 may be denoted by {Vj}jecU) . Each node in the rooted tree has only one parent but may have several children. The root has no parent node. Nodes with no children are called leaves of the tree. A collection of trees forming a larger structure may be called a forest. 0 In a rooted tree graph structure, if each node that is not a leaf has four children, this tree may be called a quadtree. A collection of quadtrees is called a forest of quadtrees.
Figure 2 illustrates a forest 200 of quadtrees. Based on the persistence property of the transform coefficients, the transform coefficients of a source image may 5 be organized as a forest of quadtrees, step 802, as disclosed by J Shapiro and M Crouse, et al. Each transform coefficient may represent a node in one of the quadtrees. These quadtrees are rooted at the transform coefficients for the high frequency bands (HL, LH, HH bands) in the coarsest scale. In an exemplary embodiment of the invention, the coefficients in the LL band are not included in the quadtrees and these coefficients are o processed separately. Exemplary quadtree 202, rooted in HH3 subband, is shown in Figure 2. In a quadtree, each coefficient in a coarse scale subband has four children coefficients in the corresponding finer scale subband. The arrows in Figure 2 point from the subband of the parents to the subband of the children. The number of quadtrees in the forest depends on the size of the image and the number of decomposition levels in 5 the wavelet (and/or pyramid) transform. For example, if we decompose an NxN image
using a wavelet transform with L decomposition levels (L scales), then in each subband set (LH, HL or HH subband set), (N2'L)2 wavelet trees are obtained.
1.2 Hidden Markov Tree Model
Each transform coefficient from a given sensor image may be treated as a 5 random variable with a Gaussian mixture probability density function. This approximation of the transform data is used in an exemplary embodiment of the invention. Based on the persistence property of transform coefficients discussed above, it is contemplated that a given coefficient may be statistically dependant on its parent and children. This dependency increases the complexity of the image formation model o and the parameter estimation for the model. In order to solve this problem, an HMM is used in an exemplary embodiment of the invention.
In this HMM model a state variable may be associated with each transform coefficient. This random state variable may capture the parent-children relationship of transform coefficients. In the exemplary quadtree representation, each coefficient node s 304 may be associated with a state node 302, as shown in Figure 3. The black nodes represent the transform coefficient and the white nodes represent the state variables. Therefore, the transform coefficients are dependant on the associated state variable and independent of the other coefficients when conditioned on the state variable. Thus, the dependency of transform coefficients may be viewed as "hidden" by the state variables. 0 The state nodes in a quadtree form a Markov chain, as each node is statistically related only to its parent and children. Therefore, the Markov model may be fit on the tree structure (see Figure 2) of the transform coefficients of the MST to form a Hidden Markov Tree (HMT) model, step 804. Figure 3 illustrates an example of a HMT model for four quadtree, including exemplary quadtree 300. The definition of the exemplary state 5 variables in this model is explained in more detail in section 1.3.
1.3 Image Formation Model
As noted above, the sensor images may be described as including the true scene corrupted by additive, possibly non-Gaussian, distortion. To illustrate, assume there are q sensor images to be fused. In the transform representation of each sensor o image, assume there are K quadtrees and, for each quadtree, there are P transform coefficients, or P nodes. Let zi k (j) , block 400 in Figure 4, denote the transform coefficient at the /h node of the kth tree in the Ith sensor image and let wk (j) denote the transform coefficient at the /h node of the /cth tree of the wavelet (and/or pyramid) transform of the image describing the true scene. The resulting image formation model 5 may be written as:
ZiM = βIJC U)MJ) + S1AJ) i = \,...,q\ k = \,...,K; j = \,...,P (1)
where βl k {j) is the sensor selectivity factor and εl k (j) is the possibly non-Gaussian distortion.
A zero-mean, ΛMerm, Gaussian mixture model may be used to fit the distortion. In an exemplary embodiment of this model, a random state variable Sl k (j) with M states is introduced. Sl k (j) denotes the state variable for the/h node in the kth transform coefficient quadtree and p{Sl k {j) = m) is the probability of being in state m. State m (m = 1, ..., M) of variable Sl k (j) may be associated with the distortion coming from the mth term in the Gaussian mixture model for the distortion. Thus, the distortion may be modeled as being in all M states. In each state, the distortion has a zero-mean Gaussian distribution. The state probability, denoted by PSl k(j)(m) = p(Sltk(j) = w) , is the probability that the distortion is in state m. To allow for efficient training given only the source images, it may be assumed that ps (j)(m) is the same for each /, k pair, which may be denoted by pS( l)(m) for simplicity. Because βl k (j)wk (j) 's assumed to be deterministic, zl k (j) also has a Gaussian mixture distribution. If the distortion is in state m, the coefficient zl k (j) may also be thought to be in state m. Thus, it may be seen that zι k (j) has a distribution corresponding to the mth term in the mixture model for the distortion, but with a nonzero mean, and that the state variables characterize the state (mixture term) of the sensor transform coefficients.
As described previously, each sensor coefficient node in a quadtree is associated with a state node to obtain a Hidden Markov Tree (HMT) similar to the one shown in Figure 3. Let Sl k (\) denote the state variable of the root node and let p(j) denote the parent node of node j in the /cth tree of Ith sensor image. Using this exemplary HMT model leads to:
P[S1AJ) = m I S.APU)) = npU),...,s,A2) = n2,sι k(i) = «,}
= P[S1AJ) = ^ i S1APU)) = np{J)} (2) where p(j) ,.-, 1 represent all the nodes between p(j) and root node 1. The state transition probability may be written as α™ p0) = P[S1 k(j) = m \ Sl k (p(j)) = n) . The desired estimation quality, based only on the source images, may be achieved by assuming that a™p' U) is the same for each /, k pair and may be written as aZω = p[SU) = m I S(P(J)) = n} for simplicity.
Given Sl k {j) = m , the distribution of distortion εl k (j) may be written as:
Using Equation (1), the distribution of zl k (j) , given Sl k (j) = m , may be written as:
Thus, removing the conditioning on Sl k (j) , the distribution of ει k (j) is generally non-Gaussian as given by:
but if ε
ι>k (j) is Gaussian, then this may also be modeled using Equation (5) by the proper choice of parameters.
This exemplary M-state Gaussian mixture model may be used to fit the transform coefficient data of the sensor images according to an exemplary embodiment of the invention. Generally speaking, the distribution of each zl k (j) may be different.
However, the coefficients may be statistically related according to the statistical relationships of their state variables. These relationships are described by the HMT model. In this exemplary image formation model, the sensor selectivity factor, βl k (j) , and true scene data, wk (j) , are generally unknown. The Gaussian variance of each state, σl (j) , may also be undetermined. These parameters explicitly appear in the image formation model. Two other variables, the state probability, ps( l) (rn) , and the state transition probabilities, a™£U) l may also be unknown. In an exemplary embodiment of the invention, the state probability and the state transition probabilities may fully specify the image formation model as they describe the unconditioned distribution in Equation (5). 2 Fusion with the Expectation Maximization (EM) algorithm In an exemplary embodiment of the invention, the image formation model of Equations (1) and (5) is based on the transform coefficients in the high frequency bands (HL1 LH, HH bands) and an exemplary iterative algorithm is applied only to these bands. The fusion of coefficients in the LL band is processed separately.
In an exemplary embodiment of the invention, the same distortion model for each transform coefficient quadtree may be assumed. An exemplary iterative algorithm may then be run over the set of wavelet (and/or pyramid) representations of the sensor images to obtain estimates of the set of image parameters, block 404:
For each iteration in the exemplary iterative algorithm:
may be used to denote the updated values of Φ . Figure 4 illustrates a block diagram of this exemplary iterative fusion procedure. This exemplary iterative fusion procedure io begins with parameter initialization, block 402, as well as step 806 in Figure 8. Initial values (generated as described in Section 2.3, below) are provided to initialize Φ . Then with the initial parameter values and observed data, the conditional probabilities list then are computed using an exemplary upward-downward algorithm based on the HMT model, step 808. i
5 The conditional probabilities list, together with the initial parameters and observed data, may used by the exemplary EM algorithm to update the parameters Φ to Φ', step 810. If the difference between Φ and Φ' is less than a given threshold δ , i.e. the image fusion parameters converge, step 812, the final estimates of Φ are set to Φ' and the fusion procedure may be terminated; otherwise, the current estimates may be
20 updated using Φ' and the procedure repeated until the iterative procedure converges or another criteria, such as a maximum number of iterations is reached, step 814. The processes of step 808 and 810 are shown in more detail in the blocks of Figure 4. Each block in Figure 4 is described in detail in the following sections. 2.1 Computing Conditional Probabilities
25 Let z denote the complete set of observations. Equation (4) demonstrates that the transform coefficients of sensor images are conditionally Gaussian, given the hidden states. Hence, to carry out the iterative EM algorithm, it is desirable to know the marginal probability mass functions (pmfs) p(S, k(j) = m | z,Φ) , and parent-child pmfs p(S, k U) = m,SιΛ (p(j)) = n \ z,Φ) , for i = l,...,q , k = \,...,K , and j = \,...,P , block 410.
30 Because it is assumed that each quadtree is independent to others, these conditional probabilities may be calculated independently for a given transform coefficient quadtree and a particular sensor image. Thus, these computations are desirably carried out for each quadtree of each image. For computing efficiency in a particular tree, these
probabilities may be given by some intermediate parameters in the exemplary HMT model. The exemplary upward -downward algorithm developed by M Crouse, et al. and O Ronen, et al. is used to produce these parameters in an exemplary embodiment of the invention. To determine the probabilities for the state variables, the state information is
5 propagated throughout the tree. The upward step of the exemplary algorithm, block 406, calculates the parameters by transmitting information from fine-scale transform coefficients, the leaf nodes of the tree structure, up to the states of the coarse-scale transform coefficients, the root nodes of the tree structures; the downward step of the exemplary algorithm, block 408, calculates the parameters by propagating information io from the coarse-scale transform coefficients, root nodes, down to the fine-scale transform coefficients, leaf nodes. After obtaining these intermediate parameters, the conditional probabilities may be calculated from them, block 412. A detailed discussion of the calculation of exemplary conditional probabilities and the exemplary upward- downward algorithm may be found in Appendix A.
15 2.2 Updating parameters using the EM algorithm
In an exemplary embodiment of the invention, an EM algorithm is used to develop the iterative equations for parameter estimation. The updated estimates are chosen to be those that maximize a likelihood-like function (see Appendix B for the details). The algorithm begins with current estimates Φ of the parameters and
20 produces updated estimates Φ' . The following update equations describe the procedure in detail.
1) Update the state probability ps(j) (m) and state transition probability a'"p(j) for all j = \,...,P;m = 1,...,M , block 414, using :
∑∑P(SαO) = ™ | z,Φ)
(™\ _
S( ι) (m) — = i=l k=\ and (8)
Kq
1 K
∑ ∑ P(S, J1 U) = m, S1J1 (P(J)) = n \ z,Φ)
* < J-^Pi),)) ~ = J≡LΛ≤ - W κw s(Pϋ))(n)
2) Update the Gaussian variance σm 2 (j) for all j = \,...,P ; m = \,,..,M , block 416, using:
∑ ∑ tα U) - βtJt OK (J))2 P(S1, U) = m I z, Φ)
< U) = JΞUΞ1 KqpSϋ)( —m) do)
3) Update βlΛ (j) for i = \,...,q , k = \,...,K , and j = 1,..., P 1 block 418, by selecting β\,k U) = ±1'° t0 maximize:
4) Update the value of wk(j) for all k = \,...,K ; j = 1,..., P , block 420, using:
The previous image fusion parameters and the updated image fusion parameters are compared to determine if they have converged, block 422. If so, the iterative procedure of Figure 4 is complete, block 424. Otherwise, the image fusion parameters in block 404 are replaced and the iterative process repeats. The above
I0 update Equations (8)-(12) may be derived from using a general approach called the SAGE version of the EM algorithm disclosed by JA Fessler, et al. The details of this exemplary derivation are presented in Appendix B. 2.3 Initialization of the fusion algorithm, block 402 and step 806
In an exemplary embodiment of the invention, initial estimates are used is for computing conditional probabilities and for Equations (8)-(12). In an exemplary embodiment of the invention, the initial estimates for the true scene, wk (j) , are chosen from the weighted average of the sensor images as per:
w*0") = έΛ.*0X*0') k = l,...,KJ = l,...,P (13) ι=l
where ∑Λ.l A (y) = l . To determine the X1 k (j) in Equation (13), in an exemplary ι=l
20 embodiment, a salience measure that was discussed in Enhanced image capture through fusion by PJ Burt, RJ Kolczynski (Proc the 4th Intl Conf on Computer Vision, 1993, pp 173-182) is used. For coefficient zl k {j) , let (x, /) denote its coordinates in a subband.
The salience measure for this coefficient may then be computed from the weighted average of the coefficients in a window around it, as per:
where p(x',/) is the weight for each coefficient around (x, y) and z
t (x + x',y + y') denotes the coefficient value at coordinates (x + x',y + /) in the subband. Shown below is an exemplary 5 by 5 window of coefficients centered on (x, y) that may be used to calculate this salience measure of z
l k(j) using:
'1/ 48 1 / 48 1/48 1 / 48 l / 48λ 1 / 48 1 / 24 1/ 24 1/ 24 1 / 48
P= 1/48 1/24 1/3 1/24 1/48 (15) 1/48 1/24 1/24 1/24 1/48 1/48 1/48 1/48 1/48 1/48 where p is the matrix notation of p(x',y') in a 5 by 5 window. The λι k {j) , for i = \,...,q , may then be specified using:
A,*C/) = Ωα O)/∑ΩαO) i = \,...,q (16)
/=i
An exemplary initialization for βι k (j) is to assume that the true scene appears in each sensor image. Hence βl k(j) = l for i = \,..., q ; k = 1,..., K ; and j = \,...,P . The initial state probabilities in an exemplary embodiment are assumed to be pS{j)(m) = \l M for m = \,...,M . At initialization, the parent-child transition probabilities in an exemplary embodiment are assumed equal for all M states. Hence a S(j),S(p(j)) = 1/ M for m = 1,...,M and n = \,...,M . To model the distortion in a robust way the distortion may desirably be initialized as impulsive, as disclosed in RS Blum, et al. (1999). Thus, it is possible to set σm 2 (j) = γσm 2_λ {j) for j = \,...,P such that fixing γ and σ,2 (y') fixes σm 2 (j) for m > 1 . The value for σf (j) may also be chosen in an exemplary embodiment so that the total variance of the mixture model
M σ2(j) — ∑σ m(./)/-^ matches the variance of the observations:
Choosing / = 10 in an exemplary embodiment, leads to an initial distortion model that is fairly impulsive. This initialization scheme has worked very well for numerous experimental cases. It is observed that the algorithm in these experiments generally converges in less than 15 iterations and most often within 5 iterations. In
exemplary embodiments of the invention, the predetermined number of iterations used in step 814 is equal to 15 or less, or 5 or less.
An exemplary region segmentation method is described below with reference to Figures 11 and 12. In an exemplary embodiment, a region segmentation method is used to supply initial estimates of the image fusion parameters, from the decision map calculated in the exemplary methods of Figures 11 and 12, for example.
For fusion of video images, the image fusion parameters may only change slightly from frame to frame. In an exemplary embodiment of the invention, the final converged image fusion parameters from one set of frames taken at approximately the same time may be used as the initial image fusion parameter estimates for the set of frames taken approximately simultaneously at a subsequent time.
The examples are not limiting, it is contemplated that embodiments of the invention encompass initial image fusion parameters that are chosen or generated using other methods. 2.4 Fusion of the LL subband transform coefficients
In an exemplary embodiment of the invention, the exemplary HMT model of the invention is applied to only the high frequency subbands of the wavelet (and/or pyramid) representations (see Figure 2). In this case, the above described exemplary iterative estimation procedure may produce only the fused high frequency subbands. Therefore, the fused LL subband may need to be produced separately. In an exemplary embodiment, the weighted average method described in section 2.3 is used to produce the fused LL subband. Let w(x,y) denote the coefficient with coordinate {x,y) in the LL subband of fused image, and let z,.(x,.y) denote the coefficient with coordinate (x,y) in LL subband of sensor image /, for i = \,...,q . The coefficients in the LL subband may then be determined using:
y*χ,y) = ∑ λ,(.χ>y)Zi(χ>y) (18)
where λt{x,y) is the weight and 2^ A1(X, y) = 1 . In an exemplary embodiment, the
1=1 method described in Equations (14)-(16) is used to determine λt(x,y) for i = \,...,q .
Once the image fusion parameters for all of the subbands have been determined and the transform coefficients have been fused, the fused image may be produced by taking an inverse multiscale transform of these fused transform coefficients, step 816.
3 Experimental Results
This exemplary fusion algorithm has been applied experimentally to concealed weapon detection (CWD) and night vision applications. 3.1 CWD with Visual and millimeter wave (MMW) Images
In an exemplary embodiment of the invention described below, this
5 algorithm is used to fuse visual image 500 and MMW image 502 shown in Figures 5A and 5B, respectively, for a CWD application. The order-3 Daubechies wavelet transform described by I. Daubechies in ORTHONORMAL BASES OF COMPACTLY SUPPORTED WAVELETS (Comm on Pure and Applied Mathematics 41 :909-996, 1998) is used to transform the visual and MMW images into multiscale representations with 4 io decomposition levels. A two-state HMT model based on these wavelet representations is created. The exemplary algorithm described above with reference to Figures 4 and 8 is used to fuse these two images. The resulting exemplary fused image 506 is shown in Figure 5C.
The EM-fusion algorithm presented by J Yang, et al. was also used to fuse
I5 these two images 500, 502 based on the wavelet representations. This fused image 508 resulting from the EM-fusion algorithm is shown in Figure 5D. Figure 5E illustrates fused image 510 resulting from using the same wavelet representations, choosing the maximum sensor wavelet coefficients for the high-pass subband wavelet coefficients, and averaging the sensor wavelet coefficients for the low-pass subband wavelet coefficients
20 as disclosed by Z Zhang, et al. (1999). Figure 5F illustrates fused image 512 resulting from selecting the maximum pixel (i.e., no wavelet transform) from each image 500, 502. Comparing these fused images 506-512, it may be seen that the exemplary HMT- based EM fusion algorithm performs, fused image 506, better than the other three exemplary fusion methods 508-512. The fused image 506 (Figure 5C) generated by the 5 exemplary HMT-based EM fusion algorithm provides improvement over the EM fusion method by considering the correlations of the coefficients at different scales. From fused image 506 of Figure 5C, there is considerable evidence to suspect that the person on the right has concealed gun 504 beneath his clothes. This fused image may be very helpful to a police officer, for example, who must respond promptly.
30 3.2 CWD with Visual and infrared (IR) images
Another exemplary embodiment of the invention is described below with reference to Figures 6A-F. This example illustrates another CWD application employing visual image 600 and IR image 602, Figures 6A and 6B, respectively. An order-3 Daubechies wavelet is used to transform the visual and IR images into wavelet 5 representations employing 6 decomposition levels and a two-state HMT model is created. Figures 6C-F show the fused results obtained by the exemplary HMT-based EM fusion method (fused image 606), the exemplary EM fusion method of J Yang, et al. (fused
image 608), the exemplary wavelet fusion method of Z Zhang, et al. (1999) (fused image 610), and the exemplary selecting maximum algorithm (fused image 612). Fused images 606, 608, 610, and 612 indicate the location of concealed object 604. In this example, the standard EM fusion method of J Yang, et al. (fused image 608) appears slightly inferior to the wavelet fusion method of Z Zhang, et al. (1999) (fused image 610). After considering the correlations between different scale coefficients, the HMT- based EM fusion method (fused image 606) may perform better than the EM fusion method of J Yang, et al. (fused image 608), the wavelet fusion method of Z Zhang, et al. (1999) (fused image 610), and the selecting maximum algorithm (fused image 612). 3.3 Night Vision Image Fusion Applications
In night vision applications, significant information may be contained in thermal images, while some information from visual images complements the information in the thermal images. This may be different from daytime image fusion applications, such as the CWD examples described above with reference to Figures 5A-F and 6A-F, where the majority of the information is typically contained in the visual images. Hence, image fusion methods which are good for daytime fusion applications may not always work as well for night vision applications. A night vision application of an exemplary embodiment of the invention is described below with reference to Figures 7A-F. Figures 7A and 7B show an exemplary IR image 700 and an exemplary visual image 702, respectively. An order-3 Daubechies wavelet is used to transform the IR 700 and visual 702 images into wavelet representations with 5 decomposition levels. A two-state HMT model is created. Figures 7C-F show the fused results obtained by the exemplary HMT-based EM fusion method (fused image 704), the exemplary EM fusion method of J Yang, et al. (fused image 706), the exemplary wavelet fusion method of Z Zhang, et al. (1999) (fused image 708), and the exemplary selecting maximum algorithm (fused image 710). This example shows that the HMT-based EM fusion algorithm performs very well as a night vision image fusion system. Comparison of the fused images 704-710 shows that the HMT-based EM fusion algorithm (fused image 704) may perform better than the EM fusion of J Yang, et al. (fused image 706), the wavelet fusion method of Z Zhang, et al. (1999) (fused image 708), and the selecting maximum algorithm (fused image 710). 4. Multi-frame video fusion
Image fusion approaches that perform video fusion on a frame-by-frame basis do not capitalize the temporal information in the neighboring frames for fusion.
However, this information may be very helpful to improve the fusion results. In a paper, MULTI-FRAME IMAGE FUSION USING THE EXPECTATION-MAXIMIZATION ALGORITHM
(Paper presented at The Eighth International Conference on Information Fusion, Wyndham Philadelphia at Franklin Plaza, Philadelphia, PA, USA. Available at http://fusion2005.no-ip.com/viewabstract.php7id = 119 as of June 15, 2005), the inventors describe an exemplary multi-frame image fusion scheme which considers use of neighboring frames to incorporate temporal as well as sensor fusion. This fusion scheme may produce an improved estimate of each frame of the fused video sequence of the true scene from multiple sensor image frames from the sources. Although, this paper only describes the use of an EM algorithm to perform the image fusion, it is contemplated that, according to an exemplary embodiment of the invention, the transform coefficients may be organized into a quadtree structure and that a HMT may be formed from this quadtree structure in a manner similar to the method described above with reference to Figure 3. 4.1 MST
This multi-frame image fusion method according to an exemplary embodiment of the invention is based on a statistical image formation model. The model parameters and the final fusion result are produced by estimation. Because this approach is based on estimation theory, training data are used to obtain the estimates. However, the true scene is generally unknown and only the information in the source videos may be used for training data. Thus, in this exemplary embodiment, the temporal information contained in the source video frames is used to provide information about the true scene. This approach is analogous to the approach exemplary embodiment of the invention described above with reference to Figures 4 and 8, except that instead of the wavelet and/or pyramid transformation being performed in two spatial dimensions, the MST is desirably performed along one spatial and one temporal dimension. In an exemplary embodiment the video frames are described as the true scene corrupted by additive non-Gaussian distortion. The HMT may be used to estimate conditional probabilities and the EM algorithm may be used to update the image fusion parameters. An inverse transformation may then be used to create the fused frames. 4.1.1 Experiment results The multi-frame EM-based fusion algorithm according to exemplary embodiments of the invention has been applied to the night vision applications as shown in Figures 9A-F and 10A-F. In night vision applications, the majority of the information may be contained in the IR images, while some information from the visual images complements IR images. This may be different from daytime image fusion situations, in which the majority of the information is typically contained in the visual images.
In the exemplary embodiment of the invention described below with reference to the images of Figures 9A-F, an infrared video sequence and an image intensified video
sequence are fused. These two sequences were taken from a still scene by two still video cameras. The first frame of these two videos are shown in Figures 9A (IR frame 900) and 9B (image intensified frame 902). The image intensified video sequence may be seen to have a large amount of sensor noise. Figure 9C illustrates fused image 904 resulting from applying a single-frame EM fusion method. Figure 9D illustrates fused image 906 resulting from applying an exemplary two frame EM fusion method according to an exemplary embodiment of the invention. Figure 9E illustrates fused image 908 resulting from applying an exemplary three frame EM fusion method according to an exemplary embodiment of the invention. Figure 9F illustrates fused image 910 resulting from applying an exemplary four frame EM fusion method according to an exemplary embodiment of the invention.
It may be seen by comparing fused images 906, 908, and 910 to fused image 904 that the exemplary multi-frame EM image fusion methods of exemplary embodiments of the invention may produce better results than the single-frame EM fusion method by removing a significant amount of sensor noise from the fused result. As illustrated above with reference to Figures 9A-F, the inclusion of the HMT algorithm to estimate the conditional probabilities may lead to improved frame fusion results. It may also be noted from fused images 906, 908, and 910 that as more frames are used in exemplary methods of the present invention, the lower the amount of noise remaining in the result. Another exemplary embodiment of the invention is described below with reference to Figures 10A-F. In this example, the two source video sequences were taken from a forward looking infrared (FLIR) sensor (IR frame 1000) and an image intensified video camera (image intensified frame 1002). In these two sequences there is an object on the road moving toward the camera slowly. Figure 1OC illustrates fused image 1004 resulting from applying a single-frame EM fusion method. Figure 1OD illustrates fused image 1006 resulting from applying an exemplary two frame EM fusion method according to an exemplary embodiment of the invention. Figure 1OE illustrates fused image 1008 resulting from applying an exemplary three frame EM fusion method according to an exemplary embodiment of the invention. Figure 1OF illustrates fused image 1010 resulting from applying an exemplary four frame EM fusion method according to an exemplary embodiment of the invention.
The results again show that the exemplary multi-frame EM fusion method of exemplary embodiments of the invention remove a significant amount of noise from the fused video frame. It is also noted that, although there is a moving object presented on the video, this does not adversely affect the multi-frame fusion results for the two frame and three frame methods of Figures 1OD and 1OE. The fused image of the moving object in fused image 1010, resulting from use of an exemplary four frame EM fusion method, is
a little blurred. Thus, according to an exemplary embodiment of the invention, the number of frames to include in the exemplary method is selected depending on the amount of motion expected in the scene. 5. Image fusion using region segmentation Another exemplary embodiment of the invention is an intelligent object/ region based image fusion method using a wavelet/pyramid transform. In most cases, the various objects and regions of interest in the scene are what most users desire to identify within the fused image. Because objects and parts of objects typically carry the information of interest, exemplary embodiments of the invention focus on objects/ regions in the image fusion algorithm.
In many image fusion applications, an object that is clear in one image, taken by a first sensor, may be hard to distinguish or invisible in another source image, taken by another type of sensor, even though both sensors are imaging the same scene. These multiple images of the same scene are desirably fused to exhibit faithfully, in a single image, all the objects/regions of interest. This single fused image may assist the viewer in making a quick decision. Exemplary systems that consider each pixel in the image separately, or only the pixel and its close neighbors, may neglect the fact that each pixel represents a portion of an object or region. By considering the pixel(s) that make up a region in the fused image together, noise and blurring effects may be reduced during the fusion process. Such an exemplary region-based method may be interpreted as trying to determine which image provides a "clearer," or more detailed, presentation of each object. Thus, this approach is related to a criterion that may insure a desirable fused image. 5.1 Overview of the Object/Region Based Fusion Method An exemplary region-based image fusion method according to an embodiment of the invention is illustrated in Figures 11 and 12 that combines aspects of both pixel-level and feature-level fusion. The elements in Figures 11 and 12 with common labels indicate common types of elements and do not necessarily indicate identity. For example, source images 100 are not the same images and do not necessarily originate from the same type of source.
The features used may be objects or regions of interest, which is referred to herein as regions for simplicity and clarity. Additionally, the following description describes the region-based image fusion of two source images, but the choice of only two source images is merely illustrative and one skilled in the art may understand that this exemplary method may be extended to create fused images from a larger number of source images.
The overview schematic diagram of this exemplary image fusion method is shown in Figure 11. Two source images 100 are provided. It is noted that the source images are desirably registered at the beginning of the exemplary method so that the corresponding pixels in the images may be substantially aligned. If the images are 5 frames of video sequence, it is desirable for the video sequences to be substantially temporally registered as well. However, if the motion in the scene is slow relative to the frame rate, this temporal registration may not need to be particularly precise. Any standard registration algorithm may be applied to the source images if they are not initially registered. io A multiscale transform of each of the two registered images is computed to form wavelet coefficient maps 102. A decision map 1100 is then generated. In an exemplary embodiment of the invention, the decision map 1100 is generated according to the exemplary method described in the next section below with reference to Figure 12. Each coefficient/pixel of decision map 1100 includes a relative weight factor for each i5 source image transform coefficient. These relative weight factors may be understood as describing the desired contribution of the corresponding source image transform coefficients to be used to determine the coefficient/pixel in the fused image. Alternatively, as described above with reference to the exemplary method of Figures 4 and 8, the relative weight factors of decision map 1100 may be used as the initial
20 estimates of image fusion parameters to be used in other exemplary image fusion methods of the invention.
Based on decision map 1100, wavelet coefficient maps 102 of the source images may be fused in the wavelet transform domain to produce fused wavelet coefficient map 1102. Thus, each transform coefficient for fused image 1104 is set to be
25 a combination of the transform coefficients of source images 100. Fused image 1104 is obtained by taking the inverse wavelet transform of fused wavelet coefficient map 1102. The operations of the exemplary methods illustrated in Figure 11 may be performed by a processing unit 1110 as illustrated in Figure 11. Embodiments of the invention encompass the processing unit 110 being a single processor, a plurality of
30 processors operating in series and/or parallel, or a combined processing unit. These processors may include special purpose circuitry or ASIC's designed to perform the processing steps, or a general purpose computer programmed to perform the processing steps. 5.2 Fusion Decision Map Base on Object/Region Detection
35 As discussed in the previous section, decision map 1100 indicates how to combine the transform coefficients of sensor images 100 to determine the desired transform coefficients of the fused image. Figure 12 illustrates an exemplary method of
generating decision map 1100. For each source image, edge image 1200, region image 1202, and region activity table 1204 are created from wavelet coefficient map 102, as shown in Figure 12. Region activity tables 1204 and region images 1202 are used in conjunction with fusion criteria 1206 to create fusion decision map 1100. Each pixel in 5 fusion decision map 1100 includes information regarding weights to be given the various transform coefficients related to the corresponding pixel in region images 1202.
5.2.1 Edge Image
After generating the MST representation of wavelet coefficient maps 102, an edge detector is applied to the LL band of the transform coefficients. Although it is IQ contemplated that any edge detection technique may be used according to embodiments of the invention, the Canny edge detection algorithm is used in an exemplary embodiment of the invention. A Canny edge detection algorithm may be desirable for extracting step edges corrupted by white noise. The desirability of this detection routine is related to three criteria : i5 (1) The detection criterion of this routine expresses the goal that important edges should not be missed and that there should be no spurious responses;
(2) The localization criterion of this routine indicates that the distance between the actual and located position of the edge should be minimal;
20 and
(3) The one response criterion minimizes multiple responses to a single edge.
The last of these criteria is partly covered by the first criterion because under this algorithm when there are two responses to a single edge one of the responses 25 should be considered to be false. The third criterion solves the problem of an edge that has been corrupted by noise. It may also work against non-smooth edge operators.
5.2.2 Region Image
After the formation of edge images 1200, region segmentation is performed based on the edge information in these images 1200. Although embodiments
30 of the invention encompass using any region segmentation algorithm, an extended labeling algorithm is used in an exemplary embodiment of the invention as described below. The LL band of wavelet coefficient map 102 may be used as the source image, S, in this example. Edge image 1200 may be denoted as E.
A separate labeling image L1 the same size as source image S, is initialized
35 such that its original pixel values are set to be zero. The entire source image S is searched row by row. A non-zero integer label, x, is assigned to each pixel L(iJ) in the labeling image whose corresponding pixel value E(i,j) in the edge image E is less than an
edge threshold value T1 (i.e. the pixel does not represent an edge or a weak edge). The value x assigned to the non-edge pixel in the labeling image is chosen according to the labels of the neighbors of the pixel. The neighboring pixels of pixel 1300 may desirably be chosen to be 4 connectivity neighbors 1302, as shown in Figure 13A, or 8 connectivity 5 neighbors 1304, as shown in Figure 13B. Exemplary criteria for deciding the value x based on the selected neighbors are:
(A) If all the neighboring pixels have not been labeled (i.e. their labels are all zero), L(I1J) is assigned a new unused label. This new label may be the next unused integer, but other choices may be made as well; and o (B) If there are one or more neighboring pixels that have non-zero labels, the neighboring pixel having an image value closest to the image value S(U) of pixel 1300 is determined. Then:
(i) If the difference between these two values is less than a certain region threshold value threshold T2, assign the label s of this neighboring pixel to L(i,j); and
(ii) Otherwise, L(i,j) is assigned a new unused label.
In an exemplary embodiment of the invention, the thresholds T1 and T2 are desirably selected adaptively based on the source images. For example, T1 and T2 may be chosen as a function of the mean and variance of pixel value in the edge and source o images. A priori knowledge of the content and noise levels of the source images may also affect the selection Of T1 and T2, for example.
The resulting labeling image, thus, includes different values to represent different regions/objects, with zero values corresponding to edges in the source image. 5.2.3 Region Activity Table 5 A region activity table 1204 is generated as described below according to an exemplary embodiment of the invention. Information on salient features of an object in visual images may be partially captured by the magnitude of high frequency transform coefficients (the LH, HL, and HH bands in the MST representation) corresponding to that object. Consider, for example, two regions with similar size and signal-to-noise ratio o (SNR) in two registered visual images, each representing the same object in a real scene. The one having the larger magnitude high frequency components typically contains more detail. The activity level of each region may be calculated as the average of the absolute value of the high frequency band transform coefficients. This calculated activity level represents a measure of the magnitude of the high frequency transform 5 coefficients. The activity level values corresponding to the pixels of each source image may be organized in region activity table 1204, as illustrated in Figure 12. It is noted that region image 1202 corresponds to the LL band of the transform coefficient pyramid
(or other pyramid), while region activity table 1204 depends on the LH, HL, and HH bands.
In an exemplary embodiment of the invention, the decision map is generated according to the activity level, size and position of each object or region. 5 Thus, by using both region image 1202 and region activity table 1204 to determine the fusion decision map, all of the bands of the MST representation may be incorporated into this determination. If the SNR's of the images are known to be different, this difference may be accounted for by introducing a weighting factor in the activity level calculation, for example. io The activity level of region k in visual source image n, Aln(k), is given by
Equation (19):
where N
k is the total number of pixels in region k. P
j is the activity intensity of pixel j in region k, which is given by Equation (20):
I 5
where W is the weight which is determined by the SNR of the image and other factors, M is the number of wavelet decomposition levels, C, is one of the transform coefficients in high frequency bands corresponding to pixel ;. The second sum in Equation (20) is over
20 all the transform coefficients that correspond to pixel j in the high frequency bands of the mth decomposition stage. We note that this exemplary activity level algorithm may be particularly useful for visual images, but determination of the decision map may use other factors to override this information in the final fusion when non-visual sensors are involved. The difference between fusion criteria for visual and non-visual sensors is
25 discussed in more detail in the next subsection. 5.2.4 Decision Map and Fusion Criterion
A decision map 1100 and fusion criterion 1206 according to an exemplary embodiment of the invention are described below. The size of fusion decision map 1100 is desirably the same as the sizes of the region images, which are the same size as the
3o LL band in wavelet coefficient maps 102. Each pixel in the decision map corresponds to a set of transform coefficients in each frequency band of all decomposition levels, such as the sets of quadtree structures associated with each pixel in the exemplary method of Figures 4 and 8. Once decision map 1100 is determined, the desired mapping is determined for determining all the transform coefficients in fused wavelet coefficient map
35 1102 from all the transform coefficients in wavelet coefficient maps 102.
For example, if two registered source images A and B are to be fused, decision map 1100 may be either a binary image or a real valued image with values in the range of 0 to 1. If a very simple approach is desired, a binary decision map (0,1 values) may be used. Alternatively, digital maps with more steps or continuous value maps may be used. In this exemplary embodiment, the value of each pixel in decision map 1100 represents the probability that the pixel should come from the corresponding pixel in source image A. It is noted that, for N source images, N-I decision maps may be desirable to determine the contribution from each source image to the fused image. This probability definition of decision map 1100 allows this generalization to N>2 in a straight forward manner.
Thus, in general, the pixel values in decision map 1100 give the weight to be given to the corresponding pixel in a given source image. For each pixel in the decision map, a value of "1" indicates that image A is to be used, instead of image B, to describe the transform coefficients that correspond to this pixel. Likewise the value "0" indicates that the corresponding transform coefficients from image B are to be used, instead of those from image A. It is noted that, the pixels are related through the MST representation of wavelet coefficient maps 102, thus, each coefficient is computed from a filtering involving the coefficients to which it is related. Therefore, if a given pixel in the decision map is a "1," then, in fused wavelet coefficient map 1102, all the transform coefficients corresponding to this pixel are taken from the wavelet coefficient map of source image A. If the pixel is "0," then, in fused wavelet coefficient map 1102, all the transform coefficients corresponding to this pixel are taken from the wavelet coefficient map of source image B. If a pixel in decision map 1100 has a value z between zero and one, then, in fused wavelet coefficient map 1102, all the transform coefficients corresponding to this pixel are taken as a weighted average of the transform coefficients from the wavelet coefficient maps of both source images. These weights in an exemplary embodiment are z for source image A and (1-z) for source image B.
Fusion criterion 1206 is used to produce fusion decision map 1100 from region images 1202 and region activity tables 1204. When comparing region images 1202 corresponding to different source images 100, three options exists for each pixel of decision map 1100, P(i,j). The pixel may be:
(1) Within region m of image A, and within region n of image B;
(2) An edge point in one image, and within a region in the other image; or
(3) An edge point in both images. The value assigned to each pixel in decision map 1100 may be determined according to one or more of the following exemplary criteria :
(A) When fusing visible images, high activity-level is preferred over low activity-level;
(B) Small regions are generally preferred over large regions when comparing similar activity levels; (C) Edge points are generally preferred over non-edge points when comparing similar activity levels;
(D) In fusing visible images, high activity-level is preferred over low activity-level;
(E) When a non-visual image is included that is anticipated to show a particular object of interest that is not seen in the visible image(s), fusion criterion 1206 may be set to assign a weight of "1" to pixels of the non- visual image that include the particular object of interest in fusion decision map 1100;
(F) Make fusion decisions on non-edge points first and consider the neighboring pixels when making fusion decisions on edge points; and
(G) Avoid isolated pixels in decision map 1100.
In an exemplary embodiment, to judge the appropriate size of the activity level an estimate of the noise/distortion in the source images is compared. If the activity level of a pixel from one source image is high enough compared to the others, a value is assigned to the corresponding pixel in fusion decision map to desirably pass the corresponding coefficients from that source image on to fused image 1104. If the activity levels for a given pixel are similar in two or more images, the corresponding coefficients from those images may be weighted in accordance to the relative size of their respective activity levels. The approach of fusion based on activity level measurement may also be employed for fusion with only visible sensors and also for some cases with some non- visible sensors (for example for fusion of IR and low light images in night vision), however this approach may be less desirable for CWD.
It may be desirable in embodiments of the invention for criterion (E) to take preference over the other criteria, when there is preexisting knowledge about particular objects of interest, such as in many CWD applications. Examples of such situations may include the detection of guns in MMW images used for CWD applications or people, who may not be visible in low light images, in IR images used for night vision applications. These objects may be determined to be present in the non-visual images by checking the edge strengths of region boundaries in edge image 1200. If the edge strength of a boundary is above a threshold, the average, lowest and highest value of
the coefficient intensity over the object may be checked to determine whether they are within predetermined bounds.
In alternate embodiments of the invention, the shape, size, location, and orientation of the object may be checked to determine whether they fit the parameters of an object of interest. In exemplary embodiments of the invention, a complete fused image 1104 may also be checked to determine whether the view of the scene provided is consistent with the normal rules of physics and the normal expectations for the type of scene imaged. For example, in a CWD application, region shapes in the visual image that represent human bodies with high confidence may be identified in visual and/or IR images. Objects from non-visual (MMW for example) images that match the characteristics of a gun that happen to fall in a likely place on a human body shape in the visual image may be identified as well. When this occurs the corresponding transform coefficients describing the gun in the non-visual image are desirably included fused image 1104 by providing a weight of "1" for the corresponding pixels of the non- visual image in fusion decision map 1100.
In exemplary embodiments of the invention, activity levels are incorporated in this approach to fuse together several non-visual images. 5.3 Experimental Results
Figures 14A-H show an exemplary fusion process according to an embodiment of the invention of visual image 1400 and MMW image 1402. The source images in Figures 14A and 14B were obtained from Thermotex Corporation. Figures 14C and 14E illustrate exemplary processing steps based on visual image 1400 and Figures 14D and 14F illustrate exemplary processing steps based on MMW image 1402. Figures 14C and 14D depict edge images 1404 and 1406, respectively, formed from the source images. Figures 14E and 14F depict labeling images 1408 and 1410, respectively, formed from the edge images and the source images. Figure 14G illustrates binary fusion decision map 1412 determined from labeling images 1408 and 1410, using a fusion criterion based on preexisting knowledge of the images.
As illustrated in fused image 1414 of Figure 14H, visual image 1400 provides the outline and the appearance of the people in the scene while MMW image
1402 shows the existence of a gun. From this fused image 1414, a user may clearly and promptly determine that the person on the right has a concealed gun beneath his clothes. Such a fused image may prove very helpful, allowing a police officer to make a quick response. Figures 15A-H show the exemplary fusion process according to an embodiment of the invention of a pair of multi-focused test images 1500 and 1502 for a digital camera application. In test image 1500, the focus is on the soda can. In test
image 1502, the focus is on the testing card. Figures 15C and 15E illustrate exemplary processing steps based on test image 1500 and Figures 15D and 15F illustrate exemplary processing steps based on test image 1502. Figures 15C and 15D depict edge images 1504 and 1506 formed from the test images 1500 and 1502, respectively. Figures 15E and 15F depict labeling images 1508 and 1510 formed from the edge images 1504 and 1506 and the source images 1500 and 1502, respectively. Figure 15G illustrates binary fusion decision map 1512 determined from labeling images 1508 and 1510 and corresponding region activity tables (not shown), using an activity level fusion criterion. The less blurred portions of each test image typically have a higher activity level. Thus, in fused image 1514, both the Pepsi can and the testing card are shown in focus.
Figures 16A-H show the exemplary fusion process of a pair of field images 1600 and 1602 with different brightness. Field image 1600 was taken at noon, while field image 1602 was taken in the dusk. The field images in Figures 16A and 16B were obtained from RSTA data set collected jointly by Colorado State University and Lockheed Martin under sponsorship of DARPA. Figures 16C and 16E illustrate exemplary processing steps based on field image 1600 and Figures 16D and 16F illustrate exemplary processing steps based on field image 1602. Figures 16C and 16D depict edge images 1604 and 1606, respectively, formed from the field images 1600 and 1602. Figures 16E and 16F depict labeling images 1608 and 1610, respectively, formed from the edge images 1604 and 1606 and the source images 1600 and 1602. Figure 16G illustrates binary fusion decision map 1612 determined from labeling images 1608 and 1610 and corresponding region activity tables (not shown), using an activity level fusion criterion. From binary fusion decision map 1612, it is seen that the clearer midday field image 1600 is correctly selected to produce fused image 1614. Although this may be an uncommon use of image fusion, it shows that this exemplary method has a reliable selectivity.
Although the invention is illustrated and described herein with reference to specific exemplary embodiments, the invention is not intended to be limited to the details shown. Rather, various modifications may be made in the details within the scope and range of equivalents of the claims and without departing from the invention. Appendix A. Outline of computing the conditional probabilities A.I Derivation of Conditional Probabilities
Let X1 k denote the vector consisting of the transform coefficients for the kth tree in /th image. Consider the computations of p(S(j) = m | z( jt ,Φ) , j = \,...,P and p(S(j) = m, S(p(j)) = n \ zι k,Φ) , j = l,...,P , where S(j) may be used as the shorthand
notation for Sl k(j) when focusing on a given /cth tree in /th image. To simplify matters, this discussion focuses on the /cth tree using this simplified notation.
For the /cth tree, TJ is defined to be the subtree of observed transform coefficients with root at node j so that the subtree T1 contains coefficient z(j) and all of its descendants. If T1 is a subtree of T (i.e., z(/) and all its descendants are members of Tj ), then TjKl is defined to be the set of transform coefficients obtained by removing the subtree T1 from T} . Let z(l) denote the root node of the /cth tree. Thus, Tx denotes the /cth tree and T1 = zα .
To further simplify the following discussion, additional notation may be defined. Let /(•) denote the probability density function (pdf) of its argument (excusing the abuse of notation for its brevity). For each subtree T 1 the conditional likelihoods are defined as:
B1 (M) = /(T1 \ S(j) = m,Φ) (21)
B J,
P(
JM) = f(
Tj I S(PU)) = m,Φ) (2 2)
= f V
P(
J)M I
S(PU)) = m,Φ) (23)
It is noted that a pdf times the appropriate differential may be thought of as the probability that the random variable or vector is in some incrementally small region. Thus, the joint probability that a discrete random variable SO) equals a particular value m and that a continuous random variable is in some incrementally small region may be expressed as:
AJ (m) = p(S(j) = m,Tι,J \ Φ) (24) after multiplication by the appropriate differential.
From the HMT properties, T1 and T]Sj are independent given the state variable S(j) . Therefore: P(S(J) = mjx I Φ) = A1(Tn)B ^m) (25)
P(S(J) = m,S(p(j)) = n,Tx \ Φ) = B^nήB^^a^A^n) (26) The likelihood of Z1 k may be obtained from:
f (zljc I Φ) = /(T1 I Φ) = ∑p(S(j) = m,Tλ I Φ) = fjBJ (m)AJ (m) ( 27) m=\
The marginal state pmf is obtained from Equations (25) and (27) as:
A1 (Tn)B , (m) p(S(j) = m \ zl k ,Φ) = M (28) n=l and the parent-child pmf is obtained from Equations (26) and (27) as:
A.2 Upward-Downward Algorithm The upward-downward algorithm disclosed by M Crouse, et al. and O
Ronen, et al. may used to produce the quantities needed for calculating the conditional probabilities in Equations (28) and (29). The upward step in the algorithm produces the B coefficients and downward step produces the A coefficients. It is assumed that the wavelet transform uses L decomposition levels and that J=I is the index for the decomposition level with the finest scale, while J=L is the index for the decomposition level with the coarsest scale. Recall that p(j) denotes the parent of node j and c(j) denotes the children of node ;. Define the shorthand notation: g(r,β,W,σ>)= ' eJj^≠] (30)
The upward-downward algorithm may be performed over one transform coefficient quadtree in the following two steps (For simplification, the subscripts in zι k (j),βι k (j) and wk (j) are omitted in the upward and downward steps because this algorithm is applied to a given quadtree for a particular image thus a given /, k). Upward step:
Initialize: For all state variables S(j) at the finest scale J=I, calculate for m = \,...,M :
1) For all state variables S(j) at scale J1 compute for m = \,...,M :
M
^υ)(™) = Σ<;<,A(*) (32)
BpU)(m) (33)
2) Set J = J + 1, move up the tree one scale. 3) If J = L, stop; else return to step 1).
Downward step:
Initialize: For state variable S(Y) at the coarsest scale J = L, set for m = \,...,M :
A(m) = Psm(m) (35)
1) Set J = J - I, move down the tree one scale. 2) For all state variables S(j) at scale J, compute for m = \,...,M :
Λ(™) = Σ<;oΛ(>)^>) (3 6) π=l
3) If J = 1, stop; else return to step 1).
Appendix B. Outline of the derivation of the update equations
The EM algorithm provides a general approach to iterative computation of maximum-likelihood estimates from incomplete data, as disclosed in AP Dempster, et al. Let X denote the incomplete data set and let X
c denote the complete data set. From the exemplary image formation model of Equations (1) and (5) : and (37) x
c
I-.
,-
?}.
(38) Here it is assumed that each quadtree use the same distortion model as shown in Equation (5). Further, p
Sϋ)(tn) identifies the probability that ε
ι k (j) from
Equation (1) comes from a particular term m in the Gaussian mixture pdf of Equation (5), which models εl k (j) , the additive distortion in the observation zl k (j) . The common parameter set is Φ defined in Equation (6). The conditional probabilities used in the iterative EM algorithm, p(Sl k (j) = m \ z,Φ) and p(Sl k (j) = m,Sl k (p(j)) = n \ z,Φ) , are derived in Appendix A.
Let S = {Sα (./) : i = l,...,g; k = l,...,K; j = \,...,P] denote the complete set of state variables. Each iteration of the EM algorithm involves two steps, the expectation step (E-step) and the maximization step (M-step), as described in AP Dempster, et al. The E-step of the EM algorithm performs an average over the complete data, conditioned upon the incomplete data to produce the cost function:
ρ(Φ'|Φ) = E5{in/z(z,s|Φ')|z,Φ}
= Es{lnps(S|Φ') + ln/z(z|S,Φ')|z,Φ}
K P
+ E s \∑∑∑p(S(j)\S(p(j)),Φ')\z,Φ
[ ,=1 k=\ ;=2
= F1 +F2 +F3
Here E5 implies an average over ,S11(I),..., SqK(ϊ), S1 ,(2),..., SqK(P) and given the conditioning, the current (old) parameters in the distribution are used in this averaging and it is noted that the probabilities involving Slk(j) are independent of/, k.
Hence ps tU)(Slk(j)) may be denoted by pSϋ)(S(j)) in Equation (37) for simplicity.
This exemplary EM algorithm updates the parameter estimates to new values Φ' defined by Equation (7) that maximize Q(Φ'\ Φ) in Equation (37). This is called the M-step of the EM algorithm.
In order to maximize Q(Φ'\ Φ) analytically, each parameter is updated one at a time. The first term in Equation (39) may be written as:
= Σ Σ Σ W(S(I) = m) ■ p(Shk (1) = m I z,Φ) (40)
The updated estimate for ps^(m) is obtained from maximizing Equation
M
(40) analytically by solving 9F
1 /φ'
S(1) (m) = 0 subject to ∑ p'
S(i) (m) = 1, for m=l m = \,...,M . The updated estimate for p
sm(m) is:
The second term in Equation (39) may be written out like:
F2 = Esl∑∑∑P(SU) I S(p(/)), Φ') I z,φ|
[l=l k=l J=2 J (42) q K P M M
= Σ ,=1 Σ k=\ Σ J=I Σ m=\ Σ n=\ ln a'Zω -P(S,Λ U) = *n,S,jt (P(J)) = n\z,Φ)
The updated estimate for a"}*{j) in Equation (9) is obtained from maximizing Equation (42) analytically by solving 9F2 / da'"*(j) = 0 , subject to
M M
ΣΣaTPωPs(pθ))(n) = l>for J = l,-,P; /w = l,...,M. For any node j = 2,...,P1 there is m=l n=l the relationship between it and its parent node as:
Therefore, from Equations (9) and (43), together with Equation (41), the update Equation (8) in section 2.2 may be obtained. The third term in Equation (39) may be written out as:
■p(SlΛ(j) = m\z,Φ) (44)
■p(SιΛ(j) = m\z,Φ) where B is a term that is independent of Φ'. The updated estimate for σ^(y') in Equation (10) is obtained from maximizing Equation (44) analytically by solving dF3/dσ'm 2 (j) = 0 for j = \,...,P ; m = 1,...,Af , using the updated p'S(j)(m), a^p n (j)l and other old parameter values. Because βlΛ(j) is discrete, β\Λ{j) is updated to have the value from the set {0, -1, +1} that maximizes Equation (44) with p's(j) (m) , a'"'p''ω, σ'm O) 1 and a" the other parameters set at their old values. The updated estimate for
wk (j) in Equation (12) is obtained from maximizing Equation (44) analytically by solving SF3 / dw\ (j) = 0 for k = l,...,K ; j = \,...,P , using the updated parameters where possible.