WO2005109339A1

WO2005109339A1 - Creating an output image

Info

Publication number: WO2005109339A1
Application number: PCT/IB2005/051440
Authority: WO
Inventors: Henricus W. P. Van Der Heijden; Paul M. Hofman; Claus N. Cordes
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2004-05-10
Filing date: 2005-05-03
Publication date: 2005-11-17
Also published as: CN1950847A; KR20070008687A; EP1751711A1; JP2007536671A

Abstract

A method of creating an output image (108) on basis of a sequence of temporally consecutive input images is disclosed. The method comprises: identifying a particular part of a particular object (100) in a first one of the input images (102); fetching a first group of pixels (110) from the first one of the input images (102), the first group of pixels (110) corresponding to the particular part of the particular object (100); localizing the particular part of the particular object (100) in a second one of the input images (104); fetching a second group of pixels (110) from the second one of the input images (104), the second group of pixels (110) corresponding to the particular part of the particular object (100); and appending the second group of pixels (110) to the first group of pixels (110) to form the output image.

Description

Creating an output image

The invention relates to a method of creating an output image on basis of a sequence of temporally consecutive input images. The invention further relates to a computer program product to be loaded by a computer arrangement, comprising instructions to create an output image on basis of a sequence of temporally consecutive input images. The invention further relates to an image processing apparatus being arranged to create an output image on basis of a sequence of temporally consecutive input images

An advantage of showing a sequence of temporally consecutive input images is that dynamic events can be visualized, e.g. movement of an object relative to its background can be shown. For instance a sport game like football, wherein the actual movement of the ball is relevant, can be shown. It is a common feature to repeat portions of a sequence of images corresponding to a football game, during broadcasts. Typically these portions correspond to the most exciting moments of the game. However, when it is required to illustrate such an exciting moment in for instance a newspaper or some other kind of printed media, much of the attractiveness of the event is lost. This is because a picture in a newspaper does not indicate the dynamics of the event.

It is an object of the invention to provide a method of the kind described in the opening paragraph for summarizing a dynamic event in an output image. This object of the invention is achieved in that the method comprising: identifying a particular part of a particular object in a first one of the input images; fetching a first group of pixels from the first one of the input images, the first group of pixels corresponding to the particular part of the particular object; localizing the particular part of the particular object in a second one of the input images; fetching a second group of pixels from the second one of the input images, the second group of pixels corresponding to the particular part of the particular object; and appending the second group of pixels to the first group of pixels to form the output image. An obvious existing approach to illustrate a dynamic event is to create a schematic drawing e.g. an artificial graphical representation. The method according to the invention differs in that use is made of a sequence of temporally consecutive input images, i.e. spatio-temporal data, to generate a static, i.e. a spatial image, comprising an object of the input images at different moments in time. Portions are selected from the dynamic (x,y,t) input images, and combined to form a single static (x,y) output image. This is done in such a manner that the static output image illustrates a dynamic event, such as for example the motion of an object. A characteristic feature is that the output image comprises image data of a particular part of a particular object sampled at different moments in time. In other words, the particular part of the particular object appears multiple times in the output image. This is because the second group of pixels is appended to the first group of pixels, typically adjacent to the second group ofpixels. There is a clear distinction with the prior-art of creating one large panorama view by stitching together different smaller images. In that case sets of spatial image data are used to create a larger output image. In the method according to the prior art, portions of consecutive images are combined differently. Typically respective pixels of spatially overlapping image regions are merged. The result is that each object appears only once in the output image. In the method according to the invention use is explicitly made of data representing a single object at different moments in time. In an embodiment of the method according to the invention, the appending comprises a weighted summation of respective pixels values of the first group ofpixels and the second group of pixels. An advantage of a weighted summation is that the transition in the luminance and/or color from the first group of pixels to the second group of pixels is smoothed. Alternatively, the second group of pixels is just put adjacent to the first group of pixels. Typically, a combination of placing groups of pixels and using weighted summation for the transitions is used. Thus portions of two images are selected and combined through some form of interpolation, either through weighted averaging or simply placing the portions adjacent to one another. In an embodiment of the method according to the invention, the first group of pixels corresponds to the pixels of a number of columns of pixels of the first one of the input images. In this embodiment of the method according to the invention the first group ofpixels, and also consecutive groups ofpixels extend over the complete height of the pixel matrix corresponding to the input images. That means that all pixels which are located at a column comprising pixels representing the particular part of the particular object are selected and used as a kind of slice to construct the output image. In other words the output image comprises a set of slices which are fetched from the consecutive input images. Each of the slices shows the particular part of the particular object in the respective input images. Typically, the slices also represent a background in front of which the particular object is moving. This embodiment according to the invention is advantageous for creating an output image which illustrates a horizontal movement of the object. In an embodiment of the method according to the invention the first group of pixels corresponds to the pixels of a number of rows ofpixels of the first one of the input images. In this embodiment of the method according to the invention the first group ofpixels, and also consecutive groups ofpixels extend over the complete width of the pixel matrix corresponding to the input images. That means that all pixels which are located at a row comprising pixels representing the particular part of the particular object are selected and used as a kind of slice to construct the output image. In other words the output image comprises a set of slices which are fetched from the consecutive input images. Each of the slices shows the particular part of the particular object in the respective input image. Typically, the slices also represent a background in front of which the particular object is moving. This embodiment according to the invention is advantageous for creating an output image which illustrates a vertical movement of the object. In an embodiment of the method according to the invention, wherein the first group of pixels corresponds to the pixels of a number of columns of pixels of the first one of the input images, the number of columns ofpixels is based on tracking of the particular object. The movement of the particular object is estimated. The estimated movement determines the dimensions of the first group ofpixels. For instance if the estimated movement of the particular part of the particular object is equal to 20 pixels, then the number of columns ofpixels is also 20. In an embodiment of the method according to the invention wherein the first group of pixels corresponds to the pixels of a number of rows ofpixels of the first one of the input images, the number of rows ofpixels is based on tracking of the particular object. The movement of the particular object is estimated. The estimated movement determines the dimensions of the first group ofpixels. For instance if the estimated movement of the particular part of the particular object is equal to 20 pixels, then the number of rows ofpixels is also 20. In an embodiment according to the invention, the tracking is based on evaluating a number of motion vector candidates, the evaluating comprising establishing of a minimal match error. This technique is generally known as motion estimation. Preferably, the match error corresponds to a difference between respective pixel values corresponding to the particular object in the first one of the input images and/or the second one of the input images. Movement is a relative quantity. Movement can be expressed relative to the pixel matrices of the consecutive input images. If the consecutive input images were acquired by means of a stationary positioned camera, that approach is appropriate. That means that the coordinates of the particular part of the particular object in the first one of the input images and the coordinates of the particular part of the particular object in the second one of the input images can directly be used to compute the motion of the object. However, in many cases the camera is panning and/or zooming during acquisition of a moving object. If the sequence of temporally consecutive input images is based on such an acquisition, a correction for this camera movement is preferred. In a preferred embodiment according to the invention the number of columns ofpixels is based on tracking motion of the background in the first one of the input images and/or the second one of the input images. Alternatively, the number of rows ofpixels is based on tracking motion of the background in the first one of the input images and/or the second one of the input images. In general compensation according to a background motion model is realized. This may be a so-called pan-zoom model, which models the background model as a combination of translation and scaling, but it may also be more complex and also cover other aspects such as perspective projections and rotations As said, the number of fetched columns/rows is based on movement. This movement is relative to the background in front of which the object is moving. In case of a stationary located camera this movement corresponds to movement relative to the various pixel matrices. As an alternative for tracking the particular object by means of motion estimation on basis of evaluation of motion vectors, the particular object can also be tracked semi-manually. In that case the number of columns ofpixels is determined by: determining a first pixel coordinate on basis of identifying the particular part of the particular object in the first one of the input images; determining a second pixel coordinate on basis of identifying the particular part of the particular object in a third one of the input images; - determining the number of consecutive input images being temporally located between the first one of the input images and the third one of the input images; and computing the number of columns on basis of the first pixel coordinate, the second pixel coordinate and the number of consecutive input images. In this embodiment according to the invention, a user has to indicate in a number of images where the particular part of the particular object is located. This might be done by means of moving a cursor relative to the displayed input images. It is a further object of the invention to provide a computer program product of the kind described in the opening paragraph for summarizing a dynamic event in an output image. This object of the invention is achieved in that the computer program product, after being loaded in computer arrangement comprising processing means and a memory, provides said processing means with the capability to carry out: accepting a location of a particular part of a particular object in a first one of the input images - fetching a first group ofpixels from the first one of the input images, the first group ofpixels corresponding to the particular part of the particular object; localizing the particular part of the particular object in a second one of the input images; fetching a second group ofpixels from the second one of the input images, the second group ofpixels corresponding to the particular part of the particular object; and appending the second group ofpixels to the first group ofpixels to form the output image. It is a further object of the invention to provide an image processing apparatus of the kind described in the opening paragraph for summarizing a dynamic event in an output image. This object of the invention is achieved in that the image processing apparatus comprises processing means with the capability to carry out: accepting a location of a particular part of a particular object in a first one of the input images fetching a first group of pixels from the first one of the input images, the first group of pixels corresponding to the particular part of the particular object; localizing the particular part of the particular object in a second one of the input images; - fetching a second group of pixels from the second one of the input images, the second group ofpixels corresponding to the particular part of the particular object; and appending the second group ofpixels to the first group ofpixels to form the output image. Modifications of the method and variations thereof may correspond to modifications and variations thereof of the image processing apparatus and the computer program product, being described.

These and other aspects of the image processing apparatus, of the method and of the computer program product, according to the invention will become apparent from and will be elucidated with respect to the implementations and embodiments described hereinafter and with reference to the accompanying drawings, wherein: Fig. 1 schematically shows the method according to the invention, wherein the camera was stationary during acquisition of the input images; Fig. 2A schematically shows the method according to the invention, wherein the camera was panning during acquisition of the input images; Fig. 2B schematically shows a number of output images according to the invention; Fig. 3 schematically shows a number of input images of a football match and an output image which is created according to the invention, based on these input images; Fig. 4 schematically shows a first embodiment of the image processing apparatus according to the invention; and Fig. 5 schematically shows a second embodiment of the image processing apparatus according to the invention. Same reference numerals are used to denote similar parts throughout the

Figures. Fig. 1 schematically shows the method according to the invention, wherein the camera was stationary during acquisition of the input images 102, 104 and 106. The input images 102, 104 and 106 represent an object, i.e. a ball 100 which was moving in front of a homogeneous background. The camera was not moving during the acquisition of the input images 102, 104 and 106. It can clearly be seen that the ball 100 is moving from the left to the right relative to the pixel matrices corresponding to the input images 102, 104 and 106. The output image 108 which is based on the input images 102, 104 and 106 comprises a number of slices 110, 112 and 114 of the respective input images 102, 104 and 106. With a slice is meant a set ofpixels corresponding to a number of columns (or rows) of an input image. The arrows in Fig. 1 depict the relation between the slices as fetched from the input images 102, 104 and 106 and the slices being combined to form the output image 108. The size of these slices is based on the movement of the ball 100 relative to the pixel matrices. The output image 108 also comprises a start portion 116 of the first input image 102 and an end portion 118 of the last input image 106. The size of the start portion 116 and of the end portion 118 is not related to the movement of the ball 100. Fig. 2A schematically shows the method according to the invention, wherein the camera was panning during acquisition of the input images. The input images 102, 104 and 106 represent an object, i.e. a ball 100 which was moving in front of a house. The camera was panning during the acquisition of the input images 102, 104 and 106. The direction of the movement of the camera and of the ball are mutually equal. The speed of the camera movement is higher than the speed of the ball 100. The output image 208 which is based on the input images 102, 104 and 106 comprises a number of slices 110, 112 and 114 of the respective input images 102, 104 and 106. The arrows in Fig. 2A depict the relation between the slices as fetched from the input images 102, 104 and 106 and the slices being combined to form the output image 208. The size of these slices is based on the movement of the ball 100 relative to the background. The output image 208 also comprises a start portion 116 of the first input image 102 and an end portion 118 of the last input image 106. The size of the start portion 116 and of the end portion 118 is not related to the movement of the ball 100. By comparing the output image 208 with the input images 102, 104 and 106 it becomes clear that the output image is larger. The output image 208 shows the complete house whereas the different input images show a portion of the house. That means that the method according to the invention is such that spatially related image data is also combined optionally resulting in a relatively large output image. It will be clear that each time a new slice of an input image is appended to the output image as constructed until then, a new output image is created. In other words a first output image which is appended with a slice becomes a second output image. Showing such a series of output images under construction gives a users the impression of a live dynamic event combined with the history of the lapsed part of the event. The user is shown a series of output images which differ in size, i.e. a subsequent output image is larger than its predecessor. Alternatively, first a relatively large overview image is constructed on basis of the sequence of input images, wherein the overview image represent the total scene being captured by the input images. However without duplicates as described above. This is preferably done by using strips ofpixels which do not comprise pixels representing a moving object. Typically these strips are located at the border of the input images. The size of these strips is not related to movement of a particular object to be tracked but is related to movement of the background relative to the camera. After having created such a large overview image the method according to the invention is applied. The intermediate results of the method, i.e. subsequent output images are combined with the overview image. Basically, this means that the subsequent output images are appended with respective portions, i.e. remaining parts, of the overview image. Fig. 2B schematically shows a number of output images 202, 204 and 208 being constructed according to this approach. A first one of the output images 202 shows the said overview image in which the ball 100 is visible only once. In a second one of the output images 204 the ball 100 is visible twice and in a third one of the output images 208 the ball 100 is visible three times. Fig. 3 schematically shows a number of input images 102, 104 and 106 of a football match and an output image 308 which is created according to the invention, based on these input images 102, 104 and 106. It should be noted that the shown input images 102, 104 and 106 are only a part of a longer sequence of consecutive input images. The input images 102, 104 and 106 represents a football match. In a first one of the input images 102 it can be seen that a player kicks the ball 100. Look at the circle. In a second one of the input images 104 it can be seen that the ball 100 is flying through the sky. Look at the circle again. In a third one of the input images 106 it can be seen that the ball 100 reaches the goal. Fig. 3 also shows the output image 308 which is based on the shown input images 102, 104 and 106 and based on approximately 40 not shown input images. The actual trajectory of the ball is clearly visible in the output image 308. Fig. 4 schematically shows a first embodiment of the image processing apparatus 100 according to the invention. The image processing apparatus 400 is provided with a sequence of input images at its image input connector 410 and is arranged to provide a sequence of intermediate output images and a final output image at its image output connector 414. Preferably, the image processing apparatus according to the invention is provided by location information which is provided by means of user interaction, e.g. by a user who has indicated the object of interest in a number of input images. The image processing apparatus 100 comprises processing means with the capability to carry out: accepting a location of a particular part of a particular object in a first one of the input images, by means of location information input interface 412; fetching a first group ofpixels by means of pixel processor 404 from the first one of the input images which is temporally stored in an input memory device 402, wherein the first group ofpixels corresponds to the particular part of the particular object; localizing the particular part of the particular object in a second one of the input images, by means of the localization unit 408; fetching a second group ofpixels by means of pixel processor 404 from the second one of the input images which is temporarily stored in the input memory device 402 after the first one of the input images, wherein the second group ofpixels also corresponds to the particular part of the particular object; and - appending the second group ofpixels to the first group ofpixels to form the output image. The pixel processor 404 is arranged to make a copy of the accessed second group of pixel values and to write the copy to pixel values at the appropriate position in the output memory device 406. Fig. 5 schematically shows a second embodiment of the image processing apparatus 500 according to the invention. This embodiment 500 is basically the same as the embodiment 400 as described in connection with Fig. 4. A difference is that this embodiment 500 is arranged to compensate for camera movement. This embodiment of the image processing apparatus is arranged to perform motion estimation of the background to be able to compensate for the effects of camera movement. This embodiment 500 comprises an additional memory device for temporally storage of a second input image. The localization unit 408 is provided with positional information of a target of interest, i.e. a particular object to be tracked, within the sequence of input images. Besides that the localization unit 408 is arranged to compute a global motion vector for the background in front of which the target object is moving. The global motion vector is computed by combining a number of motion vectors being computed on basis of a pair of input images. The motion vectors are computed by means of a standard motion estimator which is preferably incorporated in the localization unit 408. The motion estimator is e.g. as specified in the article "True-Motion Estimation with 3-D Recursive Search Block Matching" by G. de Haan et al. in IEEE Transactions on circuits and systems for video technology, vol.3, no.5, October 1993, pages 368-379. Alternatively a motion vector for the entire image is computed on basis of a mean image-row (x-component) and a mean image-column (y-component), as disclosed in the article "feature-based block matching algorithm integral projections" by J.S.Kim and R- H. Park, in Electronic Letters Vol 25, ρ.29-30. The pixel processor 404 and the localization unit 408 may be implemented using one processor. Normally, these functions are performed under control of a software program product. During execution, normally the software program product is loaded into a memory, like a RAM, and executed from there. The program may be loaded from a background memory, like a ROM, hard disk, or magnetically and/or optical storage, or may be loaded via a network like Internet. Optionally an application specific integrated circuit provides the disclosed functionality. The working of the embodiment of the image processing apparatus as depicted in Fig. 5 will be explained using an example involving a sequence of input images representing a free kick in football. A few input images, i.e. video frames, are shown in Fig. 3. The camera was panning, with a non-constant speed, from the location of the kick to the goal. The dynamic event to be captured in the output image is the ball flying into the goal and therefore the ball has to be tracked in the sequence of input images. The motion of the ball is approximated by using a constant velocity in the x direction (this is along the left-right axis in the input images). This is a reasonable assumption of the ball motion between the kick and the first following contact with an object such as the goal net. In this example motion in the y direction is disregarded (the top-bottom axis in the input images). For the x position of the football can be derived

^X screen (O + ^X camera («) = ^X screen (" ) + ^X camera («0 ) + ^{V '} (" ~ "o )» ( 0

where no is a reference input image number, in which the x positions of the ball on screen (xscrccn) , i e. pixel matrix, and the relative position of the camera (x_Camera) are considered known. The actual position of the ball is given by the sum of the screen position and the camera position. For example, if the ball moves to the right in the "real" world, it is possible that the camera pans faster to the right than the ball moves, in which case the ball can be seen moving to the left on screen. To compensate for this effect, the camera position is included in Equation (1). If a second screen position at input image ni is known, the true velocity v can be calculated using _v _ (^X screen (". ) + ^X camera (*| )) ~ (^X screen ) + ^X camera )) ₍₂₎ "i - »o In this embodiment, the user is required to provide two or more spatio- temporal positions x_SCreen(ni) for input images n,, in order to be able to determine the velocity v, as well as to provide start and end points of the event. Using a global motion estimation algorithm the relative camera position X_camera(n) for each input image n is automatically calculated from the video sequence. Then v is calculated for the event, and for each input image n the horizontal areas of interest, i.e. the slices comprising a number of columns of the input images, in screen coordinates, are centered around x_screen(n), which can be calculated from Equation (1).

^X screen (") = ^X screen ("o ) + ^X camera ) ^{" X} camera (O + ^{V '} (" ^" "o )» (³)

These areas of interest, i.e. slices are copied to appropriate parts of the output image. The embodiment presented here is limited in certain regards, which may be overcome with more advanced processing techniques. Most notably, it is dependent on user interaction to provide start and end input images, as well as start and end locations of

"interesting objects". This could be more generalized ("follow the ball") using (object based) motion estimation and with intelligent automatic choices for start and end frames for an event. The method, computer program product and image processing apparatus according to the invention may be beneficial for several applications, e.g.: professional image processing, like in film studios, broadcast studios or for making newspapers and other types of printed media; consumer electronics devices, like TVs, set-top boxes and personal video recording devices; educational purposes; and consumer video processing software, e.g. for making home videos. It should be noted that the above-mentioned embodiments illustrate rather than limit the invention and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed (typo?) between parentheses shall not be constructed as limiting the claim. The word 'comprising' does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitable programmed computer. In the unit claims enumerating several means, several of these means can be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words are to be interpreted as names.

Claims

CLAIMS:

1. A method of creating an output image (108) on basis of a sequence of temporally consecutive input images, the method comprising: identifying a particular part of a particular object (100) in a first one of the input images (102); - fetching a first group ofpixels (110) from the first one of the input images

(102), the first group ofpixels (110) corresponding to the particular part of the particular object (100); localizing the particular part of the particular object (100) in a second one of the input images (104); - fetching a second group ofpixels (110) from the second one of the input images (104), the second group ofpixels (110) corresponding to the particular part of the particular object (100); and appending the second group ofpixels (110) to the first group ofpixels (110) to form the output image.

2. A method as claimed in claim 1, wherein the appending comprises a weighted summation of respective pixels values of the first group ofpixels (110) and the second group ofpixels (1 10).

3. A method as claimed in claim 1, wherein the first group ofpixels (110) corresponds to the pixels of a number of columns of pixels of the first one of the input images (102).

4. A method as claimed in claim 1, wherein the first group ofpixels (110) corresponds to the pixels of a number of rows of pixels of the first one of the input images (102).

5. A method as claimed in claim 3, wherein the number of columns ofpixels is based on tracking of the particular object (100).

6. A method as claimed in claim 4, wherein the number of rows of pixels is based on tracking of the particular object (100).

7. A method as claimed in claim 5 or 6, wherein the tracking is based on evaluating a number of motion vector candidates, the evaluating comprising establishing of a minimal match error.

8. A method as claimed in claim 7, wherein the match error corresponds to a difference between respective pixel values corresponding to the particular object (100) in the first one of the input images (102) and/or the second one of the input images (104).

9. A method as claimed in claim 5, wherein the number of columns ofpixels is based on tracking motion of the background in the first one of the input images (102) and/or the second one of the input images (104).

10. A method as claimed in claim 6, wherein the number of rows ofpixels is based on tracking motion of the background in the first one of the input images (102) and/or the second one of the input images (104).

11. A method as claimed in claim 5, wherein the number of columns of pixels is determined by: determining a first pixel coordinate on basis of identifying the particular part of the particular object (100) in the first one of the input images (102); - determining a second pixel coordinate on basis of identifying the particular part of the particular object (100) in a third one of the input images; determining the number of consecutive input images being temporally located between the first one of the input images (102) and the third one of the input images; and computing the number of columns on basis of the first pixel coordinate, the second pixel coordinate and the number of consecutive input images.

12. A computer program product to be loaded by a computer arrangement, comprising instructions to create an output image (108) on basis of a sequence of temporally consecutive input images, the computer arrangement comprising processing means and a memory, the computer program product, after being loaded, providing said processing means with the capability to carry out: accepting a location of a particular part of a particular object (100) in a first one of the input images (102) - fetching a first group of pixels (1 10) from the first one of the input images

(102), the first group ofpixels (110) corresponding to the particular part of the particular object (100); localizing the particular part of the particular object (100) in a second one of the input images (104); - fetching a second group ofpixels (110) from the second one of the input images (104), the second group of pixels (110) corresponding to the particular part of the particular object (100); and appending the second group ofpixels (110) to the first group ofpixels (110) to form the output image.

13. An image processing apparatus being arranged to create an output image (108) on basis of a sequence of temporally consecutive input images, the image processing apparatus comprising processing means with the capability to carry out: accepting a location of a particular part of a particular object (100) in a first one of the input images (102) fetching a first group ofpixels (110) from the first one of the input images (102), the first group ofpixels (110) corresponding to the particular part of the particular object (100); localizing the particular part of the particular object (100) in a second one of the input images (104); fetching a second group of pixels (110) from the second one of the input images (104), the second group ofpixels (110) corresponding to the particular part of the particular object (100); and appending the second group ofpixels (110) to the first group of pixels (110) to form the output image.

14. An image processing apparatus as claimed in claim 13, characterized in further comprising a display device for displaying the output image.