CA2347493A1

CA2347493A1 - Attentive panoramic sensing for visual telepresence

Info

Publication number: CA2347493A1
Application number: CA002347493A
Authority: CA
Inventors: James H. Elder; Ronen Goldstein; Yuqian Hou
Original assignee: Individual
Current assignee: Individual
Priority date: 2001-05-14
Filing date: 2001-05-14
Publication date: 2002-11-14

Abstract

Sensor and bandwidth constraints limit the instantaneous spatial resolution and field-of view (FOV) achievable in any visual system. In the human eye, a compromise has evolved in which resolution is high near the optical axis, but falls off with eccentricity. The effective resolution of the system is extended by fast gaze-shifting mechanisms and a memory system that allows a form of integration over multiple fixations.
We have constructed an artificial visual system based on these concepts. The peripheral component consists of a catadioptric video sensor that provides a panoramic FOV. The foveal component is a video pan/tilt camera with 14 deg FOV. Calibration yields a table of projective parameters, indexed by foveal pan/tilt coordinates, that allows rapid transfer of pixels between foveal and panoramic coordinate frames.
A second transform maps between sensor and display frames. Alpha masking yields a circular, smoothly blended fovea embedded in the lower resolution panoramic image.
The system may be operated in 3 modes. In slaved mode, mouse-clicks in the display generate saccade commands to the pan/tilt platform. In autonomous mode, saccades are entirely determined by motion detected in the peripheral sensor. In semi-autonomous mode these two independent motor command streams are arbitrated to produce a system responsive to operator interest as well as autonomously-detected motion events.
The display duration of foveal images from past fixations is determined by a memory parameter. At one extreme, previous foveal data are immediately replaced by more recent low resolution data from the panoramic sensor. At the other extreme, a sequence of fixations builds up a persistent high resolution mosaic. In intermediate modes, foveal data from previous fixations gradually fade into more recent low-resolution data.
The system is presently operating on a Pentium platform at 15 fps.

Description

ATTENTIVE PANORAMIC SENSING FOR VISUAL TELEPRESENCE
FIELD OF THE INVENTION
The present invention relates to panoramic sensing systems for visual telepresence.
1 Introduction Over th.e last ten years there has been incre,~.sing interest in wide FOV
sensing, particularly panoramic sensing (e.g. (6, 11, 3, 5)). The advantages of a panoramic: FOV for surveillance and teleconferencing applications are clear, however these advantages come at the: expense of resolution. Switching from the 14 deg FOV of a typical lens to the 360 deg F'OV of a panoramic camera results in a 26-fold reduction in linear resolution. For a standard 768 x 494 N'I'SC camera, horizontal resolution is reduced to roughly 0.5 deg/pixel, a factor of 60 below human foveal resolution.
The human visual system has evolved a bipartite solution to the FOV/resolution tradeoff. The FOV of the human eye is roughly 160 x 175 deg - nearly hemispheric. Central vision is served by roughly 6 million photoreceptive cones that provide high resolution, chromatic sensation over a 5 deg FOV, while roughly 100 milli~~n rods provide relatively low-resolution achromatic vision over the remainder of the visual field.
The effective resolution of the system is extended by fast gaze-shifting mechanisms and a memory system that allows a form of integration over multiple fixations.
In this paper, we outline the design of an artificial visual system based on these concepts, and report preliminary results from a prototype system we have constructed. The peripheral component of the system consists of a catadioptric video sensor that provides a panoramic FOV. The foveal component is a video pan/tilt camera with 14 x 10 deg FOV. Video streams from the two sensors are fused at 15 fps on a standard video display.
Saccades (rotations of the pan/tilt sensor) may be initiated either manually by a human observer via mouse clicks on the display, or automatically by a motion localization algorithm. Memory parameters govE~rn the tradeoff between the high spatial resolution of the foveal video stream, and the high temporal resolution of the panoramic stream.
;3ystems of this kind may be useful in both autonomous and semi-autonomous applications. Events de-tect~~d in the panoramic sensor may general:e saccade commands to allow more detailed inspection/verification at foveal resolution. In telepresence applications, foveal data may provide the resolution. required to see facial expressions, read text, etc..., while the panoramic data may augment the sense of presence and situ<itional awareness.

2 Prior Work There has been considerable work on space-variant (foveated) sensor chips (9, 1, 7, 10). However, since the num'Ser of photoreceptive elements on these sensors is limited, they do not provide a resolution or FOV
advantage over traditional chips. Moreover it is not clear how such sensors could be used to achieve a panoramic FOV over which the fovea can be rapidly deployed.
Geisler & Perry (2) have demonstrated a. wavelet-based video encoding system that progressively sub-samF~les the video stream at image points distant from the viewer-defined region of interest. Recent work with saccade-contingent displays (4) has shown that video data viewed in the periphery of the human visual system can be substantially subsampled with. negligible subjective or objective impact. While our attentive panoramic sensor is not eye-slaved, these prior results do suggest that attention-contingent sampling for human-in-the-loop video is feasible_ and poi:entially useful. . . . _ Z

SUMMARY OF THE INVENTION
The present invention provides a device for panoramic sensing for visual tE;lepresence, comprising:
a video sensor having a panoramic field of view, a motion sensor and a display means connected to said video sensor; and control means connected to :.aid video sensor, said motion sensor and said display means, said control means being operable in either a slaved mode in which an o~~erator controls positioning of said video sensor, an autonomous mode in which saccades are determined by motion detected by said motion sensor, or a semi-autonomous mode in which saccadEa are determined by a combination of motion dEaected by said motion sensor and operator interest.
DETAILED DESCRIPTION OF THE INVENTION
:Design 3.1 Hardware The prototype sensor is shown in Fig. 1(a). 'rhe panoramic component is a parabolic catadioptric sensor (5) purch,3sed from Cyclovision Technologies (n.ow RemoteRealityT~~). The parabolic mirror stands roughly 2 metres from the ground, facing down, and thus images the panoramic field below the ceiling of the laboratory . Panoramic images are captured through an orthographic lens system by a Pulnix TMC-7DSP
colour CCD camera.
The foveal component consists of a Cohu 1300 colour CCD camera with a 50mm Fujinon lens, mounted on a Directed Perception PTU-46-17.5 pan/'tilt platform. As loaded, the platform travels at an average speed of roughly 60 deg/sec in both pan and tilt directions: typical saccades complete in 150 to 1200 msec. The platform has been modified so that both axes of rotation coincide approximately with the optical centre of the lens system, so that parallax between foveal images at different pan/tilt coordinates is minimized.
Th~~ optical centres of the two sensors are separated by 22cm in the vertical direction. This means that a fixed system of coordinate transforms between the two sensors can be accurate only if viewing distance is large or if dynamic variations in depth are small relative to viewing distance. Since neither condition holds in our laboratory, we currently calibrate the system for intermediate distances and accept the misregistrations that occur at other depths.
Vide processing, display and control are handled by a single-CPU 800 MHz Pentium III computer.
The two colour 30 fps NTSC video streams a.re digitized by a 6-channel Data Translation DT3132 frame grabber card into two digital 640 x 480 video :streams. The display is driven by an NVidia 64MB GeForce2 CTS graphics card.

3.2 Coordinate Transformations In t:he current prototype, we model the scene as static and piecewise planar, and approximate the cor-respondence between foveal and panoramic coordinate frames using a table of projective transformations indexed by the pan/tilt coordinates of the foveal sensor. We discuss our calibration procedure in Section 4. The system relies upon 4 different coordinate transformations (Fig. 2):
~ panorama->display ~ fovea->display ~ panorama-~pan/tilt ~ display-~pan/tilt The first two transformations rnap the two video streams to common display coordinates. The last two transformations map selected interest points from panoramic or display coordinates to pan/tilt coordinates used to effect a saccade.
3.2.1 Panorama-Display 'I~ansformation The panorama-3display coordinate transfo:cm is a fixed 3-parameter translation/scaling, so that the ob-server views the scene essentially in panoramic coordinates. In the present configuration we map a 256 x 128 pixel subimage from the upper half of the panorama to a 1280x640 pixel window in the display.

3.2.:? Fovea~Display Transformation The fovea-adisplay transformation is composed of fovea-apanorama and panorama-display transforma-tions. Calibration (Section 4) yields a table of projective matrices, indexed by the pan/tilt coordinates of the foveal sensor platform, that are used to map foveal pixels to panoramic coordinates. Given an arbitrary pan/tilt index, the projective matrix is constructed by bilinearly interpolating the 8 projective parameters stored at neighbouring entries. The result is then mapped to display coordinates using the fixed panorama~display coordinate transform. The rectangular foveal image is thus mapped to a general quadrilateral in the display.
3.2.3 Panorama-~Pan/Tilt Transformation In addition to the table of projective parameters used for the fovea--panorama transformation, the Calibra-tion procedure yields a second table used for the panorama-~pan/tilt transformation. This table provides the pan/tilt coordinates required for given panoramic coordinates to map to the centre of a foveal im-age. 'thus the table can be used to centre the fovea at a point of interest automatically detected in the panorama.
3.2.4 Display-3Pan/Tilt Transformai:ion The display-~pan/tilt transformation is composed of a fixed translation/scaling display-panorama trans-formation and the panorama-~pan/tilt transformation just described. This transformation is used to generate saccades to points of interest detected in the display by the observer.
4 (calibration The system is calibrated manually, using a simple calibration rig. Since our sensor is located close to the corner of the laboratory, we work within a 90 x 45 deg subfield located at the top of the panorama and facing out from the walls. 21 synchronous pairs of foveal/panoramic frames are captured over a 7 x 3 regularly spaced grid in pan/tilt space. The rig is positioned an intermediate distance from the sensor to optimize the working range of the coordinates transformations for the given environment. 12-16 point pairs are manually localized in each foveal/panoramic image pair, and the corresponding least-squares projective transformation is estimated using standard techniques. These data are used to form the fovea-panorama coordinate transformation, indexed by the pan/tilt coordinates of the foveal platform.
For each image pair obtained we also store the projection of the foveal centre into panoramic coordinates.
This allows construction of a second table, indexed by panoramic coordinates, that provides the pan/tilt coordinates required to centre the fovea at a specific panoramic location.
This table is used to generate sa~:cades from human or machine attention algorithms.
Operation Fig. 2 shows a schematic of how these video streams are processed, combined and displayed. The panoramic video stream is first unwarped by the CPU using Cyclovision software (5) to form a 1024 x 256 colour video stream (Fig. 3(a)). The two video streams. are then transformed into common display coardinates prior to fusion.
The fusion algorithm is essentially to display foveal pixels where they exist, and panoramic pixels otherwise (Fig. 3(b)). In order to make the fusion less ,warring to the observer, the foveal and panoramic day:a are blended using a set of concentric alpha masks, yielding a high-resolution circular fovea smoothly inset within a low-resolution periphery (1~ ig. 3(c)). All coordinate transformations and masking are done by graphics hardware using OpenGL. When not interrupted by saccade commands, the system runs at 15 fps.
Saccades are initiated in two ways. If the observer clicks the mouse in the display, the location is transformed from display to pan/tilt coordinates which form the target of an immediate saccade. Saccades may also be initiated by a motion localiz,~.tion algorithm, which we describe below.
6 Motion Localization 6.1 Algorithm The system may be operated to make saccades to points in the panorama where motion is detected. A
fundamental issue in motion processing is how to select the spatial scale of analysis. In our case, the purpose of the detection is to drive the fovea to the point of interest to resolve the change. Thus it is natural to match the scale of analysis to the FOV of the foveal sensor in panoramic coordinates. In this way, saccades will resolve the greatest amount of motion energy.
Successive panoramic RGB image pairs (Fig. 4(a-b)) are differenced, rectified, and summed to form a primitive motion map (Fig. 4(c)). This m,ap is convolved with a separable square kernel that approximates the FOV of the foveal sensor in panoramic coordinates (50 x 50 pixels). The resulting map (Fig. 4(d)) is thresholded to prevent the generation of saccades due to sensor noise and vibration (Fig. 4(e)).
In order to select the appropriate threshold, an experiment was conducted in which motion map statis-tics were collected for a static scene. 'Thirty motion reaps yielded nearly a million data points. We ran this experiment under two different conditions. In the first condition, saccades were inhibited, so that vibration in the sensor was minimized. T:he resulting distribution of motion values is shown in Fig. 5(a).
In the second condition, we computed the motion maps immediately following a saccade, at which time we ~:xpect vibration to be near its maxirr~um (Fig. 5(b)). The noise distribution can be seen to depend strongly on the state of the sensor. In the present prototype we use the first distribution to determine the threshold (3.0) and simply inhibit motion detection for a 2-second period following each saccade.
'the location of the maximum of the thresholded motion map determines the next fixation (Fig. 4(f)).
Since the motion computation and the video fusion computations are done by the same CPU, motion computation pauses the update of the display for an average of 400 msec. This need not occur in a true telepresence application, in which the attention algorithms could run on the host computer of the sensor and the fusion algorithms could run on the client computer of the observer.

? Memory What information the human visual system retains over a sequence of fixations is a subject of debate in vision science at the present time (e.g. ~8J). There is no question, however, that humans have some forms of visual memory (iconic, short-term, long-term).
We have implemented a primitive sort; of memory in our own artificial attentive sensor. The display duration of foveal images from past fixations is determined by a memory parameter. At one extreme, previous foveal data are immediately replaced by more recent low resolution data from the peripheral sensor. At the other extreme, a sequence of fixations builds up a persistent high resolution mosaic (Fig.
6(a)). In intermediate modes, foveal data from previous fixations gradually fade into more recent low-resolution data (Fig. 6(b)).
8 Future Work We see the attentive panoramic sensor as. a testbed for a number of important computer vision problems.
For applications where object distances a.re much greater than the foveal/panoramic baseline, good regis-tration can be achieved by a single calibration prior to operation. For close-range applications, however, registration is more approximate and fails if the scene is dynamic. Errors can be reduced by redesigning the sensor package to shorten the baseline, but our ultimate goal is to solve foveal-panoramic: correspondence in real time and thus achieve good registr<~tion for close-range dynamic environments. This is a challenging goal, given the 16:1 linear resolution difference between fovea and panorama.
We believe that human eye movement behaviour and attention processing are to a great degree de-termined by the decline in visual acuity with eccentricity, and thus we feel that the attentive panoramic sensor is an interesting platform on which to test attention algorithms.
It remains to be seen how effective a:nd imrnersive an experience this kind of sensor can deliver in a telepresence application. Given the speed of eye movements and the intolerance of the visual system to lag, eye-slaved systems are impractical in many telepresence applications. We wish to investigate the degree to which intelligent attention algorithms and system memory can be used to provide an effective visual experience in situations where lags are significant and bandwidth is limited.
9 Conclusion We have demonstrated what we believe to be the first attentive panoramic visual sensor in which high res-olueion (foveal) colour video is fused in real time (15 fps) with colour panoramic video. Saccadic behaviour is determined both by the interest of the observer and by autonomous attention (motion) computations. A
primitive form of memory permits the accumulation of high resolution information over space, at the ex-pense of temporal resolution. The attentive panoramic sensor is to be used for future research in real-time video fusion, attention and telepresence.
RE~ferences ~1J F. Ferrari, J. Nielsen, P. CZuesta, and G. Sandini. Space variant imaging.
Sensor Revie~cu, 15(2):17-20, 1995.
(2J W.S. Geisler and J.S. Perry. A real-time foveated mufti-resolution system for low-bandwidth video communication. In B. Rogowitz and T. Pappas; editors, Human Yision and Electronic Imaging, SPIE
Proceedings, volume 3299, pages 294-305. 1998.

(3J H. Ishiguro, M. Yamamoto, and S. Tsuji. Omni-directional stereo. IEEE
Trans. Pattern Analysis and Machine Intelligence, 14(2):257-262, 1992.
(4J L. Loschky and G.W. McConkie. Gas:e contingent displays: Maximizing display bandwidth efficiency.
Army Research Laboratory Advances' Displays and Interactive Displays Federated Laboratory Third Annual Symposium, 1999.

(5) S. Nayar. Catadioptric omnidirectional camera. Proc. IEEE Conf. Computer Vision Pattern Recog-nition, pages 482-488, 1997.
(6J S.J. Oh and E.L. Hall. Guidance of a mobile robot using an omnidirectional vision navigation system.
Proc. Soc. Photo-Optical Instrumer~ta;tion Engineers (SPIE), 852:288-300, 1987.
(7J F. Pardo, B. Dierickx, and D. Sche:ffer. CMOS foveated image sensor:
Signal scaling and small geometry effects. IEEE Transactio~as on .Electron Devices, 44(10):1731-1737, October 1997.
(8J R.A. Rensink, J.K. O'Regan, and J.J, Clark. To see or not to see: the need for attention to perceive changes in scenes. Psychological science, 8(5):368-373, Sep 1997.
(9J J. van der Spiegel, G. Kreider, C. Claeys, I. Debusschere, G. Sandini, P.
Dario, F. Fantini, P. Belluti, and G. Soncini. A foveated retina-like sensor using CCD technology. In C. Mead and M. Ismail, editors, Analog VLSI implementation of neural systems, chapter 8, pages 294-305. Kluwer, Boston, 1989.
(10~ R. Wodnicki, G. W. Roberts, and M. Levine. Design and evaluation of a log-polar image sensor fabricated using a standard 1.2 um ASIC CMOS process. IEEE Journal of Solid-State Circuits, 32(8):1274-1277, August 1997.
(11) Y. Yagi and S. Kawato. Panoramic scene analysis with conic projection.
Proc. let. G'onf. on Robots and Systems (IROS), 1990.
The foregoing description of the preferred embodiments of the invention has been presented to illustrate the principles of the invention and not to limit the invention to the particular embodiment illustrated. It is intended that the scope of the invention be defined by all of the embodiments encompassed within the following claims and their equivalents.

Claims

1. A device for panoramic sensing for visual telepresence, comprising:
a video sensor having a panoramic field of view, a motion sensor and a display means connected to said video sensor; and control means connected to said video sensor, said motion sensor and said display means, said control means being operable in either a slaved mode in which an operator controls positioning of said video sensor, an autonomous mode in which saccades are determined by motion detected by said motion sensor, or a semi-autonomous mode in which saccades are determined by a combination of motion detected by said motion sensor and operator interest.

2. The device according to claim 1 wherein said video sensor includes a foveal component comprising a video camera, and wherein display duration of foveal images from past fixations is determined bay a memory parameter.

3. The device according to claim 2 wherein said control means includes a calibration providing a table of projective parameter parameters, indexed by foveal pan/tilt coordinates, that allows rapid transfer of pixels between foveal and panoramic coordinate frames, including a transform means for mapping between the motion sensor and display frames produced on said display means.

4. The device according to claim 3 including alpha masking means for displaying a high resolution smoothly blended fovea embedded in lower resolution panoramic image.