GB2524058A

GB2524058A - Image manipulation

Info

Publication number: GB2524058A
Application number: GB1404440.8A
Authority: GB
Inventors: Stã Phane Baron; Romain Guignard; Pascal Viger
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2014-03-13
Filing date: 2014-03-13
Publication date: 2015-09-16
Anticipated expiration: 2034-03-13
Also published as: GB201404440D0; GB2524058B

Abstract

A method for constructing a combined image from initial images representing the same scene, a first initial image 510 having a first resolution and a second initial image having a second higher resolution, comprises the following steps. Spatial parameters, such as position and size, of a region of interest in the second image, corresponding to a region of interest (ROI) 515 in the first image are determined. The ROI may be user selected. The first image is then partitioned into coding entities according to the spatial parameters, to define a subset of coding entities corresponding to a zoom area. The second image is partitioned into coding entities based on those of the first image. Coding entities of the second image in the zoom area and of the first image outside the zoom area are then encoded and combined into a bitstream 540. The size of the zoom area in the first image is preferably substantially equal to the region of interest in the second image. The method may be used to reduce the amount of data required to transmit surveillance image data to mobile devices, while depicting a high definition area of interest.

Description

Image Manipulation

FIELD OF THE INVENTION

The invention relates to image manipulation, and in particular, combining images of different resolutions. Embodiments of the invention may have particular application to transmission of several versions of a live video to a plurality of devices including a storage device, high definition display devices and low definition display devices. One of these versions may be a low definition version comprising a high definition zoomed region of interest and is transmitted to a low definition display device.

BACKGROUND OF THE INVENTION

Networks and devices have greatly evolved during the last decade offering numerous new applications and experiences to users. Video surveillance is one of these applications rendered easily accessible to users thanks to this evolution. In the past these video surveillance systems were mainly based on analog video cameras, connected with dedicated coaxial wired networks to analog displays. Now, these systems use digital cameras providing high quality compressed video streams to digital displays through non dedicated hybrid networks comprising wired and wireless connections.

Storage of video data in parallel with monitoring of the displayed video is typically desired. Monitoring allows a user to react directly and rapidly to an event while the stored video can be used later for a detailed analysis of an event. In the past, the storage of the video data was an issue, due to the large amount of video data to be stored. To solve this issue, video were stored in very low definition in a grey scale format. Video compression, using for instance H.264/AVC (ISO/IEC 14496-10 -MPEG-4 Part 10, Advanced Video Coding / ITU-T H.265) or the emerging video standard HEVC (ISO/IEC 23008-2 -MPEG-H Part 2 I ITU-T H.265), have reduced significantly amount of video data to be stored while preserving the quality of images. However, since in parallel the resolution of images, in terms of image size, image frequency or in terms of number of color components of images, has increased, strategies to further reduce the amount of video data are still necessary. One strategy consists in storing video data in a lower definition than the video data displayed to the user for monitoring. To simultaneously perform both aspects (monitoring and storage), new security cameras typically employ dual coders. Those coders receive the same video to encode but each one is dedicated to a specific purpose. One coder generates a High Definition stream for the monitoring, to allow a good quality and a high interactivity. The other coder produces a Low Definition stream that will be stored by a recording system.

With the arrival and development of the multi-function wireless portable devices like smart phone or tablets, a new need appears. It is now desirable to be able to receive video surveillance streams on portable devices, to view in real time a monitored scene. To address this desire, the Low Definition stream, which matches the resolution of the portable devices, is forwarded from one security camera to the portable viewer. However, due to the reduced size of the screen of the portable devices, an advantageous feature would be to allow zooming on a dedicated part of a scene in order to obtain greater detail.

One simple solution to perform such zooming consists in up-scaling a part of the low definition images displayed on the portable device selected by the user. However, up-scaling cannot provide a good quality especially when the level of zoom is unconstrained.

Another solution available in the prior art, consists in transmitting a low definition stream and a corresponding high definition stream to the portable device and to let the device managing the combination of data of both video streams for display in accordance with user manipulations. The main drawback of this solution is to require transmission of two video streams which represent a large amount of bandwidth. Such bandwidth requirement may not be compatible with the network capabilities of a portable device, especially when this device is using a wireless network with limited bandwidth. In addition, the combination of the two videos requires decoding the two received video streams on the portable device which represents a computational load that may not be compatible with a portable device computation capabilities.

SUMMARY OF THE INVENTION

The present invention has been devised to address one or more of the foregoing concerns. In particular, embodiments of the present invention aim to address the above mentioned problems by offering the possibility of a zooming application on a portable device wirelessly connected to a video surveillance system offering a good zoom quality without increasing significantly the required bandwidth with respect to the transmission of a low definition video stream.

According to a first aspect, the invention provides a method for constructing a combined image from initial images representing the same scene, a first initial image having a first resolution and a second initial image having a second resolution higher than the first resolution, the method comprising: determining spatial parameters representing a region of interest in the second initial image, corresponding to an identified region of interest of the first initial image; partitioning the first initial image into coding entities in dependence upon the determined spatial parameters, to define a subset of coding entities corresponding to a zoom area; partitioning the second initial image into coding entities based on the coding entities of the first initial image; and encoding coding entities of the second initial image in the zoom area and coding entities of the first initial image outside of the zoom area, and combining the resulting encoded data into a bitstream representing a combined image.

In this way a combined image can be produced, which combined image includes a background area of a low definition image, with a user selected high definition zoomed image inserted therein. Furthermore, by defining coding entities in the lower definition image by reference to the ROl of the higher definition image (as opposed to defining directly from the position of the initial ROl selected) image combination is performed in a way which is more compatible with compression.

In embodiments of the invention, the spatial parameters typically comprise information representative of the position and the size of the region of interest in the second initial image.

Preferably the size of the zoom area in the first initial image is substantially equal to the size of the region of interest in the second initial image, and more preferably the coding entities of the first initial image are defined based on the position and size of the region of interest in the second initial image.

In certain embodiments the coding entities of the first initial image are defined so all parts of each encoding entity lie entirely inside or entirely outside the zoom area. In this way the boundaries of the zoom area are aligned with coding entities, and the zoom area can be encoded independently of the remaining, background portion, and vice versa. Similarly, the coding entities of the second initial image are defined so all parts of each encoding entity lie entirely inside or entirely outside the zoom area In some embodiments the coding entities of the first initial image are defined by setting the position of the zoom area in the first initial image to be substantially aligned with the position of the region of interest in the first initial image. For example, the centres of the respective areas/regions are aligned.

Alternatively, the coding entities of the first initial image can be defined by minimising the difference in position of the zoom area in the first initial and the position of the region of interest in the first initial image. This is useful if for example the zoom area is of such a size that when aligned with the region of interest, it will extend outside of the edges of the lower definition image, and effectively be cropped'.

In such a case, it may be displaced or shifted' away from the image edges in such a way that the entire zoom region is inside the edges of the lower definition image, but still be as close to the originally identified region of interest as possible.

A still further alternative is for the zoom area to be positioned according to a user designated position, which may be manually input at a position different from the region of interest.

At least part of the methods according to the invention may be computer implemented. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit", "module" or "system". Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.

Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a USB key, a memory card, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will now be described, by way of example only, and with reference to the following drawings in which: Figure 1 represents a video surveillance system in which the current invention can be implemented.

Figure 2 represents functional modules comprised in a video surveillance camera.

Figure 3 is a block diagram illustrating a communication apparatus adapted to implement embodiments of the invention.

Figure 4 illustrates a partition of an image in slices.

Figure 5 illustrates different steps of the invention.

Figure 6 is a block diagram of an algorithm according to the invention performed by a camera to produce a video bitstream comprising a high definition zoom.

Figure 7 is a detailed description of the step 620 of the figure 6.

Figure 8 is a detailed description of the step 650 of the figure 6.

Figure 9 represents the main coding entities encountered in an HEVC encoded video stream.

Figure 10 is a block diagram representing an HEVC encoder.

Figure 11 is a block diagram representing an HEVC decoder.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Figure 1 describes a video surveillance system in which the current invention can be implemented. The system is made up of a set of wireless cameras (130, 160, 170 and 180) which covers an area to survey. The security cameras include dual video encoders and each encoder is dedicated to a specific purpose. In the example of the figure 1, one encoder generates a High Definition (HD) stream for the video control 120 and the other encoder produces a Low Definition (LD) for the storage 110.

Many video compression formats, such as for example H.263, H.264/AVC, MPEG-i, MPEG-2, MPEG-4, SVC, or HEVC could be used for generating the HD and LD video streams. Each of these formats use block-based discrete cosine transform (DCT) and motion compensation to remove spatial and temporal redundancies. They are often referred to as predictive video formats.

If we take the example of HEVC, as represented in figure 9, each frame or image 902 of the video signal 901 is divided into slices 903 which are encoded and can be decoded independently. A slice is typically a rectangular portion of the frame, or more generally, a portion of a frame or an entire frame.

In HEVC, a slice can be constituted of two types of encoding entities: The independent slice segments (ISi in figure 9) and the dependent slice segments (DSi). Independent slice segments can be decoded independently of any other slice. Dependent slice segment depends on an independent slice segment since, these slice segments comprise only a summarized header and needs to refer to the header of a preceding independent slice segment. A slice contains necessarily one independent slice segment and optionally at least one dependent slice segment. In previous standards such as H.264, encoding entities covered by the terminology "slice" were corresponding only to independent slice segments.

In HEVC, blocks of pixels from 64x64, to 4x4 can be used. The partitioning is organized according to a quad-tree structure based on the coded Tree Units (CTU 904 also known as the largest coding unit (LCU) in previous version of the standard HEVC. A CTU corresponds to a square block of size NxN, N being a power of 2 between 16 and 64. If, for instance, a CTU 64x64 needs to be divided, a split flag indicates that the CTU is split into 4 32x32 blocks. In the same way, if any of these 4 blocks need to be split, the split flag is set to true and the 32x32 block is divided into 4 16x16 blocks etc. When a split flag is set to false, the current block is a coding unit CU (905). A CU has a size equal to 64x64, 32x32, i6x16 or 8x8 pixels. CU can be divided in Prediction Units (PU) 906 for spatial or temporal prediction and in Transform Units (907) for transformation in the frequency domain.

There are two families of coding modes for coding blocks of an image: coding modes based on spatial prediction, referred to as INTRA prediction and coding modes based on temporal prediction (INTER, Merge, Skip). In both spatial and temporal prediction modes, a residual is computed by subtracting a predictor block from the original block.

An INTRA block is generally predicted using an INTRA prediction process based on its encoded pixels at its causal boundary. In INTRA prediction, a prediction direction is encoded.

Temporal prediction consists in finding in a reference frame, either a previous or a future frame of the video sequence, an image portion or reference area which is the closest to the block to be encoded. This step is typically known as motion estimation. Next, the block to be encoded is predicted using the reference area in a step typically referred to as motion compensation. The difference between the block to be encoded and the reference portion is encoded, along with an item of motion information relative to the motion vector which indicates the reference area to use for motion compensation. In temporal prediction, at least one motion vector is encoded.

In order to further reduce the cost of encoding motion information, rather than directly encoding a motion vector, assuming that motion is locally homogeneous, the motion vector are encoded in terms of a difference between the motion vector and a motion vector predictor, typically computed from one or more motion vectors of the blocks surrounding the block to be encoded.

In H.264, for instance, motion vectors are encoded with respect to a median predictor computed from the motion vectors situated in a causal neighborhood of the block to be encoded, for example from the three blocks situated above and on the left of the block to be encoded. Only the difference, referred to as a residual motion vector, between the median predictor and the current block motion vector is encoded in the bitstream to reduce the encoding cost.

In order to further reduce the encoding cost, a quantization is applied to the residual information. The quantization process may be preceded by a scaling stage: quantization scaling matrixes are used to scale the residual before applying the quantization process. Then syntax elements, quantized residual and prediction information are entropy coded generally with a context adaptive entropy coder.

The main advantage of prediction and context adaptive entropy coding is to take benefit of the local correlation in the video signal to reduce the coding cost. The counterpart of this compression improvement is the reduction of error resilience of the video stream due to inter CU dependencies.

Both encoding and decoding processes involves a decoding process of an encoded image. This process is typically performed at the encoder side for the purpose of future motion estimation which enables an encoder and a corresponding decoder to have the same reference frames.

To reconstruct the coded frame, the residual is inverse quantized and inverse transformed in order to provide the decoded" residual in the pixel domain. The first reconstruction is then filtered by one or several kinds of loop filtering processes. These loop filters are applied on the reconstructed frame at encoder and decoder side in order that the same reference frame is used at both sides. The aim of this loop filtering is to remove compression artifacts. For example, H.264/AVC uses a deblocking filter. This filter can remove blocking artifacts due to the quantization of the transformed residual and to block motion compensation. In the current HEVC standard, 2 types of loop filters are used: deblocking filter and sample adaptive offset (SAO).

In order to facilitate the encapsulation of video data in network packets, H.264/AVC has defined the concept of NAL (Network Abstraction layer) Unit. The NAL unit structure definition specifies a generic format for use in both packet-oriented and bitstream-oriented transport systems. The first part of each NAL unit is a header that contains an indication of the type of data in the NAL unit, and the remaining part contains payload data of the type indicated by the header. NAL units are classified into VCL and non-VCL NAL units. The VCL NAL units contain the data that represents the values of the samples in the video images. The non-VCL NAL units contain any associated additional information such as parameter sets (important header data that can apply to a large number of VCL NAL units) and supplemental enhancement information (timing information and other supplemental data that may enhance usability of the decoded video signal but are not necessary for decoding the values of the samples in the video images).

Figure 10 is a flow chart illustrating the steps performed by an HEVC encoder.

Each frame of the original video sequence 901 is first divided into a grid of CTU during stage 1001. This stage controls also the definition of slices.

The subdivision of the CTU in CUs and the partitioning of the CU in TUs and Pus is determined based on a rate distortion criterion. Each PU of the CU being processed is predicted spatially by an INTRA predictor 1017, or temporally by an INTER predictor 1018. Each predictor is a block of pixels determined using encoded pixels from the same image or blocks of encoded pixels of another image, from which a difference block (or "residual") is computed. An encoded block is represented by an identifier representative of the predictor to be used and a residual block.

The encoded frames are of two types: temporally predicted frames (either predicted from one reference frame called P4rames or predicted from two reference frames called B-frames) and non-temporally predicted frames (called INTRA frames or I-frames). In I-frames, only INTRA prediction is considered for coding CUs/PUs. In F-frames and B-frames, INTRA and INTER prediction are considered for coding GUs/PUs.

In the INTRA prediction module 1017, the current block is predicted by means of an INTRA predictor obtained using encoded pixels at the boundary of the current block. A prediction direction allowing identifying the pixels at the boundary is determined in module 1002 and encoded in the bitstream in module 1003 along with residual data. A further compression improvement is obtained by predictively encoding the INTRA prediction direction from INTRA prediction directions of surrounding PU.

Prediction of intra direction is not allowed from neighbor CUs that are not in the same slice.

With regard to the second processing module 1018 related to INTER coding, two prediction types are possible. Mono-prediction (P-type) consists in predicting a block of pixels (i.e. a PU) by referring to one reference block of pixels from one reference image. Bi-prediction (B-type) consists in predicting a block of pixels (i.e. a PU) by referring to two reference blocks of pixels from one or two reference images. An estimation of motion 1004 between the current PU and reference images 1015 is made in order to identify, in one or several of these reference images, one (P-type) or several (B-type) blocks of pixels to use as predictors of this current PU. In case where several block predictors are used (B-type), these blocks are merged to generate one single prediction block.

The reference block is identified in the reference frame by a motion vector that is equal to the displacement between the PU in current frame and the reference block. Next stage (1005) of inter prediction process consists in computing the difference between the prediction block and current block to obtain a residual block. At the end of the inter prediction process (1006) the current PU is composed of at least one motion vector and a residual.

In order to reduce the cost of motion information, HEVC proposes in a process called AMVP (Adaptive Motion Vector Predictor) to select a motion predictor in a set of candidate motion predictors. The motion information are then encoded in the form of an index on a motion vector predictor and a motion vector difference. Again, this prediction process creates inter CU dependencies.

Prediction of motion information is not allowed from neighbor CUs that are not in the same slice.

The prediction steps is followed by a selection step (1016) selecting the type of encoding (INTRA or INTER) minimizing a rate distortion criterion.

When selected, the residual obtained is transformed 1007 using a DCT based transform or a DST based transform. The transform applies to a Transform Unit (TU) that is included into a CU and that can be split into smaller TUs using a so-called Residual QuadTree (ROT) decomposition. In HEVC, generally 2 or 3 levels of decompositions are used and authorized transform sizes are from 32x32, 16x16, 8x8 and 4x4.

The residual transformed coefficients are then quantized (1008). The coefficients of the quantized transformed residual are then coded by means of an entropy coding 1009 and then inserted in the compressed bitstream 1010.

Syntax elements are also coded with help of the stage 1009.

The encoder includes a decoding loop (1011, 1012, 1013, 1014, 1015) to insure that there is no drift between the encoder and the decoder.

Thus the quantized transformed residual is dequantized 1011 by applying the inverse quantization to that provided at step 1008 and reconstructed 1012 by applying the inverse transform to that of the step 1007.

If the residual comes from an INTRA coding 1017, the used INTRA predictor is identified thanks to the INTRA prediction information and added to the residual to recover a reconstructed block.

If the residual comes from an INTER coding 1018, the prediction block(s) is (are) identified using the motion information, merged if necessary and then added to the decoded residual.

A final loop filter processing module 1019 is applied to the reconstructed signal in order to reduce the encoding artifacts. In the current HEVC standard, 2 types of loop filters are used: deblocking filter 1013, sample adaptive offset (SAO) 1014. The parameters of the filters are coded and transmitted in one header of the bitstream typically slice header or in an adaptation parameter set.

The filtered images, also called reconstructed images, are then stored as reference images 1015 in order to allow the subsequent INTER predictions.

The resulting bitstream 1010 of the encoder 1000 is encapsulated in VCL NAL units, and complemented by Non VGL NAL units.

Figurel I is a flow chart illustrating a classical video decoder 1100 of HEVC type. The decoder 1100 receives as an input a bit stream 1010 corresponding to a video sequence 901.

During the decoding process, the bit stream 1010 is first of all parsed with help of the entropy decoding module (1101). This processing module uses the previously entropy decoded elements to decode the encoded data. It decodes in particular the parameter sets of the video sequence to initialize the decoder. Each VCL NAL unit that corresponds to coding slices is then decoded.

The parsing process that consists in 1101, 1102 and 1104 stages can be done in parallel for each slice but block prediction processes module 1105 and 1103 and loop filter module 1119 are generally sequential to avoid the issue of neighbor data availability.

The partitioning of CTU in CU, PU and TU and the coding modes (INTER or INTRA) are obtained from the bitstream 1010 with help of the entropy decoding module 1101. Depending on the coding mode, either INTRA prediction processing module 1107 or INTER prediction processing module 1106 is employed.

If the coding mode of the current block is INTRA, the prediction direction is extracted from the bitstream and decoded with help of neighbors' prediction direction during stage 1103. The intra predicted block is then computed (1103) with the decoded prediction direction and the already decoded pixels at the boundaries of current PU. The residual associated with the current block is recovered from the bitstream 1101 and then entropy decoded.

If the coding mode of the current block is INTER, the motion information are extracted from the bitstream 1101 and decoded (1104) applying the AMVP method. The obtained motion vector(s) is (are) used in the reverse motion compensation module 1105 in order to determine the INTER predictor block contained in the reference image(s) 1115 of the decoder 1100. In a similar manner to the encoder, these reference images 1115 are composed of images that precede in decoding order the image currently being decoded and that are reconstructed from the bit stream.

Next decoding step consists in decoding the residual block that has been transmitted in the bitstream. The parsing module 1101 extracts the residual coefficients from the bitstream and performs successively the inverse quantization (1111) and inverse transform (1112) to obtain the residual block.

This residual block is added to the predicted block.

At the end of the decoding of all the blocks of the current image, the loop filter processing module 1119 comprising a deblocking filter and a SAO module is used to eliminate the artifacts and improves the signal quality in order to obtain the reference images 1115.

The images thus decoded constitute the output video signal 1108 of the decoder, which can then be displayed and used.

In a preferred embodiment of the invention an INTRA only HEVC encoder is used wherein each slice of a frame is encoded as a standalone decodable unit containing only INTRA encoded blocks. In a second embodiment we will show that a full HEVC encoder could be used also.

Returning to Figure 1 both HD and LD streams are conveyed through wired connection 150. The cameras are also able to stream their video to a wireless device 140 via the wireless LAN 100. The camera operates as a proxy and forwards the Low Definition stream, which matches the resolution of the handheld devices, towards the portable viewer.

In order to allow a zoom on a dedicated part of an image so as to have a more detailed view of a scene, the zoomed part must have a good visual quality. Accordingly, the system should transmit a High Definition version of the zoom part while keeping the remaining part in background in Low Definition.

In figure 1, the video storage device 110 and the video control device 120 are represented as separate device. It can be noted that these two devices could be integrated in only one device.

In addition, in the preferred embodiment the HD and LD video streams are generated continuously by each camera. In an alternative embodiment, more economic in terms of bandwidth usage, only the LD video stream could be generated continuously. The HD video stream could be generated only under a request from a user on the device 140 for a zoom on a particular area or from the high definition device 120 for monitoring purpose.

Figure 2 represents functional modules comprised in a video surveillance camera of Figure 1. The bold blocks are of particular relevance to the present invention.

Four main devices are embodied in the figure 2: a video surveillance camera 130, a low definition device 140 (handheld device in figure 1), a High Definition device 120 (video control in figure 1) and a video storage 110.

The video surveillance camera 130 is composed of a high definition source module 131 which produces raw video data. The raw video is sent simultaneously to the low definition encoder 133 and the high definition encoder 132.

In a first step of the video surveillance system functioning, the raw video data are encoded in parallel by the two encoders. The HD video stream generated by the HD encoder 132 is transmitted to the high definition device for display. The LD video stream generated by the LD encoder 133 is transmitted to the video storage unit 110 and to the low definition device 140 for display.

In a second step of the video surveillance functioning, a user receiving and displaying the LD video stream on the device 140 requests a zoom on a selected Region Of Interest (ROl).

This ROI of the video is characterized by its coordinates and size.

We suppose in the following that a ROl is a rectangular area. The zoom application controller 142 retrieves these characteristics through a user interface, and produces a message with this zoom information (coordinates and size). This message is sent back to the slicer module 134 of the camera.

The slicer module 134 defines the slice boundaries in the current frames to be encoded by the low and high definition coder. The slice boundaries are based on the zoom information (coordinates and size) of the ROl coming from the Zoom Application Controller module 142.

When both images (LD and HD version) are encoded, the image builder module 135 gets the full LD image and the HD slices corresponding to the ROI. Then, the image builder module 135 produces an image including the LD image and the HD slices corresponding to the ROl. This modified stream is sent via the wireless to the portable device 140.

It is worth noting that in preferred embodiments, the initial image content is modified only on the wireless path. Although, the slicing of the LD and HD stream is modified in order to allow the insertion of the HD slices corresponding to the ROI into the LD stream, the initial image content in low and high definition conveyed through wired are unchanged in order to keep unaltered the main functionalities of the video surveillance system i.e. monitoring and storage.

Reference numeral 302 is a RAM which functions as a main memory, a work area, etc., of CPU 301. CPU 301 is capable of executing instructions on powering up the apparatus from program ROM 303. After the powering up, CPU 301 is capable of executing instructions from the main memory 302 relating to a software application after those instructions have been loaded from the program ROM 303 or the hard-disc (HD) 306 for example. Such software application, when executed by the CPU 301, causes the steps of the flowcharts shown in one or more of the figures 6 to 8 and figures 10 and 11 to be performed.

Reference numeral 304 represents the network interfaces that can be a single network interface or composed of a set of different network interfaces like for instance several wireless interfaces, or different kinds of wired or wireless interfaces. Data packets are written to the network interface for transmission or read from the network interface for reception under the control of the software application running in the CPU 301. Reference numeral 305 represents a user interface to display information to, and/or receive inputs from, a user. This user interface could be for instance the zoom application controller 142.

I/O module 307 represents a module able to receive or send data from/to external devices as video sensors or display devices.

The figure 4 represents an example of slice segmentation used to make independent the zoom part in the High Definition and Low Definition video stream.

Embodiments of the invention are adapted to HEVC where slices are made of dependent and independent slice segments and to any other video standards with only one type of slices equivalent to independent slice segments (for instance H.264). In addition, in the following, we use indifferently the terminology macroblocks and coding units to represent the same coding entity corresponding to a block of pixels. The way of segmenting an image in slices is relatively unconstrained since it is an encoding issue out of the scope of the video compression standards which specifies only the decoder. Slices have been originally defined as an error resilience tool, but a second application arises consisting in using slices as a tool for encoding independently region of interest.

Figure 4 shows an image of a video segmented in N slices. The first slice 410 (slice #1) comprises n macroblocks B1, B2, ... Bn.

In preferred embodiment, each slice must be subject to standalone decoding. In that case only intra prediction is allowed for macroblocks of these slices. In addition, predictors for intra prediction of macroblocks in the slice are derived exclusively from pixels belonging to the same slice. No prediction from outside pixels is allowed.

In the following, we will see that an image of the LD sequence embedding a zoomed area will be divided in two sets of slices: * A first set of LD slices corresponding to the area of the image not covered by the zoomed area inherited from an HD image. These slices will correspond to slices originally belonging to the LD sequence.

* A second set of slices corresponding to the area of the image covered by the zoomed area inherited from an HD image. These slices will correspond to slices belonging to the HD sequence with modified headers comprising consistent information adapted to the insertion in the LD sequence.

As can be seen, the independence of INTRA slices simplifies their manipulation and their insertion in a new bitstream.

Figure 5 illustrates, through an example, the different steps of the insertion of a zoomed ROl extracted from a HD sequence in a LD image according to the invention. This figure is composed of 4 main parts: 510 represents a portion of a LD image with an ROl designated by the user in 515; 530 is a portion of a HD image in which had been identified the ROl 535 corresponding to the ROl 515; 520 is an encoded slice extracted from the HD image corresponding to the identified ROl; 540 is the result of the insertion in the LD image of a modified version of the slice 520 with modified header information compatible with the LD image. The image portions comprise numbered squares which represents the macroblocks of an image.

The first part 510 shows the process of selection of a region of interest to zoom 515 in the image. The user selects a region of interest 515 in the low definition video stream displayed on its handheld device 140 (tablet or smartphone). This ROl is characterized by its coordinates and size. In our example, this portion of image is made of two macroblocks respectively numbered 78 and 118. In a general case, the ROI defined by the user could cover partially some macroblocks. In that case, the ROl will be extended to the closest rectangle set of macroblocks containing all points of the ROl defined by the user.

The second step of the invention is performed by the slicer module 134.

This step consists in finding the corresponding ROl in the high definition video images. This is done thanks to the knowledge of the ratio between the low definition and the high definition sequence and to the position of the ROl designated by the user in the LD image. In our example, the low definition format is 640 pixels by 480 lines and the high definition format is 2560 pixels by 1440 lines thus the ratio in between both format is 4 in horizontal direction and 3 in vertical direction. Thereby the 78th macroblock in the low definition image becomes the lll2th macroblock in the high definition image. The derivation of the macroblock number in the HD image (MBNumberlnHD), taking macroblocks from top to bottom and from left to right, corresponding to a macroblock of coordinates (X, Y) in the LD image is given by the following formulae: mt (Y x RatloY x NumberOfMBlnALineHD + X x RatioX) = MBNumberlnHD Wherein RatioY is the ratio in the vertical direction between the LD and the HD images, RatioX is the ratio in the horizontal direction between the LD and the HD images, NumberOfMBlnALineHD is the number of macroblocks in a line of the HD image and Int (A) is the integer part of a variable A. In the particular example of figure 5, the 78th macroblock (MB) is the 38th MB on the 2nd line of the low definition image. It gives: 2x3x40x4+38x4=lll2thmacroblock.

We suppose here that all macroblocks have the same size and correspond to CTU NxN.

Then, it remains to compute the area of the ROl in the high definition image. The ROl in the low definition image is equal to NB_MB_ROI_LD_X = 1 MB in width and NB_MB_ROI_LD_Y = 2 MBs in height. Consequently in the high definition stream this area becomes NB_MB_ROI_HD_X = 4 MBs (NB_MB_ROI_HD_X = mt (NB_MB_ROI_LD_X x RatioX)) in width and NB_MB_ROLHD_Y = 6 MBs (NB_MB_ROLHD_Y = int(NB_MB_ROLLD_Y x RatioY)) in height. The part 520 corresponds to the ROl defined in the low definition stream 515 brought in the high definition domain.

The slicer module 134 adjusts the slicing of the high definition stream in order to make independent the areas of the video which correspond to the ROl from the reminder of the image.

The slicer module 134 also defines the boundaries of the slice which will be replaced by the high definition ROl in the low definition stream. Thereby, the slice in the low definition does not match the size and position of the initial ROl defined by the user but the ROl in the HD format. In an embodiment, in order to position the high definition ROl in a LD image, the system determines first the position of the center of the area corresponding to the HD ROl. Then the position of the center of the LD ROl defined by the user is also determined. The HD ROI is then positioned in the LD image so that the 2 centers correspond.

However, the system should also manage the limit of the LD image. In our example, the superposition of the center of the two ROls brings the HO ROI outside the LD image boundaries. In an embodiment, such case is managed by positioning the center of the HD ROl as close as possible to the center of the LD ROl while respecting the boundaries of the image.

Once the slicing is defined for the low and high definition stream, the images are encoded each in their domain according to the defined slicing. We can note that the slicing, required to obtain an independent areas for the part to zoom, depends on the kind of the codec technology and option supported by the codec. For instance, in HEVC standard, In addition to slices, we can use the concept of tiles to define an independent rectangular area. H.264 also allows several possibilities. Indeed, H.264 offers a broad diversity of slice shapes that goes from slices restricted to lines of MBs to arbitrary shapes allowed by the Flexible MB Ordering (FMO) functionality.

In the case of slices restricted to lines of MBs, short slices for each new lines of macroblocks or portion of lines shall be defined. In the example of figure 5, a new slice will be started respectively at the MBs 1112, 1272, 1432, 1592, 1752 and 1912.

When the encoded images are available, the zoom image builder module extracts bitstream portions corresponding to the useless slice(s) from the LD bitstream and replaces it (them) by the bitstream portions of the corresponding HO ROl. This process is illustrated by the part 530 of the figure 5 in which the encoded HO ROI 535 is put in LD bitstream 530. At this step, the numbering of the MBs is not compliant with the usual numbering i.e. the index passes from 35 to 1112 and so on. This gap between MB numbers will be interpreted as an error by a majority of decoders or at least by an indication of a missing part in the image.

To give back the compliance of the bitstream with the standard, the zoom image builder 135 have to perform a step of translation of the index of the MBs of the high definition ROl. To do so, the system modifies in each slice header of each HD slice inserted in the LD image the value representative of the position of the first MB in the slice. Thus, the index of the MB 1112 in the part 535 becomes 36 in the part 545, the index of the MB 1272 becomes 76, the index of the MB 1432 becomes 116 and so on and so forth. The next MB in the slice, which are referenced according to the first MB of the slice, are automatically well-numbered when the first MB in slice numbering are modified. At the end of the translation of the MB index, the video bitstream is compliant again with a standard decoder although it contains a low definition background and a high definition zoom area of the ROl.

Figure 6 is a high level representation of an algorithm performed by the camera 130 to produce a video bitstream which contains a high definition zoom ROl according to one embodiment of the invention. The function 620 and 650 are described more precisely in the figure 7 and figure 8.

At the starting of the algorithm, the system obtains the minimum characteristics of the zoomed area in 610 coming from the handheld device 140. These minimum characteristics are the coordinates in the LD image of the top left MB and the size in MB in horizontal and in vertical of the zoomed area.

This information allows an accurate placement of the ROl in the global picture.

By default, the area covered by the zoom part is equal to the high definition ROl and is placed so that the centers of the LD ROl and the HD ROl correspond. In Another embodiment, in addition to the minimum characteristics of the zoomed area, the device 140 can transmit optional characteristics corresponding for instance to an expected size of the HD ROl.

After the end of the step 610, the system determines the slice frontiers 620 for the low and high definition video. According to the determination of the slicing performed in step 620, the system encodes respectively the high definition version of the source image in 630 with the high definition coder 132 and the low definition version of the source picture 640 with the low definition coder 133. The high definition video bitstream, issued from this encoding process 630, is made up of a background and a ROl part. The ROl part corresponds to the high definition version of the ROl defined by the user in the portable device 140 and characterized by the information received in the step 610. Both areas namely background and ROl are independent and may be split i.e. the ROl is encoded with dedicated slices in order to be able to decode the ROl independently from the background. The low definition video bitstream, issued from the encoding process 640 is composed of a background part and a "useless" part. The useless part is a slice or slices which cover the area to be replaced by the high definition ROl in the further step 650. In the step 650, the system uses the involved part of both video bitstream (LD and HD) to build the zoomed stream. The system mixes the background of the low definition video bitstream and the high definition ROl to build the zoomed stream. When the zoomed stream is built, it is sent to the portable device in 660.

Figure 7 is a detailed description of the step 620 in figure 6. This algorithm carries out the definition of the ROl in the low and high definition version. This algorithm also supplies the slicing to both encoders in order to make independent the ROls.

The algorithm starts with the determination of a list 621 of low definition MBs corresponding to the LD ROI and identified according to the information supplied by the user via the portable device 140. To that end, two points corresponding to the top left point and the bottom right point of the LD ROl area defined by the user are identified. Supposing that images are divided in 16x16 macroblocks, each of the two points belong to one 16x16 MB. The horizontal and vertical coordinates of the macroblocks of the LD image MB_LD(X_LD_TL, Y_LD_TL) and MB_LD(X_LD_BR, Y_LD_BR) containing the two points are then stored by the system, X_LD_i and Y_LD_i being expressed in number of MBs. The coordinates of these two MBs are sufficient to identify the list of MBs of the LD image corresponding to the ROl area.

The horizontal size hor_size_LD and vertical size ver_size_LD of the LD ROl area are respectively given by: Hor_size_LD = X_LD_BR -X_LD_TL + 1 Ver_size_LD = Y_LD_BR -Y_LD_TL + 1; Then, the system passes to the step 622, to determine a list of HD MBs corresponding to the list of LD MBs. To determine the list of HD MBs, the system uses the coordinates of the two MB5 characterizing the list of LD MB5 defined in the step 621 and the information representative of the ratio between the low and high definition. The MB number of the top left MB of the HD ROl MBNumberlnHD_TL and the MB number of the bottom right MB of the HD ROI MBNumberlnHD_BR are given as follows: MBNumberlnHD_TL = Int (Y_LD_TL x RatioY x NumberOfMBlnALineHD + X_LD_TL x RatioX); MBNumberlnHD_BR = Int (Y_LD_BR x RatioY x NumberOfMBlnALineHD + X_LD_BR x RatioX); The coordinates of the top left MB MB_HD(X_HD_TL, Y_HD_TL) of the HD ROl are given by: Y_HD_TL = mt (MBNumberlnHD_TL/ NumberOfMBlnALineHD) + 1; X_HD_TL = Rem (MBNumberlnHD_TL/ NumberOfMBlnALineHD); Where Rem(A/B) is the reminder of the division of A by B. The coordinates of the bottom right MB (MB_HD(X_HD_BR, Y_HD_BR) of the HD ROl are given by: Y_HD_BR = mt (MBNumberlnHD_BR/ NumberOfMBlnALineHD) + 1; X_HD_TL = Rem (MBNumberlnHD_BRI NumberOfMBlnALineHo); The horizontal and vertical size of the HD ROl are respectively computed as follows: HoLsize_HD = X_HD_BR -X_HD_TL + 1; VeLsize_HD = Y_HD_BR -Y_HD_TL + 1; As already mentioned above the position of the ROl areas is fully defined by the position of the center of the area. By default, the HD ROl (zoomed part of the picture) is positioned in the LD image so that its center is as close as possible to the center of the LD ROl.

The next step is an optional step which defines the position of the ROl in the low definition picture 623. As explained with reference to figure 5, the position of the ROl may be received as an optional characteristic from the user.

Then, the system continues with step 624 determining the slices frontiers in the LD image. This step is required to allow insertion of the HD ROI encoded in the form of slices in the bitstream corresponding to the LD image. To that end the system starts by positioning the center of the HD ROl in the LD image. The position of the center (xc_LD, yc_LD) expressed in number of pixels is either specified by the user or derived from the coordinates of the LD ROl area defined by the user. In that second case the position is derived from the position of the bottom right block of the LD ROl: * xc_LD = int((X_LD_BR-(Hor_size_LD/2) x 16)); * yc_LD = int((Y_LD_BR-(Ver_size_LD/2) x 16)); The slices constituting the HD ROl are defined by the coordinates of their first and last MB in a line of MB of the LD image.

The position of the top left MB (X_HD_LD_L(O), Y_HD_LD_L(O)) in the LD image corresponding to the first MB of the top most slice of the HD ROl is computed as follows: * X_H D_LD_L(O) = int((xc_LD -((Hor_size_HDx1 6)12))I1 6); * Y_HD_LD_L(O) = int((yc_LD -((Ver_size_HDx16)/2))/16); The position of the top right MB (X_HD_LD_R(O), Y_HD_LD_R(O)) in the LD image corresponding to the last MB of the top most slice of the HD ROl is computed as follows: * X_HD_LD_R(O) = X_HD_LD_L(O) + Hor_size_HD; * Y_HD_LD_R(O) = Y_HD_LD_L(O); Positions of the first and last MBs of the remaining slices of the HD ROl in the LD image are determined iteratively with respect to the position of the first and last MBs of the top most slice with the following pseudo-code: For i from ito (Ver_size_HD-1) * X_HD_LD_LO) = X_HD_LD_L(O); * X_HD_LD_RU)= X_HD_LD_R(O); * Y_HD_LD_L(i) = Y_HD_LD_L(O) + * Y_HD_LD_Rffl= Y_HD_LD_R(O) + * iri+1 In order to avoid overlapping the image frontiers, the slice frontiers are compared to the image frontiers in 625 and modified if necessary in step 626. In a first embodiment, the position of the HD ROl is shifted so that it is completely contained in the LD image. The following process is applied: j=O;k=O; For i from 0 to (Ver_size_HD-1) * If X_HD_LD_L(i) <0 then X_HD_LD_LO) = 0 and X_HD_LD_RO) = min(Hor_size_HD, NumberOfMBlnALineLo-1); * If X_HD_LD_R(i) >= NumberOfMBlnALineLD then X_HD_LD_LO) = max(X_HD_LD_LO) -(X_HD_LD_R(i) -NumberOfMBlnALineLD, 0) and X_HD_LD_RO) = (NumberOfMBlnALineLD-1); * If Y_HD_LD_L(i) < 0 then Y_HD_LD_LO) = j, Y_HD_LD_RO) = Y_HD_LD_LO) andj =i +1; * If Y_H D_LD_R(Ver_size_HD-1 -i) >= NumberOfMBlnAColLD then Y_HD_LD_L(Ver_size_HD-1-i) = (NumberOfMBlnAColLD -1-k) Y_H D_LD_R(Ver_size_HD-1 -i) = Y_H D_LD_L(Ver_size_H D-1 -i) and k= k+1; * i=i+1; Where j and k are two integer variables, NumberOfMBlnALineLD is the number of MBs in a line of the LD image and NumberOfMBlnAColLD is the number of MBs in a column of the LD image, mm (A,B) takes then minimum of A and B, and max(A,B) takes the maximum of A and B. We suppose here that a ROl cannot have a size larger than the size of the image.

In another embodiment, a simpler process could be applied consisting in truncating the slices frontiers.

The following process is applied: = 0; For i from 0 to (Ver_size_HD -1) * If Y_HD_LD_LU) < 0 then Y_HD_LD_LO+j) = 0, j=j-1 and Ver_size_HD = Ver_size_HD -1; * If Y_HD_LD_LO) >= NumberOfMBlnAGolLD then j=j-1, Y_HD_LD_LO) = (NumberOfMBlnAColLD -1) and Ver_size_HD = Ver_size_HD -1; * Y_HD_LD_RU) = Y_HD_LD_LO); * ii+1; For i from 0 to (Ver_size_HD-1) * If X_HD_LD_L(i) c 0 then X_HD_LD_LO) = 0; * If X_HD_LD_R(i) >= NumberOfMBlnALineLD then X_HD_LD_R(i) = (NumberOfMBlnALineLD -1); * i=i+1; Hor_Size_HD = X_HD_LD_R(0) -X_HD_LD_L(0); As can be seen this process can induce the removal of some slices that are outside the current LD image. The variables Ver_size_HD and Hor_size_HD are therefore modified to represent the new size of the ROI area after truncation.

The refinement of LD slices frontiers, if any, is followed by the definition of HD slices in the HD stream in step 628. HD slices are defined in function of LD slices. These HD slices must have the same size than LD slices. The determination of the HD slices starts by the determination of the position of the new center (new_xc_LD, new_yc_LD) of the HD ROl in the LD image. Indeed, if the HD ROl has been truncated, its center is no more in (xc_LD,yc_LD): * New_xc_LD = (X_HD_LD_L(O) x 16) + Hor_size_HD/2; * New_yc_LD = (Y_HD_LD_L(O) x 16) + Ver_size_HD/2; Then the position of the center of the HD ROl (xc_HD, yc_HD) in the HD image is given by: * xc_HD = New_xc_LD x RatioX; * yc_HD = New_yc_LD x RatioY; The position of the top left MB of the HD ROl in the HD image is deduced from the position of the center: * X_HD_HD_L(O) = int(xc_HD/16 -Hor_size_HD/2); * Y_HD_HD_L(O) = int(yc_HD/1 6 -Ver_size_HD/2); The position of the top right MB of the HD ROl in the HD image is deduced from the position of the center: * X_HD_HD_R(O) = X_HD_HD_L(O) + Hor_size_HD; * Y_HD_HD_R(O) = Y_HD_HD_L(O) + Ver_size_HD; The top left and top right MB defines the top most slice of the HD area in the HD image.

Remaining HD slices are defined with respect to the position of the top left and top right MBs: For i from 1 to (Ver_size_HD -1) * X_HD_HD_LO) = X_HIJ_HD_L(O); * X_HD_HD_RO) = X_HD_HD_R(O); * Y_HD_HD_LO) = Y_HD_HD_L(O) + * Y_HD_HD_RO) = Y_HD_HD_R(O) + I; i=i+1 Information concerning LD slices and HD slices are provided to the HD coder 132 and LD coder 133, so that they can adjust the encoding of images according to the defined slices. These coders deduce the position of the slices not concerned by the ROI (i.e. background slices) in function of the positions of the ROl slices.

The figure 8 is a detailed description of the step 650 of the figure 6. This algorithm performs the building of the zoomed stream which is made up of a low definition background and a high definition zoom area of the ROl.

At the end of the encoding process performed by the LD coder 133 and the HD coder 132, the two generated bitstreams are provided respectively to the Video Storage unit 110 and the High Definition device 120. The LD and HD bitstreams are also provided to the Zoom image builder module 135.

The zoom image builder module deletes the bitstream portion(s) corresponding to the relevant LD slices in the LD image bitstream in 651 and replaces these slices in 652 by the bitstream portion(s) corresponding to the HD slices copied from the HD bitstream.

In order to insure the compliance of HD slices with the LD bitstream, the slice header of each HD slice inserted in the LD bitstream is modified. Indeed, the slice header contains an information representative of the first MB in the slice. In HEVC, this information is represented by the syntax element slice_segment_address of the slice header. Practically this information corresponds to the position of the first CTU of the slice. For simplification purpose, we consider here that a CTU contains only one CU and is similar to a MB. In H.264/AVC this information is provided by the syntax element first_mb_in_slice. Since other MBs in a slice are referred with respect to the first MB of the slice, only the information representative of the position of the first MB needs to be corrected. The following process applies: For i from 0 to (Ver_size_HD 1) * slice_segment_address) = Y_HD_LD_L(i) x NumberOfMBlnAColLD + X_HD_LD_L(i); * i=i+1 where slice_segment_adress(i) represent the corrected value of the syntax element slice_segment_adress in the i1 HD slice inserted in the LO image. The same process applies to the syntax element first_mb_in_slice in case of H.264 bitstream.

With this modification of the slice header, the bitstream of the LD image containing HD slices is fully compliant with a decoder HEVC (or respectively H.264/AVC with a modification of the variable first_mb_in_slice). The so generated LD bitstream is then transmitted to the low definition device 140 and can be decoded and displayed.

As mentioned above, the preferred embodiment is based on an INTRA only encoder encoding only intra predicted MB in slices. The main advantage of this embodiment is to facilitate the manipulation of the bitstream and to avoid issues related to temporal MB dependencies. However, preventing the usage of INTER MB5 reduces significantly the compression performances. In another embodiment providing better compression efficiency, INTER slices could be used for the encoding of HO slices inserted in an LD image. Since HO slices are inserted in a LD sequence, the temporal prediction shall be constraint to data belonging to HD slices. To that end: slices corresponding to the first HD ROl inserted in a LO image shall be encoded in INTRA. HO slices of following images could use INTER prediction. However reference MBs for temporal prediction shall be restricted to blocks of an HO ROl already inserted in a previous image and received by the decoder of the low definition device 140.

Although the present invention has been described hereinabove with reference to specific embodiments, the present invention is not limited to the specific embodiments, and modifications which lie within the scope of the present invention will be apparent to a person skilled in the art. Many further modifications and variations will suggest themselves to those versed in the art upon making reference to the foregoing illustrative embodiments, which are given by way of example only and which are not intended to limit the scope of the invention as determined by the appended claims. In particular different features from different embodiments may be interchanged, where appropriate.

Claims

CLAIMS1. A method for constructing a combined image from initial images representing the same scene, a first initial image having a first resolution and a second initial image having a second resolution higher than the first resolution, the method comprising: determining spatial parameters representing a region of interest in the second initial image, corresponding to an identified region of interest of the first initial image; partitioning the first initial image into coding entities in dependence upon the determined spatial parameters, to define a subset of coding entities corresponding to a zoom area; partitioning the second initial image into coding entities based on the coding entities of the first initial image; and encoding coding entities of the second initial image in the zoom area and coding entities of the first initial image outside of the zoom area, and combining the resulting encoded data into a bitstream representing a combined image.
2. A method according to claim 1 wherein the spatial parameters comprise information representative of the position and the size of the region of interest in the second initial image.
3. A method according to claim 2, wherein the size of the zoom area in the first initial image is substantially equal to the size of the region of interest in the second initial image.
4. A method according to claim 2 or claim 3, wherein the coding entities of the first initial image are defined based on the position and size of the region of interest in the second initial image.
5. A method according to Claim 4, wherein the coding entities of the first initial image are defined so all parts of each encoding entity lie entirely inside or entirely outside the zoom area.
6. A method according to Claim 4 or Claim 5, wherein the coding entities of the second initial image are defined so all parts of each encoding entity lie entirely inside or entirely outside the zoom area
7. A method according to claim 1 or 2 wherein the spatial parameters comprise information representative of the coding entities corresponding to the zoom area.
8. A method according to any one of claims 2 to 7, wherein the coding entities of the first initial image are defined by setting the position of the zoom area in the first initial image to be substantially aligned with the position of the region of interest in the first initial image.
9. A method according to any one of claims 2 to 7, wherein the coding entities of the first initial image are defined by minimising the difference in position of the zoom area in the first initial and the position of the region of interest in the first initial image.
1O.A method according to any one of claims 2 to 7, wherein the coding entities of the first initial image are defined by setting the position of the zoom area in the first initial image to a user designated position.
11.A method according to any previous claim wherein coding entities of said first and second images are encoded in the same device.
12.A method according to any previous claim wherein a coding entity is a slice or a slice segment.
1 3.Apparatus for constructing a combined image from initial images representing the same scene, a first initial image having a first resolution and a second initial image having a second resolution higher than the first resolution, the apparatus comprising: a determining unit adapted to determine spatial parameters representing a region of interest in the second initial image, corresponding to an identified region of interest of the first initial image; a partitioning unit adapted to partition the first initial image into coding entities in dependence upon the determined spatial parameters, to define a subset of coding entities corresponding to a zoom area; and adapted to partition the second initial image into coding entities based on the coding entities of the first initial image; and an encoding unit adapted to encode coding entities of the second initial image in the zoom area and coding entities of the first initial image outside of the zoom area, and combining the resulting encoded data into a bitstream representing a combined image.
14.Apparatus according to Claim 13, further comprising an output unit adapted to output said combined image to a display device.
15. Apparatus according to Claim 14, further comprising an input unit, adapted to receive information identifying a region of interest of the first initial image.
16.A system for transmitting video images, the system comprising an apparatus for constructing a combined image according to claim 14, a video acquisition device providing video images to the combining device, at least one display device adapted to receive a bitstream corresponding to the first initial image or the combined image.
17.A system according to claim 16 wherein the display device allows a user to select at least one region of interest of an image and to transmit characteristics identifying the at least one region of interest to the combining device.
18.A computer program comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims ito 12
19.A computer-readable storage means having stored thereon a computer program according to Claim 18.