EP4278608A1

EP4278608A1 - Method for playback of a video stream by a client

Info

Publication number: EP4278608A1
Application number: EP21823541.4A
Authority: EP
Inventors: Hossein Bakhshi GOLESTANI; Christian Rohlfing; Mathias Wien
Original assignee: Rheinisch Westlische Technische Hochschuke RWTH
Current assignee: Rheinisch Westlische Technische Hochschuke RWTH
Priority date: 2021-01-12
Filing date: 2021-11-30
Publication date: 2023-11-22
Also published as: US20240056654A1; WO2022152452A9; DE102021200225A1; WO2022152452A1

Abstract

The invention relates to a method for playback of a video stream by a client (C), wherein the video stream has frames from exactly one camera in relation to an object moving relative thereto, from different positions, the method comprising the steps of • receiving a video stream (VB) from an encoder (SRV), • decoding the received video stream (VB) using camera parameters (CP) and geometry data (GD), • playing back the processed video stream (AVB).

Description

Method for playing a video stream by a client

background

The digital encoding of video signals is found in many forms in the present day. For example, digital video streams are made available on data carriers such as DVDs, Blu-ray or as a download or video stream (e.g. also for video communication). The aim of video coding is not only to send representations of the images to be transmitted, but also to keep data consumption low at the same time. On the one hand, this allows more content to be stored on media with limited storage space, such as DVDs, or to enable several users to transport (different) video streams at the same time.

A distinction is made between lossless and lossy coding.

What all approaches have in common is that information for subsequent images is predetermined from previously transmitted images.

Current analysis estimates that such encoded video will account for 82% of all network traffic by 2022 (up from 75% in 2017), see Cisco Visual Networking Index: Forecast and trends, 2017-2022 (white paper ), Cisco, Feb 2019.

From this it can be seen that any savings that can be achieved here will lead to a large saving in data volume and thus a saving in electrical energy for transport.

As a rule, a coder, a carrier medium, eg a transmission channel, and a decoder are required. The encoder processes raw video data. As a rule, a single image is referred to as a frame. Again, a frame can be thought of as a collection of pixels. A pixel represents a point in the frame and indicates its color value and/or its brightness. In this way, for example, the amount of data for a subsequent frame can be reduced if a large part of the information is/are already contained in one or more previously encoded frame(s). Then it would be sufficient, for example, if only the difference is transmitted. In doing so, one makes use of the knowledge that a lot of the same content can often be seen in consecutive frames. This is the case, for example, when a camera captures a particular scene from one point of view and only a few things change, or when a camera slowly moves or rotates through the scene (translation and/or affine movement of the camera).

However, this concept reaches its limits when a high proportion changes between frames, e.g. when the camera moves (rapidly) within a scene or when objects move within the scene. In this case, in the worst case, every pixel of two frames could be different.

Methods for multi-camera systems are known from the prior art, for example from European patent application EP 2 541 943 A1. However, these multi-camera systems are designed to use a previously known setup of cameras with previously known parameters.

However, if a camera, i.e. a monocular recording system, is used, a completely different requirement profile results. In many areas, e.g. in autonomous driving, in drones, in social media video recordings, or even in body cams or action cams, only a single camera is usually used. But it is precisely here that it is necessary to keep the storage space used and/or the amount of data to be transmitted small.

task

Proceeding from this, it is an object of the invention to provide an improvement for a single-camera system.

Summary of the Invention The object is achieved by a method according to claim 1. Further advantageous configurations are the subject matter of the dependent claims, the description and the figures.

Curated representation of the characters

1 shows schematic flow charts according to aspects of the invention,

2 shows schematic flow charts according to further aspects of the invention,

3 shows schematic flow charts according to further aspects of the invention,

FIG. 4 shows a schematic representation for explaining conceptual issues, and FIG. 5 shows an exemplary relation of frames according to further aspects of the invention.

Detailed presentation of the invention

In the following the invention will be illustrated in more detail with reference to the figures. It should be noted that different aspects are described, which can be used individually or in combination. That is, each aspect can be used with different embodiments of the invention, unless explicitly presented as purely alternative.

Furthermore, for the sake of simplicity, only one entity will generally be referred to below. Unless explicitly noted, however, the invention can also have several of the entities concerned. In this respect, the use of the words “a”, “an” and “an” is only to be understood as an indication that at least one entity is used in a simple embodiment.

Insofar as methods are described below, the individual steps of a method can be arranged and/or combined in any order, unless something different is explicitly stated in the context. Furthermore, the processes can be combined with one another, unless expressly stated otherwise.

In the following we will discuss various aspects of the invention in the context of a complete system of encoder and decoder. Errors that can occur between encoding and decoding are not discussed below because they are not relevant to understanding the (de-)coding. In current video delivery systems, the encoder is based on prediction. This means that the better you can predict a frame to be coded from a previously decoded frame, the less information (bit(s)) has to be transmitted.

In previous approaches, the approach followed was to predict frames based on similarities between the frames in a two-dimensional model.

However, it should be noted that the recording of videos mostly takes place in three-dimensional space.

With computing power now available, it is possible to determine/estimate depth information on the encoder and/or decoder side.

A three-dimensional motion model can therefore also be made available within the invention. Without restricting the generality of the invention, it is also possible to use the invention with all current video (de)coders as long as they are upgraded accordingly. In particular, Versatile Video Coding ITU-T H.266/ISO/IEC 23090-3 can be added to the invention.

The invention is based on the idea of motion-compensated prediction. To motivate this, we refer to Figure 4 below. Here, a video consisting of (consecutive) encoded images of two-dimensional frames (ie a sequence) is viewed. A frame is also referred to as a two-dimensional representation. Due to the temporal redundancy between consecutive frames, a frame to be (encoded) at time t can be predicted from previously (t-1, t-2 ... (shown t-4) without being limited to four previous frames) encoded frames. These preceding frames are also referred to as reference (reference frames, reference images). It should be noted here that the frame order does not necessarily have to be a chronological order, but that the order shown and the (de)coded order can be different. This means that not only information from chronologically preceding frames, but also information from chronologically following frames (in the representation / chronological sequence future) can be used for the (de-)coding, If the motion-compensated prediction is precise enough, it is sufficient to transmit only the difference between the prediction and the frame to be coded, the so-called prediction error. The better the prediction, the fewer prediction errors have to be transmitted, ie the less data has to be transmitted or stored between encoder and decoder.

That is, from the coder's point of view, efficiency increases.

The conventional encoders are based on the similarity of frames in a two-dimensional model, i.e. only translations and/or affine movements are considered. However, there are a number of movements that cannot be easily expressed as a 2D model.

Therefore, this invention uses an approach based on the three-dimensional environment in which the sequence is recorded and from which a 3D motion model can be displayed.

In practical terms, video recording is analogous to projecting a three-dimensional scene into the camera's two-dimensional plane. However, since the depth information is lost during the projection, the invention enables a different provision.

In the example of the flow chart according to FIG. 1, the 3D information is reconstructed on the decoder side, while in the example of the flow chart according to FIG. 2 the coder makes the 3D information available (compressed) and the decoder only uses it. In the flowchart example of Figure 3, a mixed form is provided in which the encoder provides the (coarse) 3D information (compressed) and the decoder further processes the 3D information to improve it.

It is obvious that in the first case the required bandwidth/storage capacity can be less than in the second or third case. On the other hand, in the first case, the demands on the computing power for the encoder and the decoder are high, while in the second case, the demands on the computing power are lower for the decoder and highest for the encoder. This means that different scenarios can be operated on the basis of the options available. In particular in the case of a query for a video stream, provision can therefore be made for a decoder, for example, to make its properties known to the coder, so that the coder may can do without the provision of (more precise) 3D information because the decoder provides a method according to FIG. 1 or 3.

In the following we assume that the camera is any camera and is not fixed to a specific type.

In the following, reference will be made to a monocular camera with unknown camera parameters as the most difficult application, but without using other types of cameras, such as. e.g. light field, stereo camera, etc.

In this way, the camera parameters CP and geometry data GD can be inferred. The camera parameters CP can be inferred, for example, by methods such as Structure from Motion, Simultaneous Localization and Mapping or sensors.

If such data are known from certain camera types, e.g. stereo cameras and/or additional sensors, such as LIDAR sensors, gyroscopes, etc., these can be transmitted or processed as an alternative or in addition, thus reducing the computing effort or making it obsolete. The camera parameters CP can typically be determined from sensor data from gyroscopes, inertial measurement units (IMU), location data from a global positioning system (GPS), etc., while geometry data GD from sensor data from a LIDAR sensor, stereo cameras, depth sensors , light field sensors, etc. are determined. If both camera parameters CP and geometry data GD are available, the (de-)coding becomes easier and usually of better quality.

The encoder SRV can receive a conventional video signal Input Video in step 301, for example. This video signal can advantageously be monitored for movement, ie relative movement of the camera. If a relative movement of the camera is detected, the input video signal Input Video can be subjected to a coding according to the invention, otherwise, if no relative movement of the camera is detected, the signal can be subjected to a usual coding as before and as indicated in step 303, 403, 503 the Decoder C are provided. In embodiments, camera movement can be detected by the encoder, for example by visual data processing of the video signal and/or by sensors such as an IMU (Inertial Measurement Unit), a GPS (Global Positioning System), etc.

If, on the other hand, a movement is detected, a corresponding flag Flag_3D or other signaling can be used to signal the presence of inventive content according to step 304, 404, 504 if it is not already per se recognizable from the data stream.

If camera movement is detected, the (intrinsic and extrinsic) camera parameters CP can be estimated/determined in step 306, 406, 506, as indicated in step 305, 405, 505.

For this, techniques such as visual data processing such as Structure-from-Motion (SfM), Simultaneous Localization and Mapping (SLAM), Visual Odometry (V.O.), or any other suitable method can be used.

The camera parameters CP can of course also be estimated/determined/taken over as a known value by other sensors.

Without restricting the generality of the invention, these camera parameters CP can be processed and encoded in step 307, 407, 507 and made available to the decoder C separately or embedded in the video stream VB.

The geometry in three-dimensional space can be estimated/determined in step 310, 410, 510. In particular, in step 310, the geometry in three-dimensional space can be estimated from one or more previously encoded frames (step 309). The previously determined camera parameters CP can be included in step 308 for this purpose. In the embodiments of Figures 2 and 3, the 3D geometry data can be estimated/determined from "raw" data. In the embodiment of Figure 1, this data can be estimated/determined from the encoded data. Typically, the visual quality in the embodiments of the figure 2 and 3 can be better than in the embodiment of FIG. 1, so that these embodiments can provide higher-quality 3D geometry data. In order to estimate the geometry in three-dimensional space, so-called multi-view computer vision techniques can be used without using other techniques, such as any existing depth sensors, such as LiDAR, or other image sensors that allow depth detection, such as stereo cameras, RGB+ D sensors, light field sensors, etc. to exclude.

The geometry determined in this way in three-dimensional space can be represented by a suitable data structure, e.g. a 3D model, a 3D mesh, 2D depth maps, point clouds (sparse or dense), etc.

The video signal VB can now be encoded in the three-dimensional space in steps 312, 412, 512 on the basis of the determined geometry.

The novel motion-based model can now be applied to the reproduced three-dimensional information.

For example, a reference image can be determined/selected in step 311 for this purpose. This can then be presented to the standard video encoder in step 312.

Obviously, the encoding that follows can be applied to one, several or all frames of a predetermined set. Correspondingly, of course, the coding can also be based on one, several or all previous frames of a predetermined set.

Provision can also be made for the coder SRV to process only some spatial regions within a frame in the specified manner according to the invention and others in a conventional manner.

As already stated, a standard video encoder can be used. An additional reference can be added to the list of reference images (in step 311) or an existing reference image can be replaced. Likewise, as already indicated, only a specific spatial area can be overwritten with the new reference. In this way, the standard video coder can be enabled to independently select, on the basis of the available reference images, the reference image that has a favorable property, for example high compression with low distortions (rate-distortion optimization).

The standard video encoder can thus encode the video stream using the synthesized reference and make it available to the decoder C in step 313, 413, 513.

As in previous methods, the coder SRV can start again at corresponding re-entry points with a recognition according to step 301 and run through the method again.

Respawn points can be set at set time intervals based on channel characteristics, video frame rate, application, etc.

The 3D geometry can be reconstructed in three-dimensional space or an existing one can be further developed. The 3D geometry will continue to grow as new frames are added until it restarts at the next respawn point.

The decoder side C can be operated in a corresponding manner, with the coder SRV and decoder C in FIGS. 1 to 3 being arranged horizontally at approximately the same height in their functionally corresponding components.

The decoder C can first check whether a corresponding flag Flag_3D or another signaling was used.

If such a signaling (Flag_3D is e.g. 0) is not present, the video stream can be treated in step 316 by default. Otherwise the video stream can be treated in the new inventive way.

First, camera parameters CP can be obtained in steps 317, 417, 517. The obtained camera parameters CP can be processed and/or decoded in optional steps 318 . These camera parameters CP can be used, for example, for a depth estimation as well as for generating the geometry in three-dimensional space in step 320 based on previous frames 319 .

Overall, the same strategy with regard to the reference images as with the encoder (steps 309...312, 409...412, 509...512) can be used in corresponding steps 319...332, 419...432, 519 ...532 find use. It is e.g. possible to render the synthesized reference image in step 321 by transforming the previously decoded frame (step 319) into the frame to be decoded, guided by the decoded camera parameters CP (step 318) and the geometry in three-dimensional space (step 320 ).

Finally, in step 323, 423, 523, the video stream processed according to the invention can be decoded by a standard video encoder and output as a decoded video stream 324, 424, 524.

Usually, the decoder should be synchronous with the encoder in terms of settings, so that the decoder C uses the same settings (especially for depth determination, reference creation, etc.) as the encoder SRV.

In the embodiment of FIG. 2, in contrast to the embodiment of FIG. 1, the geometry in three-dimensional space can be estimated from raw video frames in step 405. An (additional) bit stream 410.1 can be generated from the data, which, for example, is the subject of further processing, e.g. decimation, (lossless/lossy) compression and coding, which can be made available to the decoder C. This bit stream 2.2 that has been made available can now also--as in the decoder C--be converted back in step 410.2 (to ensure the congruence of the data) and made available for processing in step 411.

Likewise, the geometry in three-dimensional space can also be retained beyond a re-entry point. However, the method also allows the geometry in the three-dimensional space to be continuously improved on the basis of previous and current frames. This geometry in three-dimensional space can suitably be subject to further processing, eg decimation (e.g. mesh decimation), (lossy/lossy) compression/encoding. In a corresponding manner, the decoder C can receive and decode the bit stream 2.2 obtained in step 419.1 with the data relating to the geometry in three-dimensional space. The decoded geometry in three-dimensional space can then be used in step 420.

The decoder can obviously work faster in this variant, since the decoding requires less effort than the reconstruction of the geometry in three-dimensional space (FIG. 1).

While a very efficient method with regard to the bit rate reduction is presented in the embodiment in FIG. 1, a smaller bit rate reduction can be achieved with the embodiment in FIG. The embodiment of FIG. 3 combines aspects of the embodiments of FIG. 1 and FIG. 2, spreading the complexity and allowing a flexible and efficient method.

Essentially, the concept of the embodiment in Figure 3 differs from the embodiment in Figure 2 in that the geometry in three-dimensional space, i.e. the 3D data in step 510.1 roughly represents the original geometry in three-dimensional space, i.e. in a slimmed-down version, so that the required bit rate for this decreases. Any suitable data reduction technique such as, but not limited to, sub-sampling, coarse quantization, transform coding, and dimensionality reduction, etc. can be used for this purpose.

The 3D data minimized in this way can be coded and made available to the decoder C as before in step 510.2. The bitstream 510.1/510.2 may be subject to further processing, eg decimation, compression (lossless/lossy) and encoding, which may be provided to the decoder C. This bit stream 2.2 that has been made available can now also--as in the decoder C--be converted back in step 510.3 (to ensure the congruence of the data) and made available for processing in step 511. In this case, the previously encoded frames 509 and the camera parameters 506 can be used for finer processing of the 3D data. Correspondingly, the decoder C can receive the encoded and minimized 3D data in step 519.1 and decode them in step 519.2 and—corresponding to the encoder SRV—the processing can be made available. The previously encoded frames 519.3 and camera parameters 518 can be used for finer processing of the 3D data.

That is, in all embodiments of the decoder C, a video stream VB is obtained from the encoder SRV, e.g., a streaming server, in a first step 315, 415, 515.

The client C decodes the received video stream VB using camera parameters CP and geometry data GD, and then reproduces this as a prepared video stream AVB in step 324, 424, 524.

As shown in Figures 1 to 3, in embodiments of the invention the camera parameters CP can be obtained from the encoder SRV in step 317, 417, 517 (e.g. as bitstream 2.1) or can be determined from the obtained video stream VB.

In embodiments of the invention, geometry data GD can be obtained from the encoder SRV (e.g. as bitstream 2.2) or determined from the obtained video stream VB.

In particular, it can be provided that before receiving the video stream VB, the decoder C signals the coder SRV that it is able to process it. A set of processing options can also be supplied so that the coder can provide the appropriate format. For this purpose, the format made available can have a corresponding coding with regard to setting data.

In one embodiment of the invention, the geometry data includes depth data.

In summary, it should again be pointed out that in the case of FIG. 1, a 3D reconstruction is used both on the encoder and on the decoder side. In the example in FIG. 2, a 3D reconstruction is carried out only by the encoder and made available to the decoder. This means that the decoder does not have to carry out a 3D reconstruction, but can use the 3D geometry data provided by the encoder. Estimating the 3D geometry data on the part of the coder is usually easier than on the part of the decoder. The configuration according to FIG. 2 is advantageous in particular when the computing power is limited on the part of the decoder. In the case of FIG. 3, as in FIG. 2, a 3D reconstruction is carried out by the coder and made available to the decoder. However, only a rough version of the 3D geometry data is provided here. This allows the bit rate for the 3D geometry data to be reduced. At the same time, however, the decoder now has to complete/postprocess the 3D geometry data (refine).

The choice of method (e.g. according to Figure 1, Figure 2 or Figure 3) can be negotiated between the encoder and the decoder. This can be done, for example, on the basis of previous knowledge (e.g. computing power) or via a control/return channel (adaptive) in order to also take changes in the transmission capacity into account, for example. In a broadcast scenario that is aimed at a number of decoders, the configuration according to FIG. 2 will generally be preferred.

Even if the invention is described in relation to methods, it is clear to a person skilled in the art that the invention can also be made available in hardware, in particular hardware set up by software. Common (de-)coding units, special computing units such as GPUs and DSPs as well as ASICs or FPGAs-based solutions can be used for this, without thereby excluding the applicability of general microprocessors.

In particular, the invention can therefore also be embodied in computer program products for setting up a data processing system for carrying out a method.

With the invention, it is possible to achieve significant savings of several percent in the bit rate if there are scenes that can be encoded accordingly.

It is initially assumed below that a continuous video recording is to be encoded at a specific point in time. A few frames have already been encoded and another frame, the "to-be-encoded frame", is now to be encoded. Depending on where the frame is in the sequence and/or depending on the available data rate or video coding setting, this "to-be-encoded frame" can be coded using intra-prediction or inter-prediction tools. Typically, one would eg for each first frame of a group of pictures (English Group of Pictures - abbreviated GOP), eg the 16th frame (ie frame with order number 0, 16, 32, ... ) Intra- prediction tools, while for the "inter-frames" one would use inter-prediction tools to do this.

Of particular interest within the scope of the invention are the frames to be encoded by means of inter-prediction tools, i.e. in the example frames 1-15, 17-31, 33-47, ... , ... .

The main idea of inter-prediction tools is to use temporal similarities between consecutive frames. If a block of a frame to be encoded is similar to any frame in the previously encoded frame (e.g. due to relative motion), then instead of re-encoding this block, this already encoded block can simply be referred to. This process can be referred to as motion compensation. For this purpose, a new list of previously encoded frames is used for each frame, which can be used as a reference for the motion compensation. This list is also referred to as a reference picture list. In essence, the Coders divide the frame to be encoded into a few non-overlapping blocks, and then the block thus created can be compared with previously encoded blocks according to the corresponding listing to find a high - preferably best - match The relative 2D position of each block found (i.e. the motion vector) and the difference between the created block and the found block (i.e. the residual signal) can then be encoded (along with further created blocks, their position and their differences).

Within the scope of the invention, at least one novel reference image is created based on 3D information and added to this list of reference images or inserted instead of an existing reference image. For this purpose, the camera parameters CP for a single (monocular) camera and geometry data GD for the 3D scenes are created from a set of 2D images, which were captured by the moving monocular camera.

From this, a new type of reference image based on 3D information for frames to be encoded is created, eg by distorting the content of conventional reference images to the position of the image to be encoded. This warping process is guided by the established/estimated camera parameters CP and for the 3D scenes geometry geometry data GD. The novel reference image thus synthesized is created based on 3D information and added to this reference image list. The novel reference picture created in this way allowed improved performance with regard to motion compensation, ie requires a lower bit rate than conventional reference pictures in the reference picture list. In this way, the bit rate required at the end can also be reduced and the coding gain can be increased.

Various approaches can be used to keep the runtime of the decoder low.

On the one hand, it should be noted that the synthesis of reference images on the part of the decoder is time-consuming. However, it may be sufficient to use the encoder with the novel 3D reference image only for one or more sub-areas/regions in the frame to be encoded, e.g. 20% - 30% of the area/pixel, namely especially for those where there is a good inter-frame prediction through optimization. This is illustrated by way of example as follows. Suppose there are 3 references RI, R2, R3D and a frame to be encoded is divided into non-overlapping blocks. R3D would then be one of the references provided by the invention. Then the encoder would first choose a first block and check which is the most similar block in one of the references; this would then be done incrementally for each block. Typically, R3D is found 20%-30% of the time while RI or R2 is found in the rest of the cases. This information, which reference picture is used for which block, can be fed into the video bit stream. The decoder can then simply read this information and create the novel reference image based on 3D information at least for these regions, i.e. not for the entire area. That is, unlike the coder, it may be sufficient for the documenter if only the used part of the R3D reference is created for the inter-prediction, while it is not necessary to create the other parts of the R3D reference as well.

On the other hand, it can be seen that frames have a different contribution to the final bit rate, based on their order and position in the encoder structure used. Reference is made to FIG. 5 in this regard. The circles represent frames with the rendering order. The arrows indicate the conventionally generated reference images that can be used for motion compensation. For example, the frame can use F _M Fi and Fi _+S as a reference. In the hierarchical coding structure, frames with a temporal identifier 4 (TI D=4) contribute less to the final bit rate than frames with a temporal identifier TID<3. This is mainly due to the fact that these use their respective immediate (previous / subsequent) frame for the motion compensation Use motion compensation, but have almost the same content. On the other hand, the frame Fi _+g uses, for example, Fi and F _M6 as reference images, but they are further apart.

For the following consideration, we assume that the procedure proposed within the scope of the invention reduces the bit rate for each frame by 5% compared to the previous procedure. Since frames with TID=4 contribute less to the final coding gain/the final bit rate (assuming 10% of the total bit rate is attributable to TID=4), the proportion that could be achieved here using the invention is correspondingly low (5% of 10%). . One could therefore dispense with the use of the method according to the invention for this area, since the contribution is rather small. Computing time/memory space can thus be saved in order to keep the speed high or ready for areas in which the method according to the invention makes a greater contribution to the final coding gain/the final bit rate.

For example, assuming that the gain from the inventive method would provide a final coding gain of 3% when applied to all frames, skipping the frames with TID=4 would shrink this final coding gain to about 2.7%. On the other hand, the encoder could be (more than) twice as fast.

Even if the method according to the invention were only applied to TID<1, omitting the other frames would reduce this final coding gain to around 1%.

It makes sense for the decoder to be informed of such a circumstance if the encoder only encodes or does not encode one or more frames with specific TIDs. Depending on the design, this can be done with a flag (reduced / not reduced) or with a code word (e.g. 1 for only TID=1, 2 only for TID=2, 4 only for TID=3 or 3 for TID 1 and 2 , ... ) be used.

It should be noted that the camera parameters CP and for the 3D scene geometry geometry data GD can be made available not just once but repeatedly, individually or in combination, from the encoder to the decoder. A new sentence as well as an update can be made available.

Claims

Claims Method for playing back a video stream by a client (C), wherein the video stream has frames from exactly one camera in relation to an object moving relative thereto from different positions, comprising the steps

• receiving a video stream (VB) from an encoder (SRV),

• Decoding of the received video stream (VB) using camera parameters (CP) and geometry data (GD),

• Playback of the processed video stream (AVB). Method according to Claim 1, characterized in that the camera parameters (CP) are obtained from the encoder (SRV) or are determined from the obtained video stream (VB). Method according to Claim 1 or 2, characterized in that the geometry data (GD) are obtained from the coder (SRV) or are determined from the obtained video stream (VB). Method according to one of the preceding claims, characterized in that before receiving the video stream (VB), the client (C) signals to the coder (SRV) that it is able to process it. Method according to one of the preceding claims, characterized in that the geometry data have depth data. Device for performing a method according to any one of claims 1 to 5. Computer program product for setting up a data processing system for performing a method according to any one of the preceding claims 1 to 5. Method for providing a video stream by a coder (SRV) to a client (C), wherein the video stream has frames from exactly one camera in relation to an object moving relative thereto from different positions, comprising the steps

• Obtaining frames from a camera related to an object from different positions in the form of a video stream (eVB)

• Encoding of the received video stream (eVB) using camera parameters (CP) and geometry data (GD),

• Streaming video stream (VB). 9. The method as claimed in claim 8, characterized in that the camera parameters (CP) are obtained from an external source or are determined from the video stream (VB) obtained.

10. The method as claimed in claim 8 or 9, characterized in that the geometry data (GD) are obtained from an external source or are determined from the video stream (VB) obtained.

11. The method according to any one of the preceding claims 9 or 10, characterized in that the external source is a sensor and/or the cameras.

12. The method according to any one of the preceding claims 8 to 11, characterized in that before streaming the video stream (VB) of the encoder (SRV) from the client (C).

Ability to process gets signaled.

13. The method according to any one of the preceding claims 1 to 12, characterized in that the geometry data have depth data.

14. Device for carrying out a method according to one of claims 8 to 13. 15. Computer program product for setting up a data processing system for

Carrying out a method according to any one of the preceding claims 8 to 13.