WO2018136045A1

WO2018136045A1 - Spatial marking of an object that is displayed as part of an image

Info

Publication number: WO2018136045A1
Application number: PCT/US2017/013908
Authority: WO
Inventors: Lawrence Antony KUNNUVILLA
Original assignee: Tata Communications (America) Inc.
Priority date: 2017-01-18
Filing date: 2017-01-18
Publication date: 2018-07-26

Abstract

A system and method that enable the spatial marking on a telecommunication endpoint's display, of an object in an image that can be received from another endpoint. The image can be one in a series of images being captured by a camera at the other endpoint and shared by that endpoint. The displaying of the marker in not only the captured image, but in subsequent images, is also enabled. The position of the marker being displayed in relation to the marked object is maintained in the series of images and regardless of movement of the camera. The displaying of the marker occurs not only at the endpoint at which the marking occurred - the marker having been created by the endpoint user via a touchscreen or other suitable device - but also at other endpoints engaged in a videoconference and sharing the images, including the endpoint capturing the images being marked.

Description

Spatial marking of an object that is displayed as part of an image

Field of the Invention

[oooi] The present invention relates to telecommunications in general, and, more particularly, to the handling, across telecommunication endpoints, of the spatial marking of one or more objects that are displayed as part of an image of a sequence of images.

Background of the Invention

[0002] Videoconferencing is becoming a preferred way for conducting both one-on- one and group meetings, as well as for conducting conversations in general. It enables people to participate in a more relaxed and comfortable setting from their respective telecommunication endpoints, such as smartphones or personal computers, whether the people are in the office, at home, or elsewhere. Good video communication systems, such as telepresence systems and videoconferencing systems, including desktop video applications, can reduce travel expenditures and greatly increase productivity. This is, in part, because video feeds enable people to interact in real time.

[0003] There are, however, limitations to videoconferencing in the prior art, in particular to how the participants in a videoconference are able to share information with one another.

Summary of the Invention

[0004] The present invention enables the spatial marking on a smartphone display, or on that of a different type of telecommunication endpoint, of an object in an image that can be received from another endpoint. The image can be one in a series of images being captured by a camera or other device at the other endpoint and shared by that endpoint. The displaying of the marker in not only the captured image, but in subsequent images, is also enabled. The position of the marker being displayed in relation to the marked object is maintained in the series of images and regardless of movement of the device that is capturing the images. The displaying of the marker occurs not only at the endpoint at which the marking occurred - the marker having been created by the endpoint user via a touchscreen or other suitable device - but also at other endpoints engaged in a videoconference and sharing the images, including the endpoint capturing the images being marked.

[0005] In accordance with the telecommunications system disclosed herein, a first telecommunication endpoint continually captures images of a scene and processes those images, as part of a video stream. Each image is captured in the current spatial frame of reference in which the capturing device is operating, such as a camera that is part of the first endpoint. The first endpoint transmits continually video frame representations of one or more images, including that of a first image, to at least a second telecommunication endpoint, along with frame identifications (IDs) of the video frames. The first endpoint also transmits continually depth maps of the one or more images to a server computer, either cloud-based or otherwise, along with the frame IDs of the corresponding video frames. The frame IDs enable the server computer and endpoints to correlate the different information that these devices receive from different sources, as explained below.

[0006] Meanwhile, the second endpoint processes the video frames received from the first endpoint and displays the images represented in the video frames, including the first image. The user of the second endpoint can add a marker to the first image in order to identify an object in the images, and the second endpoint can generate and transmit a representation, such as a set of coordinates, of the created marker to the server computer. In other words, a marker can be created at an endpoint that is different from the endpoint capturing the images in the video stream, although the user of the image-capturing first endpoint may mark an image as well and share the marker in the image with the server computer. The marking endpoint transmits coordinates of the marker to the server computer, along with a frame ID that corresponds to the video frame image on which the user created the marker.

[0007] The server computer uses the received coordinates of a marker and at least the depth map that corresponds to the frame ID of the marker coordinates, in order to detect an object cluster corresponding to the object marked on the display by the user. The server computer can also determine one or more differences between the frames of references of the first image and a second image, including differences in the spatial dimension of depth. In this regard, the inventor had the insight that the depth maps themselves could be used for pattern matching, in order to determine a difference between the two images, and that the video frames were not required for pattern matching. A least some differences between frames of references of different images are presumably attributable to movement of the first endpoint's camera - and, therefore, to movement of the first endpoint itself - from one position to another. Such changes in positions can be attributed to translational movement of the camera or rotational movement, or both. The server computer transmits the coordinates of the detected object cluster, which are updated to account for any camera movement, to the telecommunication endpoints, along with the frame ID of the corresponding video frame.

[0008] The first telecommunication endpoint displays a second image of the video frame that corresponds to the frame ID that it received from the server computer, but with the marker superimposed. The endpoint superimposes the marker based on the updated coordinates of the object cluster that correspond to the received frame ID.

[0009] As a non-limiting example, a technical support usage scenario can be envisioned that involves a first endpoint user, such as a technician, who is standing with a smartphone in an office room and using the smartphone's camera to share video images with other users at other endpoints. Also in the example, a second endpoint user, such as an office or building manager who is at a remote location, is looking on a display at the video images being transmitted by the first endpoint, and is marking one or more objects in the images. As those who are skilled in the art will appreciate after reading this

specification, however, the system and method disclosed herein can also be applied to usage scenarios other than tech support, such as, while not being limited to, maintenance, education, medicine, criminal investigation, combatting terrorism, shopping, booking of travel and lodging, and so on.

[ooio] Advantageously, the marker is displayed such that its relative position and orientation in relation to the marked object is maintained in not only the second image, but also in subsequent images and despite movement of the camera or other device used to capture the images at the first endpoint. Furthermore, after having been out of the camera's view, the marker returns into view when the marked object is brought back into the camera's view. As an example of this, the second endpoint uses a marker to mark an object in an image during a video session. After some time has passed (e.g., a few seconds, a few minutes, etc.), the first endpoint's camera moves and, consequently, the object that was marked is no longer in the image seen by the camera. Because the previously marked object is not present in the camera's view at this time, neither the object nor the marker is displayed at this time. Subsequently during the video session, if the camera position changes and, as a result, the previously marked object is brought back into the camera's view, the marker returns to the displayed view, along with the object, and is displayed such that its relative position and orientation in relation to the object is reflected in the displayed view. These features related to a marker enduring changes in a camera's view over time represent technical improvements over the prior art.

[ooii] A broader impact of the aforementioned technical improvements over the prior art is that a new aspect is added to existing videoconferencing by enabling perception and engagement in three-dimensional space, thereby enhancing the video sharing experience.

[0012] In accordance with the illustrative embodiment, and as summarized above, a server computer processes the coordinates of a marker along with one or more depth maps, in order to detect an object cluster that corresponds to an object on a display marked by a user. As those who are skilled in the art will appreciate, after reading this specification, a different data-processing system can perform one or more of the actions that are disclosed herein as being performed by the server computer, such as a telecommunication endpoint.

[0013] An illustrative data-processing system for processing, in multi-dimensional space, a marker on an image comprises: a receiver configured to: a) receive a first depth map of a first video frame and a first frame identification (ID) of the first video frame, wherein the first video frame is of a first image of a scene and captured in a first frame of reference, and b) receive coordinates of a first marker, wherein the coordinates of the first marker correspond to the first video frame; a processor configured to: a) detect an object cluster based on the first depth map and the coordinates of the first marker, and b) generate a first set of coordinates of the object cluster; and a transmitter configured to transmit the first set of coordinates of the object cluster and the first frame ID, to at least one telecommunication endpoint.

[0014] An illustrative method for processing, in multi-dimensional space, a marker on an image comprises: receiving, by a data-processing system, a first depth map of a first video frame and a first frame identification (ID) of the first video frame, wherein the first video frame is of a first image of a scene and captured in a first frame of reference;

receiving, by the data-processing system, coordinates of a first marker, wherein the coordinates of the first marker correspond to the first video frame; detecting, by the data- processing system, an object cluster based on the first depth map and the coordinates of the first marker; generating, by the data-processing system, a first set of coordinates of the object cluster; and transmitting, by the data-processing system, the first set of coordinates of the object cluster and the first frame ID, to at least one telecommunication endpoint.

[0015] An illustrative telecommunication system for processing, in multi-dimensional space, a marker on an image comprising : i) a first telecommunication endpoint configured to : a) capture a first image of a scene and in a first frame of reference, b) receive a first set of coordinates of an object cluster and a first frame ID, c) superimpose a second marker on the first image, based on the first set of coordinates of the object cluster and the first frame ID, and d) display the first image with the second marker superimposed; and ii) a data- processing system configured to: a) receive coordinates of a first marker, wherein the coordinates of the first marker correspond to the first video frame, b) detect the object cluster based on a first depth map and the coordinates of the first marker, wherein the first depth map is of a first video frame of the first image, c) generate the first set of coordinates of the object cluster, and d) transmit the first set of coordinates of the object cluster and the first frame ID, to at least the first telecommunication endpoint.

Brief Description of the Drawings

[0016] Figure 1 depicts a schematic diagram of telecommunication system 100, in accordance with the illustrative embodiment of the present disclosure.

[0017] Figures 2A and 2B depict the salient components of telecommunication endpoint 101-m and server computer 103, respectively, within telecommunication system 100.

[0018] Figure 3 depicts message flow diagram 300 associated with the spatial marking of a portion of an image, in accordance with the illustrative embodiment of the present disclosure.

[0019] Figure 4 depicts a flowchart of operation 301 associated with capturing and processing one or more images.

[0020] Figure 5 depicts a flowchart of operation 403 associated with generating a video frame and depth map representations of an image.

[0021] Figure 6 depicts a flowchart of operation 305 associated with endpoint 101-2 processing and displaying a video frame representation of an image.

[0022] Figure 7 depicts a flowchart of operation 307 associated with endpoint 101-2 adding one or more markers to a particular image, including adding markers to an object in the image. [0023] Figure 8 depicts a flowchart of operation 311 associated with server computer 103 processing, for a given frame ID, at least one depth map and a marker.

[0024] Figure 9 depicts a flowchart of operation 813 associated with server computer 103 determining one or more differences between two frames of reference: one for a first image and the other for a second image.

[0025] Figure 10 depicts a flowchart of operation 315 associated with endpoint 101-1 displaying a second image with a marker or markers superimposed on the image.

[0026] Figures 11A through HE depicts scene 1100 and corresponding

images 1101 through 1104.

Detailed Description

[0027] Based on - For the purposes of this specification, the phrase "based on" is defined as "being dependent on" in contrast to "being independent of". The value of Y is dependent on the value of X when the value of Y is different for two or more values of X. The value of Y is independent of the value of X when the value of Y is the same for all values of X. Being "based on" includes both functions and relations.

[0028] Capture - For the purposes of this specification, the infinitive "to capture" and its inflected forms (e.g., "capturing", "captured", etc.) should be given the ordinary and customary meaning that the terms would have to a person of ordinary skill in the art at the time of the invention.

[0029] Coordinate system - For the purposes of this specification, a "coordinate system" is defined as a system that uses one or more numbers, or coordinates, to uniquely determine the position of a point in a space.

[0030] Depth map - For the purposes of this specification, a "depth map" is defined as an image or image channel that contains information relating to the distance of the surfaces of scene objects from a viewpoint.

[0031] Frame of reference - For the purposes of this specification, a "frame of reference" is defined as a system of geometric axes in relation to which measurements of size, position, or motion can be made.

[0032] Generate - For the purposes of this specification, the infinitive "to

generate" and its inflected forms (e.g., "generating", "generated", etc.) should be given the ordinary and customary meaning that the terms would have to a person of ordinary skill in the art at the time of the invention. [0033] Image - For the purposes of this specification, an "image" is defined as a visual impression obtained by a camera or other device.

[0034] Marker - For the purposes of this specification, a "marker" is defined as something that shows the presence or existence of something.

[0035] Matrix - For the purposes of this specification, a "matrix" is defined as a rectangular array of quantities or expressions in rows and columns that is treated as a single entity and manipulated according to particular rules.

[0036] Present - For the purposes of this specification, the infinitive "to present" and its inflected forms (e.g., "presenting", "presented", etc.) should be given the ordinary and customary meaning that the terms would have to a person of ordinary skill in the art at the time of the invention.

[0037] Receive - For the purposes of this specification, the infinitive "to receive" and its inflected forms (e.g., "receiving", "received", etc.) should be given the ordinary and customary meaning that the terms would have to a person of ordinary skill in the art at the time of the invention.

[0038] Representation - For the purposes of this specification, a "representation" is defined as something that stands for something else.

[0039] Scene - For the purposes of this specification, a "scene" is defined as something seen by a viewer; a view or prospect.

[0040] Store - For the purposes of this specification, the infinitive "to store" and its inflected forms (e.g., "storing", "stored", etc.) should be given the ordinary and customary meaning that the terms would have to a person of ordinary skill in the art at the time of the invention.

[0041] Superimpose - For the purposes of this specification, the infinitive "to superimpose" and its inflected forms (e.g., "superimposing", "superimposed", etc.) should be given the ordinary and customary meaning that the terms would have to a person of ordinary skill in the art at the time of the invention.

[0042] Transmit - For the purposes of this specification, the infinitive "to transmit" and its inflected forms (e.g., "transmitting", "transmitted", etc.) should be given the ordinary and customary meaning that the terms would have to a person of ordinary skill in the art at the time of the invention.

[0043] Figure 1 depicts a schematic diagram of telecommunication system 100, in accordance with the illustrative embodiment of the present disclosure. System 100 comprises telecommunication endpoints 101- 1 through 101-M, telecommunication network 102, a nd server computer 103, interconnected as shown . M is a positive integer that has a value of 2 as depicted ; however, as those who are skilled in the art will appreciate after reading this specification, M can have a different value (i.e., there can be a different number of endpoints present and interacting with one another) .

[0044] Each telecommunication endpoint 101-m, wherein m can have a value of between 1 and M, is a user device that ena bles its user (e.g ., human, machine, etc.) to telecommunicate with other endpoints, and/or with other resources within

telecommunications system 100. Each endpoint can be mobile or immobile. An endpoint can be a wireless terminal, a cellular telephone or cellphone, a wireless transmit/ receive unit (WTRU), a user equipment (UE), a mobile station, a fixed or mobile subscriber unit, a pager, a personal digital assistant (PDA), a smartphone, a tablet, a phablet, a smart watch, a (hands-free) wearable device, a desk set, a computer, or any other type of end-user device capable of operating in a telecommunication environment, for example and without limitation . The sa lient components of endpoint 101-m are described below and in

Figure 2A.

[0045] Endpoint 101-m is capable of providing access to its user via at least one network, in this case network 102. In accordance with the illustrative embodiment, endpoint 101-m is capable of communicating via a local area network (LAN) within telecommunication network 102 (e.g ., in accordance with the WiFi standa rd, etc.) . In some embodiments of the present disclosure, endpoint 101-m is capable of communication via a cellular access network. In some alternative embodiments of the present disclosure, endpoint 101-m is capable of communicating in accordance with one or more other standards such as the following telecommunications standards, without limitation : IEEE 802.16 WiMax, Bluetooth, LoRa, Global System for Mobile Communications (GSM),

Universal Mobile Telecommunications System (UMTS), Long Term Evolution (LTE), CDMA- 2000, IS- 136 TDMA, IS-95 CDMA, 3G Wideband CDMA, and so on.

[0046] Endpoint 101-m is capable of storing and executing one or more software applications or "apps". For example and without limitation, a video display app enables the endpoint, and thus its user, to view one or more images that constitute a video stream. In addition, such an app enables the endpoint's user to mark one or more of the displayed images via a touch screen, as described below. [0047] Telecommunication network 102 is a network that provides

telecommunications access and connectivity to the depicted endpoints. Network 102 comprises computer- and/or telecommunications-networking devices, which can include gateways, routers, network bridges, switches, hubs, and repeaters, as well as other related devices. Network 102 is managed by one or more service providers or operators, and provides bandwidth for various telecommunication services and network access to telecommunication endpoints in one or more communications service provider (CSP) networks and/or one or more enterprise networks. One of the services that can be provided by network 102 is conferencing, including audio, web, and/or videoconferencing. In order to facilitate the call processing that is associated with videoconferencing, network 102 comprises computer servers, which process appropriate protocols (e.g., TURN, etc.) for media and handle call signaling (e.g., WebRTC, etc.) for the setup and teardown of calls.

[0048] Server computer 103 is configured to perform at least some of the actions described below and in the figures, including the detecting of object clusters and the generating of one or more sets of coordinates of the object clusters detected. In some embodiments of the present disclosure, server computer 103 is cloud-based. The salient components of server computer 103 are described below and in Figure 2B.

[0049] Figure 2A depicts the salient components of telecommunication

endpoint 101-m according to the illustrative embodiment of the present disclosure.

Telecommunication endpoint 101-m is based on a data-processing apparatus whose hardware platform comprises: camera 201, touchscreen 202, keyboard 203, processor 204, memory 205, display 206, and network interface 207, interconnected as shown.

[0050] Camera 201, touchscreen 202, and keyboard 203 are input devices and are known in the art. In regard to the illustrative embodiment, camera 201 can be used to capture one or more images of a scene. Touchscreen 202 or keyboard 203, or both, can be used by a user of the endpoint to create one or more markers associated with an object being displayed on display 206, or in general to create one or more markers somewhere on a particular image. As those who are skilled in the art will appreciate after reading this specification, endpoint 101-m can have a different set of input devices for the purposes of capturing one or more images and/or entering one or more markers, in some alternative embodiments of the present disclosure.

[0051] Processor 204 is hardware or hardware and software that perform

mathematical and/or logical operations, such as a microprocessor as is known in the art. Processor 204 is configured such that, when operating in conjunction with the other components of endpoint 101-m, the processor executes software, processes data, and telecommunicates according to the operations described herein. Processor 204 can be one or more computational elements.

[0052] Computer memory 205 is non-transitory and non-volatile computer storage memory technology as is known in the art (e.g., flash memory, etc.). Memory 205 is configured to store an operating system, application software, and a database. The operating system is a collection of software that manages, in well-known fashion, telecommunication endpoint 101-m's hardware resources and provides common services for computer programs, such as those that constitute the application software. The application software that is executed by processor 204 according to the illustrative embodiment enables telecommunication endpoint 101-m to perform the functions disclosed herein. The database is used to store, among other things, various representations of video frames in various frames of reference, along with the corresponding frame IDs, as described below.

[0053] Display 206 is an output device used for presenting various captured images that are part of a video stream, both with and without markers being superimposed on the captured images. In at least some embodiments of the present disclosure, touchscreen 202 and display 206 occupy at least some of the same physical space and are integrated into the same physical device or unit.

[0054] Network interface 207 is configured to enable telecommunication

endpoint 101-m to telecommunicate with other devices and systems, by receiving signals therefrom and/or transmitting signals thereto via receiver 221 and transmitter 222, respectively. For example, network interface 207 enables its telecommunication endpoint to communicate with one or more other devices, via network 102. Network interface 207 communicates within a local area network (LAN) in accordance with a LAN protocol (e.g., WiFi, etc.) or within a cellular network in accordance with a cellular protocol, or both. In some other embodiments, network interface 207 communicates via one or more other radio telecommunications protocols or via a wireline protocol.

[0055] Receiver 221 is a component that enables telecommunication endpoint 101-m to telecommunicate with other components and systems by receiving signals that convey information therefrom. It will be clear to those having ordinary skill in the art how to make and use alternative embodiments that comprise more than one receiver 221. [0056] Transmitter 222 is a component that enables telecommunication endpoint 101-m to telecommunicate with other components and systems by transmitting signals that convey information thereto. It will be clear to those having ordinary skill in the art how to make and use alternative embodiments that comprise more than one

transmitter 222.

[0057] Gyro 208 and accelerometer 209 are sensors configured to detect rotational movement and translational movement of endpoint 101-m, respectively.

[0058] Figure 2B depicts the salient components of server computer 103 according to the illustrative embodiment of the present disclosure. Server computer 103 is based on a data-processing apparatus whose hardware platform comprises: processor 234,

memory 235, and network interface 237, interconnected as shown.

[0059] Processor 234 is hardware or hardware and software that perform

mathematical and/or logical operations, such as a microprocessor as is known in the art. Processor 234 is configured such that, when operating in conjunction with the other components of server computer 103, the processor executes software, processes data, and telecommunicates according to the operations described herein. Processor 234 can be one or more computational elements.

[0060] Computer memory 235 is non-transitory and non-volatile computer storage memory technology as is known in the art (e.g., flash memory, etc.). Memory 235 is configured to store an operating system, application software, and a database. The operating system is a collection of software that manages, in well-known fashion, server computer 103's hardware resources and provides common services for computer programs, such as those that constitute the application software. The application software that is executed by processor 234 according to the illustrative embodiment enables server computer 103 to perform the functions disclosed herein. The database is used to store, among other things, various representations of depth maps and markers in various frames of reference, along with the corresponding frame IDs, as described below.

[0061] Network interface 237 is configured to enable server computer 103 to telecommunicate with other devices and systems, by receiving signals therefrom and/or transmitting signals thereto via receiver 251 and transmitter 252, respectively. For example, network interface 237 enables its server computer to communicate with one or more other devices, via network 102. Network interface 237 communicates within a local area network (LAN) in accordance with a LAN protocol (e.g., WiFi, etc.) or within a cellular network in accordance with a cellular protocol, or both. In some other embodiments, network interface 237 communicates via one or more other radio telecommunications protocols or via a wireline protocol.

[0062] Receiver 251 is a component that enables server computer 103 to

telecommunicate with other components and systems by receiving signals that convey information therefrom. It will be clear to those having ordinary skill in the art how to make and use alternative embodiments that comprise more than one receiver 251.

[0063] Transmitter 252 is a component that enables server computer 103 to telecommunicate with other components and systems by transmitting signals that convey information thereto. It will be clear to those having ordinary skill in the art how to make and use alternative embodiments that comprise more than one transmitter 252.

[0064] Figures 3 through 9 depict message flow diagrams and flow charts that represent at least some of the salient, operational logic of one or more telecommunication endpoints 101-1 through 101-M and server computer 103, in accordance with the illustrative embodiment of the present disclosure.

[0065] In regard to the methods represented by the disclosed operations and messages, it will be clear to those having ordinary skill in the art, after reading the present disclosure, how to make and use alternative embodiments of the disclosed methods wherein the recited operations, sub-operations, and messages are differently sequenced, grouped, or sub-divided - all within the scope of the present disclosure. It will be further clear to those skilled in the art, after reading the present disclosure, how to make and use alternative embodiments of the disclosed methods wherein some of the described operations, sub-operations, and messages are optional, are omitted, or are performed by other elements and/or systems.

[0066] For example and without limitation, endpoints 101-1 and 101-2 can handle a different division of processing than described below. Similarly at least some of the processing described below can be handled by a different data-processing system entirely, such as one or more server computers within telecommunication network 102.

[0067] In accordance with the illustrative embodiment, endpoints 101-1 and 101-2 and server computer 103 operate using a multidimensional, Cartesian coordinate system (e.g., "xyz" coordinates, etc.) and on data coordinates specified with respect to a frame of and in terms of such a coordinate system. As those who are skilled in the art will appreciate after reading this specification, endpoints 101-1 and 101-2 can operate using a type of coordinate system (e.g., polar, spherical, cylindrical, etc.) different than Cartesian.

[0068] Figure 3 depicts message flow 300 associated with the spatial marking of a portion of an image, such as the marking of one or more objects displayed in the image, in accordance with the illustrative embodiment of the present disclosure. Although message flow 300 features the marking of one or more objects, the techniques disclosed in this specification can be used in general to mark a portion of an image, regardless of whether an object is conspicuously present within or designated by the created marker, or not.

Furthermore, although a single, marked object is featured in the examples below, as those who are skilled in the art will appreciate after reading this specification, more than one object, or more than one marker, or both, can be created and added to one or more images being displayed.

[0069] As an example, a technical support usage scenario can be envisioned that involves scene 1100 depicted in Figure 11A, which is that of an office room with three video monitors on a table, including leftmost monitor 1121 and rightmost monitor 1122. In the example, first endpoint user, such as a technician, is standing in the office room with a smartphone and is using the smartphone's camera (i.e., endpoint 101-1 comprising camera 201) to share video images with other endpoints. A second endpoint user, such as an office or building manager who is currently at a remote location, is looking on a display (i.e., at endpoint 101-2 comprising display 206) at the video images being transmitted by the first endpoint, and is marking one or more objects in the images.

[0070] In accordance with operation 301, telecommunication endpoint 101-1 continually captures images of a scene, such as scene 1100, and processes those images. Operation 301 is described in detail below and in Figure 4. Each image is captured in the current frame of reference that the capturing device (e.g., camera 201, etc.) is in. Endpoint 101-1 transmits continually, via a sequence of messages that comprise both video frames of images and a frame identification (ID) for each video frame, representations of one or more images to telecommunication endpoint 101-2, as well as possibly to other endpoints. For instance, endpoint 101-1 transmits, via message 302, a representation (i.e., video frame with frame ID) of a first image such as image 1101 in Figure 11B and, via subsequent messages, representations of subsequent images such as image 1103 in Figure 11D and image 1104 in Figure HE. [0071] Endpoint 101-1 also transmits continually, via a sequence of messages that comprise both depth maps of images and a frame ID for each depth map, representations of one or more images to server computer 103. For instance, endpoint 101-1 transmits, via message 303, a representation (i.e., depth map with frame ID) of a first image such as image 1101 in Figure 11B and, via subsequent messages, representations of subsequent images such as image 1103 in Figure 11D and image 1104 in Figure HE. Endpoint 101-1 also transmits, to server computer 103, information that characterizes movement of its camera 201 via one or more messages 304. In accordance with the illustrative

embodiment, such information can comprise at least one of accelerometer values and gyroscope values generated by endpoint 101-1; however, as those who are skilled in the art will appreciate after reading this specification, other types of inertial motion information can be generated and sent by the endpoint.

[0072] Telecommunication endpoint 101-2, and possibly other endpoints, meanwhile processes and displays the one or more images received from endpoint 101-1. In accordance with operation 305, telecommunication endpoint 101-2 processes and displays the first image 1101 in Figure 11B, based on the video frame representation received in message 303. Operation 305 is described below and in Figure 6.

[0073] In accordance with operation 307, telecommunication endpoint 101-2 adds markers to the first image, such as marker 1111 in Figure 11C, resulting in marked image 1102. In accordance with the illustrative embodiment, a marker can be used to identify an object in the image. Endpoint 101-2 transmits via message 309 a

representation of the created marker to telecommunication endpoint 101-1, such as i) coordinates that correspond to the pixels where the user of endpoint 101-2 marked on the video frame and ii) the corresponding frame ID. Operation 307 is described below and in Figure 7.

[0074] In accordance with operation 311, server computer 103 generates one or more sets of coordinates of an object cluster associated with an object identified in the image in accordance with operation 307. Operation 311 is described in detail below and in Figure 8. In accordance with operation 311, server computer 103 can also update coordinates of an object cluster, in part by determining a difference between frames of references of a first image and a second image; within this context an example of a first image is image 1101 in Figure 11B and an example of a second image is image 1103 or image 1104 in Figures 11D or HE, respectively. A least some differences between frames of references of different images are presumably attributable to camera 201 - and, therefore, to endpoint 101-1 itself - being moved from one position to another. Such changes in positions can be attributed to translational movement of the camera or rotational movement, or both.

[0075] Server computer 103 can use the depth maps received from endpoint 101-1 and marker coordinates received from endpoint 101-2, as well as the respective frame IDs received from both endpoints, in order to detect the object cluster and to update the coordinates of the object cluster. Server computer 103 then transmits the object cluster coordinates to endpoints 101-1 and 101-2 via messages 313 and 314, respectively, along with the applicable frame ID.

[0076] In accordance with operation 315, telecommunication endpoint 101-1 displays a second image with the markers superimposed, wherein the markers are superimposed based at least in part on the object cluster coordinates determined in accordance with operation 311. For example, endpoint 101-1 displays image 1104 with superimposed marker 1112 in Figure HE. Operation 315 is described below and in Figure 9.

[0077] Telecommunication system 100 ensures proper coordination of the various shared representations through the use of the unique frame IDs, including synchronizing the video frames across the endpoints and the superimposing of markers on those video frames. That is, a representation tagged with a first frame ID corresponds to a first frame of reference, a representation tagged with a second frame ID corresponds to a second frame of reference, and so on. For example, by using the frame ID received with a depth map from endpoint 101-1 and the frame ID received with marker coordinates from endpoint 101-2, server computer 103 knows which depth map representation (from endpoint 101-1) to match with which marker coordinate representation (from endpoint 101-2), in order to detect an object cluster.

[0078] Although message flow 300 depicts a single iteration of processing image representations and marker representations, telecommunication endpoints 101-1 and 101-2 can continue to process additional images and markers in a manner similar to that described above. In addition, either or both endpoints 101-1 and 101-2 can record some or all of the image representations and corresponding marker representations, and play back said representations such that the markers are displayed in a coordinated manner with the corresponding images, based in part on the frame IDs. [0079] Message flow 300 depicts endpoint 101-1 as capturing the images and endpoint 101-2 as adding the one or more markers to an image. As those who are skilled in the art will appreciate after reading this specification, a different combination of image- capturing endpoint and marker-adding endpoint can be provided. Furthermore, more than one endpoint can add markers to the same image, or to different images captured as part of the same video stream, to the same object in one or more images, or to different objects. For example, the user of endpoint 101-1 or the user of a third endpoint, or both, can add markers to one or more images, in addition to or instead of the user of endpoint 101-2.

[0080] Figure 4 depicts a flowchart of operation 301 associated with capturing and processing one or more images. In accordance with operation 401, endpoint 101-1 captures an image and stores it into its computer memory. In accordance with the illustrative embodiment, camera 201 captures the image in its current frame of reference (i.e., that at which the image is captured) and tags it with a unique frame ID. As those who are skilled in the art will appreciate after reading this specification, however, a different device can be used to capture the image or endpoint 101-1 can receive a representation of the captured image from an external source (e.g., endpoint 101-2, etc.), wherein the frame of reference of the representation is known and made available.

[0081] In accordance with operation 403, endpoint 101-1 generates a video frame representation of the image. Endpoint 101-1 also generates a depth-map representation of the image, including z-depth information. Operation 403 is described below and in

Figure 5.

[0082] In accordance with operation 405, endpoint 101-1 transmits the video representation of the image with frame ID, including depth information, to endpoint 101-2 via message 302. Endpoint 101-1 transmits the depth-map representation of the image with frame ID to server computer 103 via message 303. Endpoint 101-1 transmits camera movement information (e.g., accelerometer values, gyroscope values, etc.) to server computer 103 via message 304.

[0083] Figure 5 depicts a flowchart of operation 403 associated with generating a representation of an image. In accordance with operation 501, endpoint 101-1 creates a two-dimensional visual representation of the image (i.e., height and width), thereby generating a video frame of the image. Endpoint 101-1 creates the representation according to brightness values for pixels, both initially and regularly afterwards. In other frames, only the changes in the pixel values are included in the representation and transmitted to endpoint 101-2.

[0084] In accordance with operation 503, endpoint 101-1 generates maps in YUV color space for the corresponding image. As those who are skilled in the art will appreciate after reading this specification, the endpoints and server computer can operate alternatively in a different color space than YUV (e.g., RGB, etc.) in generating and otherwise processing the various representations disclosed herein. Endpoint 101-1 uses minor focus and defocus to provide two streams of YUV maps with two different predefined focal points.

[0085] In accordance with operation 505, endpoint 101-1 applies a gray scale.

Endpoint 101-1 compares the two data streams for the two focal points in gray scale.

[0086] In accordance with operation 507, endpoint 101-1 maps common points based on which lengths are checked and differences are stored. A matrix of the length differences results in z-depth position for each common cluster. The "z-depth" refers to the distance of the surfaces of scene objects from a viewpoint in the image field; it can be calculated for one or more points on the surfaces of the scene objects (i.e., on a pixel-by- pixel basis). The "z" in z-depth relates to a convention that the central axis of view of a camera is in the direction of the camera's z-axis, and not to the absolute z-axis of a scene.

[0087] In accordance with operation 509, and based on one or more of the foregoing operations in Figure 5, endpoint 101-1 creates a depth-map representation of the image, in the form of a transcoded image. In some embodiments of the present disclosure, endpoint 101-1 represents the z-depths in three-bit format, computed per pixel. Endpoint 101-1 creates the representation according to depth values for pixels, both initially and regularly afterwards. In other frames, only the changes in the pixel values are included in the representation and transmitted to server computer 103.

[0088] Figure 6 depicts a flowchart of operation 305 associated with endpoint 101-2 processing and displaying one or more representations of an image. In accordance with operation 601, endpoint 101-2 receives a representation of a particular image (e.g., "first image", etc.) in a particular video image frame, via message 302.

[0089] In accordance with operation 603, endpoint 101-2 processes the received representation for the purpose of displaying the image for its user. Endpoint 101-2 can use one or more received representations to construct the current image to be displayed, based on a complete representation with all of the pixel values and subsequent updates based on the pixels that have changed. [0090] In accordance with operation 605, endpoint 101-2 presents the image via its display to its user.

[0091] Figure 7 depicts a flowchart of operation 307 associated with endpoint 101-2 adding one or more markers to a particular image, including adding markers to an object in the image. In accordance with operation 701, endpoint 101-2 detects markers being added to the particular image being displayed in accordance with operation 605. As can be seen for example in Figure 11C, marker 1111 is being used to identify or designate the leftmost video monitor 1121 in scene 1100. Endpoint 101-2 can detect swipes being made by a user to touchscreen 202 or key selections being made to keyboard 203, wherein the swipes, the key selections, or a different type of user action correspond to the adding of a marker to a portion of the particular image being displayed. The user can create a marker in the form of a circle, square, tick mark, text, number, or any other symbol the user wants to use.

[0092] In accordance with operation 703, endpoint 101-2 generates a representation of the markers being created by the user. In accordance with the illustrative embodiment, the representation of a marker can be in the form of coordinates, wherein the frame of reference of the coordinates corresponds to that of the frame ID of the particular video frame on which the user is adding the marker.

[0093] For example and without limitation, the set of coordinates making up the representation can correspond to one or more features of a marker, such as one or more pixel points along the marker on the display, one or more vertices of the marker

approximated as a polygon (i.e., a "marker polygon"), one or more edges of said polygon, an approximated center of the marker or of said polygon, and so on.

[0094] In accordance with operation 705, endpoint 101-2 transmits the

representation of the marker or markers to server computer 103 via message 309, along with the frame ID of the video frame on which the user of endpoint 101-2 created the marker.

[0095] Figure 8 depicts a flowchart of operation 311 associated with server computer 103 generating coordinates of an object cluster and updating the object cluster coordinates as needed. In accordance with operation 801, server computer 103 receives a first depth map with first frame ID via message 303 from endpoint 101-1, in the form of a transcoded image. The depth map is of a first video frame and the first frame ID identifies the first video frame. The first video frame is of a first image of a scene and captured in a first frame of reference. Server computer 103 can also receive information that characterizes movement of camera 201 from endpoint 101-1 via message 304.

[0096] Server computer 103 can also receive subsequent information, including a second depth map with second frame ID from endpoint 101-1. The depth map is of a second video frame and the second frame ID identifies the second video frame. The second video frame is of a second image of a scene and captured in a second frame of reference.

[0097] In accordance with operation 803, server computer 103 receives coordinates of a first marker via message 309 from endpoint 101-2, along with the frame ID that corresponds to the video frame on which the user of endpoint 101-2 created the marker.

[0098] In accordance with operation 805, server computer 103 determines whether this is the first time that coordinates for the marker are being received. If this is the first time, meaning that an object cluster has not yet been detected for the marker, control of execution proceeds to operation 807. Otherwise, control of execution proceeds to operation 813. In other words, when the first set of marker co-ordinates is first received corresponding to the pixels where user marked on the video frame, the first set is used to process and identify the object cluster from the depth map. Then, for all corresponding frames, pattern matching can be exclusively used to determine the difference in the position in previous and next frame.

[0099] In accordance with operation 807, server computer 103 matches the marker coordinates tagged with a frame ID, and received from endpoint 101-2, with a depth map received from endpoint 101-1 and corresponding to the same frame ID.

[oioo] In accordance with operation 809, server computer 103 detects an object cluster in the depth map matched in operation 807, in the region of the depth map identified by the marker coordinates received. In some embodiments of the present invention, server computer 103 uses multiple depths maps (e.g., 4-8 transcoded images, etc.) corresponding to video frames that have been already received and stored in memory, in order to detect an object cluster.

[oioi] The detection of an object cluster can be based in part on one or more z-depths that are within the region defined by the marker coordinates, which z-depths are received as part of the depth map of an image. For example, a candidate cluster of z- depths that are similar in value and within the region defined by the marker coordinates can be attributed to the object; in contrast, pixels having z-depth values different from those in the candidate cluster can be ruled out as belonging to an object, certainly if they are outside the region defined by the marker coordinates. One such object cluster can coincide with a particular object marked by the user, such as the video monitor in Figure 11C.

[0102] In accordance with operation 811, server computer 103 generates

coordinates of the detected object cluster, in well-known fashion, wherein the coordinates of the object cluster comprise a representation of depth, in addition to height and width representations within an image field.

[0103] In accordance with operation 815, server computer 103 transmits the coordinates of the object cluster to endpoints 101-1 and 101-2 via messages 313 and 314, respectively. Server computer 103 also includes the frame ID corresponding to the depth map and video frame of the marker that were matched for the purpose of detecting the object cluster. In some embodiments of the present disclosure, server computer 103 can also transmit coordinates of the marker itself (e.g., the coordinates received in accordance with operation 803) to one or more both endpoints 101-1 and 101-2.

[0104] Figure 9 depicts a flowchart of operation 813 associated with server computer 103 determining subsequent coordinates of an object cluster after it has already been detected, including compensating for camera motion. In accordance with

operation 901, server computer 103 calculates a difference between information captured by camera 201 in a second frame of reference and in a first frame of reference. The second frame of reference corresponds to a second image (e.g., image 1103, image 1104, etc.), and the first frame of reference corresponds to a first image (e.g., image 1101, etc.).

Server computer 103 calculates the difference by comparing a second depth map

corresponding to the second image to a first depth corresponding to the first image. An example of such a comparison is pattern matching, in which a shift in the object cluster in the second image with respect to where it appeared in the first image can be attributed to movement of the object in the camera's field of view.

[0105] In accordance with operation 903, server computer 103 selects reference points that define a polygon (i.e., a "reference polygon") within an image, which will be tracked across one or more subsequent images being captured by camera 201.

[0106] In accordance with operation 905, server computer 103 determines the change in length and/or area of the defined polygon. Any change in length and/or area is presumably attributable to camera 201 - and endpoint 101-1 itself - being moved from one position to another. The change in positions can be attributed to translational movement of the camera or rotational movement, or both. [0107] In accordance with operation 907, server computer 103 calculates the change in camera position based on the change in length and/or area of the polygon. As part of this operation, server computer 103 applies a bandpass filter in order to remove at least some anomalous results, if present.

[0108] In accordance with operation 909, server computer 103 applies the received information that characterizes movement of camera 201. In particular, the server computer can apply the gyro values sensed by gyro 208 and/or accelerometer values sensed by accelerometer 209, in order to establish a rotational and/or translational change,

respectively, between the camera position in effect when the first image was captured (i.e., in the first frame of reference) and the camera position in effect when the second image was captured (i.e., in the second frame of reference). The change in position / rotation of the camera is obtained from its gyro and accelerometer values (i.e., within endpoint 101-1) and the change in these values is correlated with the pixel positions of the object clusters. Thus, a relation is established between a movement of the camera that is tracked using the inertial motion unit at endpoint 101-1 and the movement of object clusters through the video frames after learning from multiple video frames (e.g., 20 to 40, etc.). Thenceforth, this relation can be used in conjunction with the depth map cluster data in order to tune the tracking of the marked object.

[0109] In accordance with operation 911, server computer 103 updates the object cluster coordinates based on having determined the difference between the first and second frame of reference being considered, based on the calculated change in camera position and orientation. The server computer generates this second set of object cluster coordinates by adjusting the adjusting the first set of coordinates with the difference between the frames of reference. Whatever change in the depth-map representation of second image 1104 has occurred relative to the depth-map representation of first image 1101, in terms of position and/or orientation, the same change can also apply in determining the object cluster's position and orientation within second image 1104.

[Olio] Control of execution then proceeds to operation 815.

[oiii] Figure 10 depicts a flowchart of operation 315 associated with endpoint 101-1 displaying a second image with a marker or markers superimposed on the image. In accordance with operation 1001, endpoint 101-1 receives a representation of an object cluster (e.g., coordinates of the object cluster with frame ID) from server computer 103 via message 313. In some embodiments of the present disclosure, endpoint 101-1 also receives coordinates of the marker itself.

[0112] In accordance with operation 1003, endpoint 101-1 superimposes the markers on a second image, in this case image 1104, captured by camera 201. This is based on i) the first video frame representation of first image 1101 and ii) the

representation of the object cluster received via message 313 and corresponding to the first image, for the frame ID corresponding to the first image. In other words, endpoint 101-1 superimposes the marker created by a user on a displayed video frame having a particular frame ID, using the object cluster coordinates for that frame ID, wherein the marker is superimposed on the video frame having that frame ID. Endpoint 101-1 creates a marker from the object cluster coordinates, both in terms of establishing the marker's position as superimposed on the image and in terms of the size and shape of the marker. In some embodiments of the present invention, endpoint 101-1 uses received marker coordinates for a given frame ID, in order to establish the position, size, and/or shape of the marker as superimposed on the image, and then uses the object cluster coordinates to update the position, size, and/or shape of the marker as needed.

[0113] Endpoint 101-1 compensates for the position, shape, and/or size of the marker, in relation to any previously-superimposed marker, in part by considering the representation of depth in the coordinates of the object cluster. For example, the marker being superimposed on a current video frame can be reduced in size with respect to the marker that was superimposed on a previous video frame, based on the z-depth indicating the cluster being deeper in the image than before; likewise, the marker being superimposed on a current video frame can be increased in size with respect to the marker that was superimposed on a previous video frame, based on the z-depth indicating the cluster being shallower in the image than before.

[0114] In accordance with operation 1005, endpoint 101-1 presents second image 1104 to its user via display 206. As can be seen in Figure HE, marker 1112 is seen superimposed on image 1104. Endpoint 101-1 can perform this by storing in a display memory a combination of i) the captured image without the marker and ii) the marker adjusted to the frame of reference, as tracked by frame ID, of the captured image to be displayed. As those who are skilled in the art will appreciate after reading this specification, a different technique of display the captured image and the marker can be used instead. [oils] Because of the differences between the second and first frames of reference, in terms of where and how camera 201 has moved, marker 1112 can be different than marker 1111, even beyond their relative positions on the display. For example and without limitation, the marker that is superimposed in the second image in a second frame of reference might appear to have a different shape, owing to camera 201 moving laterally, radially, and/or circumferentially with respect to the object being marked. As those who are skilled in the art will appreciate after reading this specification, the representation of a marker generated at operation 703 can provide sufficient constituent points or other data representative of a marker. These constituent points can be individually and sufficiently adjusted, in accordance with operation 813, from the first frame of reference (i.e., that of the image on which the user created the marker) to the second frame of reference of the image on which the marker is to be superimposed. Accordingly, this can result not only in marker 1112 showing up at possibly a different position on the display than marker 1111, but also in its shape being adjusted according to the differences in frames of reference.

[0116] Marker 1112 can differ from marker 1111 (i.e., the marker initially created by the user of endpoint 101-2) in other ways than shape. In some embodiments, marker 1112 can be a different color, different line style (e.g., dotted vs. dashed vs. solid, etc.), different line thickness, or displayed using a different type of presentation (e.g., flashing versus solid, etc.) on display 206.

[0117] With regard to the displaying of other images, endpoint 101-1 presented image 1102 to its user via display 206, prior to presenting image 1104. As can be seen in Figure 11C, marker 1111 is seen superimposed on image 1102. However, because the frame of reference of image 1102 is that same as that of first image 1101, there was no need to adjust the position and orientation of marker 1111 in relation to where the user of endpoint 101-2 had created the marker in the first place. In contrast, marker 1112 appears adjusted within a second frame of reference - that is, that of image 1104 - and, as a result, appears to continue to coincide with the appearance of the video monitor object in the image.

[one] Another way to understand how endpoint 101-1 superimposes the marker is by understanding when the marker does not appear in an image; such is the case in a third image 1103 captured by camera 201 and in a third frame of reference, as in the following example. In particular, server computer 103 takes the difference between the first and third frame of reference determined at operation 901, and adjusts the representation of the object cluster detected at operation 809 based on the difference. This is because the representation of the marker as provided by endpoint 101-2 is defined with respect to the frame of reference of first image 1101. Therefore, whatever change in representation of third image 1103 has occurred relative to the representation of first image 1101, in terms of position and orientation, the same change also applies in determining the marker's position and orientation within third image 1103. Consequently, because the leftmost video monitor 1121 of scene 1100 is fully outside of third image 1103, so is any marker of that video monitor.

[0119] Figures 11A through HE depicts scene 1100 and corresponding

images 1101 through 1104. Although Figures 11A through HE have been referred to above and in regard to message flow 300, they are now described from the perspective of the individual figures.

[0120] Figure HA depicts scene 1100, which is that of an office room with three video monitors on a table, including leftmost monitor 1121 and rightmost monitor 1122. As an example, first endpoint user, who is a technician, is standing in the office room with a smartphone having a camera (i.e., endpoint 101-1 comprising camera 201); a second endpoint user, who is an office or building manager currently at a remote location, is looking on a display (i.e., at endpoint 101-2 comprising display 206) at the video images being transmitted by the first endpoint. The technician can be walking around the room depicted in scene 1100, training the smartphone camera at various objects, and the manager can be marking one or more objects in the video-stream images being received.

[0121] As those who are skilled in the art will appreciate after reading this

specification, however, the system and method disclosed herein can also be applied to usage scenarios other than tech support, such as, while not being limited to, maintenance, education, medicine, criminal investigation, combatting terrorism, shopping, booking of travel and lodging, and so on; can also be applied to scenes other than that of an office location; and can also be applied to marking various objects other than those found in an office location. Also, the images need not be part of a video stream, nor do the images need to be shared as part of a videoconference.

[0122] Figure 11B depicts image 1101, which includes the leftmost video

monitor 1121 on the table in scene 1100. Image 1101 is captured by camera 201 of endpoint 101-1; a video frame representation of the image is transmitted to other endpoints, including endpoint 101-2, and a depth map representation of the image is transmitted to server computer 103.

[0123] Figure 11C depicts image 1102, which includes the leftmost video

monitor 1121 on the table in scene 1100 now appearing as being marked with marker 1111. Images 1101 and 1102 are of the same image, as captured by camera 201, but with marker 1111 appearing in image 1102.

[0124] Figure 11D depicts image 1103, which includes the rightmost video monitor 1122 on the table in scene 1100, but not the leftmost video monitor 1121, which has appeared in image 1102 as being marked. Image 1103 is captured by camera 201 of endpoint 101-1 and is the result of the user of endpoint 101-1 shifting and/or panning the endpoint toward the right of scene 1100. Endpoint 101-1 transmits a video frame representation of the image to other endpoints, including endpoint 101-2, and a depth map representation of the image to server computer 103. Image 1103 is of a different frame of reference from that of image 1101, as camera 201 has moved.

[0125] Figure HE depicts image 1104, which once again includes the leftmost video monitor 1121 on the table in scene 1100. Image 1104 is captured by camera 201 of endpoint 101-1 and is the result of the user of endpoint 101-1 shifting and/or panning the endpoint back toward the left of scene 1100, after having been trained on the right part of scene 1100. Image 1104 is of a different frame of reference from those of images 1101 and 1103, as camera 201 has moved in relation to the camera position and orientation when images 1101 and 1103 were captured.

[0126] In accordance with the illustrative embodiment, endpoint 101-1 superimposes marker 1112 on its display for its user, as described earlier, in the approximate position in relation to where marker 1111 was displayed, after server computer 103 has accounted for the different frames of reference between that of image 1104 and that of one or more of the previous images, including image 1101. Similarly, endpoint 101-2 also can superimpose marker 1112 on its display for its user.

[0127] It is to be understood that the disclosure teaches just one example of the illustrative embodiment and that many variations of the invention can easily be devised by those skilled in the art after reading this disclosure and that the scope of the present invention is to be determined by the following claims.

Claims

What is claimed is:

1. A data-processing system (103) for processing, in multi-dimensional space, a marker on an image comprising :

a receiver (251) configured to:

a) receive (801) a first depth map of a first video frame and a first frame identification (ID) of the first video frame, wherein the first video frame is of a first image (1101) of a scene (1100) and captured in a first frame of reference, and

b) receive (803) coordinates of a first marker, wherein the coordinates of the first marker correspond to the first video frame;

a processor (234) configured to:

a) detect (809) an object cluster based on the first depth map and the coordinates of the first marker, and

b) generate (811) a first set of coordinates of the object cluster; and

a transmitter (252) configured to transmit (815) the first set of coordinates of the object cluster and the first frame ID, to at least one telecommunication endpoint (101-1, 101-2).

2. The data-processing system (103) of claim 1 wherein the receiver (251) is further configured to receive the first depth map from a first telecommunication endpoint (101-1) and the coordinates of the first marker from a second telecommunication endpoint (101-2).

3. The data-processing system (103) of any of the preceding claims:

wherein the receiver (251) is further configured to receive (801) a second depth map of a second video frame and a second frame identification (ID) of the second video frame, wherein the second video frame is of a second image (1104) of the scene (1100) and captured in a second frame of reference;

wherein the processor (234) is further configured to:

a) calculate (901) a difference between the second frame of reference and the first frame of reference, by comparing the second depth map to the first depth map, and b) generate (911) a second set of coordinates of the object cluster, by adjusting the first set of coordinates with the difference between the second and first frame of reference; and

wherein the transmitter (252) is further configured to transmit (815) the second set of coordinates of the object cluster and the second frame ID, to the at least one

telecommunication endpoint (101-1, 101-2).

4. The data-processing system (103) of either of claims 2 and 3 :

wherein the receiver (251) is further configured to receive (801) information that characterizes movement of a camera (201) at the first telecommunication endpoint (101-1); wherein the processor (234) is further configured to generate (911) a second set of coordinates of the object cluster based on the first set of coordinates and the information that characterizes movement of the camera; and

wherein the transmitter (252) is further configured to transmit (815) the second set of coordinates of the object cluster.

5. The data-processing system (103) of claim 4 wherein the information that characterizes movement of the camera comprises at least one of accelerometer values and gyroscope values generated by the first telecommunication endpoint (101-1).

6. The data-processing system (103) of either of claims 4 and 5 wherein the second set of coordinates of the object cluster is further based on movement of the object cluster across depth maps of multiple video frames.

7. The data-processing system (103) of any of the preceding claims wherein the data-processing system (103) is in communication with a first telecommunication endpoint (101-1) configured to:

capture (401) the first image (1101) of the scene (1100),

receive (1001) the first set of coordinates of the object cluster and the first frame ID, superimpose (1003) a second marker on the first image, based on the first set of coordinates of the object cluster and the first frame ID, and

display (1005) the first image with the second marker superimposed.

8. The data-processing system (103) of claim 7 wherein the first set of coordinates of the object cluster comprise a representation of depth, and wherein the first

telecommunication endpoint (101-1) is further configured to superimpose (1003) the second marker on the first image ( 1101) further based on compensating for the position of the second marker in the first image by considering the representation of depth .

9. A method for processing, in multi-dimensional space, a marker on an image comprising :

receiving (801), by a data-processing system ( 103), a first depth map of a first video frame and a first frame identification (ID) of the first video frame, wherein the first video frame is of a first image ( 1101) of a scene (1100) and captured in a first frame of reference;

receiving (803), by the data-processing system, coordinates of a first marker, wherein the coordinates of the first marker correspond to the first video frame;

detecting (809), by the data-processing system, an object cluster based on the first depth map and the coordinates of the first marker;

generating (811), by the data-processing system, a first set of coordinates of the object cluster; and

transmitting (815), by the data-processing system, the first set of coordinates of the object cluster and the first frame ID, to at least one telecommunication endpoint ( 101- 1, 101-2) .

10. The method of claim 9 wherein the first depth map is received by the data processing system ( 103) from a first telecommunication endpoint (101- 1) and the coordinates of the first marker are received from a second telecommunication endpoint ( 101-2) .

11. The method of any of the preceding claims further comprising :

receiving (801), by the data-processing system, a second depth map of a second video frame and a second frame identification (ID) of the second video frame, wherein the second video frame is of a second image ( 1104) of the scene ( 1100) and captured in a second frame of reference;

calculating (901), by the data-processing system, a difference between the second fra me of reference and the first frame of reference, by comparing the second depth map to the first depth map;

generating (911), by the data-processing system, a second set of coordinates of the object cluster, by adjusting the first set of coordinates with the difference between the second and first frame of reference; and transmitting (815), by the data-processing system, the second set of coordinates of the object cluster and the second frame ID, to the at least one telecommunication endpoint (101-1, 101-2).

12. The method of either of claims 10 and 11 further comprising :

receiving (801), by the data-processing system, information that characterizes movement of a camera (201) at the first telecommunication endpoint (101-1);

generating (911), by the data-processing system, a second set of coordinates of the object cluster based on the first set of coordinates and the information that characterizes movement of the camera; and

transmitting (815), by the data-processing system, the second set of coordinates of the object cluster.

13. The method of claim 12 wherein the information that characterizes movement of the camera comprises at least one of accelerometer values and gyroscope values generated by the first telecommunication endpoint (101-1).

14. The method of either of claims 12 and 13 wherein the second set of coordinates of the object cluster is further based on movement of the object cluster across depth maps of multiple video frames.

15. The method of any of the preceding claims further comprising :

capturing (401), by a first telecommunication endpoint (101-1), the first image

(1101) of the scene (1100);

receiving (1001), by a first telecommunication endpoint, the first set of coordinates of the object cluster and the first frame ID;

superimposing (1003), by the first telecommunication endpoint, a second marker on the first image, based on the first set of coordinates of the object cluster and the first frame ID; and

displaying (1005), by the first telecommunication endpoint, the first image with the second marker superimposed.

16. The method of claim 15 wherein the first set of coordinates of the object cluster comprise a representation of depth, and wherein the second marker is superimposed (1003) on the first image (1101) further based on compensating for the position of the second marker in the first image by considering the representation of depth.

17. A telecommunication system (100) for processing, in multi-dimensional space, a marker on an image comprising :

i) a first telecommunication endpoint (101-1) configured to :

a) capture (401) a first image (1101) of a scene (1100) and in a first frame of reference,

b) receive (1001) a first set of coordinates of an object cluster and a first frame ID,

c) superimpose (1003) a second marker on the first image, based on the first set of coordinates of the object cluster and the first frame ID, and d) display (1005) the first image with the second marker superimposed; and ii) a data-processing system (103) configured to:

a) receive (803) coordinates of a first marker, wherein the coordinates of the first marker correspond to the first video frame,

b) detect (809) the object cluster based on a first depth map and the coordinates of the first marker, wherein the first depth map is of a first video frame of the first image,

c) generate (811) the first set of coordinates of the object cluster, and d) transmit (815) the first set of coordinates of the object cluster and the first frame ID, to at least the first telecommunication endpoint.

18. The telecommunication system (100) of claim 17 further comprising a second telecommunication endpoint (101-2) configured to generate the coordinates of the first marker and transmit the coordinates to the data-processing system (103).

19. The telecommunication system (100) of any of the preceding claims wherein the data-processing system is further configured to:

a) receive (801) a second depth map of a second video frame and a second frame identification (ID) of the second video frame, wherein the second video frame is of a second image (1104) of the scene (1100) and captured in a second frame of reference,

b) calculate (901) a difference between the second frame of reference and the first frame of reference, by comparing the second depth map to the first depth map,

c) generate (911) a second set of coordinates of the object cluster, by adjusting the first set of coordinates with the difference between the second and first frame of reference, and d) transmit (815) the second set of coordinates of the object cluster and the second frame ID, to at least the first telecommunication endpoint (101-1).

20. The telecommunication system (100) of either of claims 18 and 19 wherein : i) the first telecommunication endpoint (101-1) is further configured to:

a) transmit information that characterizes movement of a camera (201), b) receive (1001) a second set of coordinates of the object cluster, and c) superimpose (1003) the second marker on a second image (1104) based on the second set of coordinates of the object cluster; and

ii) the data-processing system (103) is further configured to:

a) receive (801) the information that characterizes movement of a camera (201) at the first telecommunication endpoint (101-1),

b) generate (911) the second set of coordinates of the object cluster based on the first set of coordinates and the information that characterizes movement of the camera, and

c) transmit (815) the second set of coordinates of the object cluster to the first telecommunication endpoint.