AU2015264917A1

AU2015264917A1 - Methods for video annotation

Info

Publication number: AU2015264917A1
Application number: AU2015264917A
Authority: AU
Inventors: Andres Nicolas Kievsky
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2015-12-04
Filing date: 2015-12-04
Publication date: 2017-06-22

Abstract

METHODS FOR VIDEO ANNOTATION A computer-implemented method and system of applying an annotation to a portion of a video 5 sequence are disclosed. The method comprises monitoring a plurality of signals associated with the video sequence during display of the video sequence (410), detecting, during the display of the video sequence, an input identifying a point of interest with respect to at least one frame of the video sequence (420), and displaying, during the display of the video sequence, one or more annotation suggestions based on the monitored signals and the identified point of interest (430). 10 The method further comprises refining the one or more of the annotation suggestions based on further input detected with respect to at least one subsequent frame during the display of the video sequence (440); and where the further input corresponds (450) to a single one of the one or more annotation suggestions, applying the single one annotation to the portion of the video sequence corresponding to the input (460). Fig.lA

Description

1 2015264917 04 Dec 2015

METHODS FOR VIDEO ANNOTATION TECHNICAL FIELD

[0001] The present invention relates to cinematography and digital cinema. In particular, the present invention relates to a system and method for annotating a portion of a video sequence. The present invention also relates to an interactive display device and a computer readable medium for annotating a portion of a video sequence.

BACKGROUND

[0002] The advent of digital imaging technology has altered the behaviour of the film industry, in the sense that more and more films are produced digitally. Digital cinematography, the process of capturing video content as digital content items, has become increasingly prevalent for film production.

[0003] In addition to simplifying the transition of source materials between production and post-production, digital cinematography has improved the work flow of film production. For example, digital cinematography has enabled on-set monitoring, which means directors, clients, and others on set are able to watch the live video sequences of every shot during film production.

[0004] Annotations that are created from freehand interaction can be used on-set to annotate video footage as well as communicate in real-time with others. However, freehand creation of annotations on top of a video sequence is often error prone. A user may try to chase a moving object as the object changes position on a screen display, or may lose track of the shape of an object that has moved away. Accordingly, temporal and positional (spatial) coherence between the freehand annotation and a corresponding object can decrease, resulting in erroneous application of annotations. Freehand annotation in general (even when annotating static objects) is intrinsically error prone and inexact.

[0005] There is presently no easy way for an annotation to be specified with minimum error and maximum temporal and positional coherence. A need exists to facilitate applying annotations to a video sequence. 10756471 2 2 2015264917 04 Dec 2015

SUMMARY

[0006] It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.

[0007] A first aspect of the present disclosures provides a computer-implemented method of applying an annotation to a portion of a video sequence, comprising: monitoring a plurality of signals associated with the video sequence during display of the video sequence; detecting, during the display of the video sequence, an input identifying a point of interest with respect to at least one frame of the video sequence; displaying, during the display of the video sequence, one or more annotation suggestions based on the monitored signals and the identified point of interest; refining the one or more of the annotation suggestions based on further input detected with respect to at least one subsequent frame during the display of the video sequence; and where the further input corresponds to a single one of the one or more annotation suggestions, applying the single one annotation to the portion of the video sequence corresponding to the input.

[0008] According to another aspect, the display of the video sequence is one of display during capture of the video sequence and display after capture of the video sequence.

[0009] According to another aspect, at least one of the one or more of the annotation suggestions represents motion of an object or a person in the video sequence.

[0010] According to another aspect, wherein refining the one or more annotation suggestions comprises ceasing display of at least one of the annotation suggestions where the further input does not correspond to the at least one of the annotation suggestions.

[0011] According to another aspect, refining the one or more annotation suggestions comprises displaying one or more additional annotation suggestions based on the monitored signals and the further input.

[0012] According to another aspect, refining the one or more annotation suggestions comprises determining a correspondence score for each suggested annotations based on the further input and refining the number of annotations selections based on the correspondence scores. 10756471 2 3 2015264917 04 Dec 2015 [0013] According to another aspect, refining the number of annotations suggestions based on the correspondence scores comprises comparing each correspondence score to a predetermined threshold.

[0014] According to another aspect, the monitored signals comprise one or more of a feature detected in the video sequence, frame segmentation of the video sequence, trajectory of camera movements in the video sequence and trajectory of a moving feature of a frame segment of the video sequence.

[0015] According to another aspect, the at least one of the one or more annotation suggestions is displayed as path suggesting a completion of the input.

[0016] According to another aspect, the input is a contact by a user on a touch-sensitive display, the touch-sensitive display displaying the video sequence.

[0017] According to another aspect, the further input with respect to the one or more subsequent frames comprises progression of the contact on the touch-sensitive display.

[0018] According to another aspect, the one or more annotation suggestions are displayed in association with the video sequence.

[0019] According to another aspect, the detected input is associated with a spatial portion of the at least one video frame.

[0020] According to another aspect, where the further input is determined not to correspond to any of the at least one annotation suggestions, the annotation applied to the portion of the video sequence is an annotation reflecting a trajectory of the user input.

[0021] According to another aspect, applying the single one annotation to the portion comprises displaying the one single annotation in association with the video sequence, and storing the one single annotation in relation to the video sequence.

[0022] According to another aspect, the refining comprises determining an intent of the input. 10756471 2 4 2015264917 04 Dec 2015 [0023] Another aspect of the present disclosure provides a non-transitory computer readable storage medium having a computer program stored thereon for applying an annotation to a portion of a video sequence, comprising: code for monitoring a plurality of signals associated with the video sequence during display of the video sequence; code for detecting, during the display of the video sequence, an input identifying a point of interest with respect to at least one frame of the video sequence; code for displaying, during the display of the video sequence, one or more annotation suggestions based on the monitored signals and the identified point of interest; code for refining the one or more of the annotation suggestions based on further input detected with respect to at least one subsequent frame during the display of the video sequence and code for, where the further input corresponds to a single one of the one or more annotation suggestions, applying the single one annotation to the portion of the video sequence corresponding to the input.

[0024] Another aspect of the present disclosure provides a system, comprising: an interactive display device, the interactive display device being configured to monitor a plurality of signals associated with the video sequence during display of a video sequence; detect, during the display of the video sequence, an input identifying a point of interest with respect to at least one frame of the video sequence; display, during the display of the video sequence, one or more annotation suggestions based on the monitored signals and the identified point of interest; refine the one or more of the annotation suggestions based on further input detected with respect to at least one subsequent frame during the display of the video sequence; and where the further input corresponds to a single one of the one or more annotation suggestions, applying the single one annotation to a portion of the video sequence corresponding to the input.

[0025] A further aspect of the present disclosure provides an interactive display device for applying an annotation to a portion of a video sequence, comprising: a memory; a processor coupled to the memory for executing a computer program, said computer program comprising instructions for: monitoring a plurality of signals associated with the video sequence during display of the video sequence; detecting, during the display of the video sequence, an input associated with a spatial portion of the at least one video frame, and identifying a point of interest with respect to a portion of at least one frame of the video sequence; displaying, during the display of the video sequence, one or more annotation suggestions based on the monitored signals and the identified point of interest, the one or more annotation suggestions being displayed in association with the display of the video 10756471 2 5 2015264917 04 Dec 2015 sequence; refining the one or more of the annotation suggestions based on further input detected with respect to at least one subsequent frame during the display of the video sequence; and where the further input corresponds to a single one of the one or more annotation suggestions, applying the single one annotation to the portion of the video sequence corresponding to the input.

[0026] Other aspects are also disclosed.

BRIEF DESCRIPTION OF THE DRAWINGS

[0027] Fig. 1A shows an example system for annotating a portion of a video sequence; [0028] Figs. IB and 1C collectively form a schematic block diagram representation of an electronic device upon which described arrangements can be practised; [0029] Fig. 2 shows an example system for annotating a portion of a video sequence in greater detail; [0030] Figs. 3 A to 3C show an example sequence for interactive user input and resultant annotation of a video sequence; [0031] Fig. 4 shows a method of applying an annotation to a portion of a video sequence; [0032] Fig. 5 shows a method of monitoring signals associated with the video sequence according to the method of Fig. 4; [0033] Fig. 6 shows a method of displaying annotation suggestions in accordance with the method of Fig. 4; [0034] Figs. 7A to 7C show an example of problems that may occur inputting an annotation freehand to a moving object of a video sequence; [0035] Figs. 8A and 8B show examples of loss of positional and temporal coherence for inputting an annotation freehand to a moving object of a video sequence; 10756471 2 6 2015264917 04 Dec 2015 [0036] Figs. 9A to 9D show examples of intended annotations or input gestures typically input by users of touch screens; [0037] Figs. 10A to 10C show example annotation suggestions for a moving object of a video sequence across a number of frames; and [0038] Figs. 11A to 1 IE show an example of annotating a portion of a video sequence in relation to a user input across a number of frames of a video sequence.

DETAILED DESCRIPTION INCLUDING BEST MODE

[0039] The arrangements described are directed to application of an annotation to a portion of a video sequence. Annotations typically comprise or identify information relating to features of a video sequence. Annotations may for example by applied by a director viewing a display of a video sequence during capture of the video sequence, or may be applied by post-production analysts of a video sequence.

[0040] Annotations that directors are interested in can be classified into a number of categories. Typical categories of annotations may comprise performance, camera (image capture apparatus) parameters and quality. The performance category includes annotations relating to characters of the video sequence. Example annotation types include script, voice and character positioning. Camera parameter annotations typically include annotation types such as framing and zoom speed. Framing refers to selection of what to include in the scene captured using the camera. Expressive qualities of framing include an angle of the camera to an object of, the scene an aspect ratio of the projected image, and the like. Zooming means a change of focus length of a lens of the camera while the shot is in progress. Different effects may be created by different zooming speed. For example, zooming in creates a feeling of seemingly “approaching” a subject of the shot while zooming out makes audience feel that they are seemingly “distancing” the subject. Quality annotation types relate to issues of quality of the video sequence captured by the camera such as blur and focus. Different quality requirements may affect the camera movements. For example, a smooth camera pan may allow the scene to be sharp enough for the audience to observe, whereas a fast pan may create motion blur to the scene. Such information may be used in adjusting camera movement when making a next shot of the video sequence. The annotations may for 10756471 2 7 2015264917 04 Dec 2015 example provide some guidance at a production stage of a video sequence as to how to improve shooting the next shot, or at a post-production stage to improve editing of a video sequence.

[0041] Input of annotations to an interactive electronic device on which the video sequence is displayed, can be problematic. In particular, freehand creation of annotations by a user on top of a video sequence is often error prone and inexact. The arrangements described address this problem by displaying and refining annotation suggestions in association with the video sequence based upon received user input and signals associated with the video sequence.

[0042] Figures 1A to 1C illustrate an exemplary system 100 for applying an annotation to apportion of a video sequence. The system 100 is particularly suitable for applying annotations to a portion of a video sequence based upon freehand input by a user. An interactive display device 101 (also referred to as an interactive device) is used to display a video sequence to which an annotation is to be applied. The interactive display device 101 displays or presents the video sequence via a display 114 displaying a video layer 160. The video layer 160 is a representation of a video graphics layer presented by the display 114.

[0043] Figs. IB and 1C collectively form a schematic block diagram of the interactive display device 101 including embedded components, upon which the methods to be described are desirably practiced.

[0044] In a preferred arrangement the interactive display device 101 is a tablet device. However, the interactive display device 101 may be any electronic device capable of reproducing or displaying images or video and executing a graphical user interface. Examples of interactive display devices include mobile devices, smartphones, a portable media player or any device, in which processing resources are limited. Nevertheless, the methods to be described may also be performed on higher-level devices such as desktop computers, server computers, and other such devices with significantly larger processing resources.

[0045] As seen in Fig. IB, the interactive display device 101 (also referred to as an electronic device) comprises an embedded controller 102. Accordingly, the electronic device 101 may be referred to as an “embedded device.” In the present example, the 10756471 2 8 2015264917 04 Dec 2015 controller 102 has a processing unit (or processor) 105 which is bi-directionally coupled to an internal storage module 109. The storage module 109 may be formed from non-volatile semiconductor read only memory (ROM) 160 and semiconductor random access memory (RAM) 170, as seen in Fig. IB. The RAM 170 may be volatile, non-volatile or a combination of volatile and non-volatile memory.

[0046] The electronic device 101 includes a display controller 107, which is connected to the video display 114, such as a liquid crystal display (LCD) panel or the like. The display controller 107 is configured for displaying graphical images on the video display 114 in accordance with instructions received from the embedded controller 102, to which the display controller 107 is connected. The display controller 107 is configured to control display of the video sequence by the layer 160.

[0047] The interactive device 101 also includes user input devices 113 which are typically formed by keys, a keypad or like controls. In the example described herein, the user input devices 113 include a touch-sensitive display panel physically associated with the display 114, as depicted by the dashed line 195, to collectively form a touch-sensitive screen. For ease of reference, the combination of the display 114 and the user input devices 113 are referred to as a touch screen 114 in the arrangements described, consistent with that type of structure as found in traditional tablet devices, such as the Apple iPad™. The touch screen 114 may thus operate as one form of graphical user interface (GUI) as opposed to a prompt or menu driven GUI typically used with keypad-display combinations. Other forms of user input devices may also be used, such as a microphone (not illustrated) for voice commands or a joystick/thumb wheel (not illustrated) for ease of navigation about menus.

[0048] As seen in Fig. IB, the electronic device 101 also comprises a portable memory interface 106, which is coupled to the processor 105 via a connection 119. The portable memory interface 106 allows a complementary portable memory device 125 to be coupled to the electronic device 101 to act as a source or destination of data or to supplement the internal storage module 109. Examples of such interfaces permit coupling with portable memory devices such as Universal Serial Bus (USB) memory devices, Secure Digital (SD) cards, Personal Computer Memory Card International Association (PCMIA) cards, optical disks and magnetic disks. 10756471 2 9 2015264917 04 Dec 2015 [0049] The electronic device 101 also typically has a communications interface 108 to permit coupling of the device 101 to a computer or communications network 120 via a connection 121. The connection 121 may be wired or wireless. For example, the connection 121 may be radio frequency or optical. An example of a wired connection includes Ethernet. Further, an example of wireless connection includes Bluetooth™ type local interconnection, Wi-Fi (including protocols based on the standards of the IEEE 802.11 family), Infrared Data Association (IrDa) and the like. The network 120 may also allow communication between the interactive device 101 to a video capture device, such as a camera 190. The network 120 may also permit communications with a cloud server computer, or another type of computer (not shown).

[0050] Typically, the electronic device 101 is configured to perform some special function. The embedded controller 102, possibly in conjunction with further special function components 110, is provided to perform that special function. In the present arrangements where the electronic device 101 is a tablet device, the special functions may relate to the touch-screen 114. The special function components 110 is connected to the embedded controller 102. As another example, the device 101 may be a mobile telephone handset. In this instance, the components 110 may represent those components required for communications in a cellular telephone environment. Where the device 101 is a portable device, the special function components 110 may represent a number of encoders and decoders of a type including Joint Photographic Experts Group (JPEG), (Moving Picture Experts Group) MPEG, MPEG-1 Audio Layer 3 (MP3), and the like.

[0051] The methods described hereinafter may be implemented using the embedded controller 102, where the processes of Figs. 4 to 6 may be implemented as one or more software application programs 133 executable within the embedded controller 102. The electronic device 101 of Figs. 1A and IB implements the described methods. In particular, with reference to Fig. 1C, the steps of the described methods are effected by instructions in the software 133 that are carried out within the controller 102. The software instructions may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the described methods and a second part and the corresponding code modules manage a user interface between the first part and the user. 10756471 2 10 2015264917 04 Dec 2015 [0052] The software 133 of the embedded controller 102 is typically stored in the nonvolatile ROM 160 of the internal storage module 109. The software 133 stored in the ROM 160 can be updated when required from a computer readable medium. The software 133 can be loaded into and executed by the processor 105. In some instances, the processor 105 may execute software instructions that are located in RAM 170. Software instructions may be loaded into the RAM 170 by the processor 105 initiating a copy of one or more code modules from ROM 160 into RAM 170. Alternatively, the software instructions of one or more code modules may be pre-installed in a non-volatile region of RAM 170 by a manufacturer. After one or more code modules have been located in RAM 170, the processor 105 may execute software instructions of the one or more code modules.

[0053] The application program 133 is typically pre-installed and stored in the ROM 160 by a manufacturer, prior to distribution of the electronic device 101. However, in some instances, the application programs 133 may be supplied to the user encoded on one or more CD-ROM (not shown) and read via the portable memory interface 106 of Fig. IB prior to storage in the internal storage module 109 or in the portable memory 125. In another alternative, the software application program 133 may be read by the processor 105 from the network 120, or loaded into the controller 102 or the portable storage medium 125 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that participates in providing instructions and/or data to the controller 102 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, flash memory, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the device 101. Examples of transitory or non-tangible computer readable transmission media that may also participate in the provision of software, application programs, instructions and/or data to the device 101 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like. A computer readable medium having such software or computer program recorded on it is a computer program product.

[0054] The second part of the application programs 133 and the corresponding code modules mentioned above may be executed to implement one or more graphical user 10756471 2 11 2015264917 04 Dec 2015 interfaces (GUIs) to be rendered or otherwise represented upon the touch-screen display 114 of Fig. IB. Through manipulation of the user input device 113 (e.g., the keypad), a user of the device 101 and the application programs 133 may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via loudspeakers (not illustrated) and user voice commands input via the microphone (not illustrated).

[0055] Fig. 1C illustrates in detail the embedded controller 102 having the processor 105 for executing the application programs 133 and the internal storage 109. The internal storage 109 comprises read only memory (ROM) 160 and random access memory (RAM) 170. The processor 105 is able to execute the application programs 133 stored in one or both of the connected memories 160 and 170. When the electronic device 101 is initially powered up, a system program resident in the ROM 160 is executed. The application program 133 permanently stored in the ROM 160 is sometimes referred to as “firmware”. Execution of the firmware by the processor 105 may fulfil various functions, including processor management, memory management, device management, storage management and user interface.

[0056] The processor 105 typically includes a number of functional modules including a control unit (CU) 151, an arithmetic logic unit (ALU) 152, a digital signal processor (DSP) 153 and a local or internal memory comprising a set of registers 154 which typically contain atomic data elements 156, 157, along with internal buffer or cache memory 155. One or more internal buses 159 interconnect these functional modules. The processor 105 typically also has one or more interfaces 158 for communicating with external devices via system bus 181, using a connection 161.

[0057] The application program 133 includes a sequence of instructions 162 through 163 that may include conditional branch and loop instructions. The program 133 may also include data, which is used in execution of the program 133. This data may be stored as part of the instruction or in a separate location 164 within the ROM 160 or RAM 170. 10756471 2 12 2015264917 04 Dec 2015 [0058] In general, the processor 105 is given a set of instructions, which are executed therein. This set of instructions may be organised into blocks, which perform specific tasks or handle specific events that occur in the electronic device 101. Typically, the application program 133 waits for events and subsequently executes the block of code associated with that event. Events may be triggered in response to input from a user, via the user input devices 113 of Fig. IB, as detected by the processor 105. Events may also be triggered in response to other sensors and interfaces in the electronic device 101.

[0059] The execution of a set of the instructions may require numeric variables to be read and modified. Such numeric variables are stored in the RAM 170. The disclosed method uses input variables 171 that are stored in known locations 172, 173 in the memory 170. The input variables 171 are processed to produce output variables 177 that are stored in known locations 178, 179 in the memory 170. Intermediate variables 174 may be stored in additional memory locations in locations 175, 176 of the memory 170. Alternatively, some intermediate variables may only exist in the registers 154 of the processor 105.

[0060] The execution of a sequence of instructions is achieved in the processor 105 by repeated application of a fetch-execute cycle. The control unit 151 of the processor 105 maintains a register called the program counter, which contains the address in ROM 160 or RAM 170 of the next instruction to be executed. At the start of the fetch execute cycle, the contents of the memory address indexed by the program counter is loaded into the control unit 151. The instruction thus loaded controls the subsequent operation of the processor 105, causing for example, data to be loaded from ROM memory 160 into processor registers 154, the contents of a register to be arithmetically combined with the contents of another register, the contents of a register to be written to the location stored in another register and so on. At the end of the fetch execute cycle the program counter is updated to point to the next instruction in the system program code. Depending on the instruction just executed this may involve incrementing the address contained in the program counter or loading the program counter with a new address in order to achieve a branch operation.

[0061] Each step or sub-process in the processes of the methods described below is associated with one or more segments of the application program 133, and is performed by repeated execution of a fetch-execute cycle in the processor 105 or similar programmatic operation of other independent processor blocks in the electronic device 101. 10756471 2 13 2015264917 04 Dec 2015 [0062] Referring back to Fig. 1 A, the device 101 executes to display, overlaid on video layer 160, a drawing layer 130 to display the user’s ongoing freehand input, as represented by an input 150. The freehand input 150 extends as, for example, the user moves his or her finger along the touch screen 114 of the interactive device 101. In the arrangements described, the input is made by interacting with the touch screen 114. While use of the touch screen 114 is particularly suited to the arrangements described, other methods for receiving an interaction may nevertheless be appropriate (e g. a mouse drag, a joystick, hovering of fingers, engaging of buttons, eye tracking, body or extremities tracking, input captured via computer vision, trackpad, trackball and the like) as means for receiving a freehand input or indication, provided the indication or input can be associated with a particular spatial portion of the video frame. Typically, progressive touch interactions are be used to operate the touch screen 114 in the arrangements described.

[0063] The video layer 120 presented by the touch screen 114 continues displaying the video sequence as the user interacts with the touch screen 114. Continued display of the video sequence provides subjects (such as objects, faces, bodies, and the like) and context for user input.

[0064] As the freehand input 150 is detected, a record of the input 150 is stored on the interactive device 101, for example on the memory 109.. Storing a record of the input 150 includes storing temporal and positional components of the input 150 with reference to at least one video frame being displayed on the touch screen 114 at the time the input 150 occurs. Storing the temporal and positional components allows the application 133 to synchronise the freehand input 150 with the video sequence displayed on video layer 160.

[0065] Once stored, the freehand input 150 may be used as freehand annotation of the video sequence. The stored input 150 may also be played back overlaid on the video sequence.

[0066] Figure 2 shows a system 200 for annotating a portion of a video sequence.

[0067] The interactive display device (electronic device) 101 is used to display a video sequence on the touch screen 114 via presentation of the video layer 160. The video sequence may be displayed in near real-time during capture of the video sequence by the camera 190. Alternatively, display of the video sequence may relate to play of the video 10756471 2 14 2015264917 04 Dec 2015 sequence after capture of the video sequence, such as play of a video file stored on the memory 109, or a video file streamed via the network 120 from an external device such as a cloud server computer.

[0068] Overlaid on presentation of the video layer 160, a drawing layer 130 is used to display the user’s ongoing freehand input 150 made by contact of the user’s finger with the touch screen 114. The freehand input 150 extends as the user contacts and moves his or her finger on the touch screen 114 of the interactive display device 101 as previously described.

[0069] Overlaid on the drawing layer 130, a guide layer 240 is used to display suggested annotations on the touch screen 114. The suggested annotations are displayed in association with the display of the video sequence. In the example of Fig. 2, two indications 260 and 261 are displayed on the touch screen 114 via the layer 240 overlaid on the video layer 160, representing two suggested annotations on the interactive display device 101. The suggested annotations 260 and 261 correspond to the freehand input 150. As the user progresses the freehand input 150, suggested annotations may appear or disappear as correspondence of the annotated suggestions to further input progressively changes, forming a refinement process.

[0070] Similarly to the video layer 160, the drawing layer 130, and the annotation layer 240 are representations of video graphics layers presented by the touch screen display 114 in playing the video sequence. Presentation of the drawing layer 130 and the guide layer on the touch screen 114 is controlled by execution of the display controller 107. The layers 160, 130 and 240 are shown in exploded form in Fig. 2 for ease of reference.

[0071] In other arrangements display of the suggested in association with the displayed video sequence may be of different form. For example, the annotation suggestions may be displayed on a thumbnail or reduced size display of the video sequence. In some arrangements, the thumbnail may be overlaid on a portion of the overall video sequence displayed on the touch screen 114, or adjacent thereto.

[0072] When the refinement process results in one single annotation suggestion corresponding to the freehand input 150, a corresponding annotation is applied to a portion of the video sequence. 10756471 2 15 2015264917 04 Dec 2015 [0073] In other arrangements, the single suggested annotation may depend on a suggested annotation score.

[0074] In other arrangements, the user may use an explicit trigger to signal an end of the freehand input. Examples of an explicit trigger include breaking contact with the touch screen 114 of the interactive display device 101 or touching a separate area of the touch screen 114 or the interactive device 101, and the like.

[0075] In other arrangements, if the end of the freehand input 150 is reached and the number of suggested annotations is zero, the freehand input 150 may be saved as a freehand annotation. In such an event, the annotation applied to the portion of the video sequence reflects a trajectory of the freehand input relative to one or more frames of the video sequence.

[0076] In some arrangements, when the end of freehand input 150 is reached and only one suggested annotation remains, an annotation corresponding to the one suggested annotation is applied to a portion of the video sequence.

[0077] Figures 3 A to 3C shows an example sequence of interactive user input and resultant annotation of a video sequence. A frame 310 of the video sequence shows display on the touch screen 114 at a first moment time. In display of the frame 310, a stationary object 320 is displayed on the video layer 120. The user proceeds to start a freehand input 330a, the freehand input 330a being displayed by the drawing layer 130. The input 330a typically does not form part of the frame 310, but may be stored with reference to the frame 310 of the video sequence. The temporal and spatial location of the input 330a identifies a point of interest with respect to the frame 310, being the object 320.

[0078] A point of interest, in the context of the present disclosure, relates to a spatial region of a frame. The point of interest may relate to a region of a frame corresponding directly to a user input, for example a region where the user’s finger makes contact with the touch screen 114. The point of interest however, typically relates to a region of the frame corresponding to an object or person, or a trajectory of an object or person, as determined relative to the user input. 10756471 2 16 2015264917 04 Dec 2015 [0079] Fig. 3B shows a subsequent frame 311 of the video sequence. In Fig. 3B, the object 320 has moved away from an initial spatial position of the object 320 shown in Fig. 3A. The user progressively continues freehand input to form an extended input 330b. As shown in Fig. 3B, the object 320 is not located adjacent to the input 330b due to progression of the video sequence.

[0080] Fig. 3C shows a further subsequent frame 312 of the video sequence. In the frame 312, the object 320 is no longer visible on the touch screen display 114 as the object 320 has moved off-frame. However, the user has further continued their input as indicated by a further extended input 330c. A resultant suggested annotation 340 is displayed on the touch screen 114. The indication 340 corresponds to the freehand input 330c.

[0081] A method 400 of applying an annotation to a portion of a video sequence is shown in Figure 4. The method 400 may be implemented as one or more modules of the application 133 stored on the memory 109 and controlled by execution of the processor 105. The method 400 is normally executed during play of the video sequence. Play of the video sequence may comprise capture, playback or streaming of the video sequence.

[0082] The method 400 starts at a step 410. At step 410, the application 133 executes to monitor signals associated with the video sequence during display of the video sequence. A method 500 of monitoring video signals, as executed at step 410 is described hereafter in relation to Fig. 5.

[0083] The method 400 progresses from step 410 to step 420 when the user provides an input such as a freehand annotation. Step 420 executes to receive the monitored signals from step 410 and to detect the input received from the user. The input received, such as the freehand annotation 150, identifies a point of interest with respect to at least one frame of the video sequence.

[0084] In the arrangements described in relation to Fig. 4, the step 410 is implemented prior to step 420. However, in other arrangements, the step 410 may start upon detection of the user interaction with the touch screen at step 420. In such arrangements, the step 410 may operate to monitor signals across a number of frames displayed prior to the frame at which the input occurred. The prior frames may, for example, be stored in a buffer or temporary memory of the interactive device 101 associated with the video sequence. 10756471 2 17 2015264917 04 Dec 2015 [0085] In yet other arrangements, the step 410 may operate to monitor signals associated the video sequence throughout execution of the method 400. In such arrangements, only monitored signals within a predetermined threshold are provided to the step 420.

[0086] Once the input is detected, the method 400 progresses from step 420 to step 430. In execution of step 430, the application 133 receives the monitored signals and detected input from step 420 and proceeds to display one or more annotation suggestions. A method 600 of displaying annotation suggestions, as executed at step 430, is described hereafter in relation to Fig. 6.

[0087] Determination and display of annotation suggestions provide a means of assisting a user in applying their intended annotation. For example, an annotation suggestion may form a visual guide such as a path suggesting completion of the input. A path may comprise a directed path indicating a trajectory, or an outline or polygon representing an object or trajectory of an object in the video sequence.

[0088] Display of a path can clarify to the user how to finish the input to achieve a desired annotation. The user can interact with the touch screen 114 to progress the input to relate to an appropriate annotation suggestion, for example by following a displayed path. Such allows the user to apply the annotation in decreased time. Further, such assists in refining the suggested annotations as discussed in relation to steps 440 to 450.

[0089] Figures 11A to 1 IE show an example of the user input of a freehand annotation across a number of frames of a video sequence. The video sequence displays a trajectory of a bouncing ball in the presence of possible alternate annotation objects. Figure 11A shows display of a frame 1100 of a video sequence on the touch screen 114. In the frame 1100, potential annotations can be associated with a trajectory 1140 of a ball and objects that include a block 1130, a person 1110 and the ball 1120.

[0090] The user interacts with the touch screen 114, as indicated by a cursor 1145, to form a point of contact at a point 1150 of the frame 1100 (shown in Fig. 1 IB), as detected at step 420. Step 430 executes to display annotation suggestions. In the example of Fig. 1 IB, the point of contact 1150 could relate to the block 1130, the person 1110, the ball 1120 or the trajectory of the ball 1140. The application 133 executes at step 430 to display an annotation 10756471 2 18 2015264917 04 Dec 2015 suggestion 1132 corresponding to the block 1130, an annotation suggestion 1112 corresponding to the person 1110, an annotation suggestion 1122 corresponding to the ball 1120, and an annotation suggestion 1142 corresponding to the trajectory of the ball 1140. If the user wants to apply an annotation in relation to the trajectory of the ball 1140 for example, the user can progress the point of contact with the touch screen 114 along the annotation suggestion 1142. Accordingly, the annotation suggestions 1132, 1112, 1122 and 1142 assist in clarifying to the user how to apply the desired annotation.

[0091] Fig. 11C shows a subsequent frame 1105 of the video sequence in which the user further progresses the point of contact to form an extended input 1155 by interaction with the touch screen 114 along the trajectory of the ball. The annotation suggestion 1142 corresponding to the trajectory of the ball 1140 remains prominent while the other annotation suggestions become progressively less prominent. That is, an annotation suggestion 1135 corresponding to the block 1130, an annotation suggestion 1115 corresponding to the person 1110, and an annotation suggestion 1125 corresponding to the ball 1120 are displayed in a less visually prominent manner than the suggestion 1142, in the example of Fig. 11C by use of narrower outlines. Visual prominence can be effected in a number of ways for example using colours, flashing indicators, dashed lines and the like. A spatial position 1127 of the ball 1120 in Fig. 11C does not coincide with the annotation suggestion 1125 which coincides with the original spatial position of the ball 1120 in Fig. 11 A. Change of spatial position across frames of a video sequence is discussed further in relation to Fig. 10.

[0092] Fig. 1 ID shows a display of a frame 1190 subsequent to the frame 1105 of Fig. 11C in the video sequence. The ball 1120 is partially included in the frame 1190, as indicated by a spatial position 1160. The user further progresses input to a point of contact 1157 (from the point of contact 1155 in Fig. 11C) to form an extended input 1170. The progression of the point of contact remains along an annotation suggestion 1165 corresponding to the trajectory of the ball 1140 (Fig. 11 A). Step 430 executes to determine that the annotation suggestion 1165 corresponding to the trajectory of the ball 1140 is to remain while the other annotation suggestions are to be removed. That is, the annotation suggestion 1135 corresponding to the block 1130, annotation suggestion 1115 corresponding to the person 1110 are removed from display on the touch screen 114. In one arrangement an annotation suggestion 1125 corresponding to the ball 1120 continues to be displayed but in a state indicating that the annotation 1125 is not a possible selection. In Fig. 1 ID for example, 10756471 2 19 2015264917 04 Dec 2015 the annotation suggestion 1125 is shown as a dashed line but alternate ways of indicating that the ball is not a possible selection can be used. In another arrangement, the suggestion 1125 may be removed from display at execution of step 430 for the frame 1190.

[0093] Fig. 1 IE shows an annotation 1175 recorded when the person raises a finger (indicated by the cursor 1145) from the touch screen 114 to end user input. The annotation 1175 corresponding to the trajectory of the ball 1140 (Fig. 11 A) is applied to the video sequence according to step 460.

[0094] Figures 10A to 10C show examples display of one or more annotation suggestions on the touch screen 114 in association across a number of frames of a video sequence. Fig. 10A shows a first frame 1010. The frame shows an object 1020. The user interacts with the interactive display device 101 as indicated by an input 1030a. The input is received at step 420. The method 400 executes to continue to step 430.

[0095] Step 430 executes to determine and display an annotation suggestion. An annotation suggestion 1035 is displayed in association with a subsequent frame of the sequence, frame 1011 shown in Figure 10B. The annotation suggestion 1035 comprises an outline path corresponding to the object. The suggested annotation 1035 corresponds to an original position of the object 1020, although the object 1020 has a different relative spatial position in the frame 1011 compared to the frame 1012. The user progresses input to the touch screen 114 forming extended input 1030b.

[0096] Returning to Fig. 4, the method 400 progresses from step 430 to step 440. In execution of step 440, the application 133 receives the monitored signals, the detected input and the annotation suggestions from step 430. The application 133 proceeds to use further input detected with respect to one or more subsequent frames of the video sequence (that is, new or extended user interaction) to refine the annotation suggestions. The further input typically relates to progression of contact of the user’s finger with the touch screen 114.

[0097] The arrangements disclosed normally relate to progressive user input - input detected at the touch screen 114 at step 420 is normally progressed or continued by the user as a single extended input. For example, the progressive input may relate to an extended swipe gesture by the user on the touch screen 114 following a trajectory, or a drag gesture outlining an object. The refinement of the annotation suggestions involves using the user 10756471 2 20 2015264917 04 Dec 2015 input received or detected so far (including the further input) to generate and display annotations suggestions in the same manner as at step 430. Execution of step 440 accordingly calculates correspondence scores based on the further input to have an effect of refining the set of annotation suggestions and displaying the refined set of annotation suggestions. For example, the correspondence score may be compared to a predetermined threshold. As an effect of execution of step 440, additional annotation suggestions may be displayed on the touch screen, or some previously displayed annotation suggestions (displayed at step 430) may be removed from display by the touch screen 114.

[0098] The method 400 progresses from step 440 to step 450. In execution of the method 400, the user input received so far, the monitored signals and the annotation suggestions are passed from step 440 to step 450. At step 450, the application 133 execute to determine if a correspondence exists between the input received so far, the monitored signals and the annotation suggestions. Step 450 executes to creates a list of corresponding annotations, that is annotations which are considered to be in correspondence with the user input received so far.

[0099] In one arrangement, the user input received so far and the suggested annotations are matched using a curve matching algorithm (for example Frenkel, M., & Basri, R. (2003, January); Curve matching using the fast marching method. In Energy Minimization Methods in Computer Vision and Pattern Recognition (pp. 35-51); Springer Berlin Heidelberg.) in order to determine the correspondence. Use of such a curve matching algorithm at step 450 returns a distance rather than a score from 0 to 1.

[00100] In another arrangement, the correspondence is determined at step 450 by using a Hausdorff distance between the set of points of the input received so far and the set of points of a given annotation. Equation 1 below expresses the Hausdorff distance “h”, where “A” is a set of points of the user input received so far; “B” is a set of points of a given annotation; “d” is a distance function; “a” is a point belonging to the set A; “b” is a point belonging to the set B; “min” is a function that returns the minimum value of a set; and “max” is a function that returns the maximum value of a set. Arrangements relating to the algorithm using the Hausdorff distance at step 450 return a distance rather than a score from 0 to 1.

[00101] In another arrangement, if a matching algorithm returns a distance rather than a score from 0 to 1, a correspondence score can be determined at step 450 by using Equation 2 10756471 2 21 2015264917 04 Dec 2015 below. In Equation 2, “A” is the user input received so far, “B” is the annotation, “match()” is a matching algorithm that returns a distance, “MIN(x, y)” is the minimum function, “MAX(x,y)” is the maximum function, “M” is the maximum possible distance between two points, and “score()” is the resulting score function. The maximum possible distance between two points “M” may be determined by using the length of the screen diagonal, or querying the signal database for the maximum distance between two stored points.

[00102] In one implementation, the set of points of the input received so far comprises the vertices of a given path, directed path or polygon.

[00103] In another arrangement, the set of points of an annotation comprises the vertices of a given path, directed path or polygon. h(A,B) = maxaEA{mmbeB{d(a, b)}} Equation 1 score(A,B) = 1 - MIN{1, MAX (θ, matcl^A·8^ Equation 2 [00104] In some instances, there may be no correspondence determined at step 450 when the input received so far and the given annotation are both represented as directed paths and the directed paths are found to have opposite directions.

[00105] If no correspondence is found in execution of step 450, the method 400 executes to return to step 440. The application 133 at step 440 receives the monitored signals and annotation suggestions and proceeds to use further input to refine the annotation suggestions. The refinement at step 440 may execute to cease display of an annotation suggestion that no longer corresponds to the input received so far, or to add new annotation suggestions.

[00106] Referring back to Figs. 10A to 10C, Fig. 10 C shows a frame 1012 subsequent to the frames 1010 and 1011. Based upon an extended user input 1030c, step 450 operates to display two annotations suggestions - the outline 1035 and a trajectory path 1050. The trajectory path 1050 follows a trajectory of the object 1020 through the frames 1010 to 1012. The object 1020 does not form part of the display of the video sequence displayed at the frame 1012.

[00107] Returning to Figure 4, if a correspondence is found in execution of step 450, method 400 progresses from step 450 to step 460. In execution of step 460, the application 10756471 2 22 2015264917 04 Dec 2015 133 receives the input received so far, the monitored signals, the annotation suggestions and the list of corresponding annotations from step 450. The application 133 selects one or more corresponding annotations to apply, that is, the application 133 executes to create or correct an annotation corresponding to the annotation suggestion. In some arrangements, a single annotation suggestion relates to the user input. In other arrangements, a single annotation that has a greatest correspondence to the input is selected, for example by comparison to a threshold. In creating an annotation, the application 133 applies an annotation corresponding to an annotation suggestion to a portion of the video sequence by displaying the single annotation suggestion corresponding to the input. The portion of the video sequence may be a spatial and temporal portion of the video sequence corresponding to the user input across one or more frames of the video sequence. Step 460 also operates to storing the annotation applied to the video sequence.

[00108] Storing the annotations may comprise embedding the determined annotation as metadata by storing the determined annotation and associated semantics as metadata as an output video file. Alternatively, the applied annotation is stored as an annotation record associated with the video sequence. The output video file, the associated metadata stream, and the annotation record may be stored on the memory 109.

[00109] As discussed in relation to Figs. 8 A and 8B, the annotation is applied to the video sequence corresponding to the input by applying the annotation to the signal, object or trajectory determined to correspond to the user’s intended annotation.

[00110] In other arrangements, the creation of an annotation based on the annotation suggestion is solely based on the information associated with the annotation suggestion. The information associated with the annotation suggestion includes timing information, shape information, spatial information and the like.

[00111] In other arrangements, the creation of an annotation based on the annotation suggestion is based on both the information of the annotation suggestion and the input. In such arrangements, creation is then a correction, as the creation of the annotation corrects the user’s input taking into account the annotation suggestion. The correction may be accomplished by for instance using a weighting process between the information of said annotation suggestion and the information of said input, or using some properties (such as 10756471 2 23 2015264917 04 Dec 2015 temporal information, positional information and the like) from said annotation suggestion, and other properties from said input.

[00112] In some embodiments, the annotation suggestion is stored in association with the freehand annotation input in the memory 109.

[00113] After execution of step 460, the method 400 ends. In other arrangements, the method 400 may return to step 410 to monitor signals during play of the video sequence until further user input is detected.

[00114] The method 500 of monitoring signals of a video sequence, as executed at step 410, is now described in relation to Figure 5. The monitored signals may relate to one or more of a feature detected in the video sequence, frame segmentation of the video sequence, trajectory of camera movements in the video sequence and trajectory of a moving feature of a frame segment of the video sequence.

[00115] Table 1 below shows examples of signals detected and monitored by the method 500. Table 2 below shows examples of methods for detecting features in the video sequence.

Signal Description Example detection method Feature detected Polygonal areas of frames that contain a feature. (see Table 2) Frame Segmentation Find edges as paths; find contours as polygons; areas of similar colour or patterns. Using superpixels (for example: Chang, J., Wei, D., & Fisher, J. W. (2013, June); A video representation using temporal superpixels; In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (pp. 2051-2058). IEEE.) Canny Edge detector, line detectors. Trajectory of camera movements Zoom in or out; camera panning. Can be stored as directed paths. Camera motion (zoom, pan or free hand motion) is generated using techniques such as PTAM - parallel 10756471 2 2015264917 04 Dec 2015 24 tracking and mapping algorithm (Klein, G., & Murray, D. (2007, November); Parallel tracking and mapping for small AR workspaces. Mixed and Augmented Reality, 2007; ISMAR 2007, 6th IEEE and ACM International Symposium on (pp. 225-234), IEEE.) Trajectory of moving feature or frame segment Motion of features or frame segments. Can be stored as directed paths. Tracking (for example: Chang, J., Wei, D., & Fisher, J. W. (2013, June). A video representation using temporal superpixels; In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (pp. 2051-2058); IEEE.) Table 1 [00116] The method 500 starts at step 510. The application 133 executes at step 510 to perform frame segmentation. Each segment determined is then added to a signal database, stored for example of the memory 109. Table 2 below shows examples of methods for detecting features in the video sequence, as referenced in Table 1.

Feature Description Example detection method Object An object in the foreground Foreground detection (A. Elgammal, D. Harwood, and L. Davis, "Non-parametric model for background subtraction," in Proc. Eur. Conf. on Computer Vision, Lect. Notes Comput. Sci. 1843, 751-767 2000.) Person Person detected on a frame, including body detection. Human body detector (Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010); Object detection with discriminatively 10756471 2 2015264917 04 Dec 2015 25 trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9), 1627-1645.) Person identifier (Nakajima, C., Pontil, M., Heisele, B., & Poggio, T. (2003); Full-body person recognition system. Pattern recognition, 36(9), 1997-2006.) Face Face detected on a frame Face detector (Viola, P., & Jones, M. J. (2004); Robust real-time face detection. International journal of computer vision, 57(2), 137-154.) Artefact Out of focus/blurry area, over/under exposed area Blur Detection in an image: Tong, H., Li, M., Zhang, H., & Zhang, C. (2004, June); Blur detection for digital images using wavelet transform; In Multimedia and Expo, 2004. ICME'04. 2004 IEEE International Conference on (Vol. 1, pp. 17-20); IEEE. Over/Under expose detection ref: Yoon, Y. J., Byun, K. Y., Lee, D. H., Jung, S. W., & Ko, S. J. (2014); A New Human Perception-Based Over-Exposure Detection Method for Color Images; Sensors, 14(9), 17159-17173. Table 2 [00117] The method 500 operates to create or update a database of the monitored signals. In one arrangement, this database of signals can be, for example, an in-memory data structure or an embedded SQL database. Table 3 shows an exemplary signal database 10756471 2 26 2015264917 04 Dec 2015 structure for an embedded database arrangement. In such arrangements, the monitored signals are stored in a list where each entry is in the structure of Table 3.

Name Description Signal Type Type of the signal stored. Confidence Score A number from 0 to 1 which serves as a confidence score of the signal monitoring and detection that took place. Signal Shape Points that make up the signal’s shape. Signal Timing Encodes a unique frame identifier corresponding to each point in the Signal Shape. Extra Shape Information Includes any extra information associated with the shape.

Table 3 [00118] The method 500 progresses from step 510 to step 520. Step 520 operates to detect features of the video signal. Table 2 above shows examples of the features detected at step 520.

[00119] The detected features are each assigned a confidence score and added to the signal database kept during execution of the method 500.

[00120] In one arrangement, in execution of step 520 each frame of the video sequence is divided into superpixels. An outline of superpixel segments is used as a contour for a detected feature (face, human body, etc.) using superpixel segmentations and tracking (see for example: Chang, J., Wei, D., & Fisher, J. W. (2013, June); A video representation using temporal superpixels; In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (pp. 2051-2058); IEEE.) [00121] The method 500 progresses from step 520 to step 530. At step 530, the application 133 executes to perform camera movement determination. Any determined trajectory of camera movement is stored as a signal in the database on the memory 109. 10756471 2 27 2015264917 04 Dec 2015 [00122] The method 500 continues from step 530 to step 540. At step 540, the application 133 executes to perform linking of equivalent signals across frames. Step 540 takes as an input the signals such as the features detected at step 520 and the segments determined at step 510. Step 540 then proceeds to link objects (that is, features and frame segments) that are sufficiently optically correlated (simulating judgement of a human eye) to be considered a translation of a single object. In one arrangement, superpixel segmentations and tracking are used (for example those described at: Chang, J., Wei, D., & Fisher, J. W. (2013, June); A video representation using temporal superpixels; In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (pp. 2051-2058); IEEE.) [00123] The method 500 progresses from step 540 to step 550. Each group of linked objects is passed by step 540 to step 550. In execution of step 550, the application 133 determines the trajectory of each group by, for example, calculating the centre of mass of each area where the same object was detected, and using (i) that location as a point of a polyline or path; (ii) the frame where the object was detected as the point’s corresponding temporal information and (iii) a copy of the monitored signals as Extra Shape Information. Each determined trajectory signal is stored in the database by execution of step 550.

[00124] The method 500 progresses from step 550 to step 560. Execution of step 560 proceeds to limit the number of existing signals by, for example, deleting any signals whose Confidence Score is below a certain threshold. In some arrangements, the step 560 may be omitted from the method 500.

[00125] The method 500 progresses from step 560 to step 570. Execution of step 570 operates to index the monitored signals stored from steps 510 through 560 in order to provide for faster access and modification times when handling the monitored signals. In some arrangements, the step 570 may be omitted from the method 500.

[00126] The method 500 ends after execution of step 570.

[00127] The method 600 of generating and displaying annotation suggestions, as executed at step 430, is now described in relation to Figure 6. The method 600 may be implemented as one or more modules of the application 133 stored on the memory 109, and controlled by execution of the processor. 10756471 2 28 2015264917 04 Dec 2015 [00128] The method 600 starts at step 610. Step 610 executes to generate annotation suggestions based on the input received at step 420, including the monitored signals and detected user input from step 420.

[00129] Execution of the step 610 may be considered operation of a group of steps. Step 610 operates to process the list of monitored signals. Step 610 executes in relation to each monitored signal as indicated by step 620. Execution of step 610 operated to repeat steps 630 and 640 for each signal in the list of monitored signals. Execution of steps 630 and 640 for each monitored signal effectively causes the list of monitored signals to be filtered down to a list of signal suggestions. The list of suggestions forms the output of step 610. The signal suggestions identify potential signals to be associated with user input for annotation, effectively operating to identify one or more annotation suggestions.

[00130] Execution of step 620 may operate to pre-filter the list of signals by taking into account only those signals with a defining property. Examples of a defining property include (i) the signal having a particular signal type, (ii) the signal being within a number of frames of the newest available frame, and the like. The defining property may relate to a spatial portion of the user input associated with display of a frame, such as a region surrounding a point contact of the user’s finger with the touch screen 114. The point of interest may, alternatively, be based upon a trajectory across the frame based upon the user input, the signal type and behaviour of the signal type across a number of frames. Accordingly, the defining property identifies a point of interest of the frame according to the user input.

[00131] Step 630 is executed for each signal in the filtered list of suggestions determined in step 620. Execution of step 630 operates to determine a correspondence of each suggested signal to the user input. The correspondence is determined based on an amount by which the user input received so far corresponds to each given signal. The correspondence may be expressed as a score. An example of a correspondence score is a “Suggested Annotation Score”, being a decimal number from 0 to 1, with 0 indicating no correspondence and 1 indicating maximum correspondence between the suggested annotation and a user input. The Suggested Annotation Score is associated with the suggested signal by step 630, for example by storing the Suggested Annotation Score with reference to the suggested signal. Determining a correspondence of each monitored signal effectively operates to use the input to identify a point of interest in the display of the video sequence. 10756471 2 29 2015264917 04 Dec 2015 [00132] The method 600 continues from step 630 to step 640 for each filtered signal of step 620. Execution of step 640 operates to filter annotation suggestions that fail a threshold test by comparing the Suggested Annotation Score (correspondence score) to a given or predetermined threshold. As an example, a threshold of 0.9 can be used to filter all but the top 10% of suggested annotations. Filtering can be implemented by, for example, excluding from a list those suggested annotations which fail the threshold test.

[00133] After the annotation suggestions have been generated by the method 610 (by operation of the steps 620 through 640), the method 600 progresses to step 650. Step 650 executes to receive the filtered annotation suggestions from step 610 and adapt the filtered annotation suggestions for display by, for example, creating an indication for the suggested annotation. An indication for the suggested annotation may comprise for example a graphic representation of the suggested annotation such as an outline, a path, a polygon, a vectorial graphic, an icon, representative text, and the like.

[00134] Execution of step 650 may operate to vary some of the attributes of an indication for a suggested annotation (such as alpha level, colour, pen size, font, and the like) with accordance to the suggested annotation’s properties (such as Suggested Annotation Score) or the properties of the suggested annotation’s signal (such as signal type, temporal characteristics and the like.).

[00135] A suggested annotation based on a trajectory signal may be adapted by operation of step 650 into an indication for display. Accordingly, the suggested annotation may represent the motion of a feature or frame segment based on the trajectory signal, for example motion of a person or object of the video sequence.

[00136] The method 600 proceeds from step 650 to step 660. The application 133 executes in step 660 to receive the indication and display one or more annotation suggestions in association with the video sequence on the touch screen 114. For example, the annotation suggestions may be displayed overlaid on display of the video sequence using the layer 240 of Fig. 2. The method 600 ends after execution of step 660.

[00137] Figures 7A, 7B and 7C show some of the problems that may occur when trying to freehand annotate a moving object. The display 114 is used to display live video, played in real-time. At a moment in time corresponding to a frame 710, shown in Fig. 7A, a subject 10756471 2 30 2015264917 04 Dec 2015 720 is displayed on the display 114. The user draws a freehand sketch or input 730a by interaction with the touch screen 114 whilst the subject 720 is in a stationary position, as shown in the frame 710.

[00138] At a later moment in time, indicted by a subsequent frame 711 shown in Fig. 7B, the subject 720 has started moving away from the stationary position shown in frame 710. Accordingly, although the freehand input 730a been extended by progressive user interaction to an extended input 730b, the input has lost positional or spatial coherence as the input 730b is no longer consistent with the position and shape of the subject 720. Accordingly error has been introduced in terms of the temporal and spatial coherence (correspondence) of the freehand input 730b in relation to the subject 720.

[00139] At a later moment in the video sequence, corresponding to a subsequent frame 712 shown in Fig. 7C, the subject 720 has moved off-frame, and does not form part of the display of the video sequence on the display 114. Accordingly, the user has no reference to follow the shape of the object 720. The freehand input still exists for the frame 712 having been extended by progressive user interaction to form an input 730c. However, as the subject 720 is no longer visible, so the temporal and spatial coherence of the input 730c relative to the object 720 has been lost.

[00140] Turning to Figure 8A, a frame 810 of a video sequence illustrates a loss of positional coherence (also referred to as spatial coherence) for a user input. The loss of positional coherence is shown by comparison between an ideal contour 840 of a subject 820 when compared to a user interaction with the touch screen 114 received as an approximation 830 of the contour of the subject 820. The loss of positional coherence is due to several factors such as (i) human error, (ii) in touch screens, the user not being able to see the space directly under the user’s finger, and (iii) limits in input device precision (for example, precision of the touch-sensitive screen 114).

[00141] Application of an annotation to the video sequence corresponding to the user input in Fig. 8A may operate to apply to a signal determined at step 450 to correspond to the user input. Given the correspondence between the shape of the approximate contour 830 and the ideal contour 840, and signals associated with the object 820, the annotation may be applied to the object 820. 10756471 2 31 2015264917 04 Dec 2015 [00142] A frame 811 of a video sequence, shown in Fig. 8B, illustrates a loss of positional and temporal coherence between an ideal trajectory 850 of the subject 820 when compared to a human input received identifying an approximation 860 of the trajectory 850.

[00143] Application of an annotation to the video sequence corresponding to the user input in Fig. 8B may operate to apply to a signal determined at step 450 to correspond to the user input. Given the correspondence between the shape of the approximate trajectory 860 and the ideal trajectory 850, and signals associated with trajectory of the object 820, the annotation may be applied with respect to signals representing the trajectory of the object 820 across relevant frames of the video sequence.

[00144] In some implementations, the refinement of the annotation suggestions based on further input step 440 uses detection of intent as additional information in the refining process. Examples of intended annotations or input gestures typically input by users of touch screens are shown in Figures 9A to 9D, including an encircling gesture 910 (Fig. 9A), a cross-out gesture 920 (Fig. 9B), a fill gesture 930 (Fig. 9C) and an arrow gesture 940 (Fig. 9D). Table 4 below shows a list of sample intents.

Intent Type Description Example in Figs. 9A-9D Encircle Input forms an approximate circular shape around an area. 910 Cross out Input forms an approximate cross. 920 Fill Input significantly fills an area. 930 Arrow Input is in the shape of an arrow. 940

Table 4 [00145] In one arrangement, intents may be detected by using curve matching algorithm (e g. Frenkel, M, & Basri, R. (2003, January); Curve matching using the fast marching method; In Energy Minimization Methods in Computer Vision and Pattern Recognition (pp. 10756471 2 32 2015264917 04 Dec 2015 35-51); Springer Berlin Heidelberg.) The method may be used to match the freehand input against a set of predefined shapes representative of intents such as the intents in Table 4.

Once an intent is detected, the application 133 can execute to use the detected intent to enhance the refinement mechanism by additionally using the predefined shape as the freehand input.

[00146] In some implementations, detecting an intent of type “cross out” may be done by calculating a number of changes of direction in a predetermined time, and using a predetermined threshold below which no “cross out” activity is taking place. A threshold may be for example 10 changes of direction per second.

[00147] An angle of direction may be determined at step 440 by implementation of Equation 3, where ATAN2 is the C-style ATAN2(Y,X) function, pi and p2 are input points and p2 was received more recently than pi. ANGLE = ATAN2(p2.y - pl.y, p2.x - pl.x) Equation 3 [00148] Further, in some arrangements, a change in direction is detected when an absolute value of a minimum angle between two consecutive angles is higher than a threshold. In one implementation, said threshold may be PI/2.

[00149] In some implementations, detecting an intent of type “encircle” may be used during execution of refinement step 440 to additionally refine annotation suggestions that are wholly or significantly within the encircled area.

[00150] In some implementations, detecting an intent of type “cross out” may be used during refinement step 440 to additionally refine annotation suggestions whose centres are close (to within a threshold) to the point where the segments of the “cross out” intent overlap.

[00151] In some implementations, detecting an intent of type “arrow” may be used during refinement step 440 to additionally refine annotation suggestions whose centres are close (to within a threshold) to the point the arrow points at. 10756471 2 33 2015264917 04 Dec 2015 [00152] In some implementations, detecting an intent of type “arrow” may be used during refinement step 440 to additionally refine annotation suggestions whose trajectories follow the detected arrow’s orientation and direction.

[00153] In some implementations, detecting an intent of type “fill” may be used during refinement step 440 to additionally refine annotation suggestions which correspond to the convex hull of the points that comprise said intent.

[00154] In displaying annotation suggestions based upon a user input and the monitored signals, and refining the annotation suggestions based upon further input, the arrangements described can reduce error or decrease in spatial and temporal coherence between a user input for annotation and a relevant object in the video sequence. The arrangements and implementations described allow a user to apply annotations during display of the video sequence. Presenting and refining relevant suggestions of annotations based upon progressive input can assist the user in more accurately and quickly applying a desired annotation. The arrangements described are accordingly particularly suitable for freehand input using a touch screen.

[00155] The arrangements described are applicable to the computer and data processing industries and particularly for the digital, cinema industries.

[00156] The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

[00157] In the context of this specification, the word “comprising” means “including principally but not necessarily solely” or “having” or “including”, and not “consisting only of’. Variations of the word "comprising", such as “comprise” and “comprises” have correspondingly varied meanings. 10756471 2

Claims

1. A computer-implemented method of applying an annotation to a portion of a video sequence, comprising: monitoring a plurality of signals associated with the video sequence during display of the video sequence; detecting, during the display of the video sequence, an input identifying a point of interest with respect to at least one frame of the video sequence; displaying, during the display of the video sequence, one or more annotation suggestions based on the monitored signals and the identified point of interest; refining the one or more of the annotation suggestions based on further input detected with respect to at least one subsequent frame during the display of the video sequence; and where the further input corresponds to a single one of the one or more annotation suggestions, applying the single one annotation to the portion of the video sequence corresponding to the input.

2. The method according to claim 1, wherein the display of the video sequence is one of display during capture of the video sequence and display after capture of the video sequence.

3. The method according to claim 1, wherein at least one of the one or more of the annotation suggestions represents motion of an object or a person in the video sequence.

4. The method according to claim 1, wherein refining the one or more annotation suggestions comprises ceasing display of at least one of the annotation suggestions where the further input does not correspond to the at least one of the annotation suggestions.

5. The method according to claim 1, wherein refining the one or more annotation suggestions comprises displaying one or more additional annotation suggestions based on the monitored signals and the further input.

6. The method according to claim 1, wherein refining the one or more annotation suggestions comprises determining a correspondence score for each suggested annotations based on the further input and refining the number of annotations selections based on the correspondence scores.

7. The method according to claim 6, wherein refining the number of annotations suggestions based on the correspondence scores comprises comparing each correspondence score to a predetermined threshold.

8. The method according to claim 1, wherein the monitored signals comprise one or more of a feature detected in the video sequence, frame segmentation of the video sequence, trajectory of camera movements in the video sequence and trajectory of a moving feature of a frame segment of the video sequence.

9. The method according to claim 1, wherein the at least one of the one or more annotation suggestions is displayed as path suggesting a completion of the input.

10. The method according to claim 1, wherein the input is a contact by a user on a touch-sensitive display, the touch-sensitive display displaying the video sequence.

11. The method according to claim 10, wherein the further input with respect to the one or more subsequent frames comprises progression of the contact on the touch-sensitive display.

12. The method according to claim 1, wherein the one or more annotation suggestions are displayed in association with the video sequence.

13. The method according to claims 1, wherein the detected input is associated with a spatial portion of the at least one video frame.

14. The method according to claim 1, wherein, where the further input is determined not to correspond to any of the at least one annotation suggestions, the annotation applied to the portion of the video sequence is an annotation reflecting a trajectory of the user input.

15. The method according to claim 1, wherein applying the single one annotation to the portion comprises displaying the one single annotation in association with the video sequence, and storing the one single annotation in relation to the video sequence.

16. The method according to claim 1, wherein the refining comprises determining an intent of the input.

17. A non-transitory computer readable storage medium having a computer program stored thereon for applying an annotation to a portion of a video sequence, comprising: code for monitoring a plurality of signals associated with the video sequence during display of the video sequence; code for detecting, during the display of the video sequence, an input identifying a point of interest with respect to at least one frame of the video sequence; code for displaying, during the display of the video sequence, one or more annotation suggestions based on the monitored signals and the identified point of interest; code for refining the one or more of the annotation suggestions based on further input detected with respect to at least one subsequent frame during the display of the video sequence; and code for, where the further input corresponds to a single one of the one or more annotation suggestions, applying the single one annotation to the portion of the video sequence corresponding to the input.

18. A system, comprising: an interactive display device, the interactive display device being configured to monitor a plurality of signals associated with the video sequence during display of a video sequence; detect, during the display of the video sequence, an input identifying a point of interest with respect to at least one frame of the video sequence; display, during the display of the video sequence, one or more annotation suggestions based on the monitored signals and the identified point of interest; refine the one or more of the annotation suggestions based on further input detected with respect to at least one subsequent frame during the display of the video sequence; and where the further input corresponds to a single one of the one or more annotation suggestions, applying the single one annotation to a portion of the video sequence corresponding to the input.

19. An interactive display device for applying an annotation to a portion of a video sequence, comprising: a memory; a processor coupled to the memory for executing a computer program, said computer program comprising instructions for: monitoring a plurality of signals associated with the video sequence during display of the video sequence; detecting, during the display of the video sequence, an input associated with a spatial portion of the at least one video frame, and identifying a point of interest with respect to a portion of at least one frame of the video sequence; displaying, during the display of the video sequence, one or more annotation suggestions based on the monitored signals and the identified point of interest, the one or more annotation suggestions being displayed in association with the display of the video sequence; refining the one or more of the annotation suggestions based on further input detected with respect to at least one subsequent frame during the display of the video sequence; and where the further input corresponds to a single one of the one or more annotation suggestions, applying the single one annotation to the portion of the video sequence corresponding to the input.