AU2018282254A1

AU2018282254A1 - System and method for determining a three-dimensional position of a person

Info

Publication number: AU2018282254A1
Application number: AU2018282254A
Authority: AU
Inventors: Amit Kumar Gupta
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-12-17
Filing date: 2018-12-17
Publication date: 2020-07-02

Abstract

SYSTEM AND METHOD FOR DETERMINING A THREE-DIMENSIONAL POSITION OF A PERSON A system and method of determining a three-dimensional position of a person in a video. The method comprises identifying (110) a person of interest in a scene of a first frame of a video captured using a camera, the person of interest having an expected movement property in a first homography plane with respect to the camera; detecting (120) a deviation in the expected movement property of the person in the scene at a time of a subsequent frame of the video captured by the camera; and determining (130) a planar position of the person using the first homography plane at the time of the further frame in response to the detected deviation. The method also comprises determining (140) a three-dimensional position of the person in the scene at the time of the further frame using a second homography plane with respect to the camera based on the determined planar position, the second homography plane being non parallel to the first homography plane. 21842884_1 - 1/17 100 Start 110 Identify a person of interest 120 Detect deviation in expected movement property 130 Determine planar position of the person using first homography plane j:eFn 140 Determine 3D position using second homography plane 199 End 21843452_1 Fig. 1

Description

- 1/17

100

Start

110

Identify a person of interest

120

Detect deviation in expected movement property 130

Determine planar position of the person using first homography plane

j:eFn 140 Determine 3D position using second homography plane

199 End

21843452_1

Fig. 1

SYSTEM AND METHOD FOR DETERMINING A THREE-DIMENSIONAL POSITION OF A PERSON TECHNICAL FIELD

[0001] The present invention relates to a method, system and apparatus of 3D position estimation of athletes using images captured by a single camera.

BACKGROUND

[0002] Three-dimensional (3D) position estimation is an important step in many computer vision applications such as estimating motion pattern of athletes in a sports scenario, designing visual system for robots and the like.

[0003] Many methods have been proposed for estimating 3D position of objects in a scene. An existing method uses multiple cameras to record the scene and uses multiple view geometry to estimate 3D position. The use of multiple cameras incurs disadvantages of higher cost (due to more than one camera requirement) and requirements of multi camera calibration requirements.

[0004] In another existing method, a machine learning method is training to estimate 3D position. However, the machine learning system requires relevant training data which restricts usage to new view points and scenes.

[0005] In yet another existing method, a planar object (ground plane) is tracked to estimate 3D position of a moving camera (such as a drone). The method estimates homographies between images taken by a moving camera. The estimated homographies are used to implement real time object tracking. The method assumes that the reference object is usually a flat surface atop level ground and uses pyramidal optical flow and corner detection are used to find points of correspondence between two frames. The method suffers disadvantage in that images from the view point of the object to be tracked are required, thus limiting applicability for scenarios where 3D position of multiple objects are required and a single camera is available.

[0006] Thus, there is a need for improved 3D position estimation.

21842884_1

SUMMARY

[0007] It is an object of the present invention to substantially overcome, or at least ameliorate, at least one disadvantage of present arrangements.

[0008] The arrangements described relate to a method of estimating 3D position of a person.

[0009] One aspect of the present disclosure provides a method of determining a three dimensional position of a person in a video, the method comprising: identifying a person of interest in a scene of a first frame of a video captured using a camera, the person of interest having an expected movement property in a first homography plane with respect to the camera; detecting a deviation in the expected movement property of the person in the scene at a time of a subsequent frame of the video captured by the camera; determining a planar position of the person using the first homography plane at the time of the further frame in response to the detected deviation; and determining a three-dimensional position of the person in the scene at the time of the further frame using a second homography plane with respect to the camera based on the determined planar position, the second homography plane being non-parallel to the first homography plane.

[00010] According to some aspects, the deviation in the expected movement property is determined using a convolutional neural network.

[00011] According to some aspects, the deviation in the expected movement property is determined based upon known constraints relating to a position of the person in the first frame.

[00012] According to some aspects, determining the three-dimensional position of the person comprises determining a homography cube using the first homography plane and a parallel homography plane.

[00013] According to some aspects, detecting the deviation in the expected movement property of the person comprises determining that the person is on a ground plane of the scene.

[00014] According to some aspects, detecting the deviation in the expected movement property of the person comprises determining that the person is in the air.

[00015] According to some aspects, the deviation in the expected movement property of the person indicates whether the person is jumping or not jumping.

21842884_1

[00016] According to some aspects, the scene is a basketball court and determining the three dimensional position of the person comprises determining boundary of a court using line detection.

[00017] According to some aspects, the scene is a basketball court, the camera is a moving camera and determining the three-dimensional position of the person comprises determining boundary of a court of the scene using line detection for each frame of the video.

[00018] According to some aspects, the deviation relates to the person being on the ground and determining the three-dimensional location of the person relates to determining a location of pixels of feet of the person.

[00019] According to some aspects, the deviation relates to the person being in the air and determining the three-dimensional location of the person comprises identifying a last frame of the video containing the person being in the expected movement position.

[00020] According to some aspects, determining the three-dimensional position comprises determining a vertical homography plane using two horizontal homography planes.

[00021] According to some aspects, determining the three-dimensional position comprises using known structure of the scene to determine a boundary associated with the scene.

[00022] According to some aspects, determining the three-dimensional position of the person comprises determining a homography cube using the first homography plane and a parallel homography plane and determining the second homography plane at any location within the cube.

[00023] Another aspect of the present disclosure provides a non-transitory computer readable medium having a computer program stored thereon to implement a method of determining a three-dimensional position of a person in a video, the program comprising: code for identifying a person of interest in a scene of a first frame of a video captured using a camera, the person of interest having an expected movement property in a first homography plane with respect to the camera; code for detecting a deviation in the expected movement property of the person in the scene at a time of a subsequent frame of the video captured by the camera; code for determining a planar position of the person using the first homography plane at the time of the further frame in response to the detected deviation; and code for determining a three-dimensional position of

21842884_1 the person in the scene at the time of the further frame using a second homography plane with respect to the camera based on the determined planar position, the second homography plane being non-parallel to the first homography plane.

[00024] Another aspect of the present disclosure provides a system, comprising: a camera configured to capture video of a scene; a memory; a processor, wherein the processor is configured to execute code stored on the memory for implementing a method comprising: receiving video captured using the camera; identifying a person of interest in a scene of a first frame of the video, the person of interest having an expected movement property in a first homography plane with respect to the camera; detecting a deviation in the expected movement property of the person in the scene at a time of a subsequent frame of the video captured by the camera; determining a planar position of the person using the first homography plane at the time of the further frame in response to the detected deviation; and determining a three dimensional position of the person in the scene at the time of the further frame using a second homography plane with respect to the camera based on the determined planar position, the second homography plane being non-parallel to the first homography plane.

[00025] Another aspect of the present disclosure provides apparatus, comprising: a memory; a processor configured to execute code stored on the memory implement a method of determining a three-dimensional position of a person in a video, the method comprising: identifying a person of interest in a scene of a first frame of a video captured using a camera, the person of interest having an expected movement property in a first homography plane with respect to the camera; detecting a deviation in the expected movement property of the person in the scene at a time of a subsequent frame of the video captured by the camera; determining a planar position of the person using the first homography plane at the time of the further frame in response to the detected deviation; and determining a three-dimensional position of the person in the scene at the time of the further frame using a second homography plane with respect to the camera based on the determined planar position, the second homography plane being non-parallel to the first homography plane.

[00026] Other aspects are also described.

21842884_1

BRIEF DESCRIPTION OF THE DRAWINGS

[00027] One or more example embodiments of the invention will now be described with reference to the following drawings, in which:

[00028] Fig. 1 shows a flow chart of a method of estimating 3D position of an athlete;

[00029] Fig. 2 shows a flow chart of a method of identifying a person of interest in the video sequence for 3D position estimation;

[00030] Fig. 3 shows a flow chart of a method of training a person detector;

[00031] Fig. 4 shows an example of image level region of interest for person of interest detection on a court;

[00032] Fig. 5 shows examples sequence of athletes in different poses;

[00033] Fig. 6 shows an example comparison of pose detector output for two different poses;

[00034] Figs. 7A and 7B show a court and marking of known landmarks on the court for homography estimation;

[00035] Figs. 8A and 8B illustrate estimation of court cube homography planes;

[00036] Fig. 9 shows a sequence of three frames showing motion of the athlete from running to jumping;

[00037] Figs. 1OA and 1OB show estimation of homography planes passing at the location of an athlete;

[00038] Fig. 11 illustrates image and ground planes to explain homography relationship;

[00039] Fig. 12 shows a world coordinate system in relation to an image coordinate system;

[00040] Fig. 13 shows a method for estimation of 3D coordinate of a person of interest bounding box;

21842884_1

[00041] Figs. 14 A shows a system for determining a three-dimensional position of an object of interest;;

[00042] Fig. 14B and 14C collectively form a schematic block diagram representation of an electronic device upon which described arrangements can be practised and

[00043] Figs. 15A and 15B show a pole jump scenario to demonstrate a deviation in expected movement property.

DETAILED DESCRIPTION INCLUDING BEST MODE

[00044] Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.

[00045] Fig. 1 shows a method 100 of determining 3D position of a person of interest in an image sequence. In the example described, the image sequence is assumed to be of a sports scene such as a basketball game, and the person of interest an athlete partaking in a sports event. The arrangements described may also relate to determining a location of a person of interest in other use cases such as in a street environment, an airport environment, a theatre environment or the like. The arrangements described may also be applied to determine a location of objects of interest in a scene based on the detection of a deviation in expected location or movement of the objects.

[00046] Fig. 14A shows a system 1400 on which the arrangements described can be practiced. An image capture device 1490 captures images including a scene 1495. The image capture device may be stationary (fixed) or moving. The scene 1495 relates to a basketball court in the example of Fig. 14A but may relate to other types of scenes such as a stadium, a street, a shopping centre, or the like. The image capture device 1490 can be any image capture device capable of capturing video or images of the scene 1495, whether a full scene or a partial scene and communicating captured images via a network. For example, the image capture device may be a digital video camera and is referred to hereafter as a camera. The image capture device 1490 is capable of communication with a computing module 1401 via a network connection 1421. The method 100 operates to determine a location of a person of interest in the scene 1495

21842884_1 using video captured by the device 1490. Figs. 14B and 14C depict the system 1400, upon which the various arrangements described can be practiced.

[00047] As seen in Fig. 14B, the computer system 1400 includes: the computer module 1401; input devices such as a keyboard 1402, a mouse pointer device 1403, a scanner 1426, a camera 1427, and a microphone 1480; and output devices including a printer 1415, a display device 1414 and loudspeakers 1417. An external Modulator-Demodulator (Modem) transceiver device 1416 may be used by the computer module 1401 for communicating to and from a communications network 1420 via the connection 1421. The communications network 1420 may be a wide-area network (WAN), such as the Internet, a cellular telecommunications network, or a private WAN. Where the connection 1421 is a telephone line, the modem 1416 may be a traditional "dial-up" modem. Alternatively, where the connection 1421 is a high capacity (e.g., cable) connection, the modem 1416 may be a broadband modem. A wireless modem may also be used for wireless connection to the communications network 1420. As described in relation to Fig. 14A, the module 1401 can communicate with the image capture device 1490 via the connection 1421.

[00048] The computer module 1401 typically includes at least one processor unit 1405, and a memory unit 1406. For example, the memory unit 1406 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 1401 also includes a number of input/output (1/0) interfaces including: an audio-video interface 1407 that couples to the video display 1414, loudspeakers 1417 and microphone 1480; an 1/O interface 1413 that couples to the keyboard 1402, mouse 1403, scanner 1426, camera 1427 and optionally a joystick or other human interface device (not illustrated); and an interface 1408 for the external modem 1416 and printer 1415. In some implementations, the modem 1416 may be incorporated within the computer module 1401, for example within the interface 1408. The computer module 1401 also has a local network interface 1411, which permits coupling of the computer system 1400 via a connection 1423 to a local-area communications network 1422, known as a Local Area Network (LAN). As illustrated in Fig. 14B, the local communications network 1422 may also couple to the wide network 1420 via a connection 1424, which would typically include a so-called "firewall" device or device of similar functionality. The local network interface 1411 may comprise an Ethernet circuit card, a Bluetooth© wireless arrangement or an IEEE 802.11 wireless arrangement; however, numerous other types of interfaces may be practiced for the interface 1411.

21842884_1

[00049] The I/O interfaces 1408 and 1413 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 1409 are provided and typically include a hard disk drive (HDD) 1410. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 1412 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g., CD-ROM, DVD, Blu-ray DiscTM), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the system 1400.

[00050] The components 1405 to 1413 of the computer module 1401 typically communicate via an interconnected bus 1404 and in a manner that results in a conventional mode of operation of the computer system 1400 known to those in the relevant art. For example, the processor 1405 is coupled to the system bus 1404 using a connection 1418. Likewise, the memory 1406 and optical disk drive 1412 are coupled to the system bus 1404 by connections 1419. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations, Apple MacTM or like computer systems.

[00051] The method of determining a three-dimensional position of a person of interest may be implemented using the computer system 1400 wherein the processes of Figs. 1 to 3 and 13, to be described, may be implemented as one or more software application programs 1433 executable within the computer system 1400. In particular, the steps of the methods described are effected by instructions 1431 (see Fig. 14C) in the software 1433 that are carried out within the computer system 1400. The software instructions 1431 may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the described methods and a second part and the corresponding code modules manage a user interface between the first part and the user.

[00052] The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 1400 from the computer readable medium, and then executed by the computer system 1400. A computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program

21842884_1 product in the computer system 1400 preferably effects an advantageous apparatus for determining a three-dimensional position of a person using the arrangements described.

[00053] The software 1433 is typically stored in the HDD 1410 or the memory 1406. The software is loaded into the computer system 1400 from a computer readable medium, and executed by the computer system 1400. Thus, for example, the software 1433 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 1425 that is read by the optical disk drive 1412. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the computer system 1400 preferably effects an apparatus for determining a location of a person or object of interest.

[00054] In some instances, the application programs 1433 may be supplied to the user encoded on one or more CD-ROMs 1425 and read via the corresponding drive 1412, or alternatively may be read by the user from the networks 1420 or 1422. Still further, the software can also be loaded into the computer system 1400 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 1400 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu rayM Disc, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 1401. Examples of transitory or non tangible computer readable transmission media that may also participate in the provision of software, application programs, instructions and/or data to the computer module 1401 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like.

[00055] The second part of the application programs 1433 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 1414. Through manipulation of typically the keyboard 1402 and the mouse 1403, a user of the computer system 1400 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface

21842884_1 utilizing speech prompts output via the loudspeakers 1417 and user voice commands input via the microphone 1480.

[00056] Fig. 14C is a detailed schematic block diagram of the processor 1405 and a "memory" 1434. The memory 1434 represents a logical aggregation of all the memory modules (including the HDD 1409 and semiconductor memory 1406) that can be accessed by the computer module 1401 in Fig. 14B.

[00057] When the computer module 1401 is initially powered up, a power-on self-test (POST) program 1450 executes. The POST program 1450 is typically stored in a ROM 1449 of the semiconductor memory 1406 of Fig. 14B. A hardware device such as the ROM 1449 storing software is sometimes referred to as firmware. The POST program 1450 examines hardware within the computer module 1401 to ensure proper functioning and typically checks the processor 1405, the memory 1434 (1409, 1406), and a basic input-output systems software (BIOS) module 1451, also typically stored in the ROM 1449, for correct operation. Once the POST program 1450 has run successfully, the BIOS 1451 activates the hard disk drive 1410 of Fig. 14B. Activation of the hard disk drive 1410 causes a bootstrap loader program 1452 that is resident on the hard disk drive 1410 to execute via the processor 1405. This loads an operating system 1453 into the RAM memory 1406, upon which the operating system 1453 commences operation. The operating system 1453 is a system level application, executable by the processor 1405, to fulfil various high level functions, including processor management, memory management, device management, storage management, software application interface, and generic user interface.

[00058] The operating system 1453 manages the memory 1434 (1409, 1406) to ensure that each process or application running on the computer module 1401 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the system 1400 of Fig. 14B must be used properly so that each process can run effectively. Accordingly, the aggregated memory 1434 is not intended to illustrate how particular segments of memory are allocated (unless otherwise stated), but rather to provide a general view of the memory accessible by the computer system 1400 and how such is used.

[00059] As shown in Fig. 14C, the processor 1405 includes a number of functional modules including a control unit 1439, an arithmetic logic unit (ALU) 1440, and a local or internal

21842884_1 memory 1448, sometimes called a cache memory. The cache memory 1448 typically includes a number of storage registers 1444 - 1446 in a register section. One or more internal busses 1441 functionally interconnect these functional modules. The processor 1405 typically also has one or more interfaces 1442 for communicating with external devices via the system bus 1404, using a connection 1418. The memory 1434 is coupled to the bus 1404 using a connection 1419.

[00060] The application program 1433 includes a sequence of instructions 1431 that may include conditional branch and loop instructions. The program 1433 may also include data 1432 which is used in execution of the program 1433. The instructions 1431 and the data 1432 are stored in memory locations 1428, 1429, 1430 and 1435, 1436, 1437, respectively. Depending upon the relative size of the instructions 1431 and the memory locations 1428-1430, a particular instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 1430. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 1428 and 1429.

[00061] In general, the processor 1405 is given a set of instructions which are executed therein. The processor 1405 waits for a subsequent input, to which the processor 1405 reacts to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 1402, 1403, data received from an external source across one of the networks 1420, 1402, data retrieved from one of the storage devices 1406, 1409 or data retrieved from a storage medium 1425 inserted into the corresponding reader 1412, all depicted in Fig. 14B. The execution of a set of the instructions may in some cases result in output of data. Execution may also involve storing data or variables to the memory 1434.

[00062] The described arrangements use input variables 1454, which are stored in the memory 1434 in corresponding memory locations 1455, 1456, 1457. The described arrangements produce output variables 1461, which are stored in the memory 1434 in corresponding memory locations 1462, 1463, 1464. Intermediate variables 1458 may be stored in memory locations 1459, 1460, 1466 and 1467.

[00063] Referring to the processor 1405 of Fig. 14C, the registers 1444, 1445, 1446, the arithmetic logic unit (ALU) 1440, and the control unit 1439 work together to perform

21842884_1 sequences of micro-operations needed to perform "fetch, decode, and execute" cycles for every instruction in the instruction set making up the program 1433. Each fetch, decode, and execute cycle comprises: a fetch operation, which fetches or reads an instruction 1431 from a memory location 1428, 1429, 1430; a decode operation in which the control unit 1439 determines which instruction has been fetched; and an execute operation in which the control unit 1439 and/or the ALU 1440 execute the instruction.

[00064] Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 1439 stores or writes a value to a memory location 1432.

[00065] Each step or sub-process in the processes of Figs. 1-3 and 13 is associated with one or more segments of the program 1433 and is performed by the register section 1444, 1445, 1447, the ALU 1440, and the control unit 1439 in the processor 1405 working together to perform the fetch, decode, and execute cycles for every instruction in the instruction set for the noted segments of the program 1433.

[00066] The methods described may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions of Figs. 1-3 and 13. Such dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories.

[00067] The method 100 is typically implemented as one or more modules of the application 1433, stored in the memory 1406 and controlled under execution of the processor 1405. The method 100 starts with an identification of a person of interest step 110. Step 110 operates to detect a person of interest in a video sequence captured by the camera 1490 and to output a bounding box corresponding to the person of interest, for example representing a location of the person of interest in a current image frame. In one implementation, the person of interest is an input to the system in the form of bounding box specification for the person of interest in each frame of the image (video) sequence.

21842884_1

[00068] In another implementation, the method 100 receives image sequences as input from the camera 1490, and executes to apply a person detector to each video frame at person detection step 110. In one embodiment, the person detector is trained by utilizing a Histograms of Oriented Gradients (HOG) descriptor and a Support Vector Machine (SVM) classifier. Such a person detector is named as HOG people detector. The HOG descriptor represents an entire person by a single feature vector. The HOG people detector uses a sliding window detection approach for detecting the occurrence of people in each video frame. At each detection window, a HOG descriptor is computed. The descriptor is shown to the trained SVM classifier, which classifies the windows as either "person" or "not person". To detect people at different scales, each video frame is sub-sampled to produce multiple copies. The same HOG people detector can then be applied to each sub-sampled video frame.

[00069] An alternative implementation of step 110 generates a person detector using a trained deep convolutional neural network. Fig. 3 shows a method 300 of training a deep convolutional neural network for use in some implementations of the step 110. The method 300 is typically implemented as one or more modules of the application 1433 stored in the memory 1406 and controlled under execution of the processor 1405.

[00070] Operation of the method 300 of training a deep convolutional neural network based person detector is now described. The method 300 receives images 301 of people and annotated data of bounding boxes of people 302 in the training images 301. The training method 300 involves iterations for a pre-defined number of epochs, total epochs, as determined at a check step 303. If the current number of epochs is less than the predefined number of total epochs ("Yes" at step 303), the method 300 continues to a check step 304. Otherwise ("No" at step 303), the method 300 ends.

[00071] Every epoch is divided in to number of batches, with a pre-defined number of samples in each batch. Step 304 checks whether the current number of batches is less than the pre defined (total) number of batches. If the number of batches is less than the total, the method 300 continues to a modelling step 306. Otherwise, once all batches have been processed in an epoch ("No" at step 305), the method 300 continues to an increment epoch step 305.

[00072] At execution of step 306, a pre-decided convolutional neural network (CNN) model is used to process an input image of the samples 301. Operation of the step 306 generates a bounding box and probability of person detected 307. The method 300 continues from step 306

21842884_1 to a generating step 308. The generated results of step 306 are compared with ground truth annotations for the input images data and a corresponding loss is generated in execution of step 308. In one embodiment regression loss for bounding box and entropy loss for probability is used at step 308. The method 300 continues from step 308 to a check step 309. Step 309 operates to determine of the loss generated at step 308 is less than a predetermined threshold. The training method 300 end if the loss generated is less than the pre-defined loss threshold ("Yes" at step 309) or all epochs are completed ("No" at step 303).

[00073] If the generated loss is not less than a pre-defined loss threshold ("No" at step 309), the method 300 continues to an incrementing step 310. The batch number is incremented at step 310 and the training method continues to step 304. The loop from step 303 to step 310 continues until all epochs are completed. The trained model resultant from operation of the method 300 is used to process input images sequence to generate bounding boxes of detected persons and probability of detections. A pre-defined probability threshold is used to select bounding boxes of detected persons.

[00074] Fig. 2 shows a method 200 of identifying a person of interest as implemented in some embodiments of step 110. The method 200 is typically implemented as one or more modules of the application 1433, stored in the memory 1406 and controlled under execution of the processor 1405.

[00075] The method 200 starts at an application step 210. The step 210 executes to apply a trained person detector, for example a model trained using the method 300. The trained person detector outputs bounding boxes, each representing detection of a person in the scene of video sequence. Alternatively, step 210 may relate to receiving indication of founding boxes from a user manipulating inputs of the module 1401 such as the keyboard 1402 or the mouse pointer 1403.

[00076] Once persons are detected in an input image sequence, the method 200 continues to a track forming step 220. At step 220, the method 200 operates form tracks for each of the people detected at step 210. In one arrangement, detected people at adjacent frames may be associated by the processor 1405 executing to compare appearance features of each of the people. For example, a distance measure such as L2 distance, histogram intersection distance, chi-square distance, and the like can be determined by comparing the HOG descriptors or colour histograms for people detected at frame ti and people detected at frame t 2 . Two people are

21842884_1 considered as the same one if the distance measure of their appearance features is lower than an empirically determined threshold. A track can then be formed by connecting the centres of matched people through all frames. The track reveals the movement of a person across the sequence of frames. Other known alternative tracking algorithms may be used in the track forming step 220.

[00077] The track forming step 220 generates a set of people tracks as output. Each track is typically associated with the occurrences of a particular person in all frames of the video. Often a track identifier (ID) is assigned to each generated track.

[00078] The method 200 continues from step 220 to a selecting step 240. Step 240 selects one of the people in the scene as the person of interest. In one embodiment, a person is selected by a user of the system. In another embodiment, each detected person is used iteratively for three dimensional position estimation one person at a time. One of the tracks generated at step 220 is associated with the person of interest and can be considered to be selected when the person if interest is selected. The method 200 ends upon execution of step 240.

[00079] The output of method 200 (the step 110) is the selected track of a person of interest. Equation (1) is used to represent the track of the person of interest:

T, = [x', y, x',] (1)

[00080] In Equation (1), T, represents track coordinates in image co-ordinates system for an nth input image 400 of the image sequence as shown in Fig. 4. The track co-ordinates consist of

[x', y,] representing a top-left pixel coordinate of a bounding box 410 and [x', y,] represents a bottom-right pixel coordinate of the bounding box 410. The output of step 110 (method 200) is track information of the selected person of interest whose three-dimensional position is to be determined.

[00081] Returning to Fig. 1, the method 100 continues from step 110 to a detecting step 120. The step 120 detects a deviation in expected movement property of the person of interest. Each person detected in step 110 has an expected movement property. In the context of the present disclosure an expected movement property relates to a property affecting location of a person (or object) in terms of height or the z-dimension in the scene, that is determining whether the person is on a ground plane of the scene or in the air. The expected movement property can be detected in a homography plane with respect to the camera 1490.

21842884_1

[00082] Operation of step 120 is described by way of example with reference to Figs. 5 and 6. Fig. 5 shows a scene 500 of a basketball game, for example a frame captured by the camera 1490. The scene includes two players each in a different movement characteristic, shown as 510 and 520. In the example of Fig. 5, the player can be on ground or in the air. The properties of on the ground or in the air represent movement properties of a person in the scene, one of which is expected.

[00083] A deviation from an expected movement property can occur when the person has changed pose between a first or previous frame of the video and a subsequent frame, for example between the current frame and the last frame, or a last frame within a threshold number of frames. Alternatively, the deviation may be from a default expected pose. The deviation is determined to have occurred at a time of the subsequent frame of the video.

[00084] A method to identify the player movement property in an image is by executing a pose determination module to determine whether a player is on ground or in the air (jumping). Fig. 6 shows a comparison 600 of poses 610 and 620. Fig. 6 shows a pose detector module 630 used for each of the poses 610 and 620 to determine a pose of input image of the corresponding person of interest. Each person of interest relates to abounding box image extracted using bounding box coordinates from the input image as described in relation to step 110.

[00085] A method of pose estimation is now described. In one arrangement, a similar approach to using a convolutional neural network to detect a person is used. A convolutional neural network is trained based on pre-annotated training data for pose detection, for example in a similar manner to the method 300. The output of execution of step 120 is a sequence of pose detections for an input image sequence for the person of interest. The person of interest can be tracked identified through the sequence of images using the tracks determined at step 110. Table below shows an example output of the module 120 for three images frames, N, N+1, and N+2. The frames N, N+1 and N+2 are shown in Fig. 9 as frames 910, 920 and 930 respectively.

21842884_1

Frame Number Pose

N Not Jumping

N+1 Not Jumping

N+2 Jumping

Table 1 Poses for frames N to N+2

[00086] In an alternative arrangement, the pose of the person of interest (and accordingly any deviation in expected movement position) can be determined based upon prior known constraints relating to a position and/or direction of movement of the person.

[00087] For example, in a pole jump use case, the current location and a direction of movement of an athlete are known to follow a substantially straight line while the athlete is running to take a jump. In one example, an expectation due to prior information is that the athlete will not move away more than 0.5 metres from the original position. Fig. 15A shows scene 1500a for a pole jump scenario. The scene 1500a includes a pole obstacle 1510 and a running track 1520. A person 1530 represents the athlete with the pole. A corresponding set of axes 1540 shows coordinate axes where the x-axis is the axis towards the pole obstacle 1510 and the y-axis is perpendicular to the x-axis representing the running track 1520. In the example of Fig 15A, it is expected that the athlete will stay within a boundary of the running track 1520 when running towards to obstacle 1510. Fig. 15B shows an example scene 1500b where the athlete 1530 is in the middle of a jump. As a position 1550 of the athlete 1530 appears to be outside the boundary of the running track 1520, the athlete 1530 has changed pose and an alternative plane is used for position estimation.

[00088] In the example arrangements described, the deviation in expected movement property of the person relates to determining a pose of jumping or not jumping. The change in pose represents a change in height or the z-coordinates of the person (or object) in world coordinates. Accordingly, detection of the deviation in expected movement property indicates that an alternative homography plane must be used to determine the three-dimensional position. The alternative plane is an alternative to planes currently used to determine location. The alternative plane is typically a vertical plane so that the change in the z-coordinates can be determined.

21842884_1

[00089] A vertical plane is used when the pose estimation module differentiates between two pose classes -jumping and not jumping. In alternative implementations, the pose estimation module may also estimate pose angle with respect to the ground floor. In such cases, a plane with the same angle as estimated, is used. The estimation of a non-vertical homography plane is further described below in relation to Fig. 1OB.

[00090] The determination of three-dimensional location (position) is performed in response to the detection of the deviation.

[00091] Returning to Fig. 1, the method 100 continues from step 120 to a determining step 130.

[00092] The step 130 determines a planar position of the person of interest using a first homography plane. Step 130 operates using homography transformation matrices. For reference, the homography matrix and transformation are described.

[00093] Homography refers to a mapping relationship between two images of a plane captured by two pinhole cameras (or a single pinhole camera from two different positions).

[00094] For example, two images Ii and 12 are captured by a pinhole camera consisting of a plane P consisting of a point p. the point p is said to be represented by point pi andP2 in the images Ii and 12. A homography relationship between images Ii and 12, represented as H1 2 , is provided by Equation (2).

P2 = H12P1 (2)

[00095] Equation (2) represents the calculation of point P2 in image 2 using homography relation H 12 applied to point pi.

[00096] The homography matrix is defined using 8 degrees of freedom as shown in Equation (3).

[h H= h2 l 1 h12 h22 h 131 h23 (3) h 31 h3 2 h3 3

[00097] In Equation (3),, hj,i,j E [1,3] follows the constraint of Equation (4).

3=1 = 1 (4)

21842884_1

[00098] The point pi is at coordinate (x, y) in image 1 (for example 1110 of Fig. 11) and P2 is at coordinate (x', y') in image 2 (1120 of Fig. 11). The points pi and P2 are related to each other as per Equation (5), which is equivalent to relationship shown in Equation (2) above:

y= H12 y (5)

[00099] To estimate a homography matrix, a minimum of 4 points of correspondence between two planes are required to solve the linear set of 4 equations of Equation (5). P, and P 2 are set to represent a set of 4 corresponding points on image 1 1110 (Fig. 11) and image 2 1120 (Fig. 11) respectively. Then,

2 H12 = fHE(P1' p ) (5)

[000100] The function fHE represents a method to solve homography estimation. In one arrangement, a non-homogeneous linear solution is used to solve the homography estimation at step 130.

[000101] Equation (6) can also be used to estimate a reverse homography matrix H2 1 as follows:

2 H21 = fHE(p ' ') (6)

[000102] Step 130 operates to determine a homography matrix for the person of interest between the first and subsequent frames in the video sequence. The homography matrix relates to the positions of the person of interest, thereby determining a planar position or location of the person of interest at the time of the subsequent frame of the video data. The planar position of the person of interest is determined in response to the detection of the deviation in the expected movement property of the person of step 120. In determining the homography matrix, the planar position can be considered to be determined using a first homography plane with respect to the camera 1490.

[000103] The method 100 continues under execution of the processor 1405 from step 130 to a determining step 140. The step 140 operates to determine the three-dimensional position of the

21842884_1 person of interest at the time of the subsequent frame. The step 140 typically comprises two components, being matrix estimation and homography transformation.

[000104] In Fig. 11, the following homography relationships are shown for an arrangement 1100:

H1 2 = Homography matrix from an image 1 (1110) to an image 2 (1120)

H2 1 = Homography matrix from image 1 (1110) to image 2 (1120)

HG 2 = Homography matrix from a ground plane (1130) to image 2 (1120)

H2 G = Homography matrix from image 2 (1120) to the ground plane (1130)

HG1 = Homography matrix from ground plane (1130) to image 1 (1110)

H1G = Homography matrix from image 1 (1110) to ground plane (1130)

[000105] The homography matrix estimation and use of homography transformation from one image to another is used to determine the three-dimensional position of a person in the scene operation of step 140 of Fig. 1.

[000106] Estimating a three-dimensional (3D) position in the arrangements described involves estimating the position of the person of interest with respect to a the-dimensional coordinate system on the ground plane using a position of the person in the image. The ground plane is assumed to be in metric units in world coordinates (meters in Wx, Wy, Wz). As described in relation to Equation (5), four (4) point correspondences are required to generate a homography matrix.

[000107] In an example implementation, four comers of a court boundary, as shown by 711, 712, 713 and 714 in an image frame 700 of Fig. 7A, are used as point correspondences to determine a homography matrix between a ground plane 710 and the input image. The points on the image 700 can be marked manually once for stationary cameras. Alternatively, the points on the image can generated for every image in the image sequence for moving cameras. In one example embodiment, a boundary of a basketball court is determined by detecting straight lines using a Hough Transform. The point correspondences on the ground plane can be determined

21842884_1 due to known size of a professional basketball court, such as 15 metres wide and 28 metres long. Accordingly, known structure or known physical properties of the scene can be used to generate the point correspondences. The ground coordinate system uses x for width, y for length and z for height. The point correspondences are shown in Table 2.

Image Point Ground Plane Point [x,y]

711 [0,0]

712 [0,28]

713 [15,28]

714 [15,0]

Table 2 Point correspondences for Fig. 7A

[000108] The ground plane homography represented as HG and HG1 are generated using equations (6) and (7). Next, as shown in an image frame 700B of Fig. 7B, real world structure associated with the scene and/or the use case is used to generate a a further homography matrix between a plane which is horizontally parallel to the ground plane. In one embodiment, the further plane is created using a basketball hoop structure shown as 720. Using the known specifications of the hoop structure and the basketball court size, the further horizontal plane is determined. An example of correspondences with the horizontal plane is shown in Table 3 based on the hoop structure having edges at 6.6 metres and 8.4 metres from each of the points 711 and 712.

21842884_1

Image Point 2 nd horizontal plane

721 [6.6,0]

722 [6.6,28]

723 [8.4,28]

724 [8.4,0]

Table 3 Point Correspondences for Fig. 7B

[000109] The point correspondences determined for the ground and horizontal planes are used to generate a homography matrix corresponding to the further plane. Fig. 8A shows an example further plane 810 between the upper edges of basketball hoop structures 811 and 812 in a frame 800a. The further homography matrix is denoted by HI2G. The further plane is at a known z coordinate (depending on the basketball court specifications, z = 3.85 metres as shown in Fig. 12 using the X-cube = 15 metres, Y-cube=28 metres and Z-cube=3.85 metres). The plane 810 is substantially parallel to the ground plane.

[000110] Next, using the homography HI2G, the image coordinates are estimated for following ground points:

Ground point 1 821: [0, 0, 3.85]

Ground point 2 822: [0, 28, 3.85]

Ground point 3 823: [15, 28, 3.85]

Ground point 4 824: [15, 0, 3.85]

[000111] The ground points 821 to 824 are used to generate image points using equation (2) described above. The determination of image points related to the further ground plane leads to a homography cube 820 shown in a frame 800b of Fig. 8B.

21842884_1

[000112] The method of determining a three-dimensional position of the person of interest, as implemented at step 140, is now described. The method is described separately for two cases. The first case relates to a pose of the person being 'Not jumping' and the second case relates to the pose of the person being 'Jumping'.

[000113] In case 1, the pose of the person is 'Not jumping'. An example scenario for 'not jumping' pose is shown in a sequence 900 in Fig. 9. The 'not jumping' pose relates to the frame 910 (frame N).

[000114] Fig. 13 shows a method 1300 of determining three-dimensional coordinates [Wx, WY, Wz] for this case. The method 1300 is typically implemented as one or more modules of the application 1433, stored in the memory 1406, and controlled under execution of the method 1300.

[000115] The method 1300 starts with a determining step 1310. The step 1310 determines position of feet of the person of interest in pixels. In step 1310, the feet positions are represented as having an (xf, yf) pixel position. In one embodiment, (xf, yf) of the person of interest is estimated using the bounding box coordinates [x1, y 1 2,xy,2y] of the person of

interest as follows:

X N±X (7) Xf 2

yf =y2 E (8)

After generating feel pixels position, the method 1300 moves to a generating step 1320. Step 1320 operates to generate or determine X and Y coordinates in the world coordinate system. The coordinates [Wx, Wy] are generated using the homography matrix HG (generated in equation 6 above) and foot positions in the point transformation equation and using point transformation method (equation 2) as follows:

[000116] [Wx,Wy] = H1G * [Xf,Yf]

[000117] After estimating the x and y coordinates in the world coordinate system, the method 1300 moves to an estimating step 1330. Step 1330 operates to estimate and output the world z coordinate system. The z-position of the player corresponds to the centre position of the bounding box of the person. The center position image coordinates are estimated as follows:

21842884_1 xc= XNN (9)

Ye = (y + )y2 (10)

[000118] To estimate the z position of the world coordinate, the first step is to determine a vertical homography plane 1050 for an image 1000a as shown in Fig. 10A. The vertical plane 1050 represents a second homography plane used to determine the three-dimensional position of the person. The plane 1050 is estimated using two horizontal homography planes, as described in relation to Fig. 8 and in the example below.

[000119] As shown in Fig. 10A, the plane 1050 is not parallel to the first homography plane (compared to the for example the planes 710 of Fig. 7 and 810 of Fig. 8). Rather the homography plane used to determine the three-dimensional position is substantially at an non zero angle to the first homography plane determined in step 130. To determine the homography matrix for the vertical plane, four (4) point correspondences are determined as follows.

Point 1010 on the image 1000 corresponds to world coordinates: [0, Wy, 0]

Point 1020 on the image 1000 corresponds to world coordinates: [0,Wy, Z-cube]

Point 1030 on the image 1000 corresponds to world coordinates: [X-cube,Wy, Z-cube]

Point 1040 on the image 1000 corresponds to world coordinates: [X-cube, Wy, 0]

[000120] To determine the vertical homography matrix, image coordinates for points 1010, 1020, 1030 and 1040 are estimated. The estimation of the points 1010, 1020, 1030 and 1040 is now described.

[000121] The points 1010 and 1040 lie on the ground plane and thus, image coordinates are estimated using homography matrix HGI. Similarly, image coordinates for points 1020 and 1030 are estimated using the second homography matrix HG2. After estimating images coordinates of points 1010, 1020, 1030 and 1040, the vertical homography plane 1050 is estimated using point correspondences to estimate HIGV. The HIGV is used to determine world coordinate of the centre of body [xe,y] of the person of interest using the point transformation method of Equation (2)

as follows:

[Wy2, Wz]= HIGV*[Xc,yc]

21842884_1

[000122] Accordingly, a vertical homography plane is estimated using two horizontal homography planes in determining the three-dimensional location of the person. The values Wy2 and Wyl represent world coordinates in the y direction for comers 1010 and 1040 respectively. The values Wy2 and Wyl should be similar but might vary relatively slightly due to numerical errors in homography matrices estimation. In one embodiment, final world y coordinate is estimated as an average value of Wyl and Wy2 estimated as:

Wy = (Wy1 +Wy2)/2

[000123] The output of step 1330 is [Wx, Wy and Wz] as world coordinates of the person of interest.

[000124] A method of estimating a non-orthogonal plane 1060 at step 1330 is now described with reference to an image 1000b Fig. 10B. In the image frame 1000b, the points 1010 and 1040 are the same as used for estimating plane 1050 in Fig. 10A. An estimation of points 1070 and 1080 in the image and world coordinates is now described. The points 1070 and 1080 have the same values for x and z coordinates as points 1020 and 1030. The y-coordinate for points 1020 and 1070 are related using Equations (12) and (13).

y17= y2 + tan 0 *z10 2 0 (12)

y18= y1 + tan0 *z1 0 3 0 (13)

[000125] In Equations (12) and (13), 0 represents the relative angle with respect to vertical pose estimated by the pose estimation module as shown by 1080.

[000126] The image coordinates for the points 1070 and 1080 are estimated using the same formulation as for points 1020 and 1030. Once the points 1070 and 1080 are estimates, the homography matrix for plane 1060 is estimated in a similar manner to the homography matrix for plane 1050.

[000127] Upon determining the world coordinates, the at step 1330, the method 1300 ends.

[000128] Accordingly, in case 1, the step 140 determines the three-dimensional position of the person of interest using the second homography plane based on the determined planar position

21842884_1 of step 130. The second homography plane is determined with respect to the camera 1490 and is substantially non-parallel (for example orthogonal to or at a non-zero angle to) to the first homography plane, and representing a homography plane with respect to the camera.

[000129] The second example case occurs when the pose of the person of interest is 'Jumping' such as in frame 930 of Fig. 9 is now described. The nearest frame before the 'Jumping' pose frame which has pose 'Not jumping' is used to estimate world coordinates [Wx ,Wy] using in the manner described above for case 1, that is using [Wx, Wy] = H1G * [Xf f •

[000130] Effectively, the last frame of the video in which the expected movement property was detected is identified and used to determine the position. The world coordinates are used to estimate the vertical homography plane and matrix HIGV as described for case 1 using the method 1300. The centre coordinate of the person in image coordinate are estimated for frame N+1 (the subsequent frame in which deviation is detected) using equations 10 and 11 as described above. The centre coordinate and homography matrix HIGV are used to determine world y and z coordinates as follows:

[Wy2, Wz]= HIGV*[Xc,yc]

[000131] The final estimation of x, y and z coordinates are similar to case 1. Accordingly, similarly to case 1, in implementing step 140 for case 2, the three-dimensional position of the person of interest is determined using the second homography plane based on the determined planar position of step 130.

[000132] Upon determination of the position of the person by determining the x, y, z coordinates in the manner described above (whether for case 1 or case 2), the method 100 ends.

[000133] The methods described are also applicable to detecting a location of an object of interest other than a person. An example of another type of object whose location can be tracked is a product in a shop. The movement property associated with the product can relate to whether the product is located in an expected position on a shelf or elsewhere. For example, the height of the product may change if picked up by a customer and/or placed in a basket. The position of object can be determined using other contextual features to determine the 3D position of the objects. In the case of a product for example, expected change in movement

21842884_1 property is estimated using the pose of the human interacting with (for example picking up) the object.

[000134] The arrangements described are applicable to the computer and data processing industries and particularly for the image processing industries.

[000135] Example applications include a sports application such as the basketball application described above or other sports such as volleyball or pole vaulting where the location of the person or object of interest may change in terms of height. For example, the camera 149 can capture video of a game in the scene 1490 and the module 1401 can execute the method 100 to determine the location of a person or object of interest. The location can be used to fine-tune positioning of cameras, for tracking purposes, or to ensure accurate images of the person are captured as their movement changes.

[000136] The arrangements described can be used to determine a three-dimensional person of interest in a scene captured by a single image capture device. In estimating a planar position of a person in a further frame using a first homography based on detection of a deviation, the arrangements described allow a single image capture device to be used. A parallel homography plane to the first homography plane is used to create a homography cube (shown by 820 in Fig. 8). The homography cube allows generating a second homography plane at any position or location in the cube. Determining a position of the person using a second homography plane allows a position of the person of interest to be generated. Detecting the location of the person of interest can provide a form of tracking the person of interest, and/or allow positioning of a camera to ensure an image of the person of interest is captured. The arrangements described accordingly can be used in sports broadcast applications as well as security applications.

[000137] The three-dimensional position can be estimated without requiring multiple image capture devices, thereby avoiding unnecessary cost in terms of hardware, networks and congestion. Further, the arrangements described can be applied to different scenes. Use of convolutional neural networks relates to determining a person of interest and a pose estimation of the person of interest rather than location, reducing training overheads. Additionally, the location of the person of interest can be determined without capturing images from the viewpoint of the person of interest.

21842884_1

[000138] Use of homography techniques described means that world coordinates and values (such as the dimensions of the basketball court) can be used to determine the location of the person of interest. Further, the determined position may be used to supplement tracking techniques. For example, the determined position may be used to track the person of interest if other tracking techniques are providing spurious results

[000139] The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

[000140] In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings.

21842884_1

Claims

Claims:

1. A method of determining a three-dimensional position of a person in a video, the method comprising:

identifying a person of interest in a scene of a first frame of a video captured using a camera, the person of interest having an expected movement property in a first homography plane with respect to the camera;

detecting a deviation in the expected movement property of the person in the scene at a time of a subsequent frame of the video captured by the camera;

determining a planar position of the person using the first homography plane at the time of the further frame in response to the detected deviation; and

determining a three-dimensional position of the person in the scene at the time of the further frame using a second homography plane with respect to the camera based on the determined planar position, the second homography plane being non-parallel to the first homography plane.

2. The method according to claim 1, wherein the deviation in the expected movement property is determined using a convolutional neural network.

3. The method according to claim 1, wherein the deviation in the expected movement property is determined based upon known constraints relating to a position of the person in the first frame.

4. The method according to claim 1, wherein determining the three-dimensional position of the person comprises determining a homography cube using the first homography plane and a parallel homography plane.

5. The method according to claim 1, wherein detecting the deviation in the expected movement property of the person comprises determining that the person is on a ground plane of the scene.

6. The method according to claim 1, wherein detecting the deviation in the expected movement property of the person comprises determining that the person is in the air.

21842884_1

7. The method according to claim 1, wherein the deviation in the expected movement property of the person indicates whether the person is jumping or not jumping.

8. The method according to claim 1, wherein the scene is a basketball court and determining the three-dimensional position of the person comprises determining boundary of a court using line detection.

9. The method according to claim 1, wherein the scene is a basketball court, the camera is a moving camera and determining the three-dimensional position of the person comprises determining boundary of a court of the scene using line detection for each frame of the video.

10. The method according to claim 1, wherein the deviation relates to the person being on the ground and determining the three-dimensional location of the person relates to determining a location of pixels of feet of the person.

11. The method according to claim 1, wherein the deviation relates to the person being in the air and determining the three-dimensional location of the person comprises identifying a last frame of the video containing the person being in the expected movement position.

12. The method according to claim 1, wherein determining the three-dimensional position comprises determining a vertical homography plane using two horizontal homography planes.

13. The method according to clam 1, wherein determining the three-dimensional position comprises using known structure of the scene to determine a boundary associated with the scene.

14. The method according to claim 1, wherein determining the three-dimensional position of the person comprises determining a homography cube using the first homography plane and a parallel homography plane and determining the second homography plane at any location within the cube.

15. A non-transitory computer readable medium having a computer program stored thereon to implement a method of determining a three-dimensional position of a person in a video, the program comprising:

21842884_1 code for identifying a person of interest in a scene of a first frame of a video captured using a camera, the person of interest having an expected movement property in a first homography plane with respect to the camera; code for detecting a deviation in the expected movement property of the person in the scene at a time of a subsequent frame of the video captured by the camera; code for determining a planar position of the person using the first homography plane at the time of the further frame in response to the detected deviation; and code for determining a three-dimensional position of the person in the scene at the time of the further frame using a second homography plane with respect to the camera based on the determined planar position, the second homography plane being non-parallel to the first homography plane.

16. A system, comprising:

a camera configured to capture video of a scene;

a memory;

a processor, wherein the processor is configured to execute code stored on the memory for implementing a method comprising:

receiving video captured using the camera;

identifying a person of interest in a scene of a first frame of the video, the person of interest having an expected movement property in afirst homography plane with respect to the camera;

21842884_1 determining a three-dimensional position of the person in the scene at the time of the further frame using a second homography plane with respect to the camera based on the determined planar position, the second homography plane being non-parallel to the first homography plane.

17. Apparatus, comprising:

a memory;

a processor configured to execute code stored on the memory implement a method of determining a three-dimensional position of a person in a video, the method comprising:

Canon Kabushiki Kaisha

Patent Attorneys for the Applicant

SPRUSON&FERGUSON

21842884_1

- 1 / 17 - 17 Dec 2018

100

Start 2018282254

110

Identify a person of interest

120

Detect deviation in expected movement property 130

Determine planar position of the person using first homography plane

140

Determine 3D position using second homography plane

199 End

21843452_1

Fig. 1

- 2 / 17 - 17 Dec 2018 2018282254

200 Start

Apply a trained person 210 detector

220

Generate tracks

240 Select a Person of interest

299 End

21843452_1

Fig. 2

- 3 / 17 - 17 Dec 2018

210 Start

302 301

Person Image Annotation 2018282254

Samples data Training data 303 No Epoch < Total Epochs 305 Yes 304 No Increment Epoch Batch < Total Batches

306 Yes

2D CNN model 310 307

Bounding Box and probability of person Increment Batch detected

308

Generate Loss 309

Loss < Loss No Threshold

Yes

End 21843452_1 Fig. 3

- 4 / 17 - 17 Dec 2018

400 2018282254

410

Fig. 4 21843452_1

- 5 / 17 - 17 Dec 2018 2018282254

500

510

520

Fig. 5 21843452_1

- 6 / 17 - 17 Dec 2018

600 610 2018282254

Not Pose Detector Jumping

620 630

Pose Detector Jumping

Fig. 6

21843452_1

- 7 / 17 - 700 17 Dec 2018

713

712 2018282254

710

714

711

Fig. 7A 700b

723 722 720

724 721

21843452_1 Fig. 7B

- 8 / 17 - 800a 17 Dec 2018

811 810 2018282254

812

Fig. 8A 800b

823 820 822

824

821

21843452_1 Fig. 8B

21843452_1 920 930 910 - 9 / 17 -

Frame N+1 Frame N+2 Frame N

Fig. 9

- 10 / 17 - 1000a 17 Dec 2018

1050 1030

1040 2018282254

1020

1010

Fig. 10A 1000b

1060 1080

1040

1070 1080

1010

Fig. 10B 21843452_1

21843452_1 Image 1 Image 2 1120 1110 H12

H21 - 11 / 17 -

H1G HG2

HG1 H2G

1130 (Ground plane) Fig. 11

- 12 / 17 - 17 Dec 2018 2018282254

1210

Z-cube

Y-cube

Z-axis Y-axis

X-axis X-cube

Fig. 12

21843452_1

- 13 / 17 - 17 Dec 2018

1300 Start 2018282254

1310 Determine feet pixels

1320 Generate X and Y coordinates

1330 Estimate Z coordinate

End

21843452_1

- 14 / 17 - 17 Dec 2018 2018282254

1400

1495

1401

Fig. 14A 21843452_1

- 15 / 17 - 17 Dec 2018

1490

(Wide-Area) Communications Network 1420 2018282254

Printer 1415

Microphone 1424 1480 1421 1417 (Local-Area) Video Communications Display Network 1422 Ext. 1423 1414 Modem 1416 1400

1401

Appl. Prog Storage Audio-Video I/O Interfaces Local Net. 1433 Devices Interface 1407 1408 I/face 1411 HDD 1410 1409

1404

1418 1419

Processor I/O Interface Memory Optical Disk 1405 1413 1406 Drive 1412

Keyboard 1402

Scanner 1426 Disk Storage 1403 Medium 1425 Camera 1427

21843452_1 Fig. 14B

- 16 / 17 - 17 Dec 2018

1434 1433

Instruction (Part 1) 1428 Data 1435 Instruction (Part 2) 1429 Data 1436 1432 1431

Instruction 1430 Data 1437 2018282254

ROM 1449 POST BIOS Bootstrap Operating 1450 1451 Loader 1452 System 1453

Input Variables 1454 Output Variables 1461

1455 1462

1456 1463

1457 1464

Intermediate Variables 1458 1459 1466 1460 1467

1419 1404

1418

1405 Interface 1442

1441 1348 Reg. 1444 (Instruction) Control Unit 1439 Reg. 1445

ALU 1440 Reg. 1446 (Data)

21843452_1 Fig. 14C

- 17 / 17 - 17 Dec 2018

1510

1500a 2018282254

1530 1520

1540 Y-axis X-axis

Fig. 15A

1500b 1530 1550

1520

Y-axis X-axis 21843452_1 Fig. 15B