AU2019200854A1

AU2019200854A1 - A method and system for controlling a camera by predicting future positions of an object

Info

Publication number: AU2019200854A1
Application number: AU2019200854A
Authority: AU
Inventors: Anthony Knittel
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2019-02-07
Filing date: 2019-02-07
Publication date: 2020-08-27

Abstract

The present disclosure relates to a method (100) of capturing an image of an object (1201, 1202, 1203) using a plurality of cameras (1230, 1231, 1232) in a network. The method (100) 5 comprises receiving orientation information (1410, 1430) of the object (1201, 1202, 1203) whose image is to be captured. The method (100) then predicts future positions of the object (1201, 1202, 1203), wherein each of the future positions is associated with a probability score based on the plurality of cameras (1230, 1231, 1232), wherein the prediction of the future positions is based on the received orientation information (1410, 1430). The plurality of 10 cameras (1230, 1231, 1232) can then be configured to capture an image of the object (1201, 1202, 1203) at one or more of the predicted future positions. P0003617AU_22118574_1 8/13 1210 1201 1202 1203 0 I % % 1222 % % I ---- \ \1231 1232 I '~1230 " \ \ 11240 1241 1242 Fig. 8 22118577_1

Description

8/13

1210 1201 1202 1203 0

I % 1222 % % %

I ---- \ \1231 1232 I '~1230 " \ \ 11240 1241 1242

Fig. 8 22118577_1

A METHOD AND SYSTEM FOR CONTROLLING A CAMERA BY PREDICTING FUTURE POSITIONS OF AN OBJECT TECHNICAL FIELD

The present invention relates to the positioning of cameras to capture events of interest, for example the positioning of video and photographic cameras to photograph players in a sporting event. In particular the present invention describes a method for positioning several cameras to record events of interest performed by several sports players, by using the previous positions of players to predict the future position of players and positions cameras to record events accordingly.

BACKGROUND

When capturing a sports event using video or still-image cameras, several cameras are commonly used to record the event, in order to capture events of interest in detail, from various different angles. Positioning of cameras is commonly performed by human operators, where several photographers are positioned at various locations around a sports field and operate cameras to capture events of interest. There are limitations with the number of camera views that can be captured due to restrictions on the number of camera operators, on the number of cameras that can be controlled by each operator, and available locations for a camera operator to be positioned. A further limitation is that a human camera operator has limited ability to position a camera in order to capture content. For example, in order to capture high resolution images, a narrow-angled camera view needs to be used, and the camera must be accurately aimed to capture content through the camera. As movement of the camera has a delay, aiming the camera requires prediction of where an event will likely occur in the future. Human operators have limited accuracy for observing the event, performing this prediction and moving a narrow field of view camera accurately to capture the event. Events also cannot be accurately predicted as there is inherent unpredictability of what events will occur in future and where they will occur, such as where a player with the ball will be.

Remotely operated cameras controlled by human operators can also be used, which can allow a greater variety of camera viewpoints to be captured, but is limited by costs.

All human operated methods are limited by the reaction time and ability of human operators to estimate where to direct the camera. For example, a telephoto camera operator recording close

P0003617AU_22118574_1 up images has a limited view of the playing area, therefore restricting the ability to estimate where to position the camera.

Automated camera systems can also be used. However, there are also limitations on operating automated camera systems using conventional arrangements.

Conventional arrangements have not shown the ability to accurately produce close-up photography of moving objects of interest. Conventional arrangements have not shown the ability to perform photography of multiple objects using multiple cameras, and to control the cameras in a way that accurately and reliably produces high quality recording and images.

SUMMARY

It is an object of the present invention to substantially overcome or at least ameliorate one or more of the above disadvantages.

The present disclosure describes a method for predicting possible future positions of objects (e.g., players, a ball, etc.) in a scene and controlling the positioning of several cameras to capture a video or still-images of the objects. By predicting possible future positions of objects and controlling the positioning of cameras based on the predicted positions, the video or still images taken by the cameras do not need to be cropped and therefore result in high quality video or still images being captured. A future position refers to a position at a later time to where an object may move.

The present disclosure uses input data such as previous video footage of objects (e.g., players, a ball, etc.) in a sporting field, and data such as orientation information of the position of the objects in the scene, such as that produced by a tracking system. The present disclosure uses the input data to predict the future positions of the objects of interest, for example using a machine-leaming method to predict the future positions of the objects. The present disclosure uses the predicted future positions of the objects to select the most probable future positions, and to operate the cameras to position the viewing angle of the cameras to accurately capture the objects in high quality at the selected future positions, for example by controlling the pan, tilt and zoom parameters of a mechanically operated camera apparatus.

In accordance with an aspect of the present disclosure, there is provided a method of capturing an image of a subject using a plurality of cameras, the method comprising: receiving orientation

P0003617AU_22118574_1 information of an object whose image is to be captured; predicting future positions of the object, wherein each of the future positions is associated with a probability score, wherein the prediction of the future positions is based on the received orientation information; and configuring the plurality of cameras to capture an image of the object at one or more of the predicted future positions.

BRIEF DESCRIPTION OF THE DRAWINGS

One or more embodiments of the invention will now be described with reference to the following drawings, in which:

Fig. 1 is a schematic flow diagram illustrating a method of training and operating a prediction and camera control system;

Fig. 2 is a schematic flow diagram illustrating a training method of Fig. 1;

Fig. 3 is a schematic flow diagram illustrating a position prediction and camera control method of Fig. 1;

Fig. 4 is a schematic flow diagram illustrating a position prediction method;

Fig. 5 is a schematic block diagram illustrating an artificial neural network for implementing the position prediction method of Fig. 4;

Fig. 6 is a schematic block diagram illustrating a structure of the artificial neural network layers of the artificial neural network shown in Fig. 5;

Fig. 7 shows examples of input data of the artificial neural network layers shown in Fig. 6;

Fig. 8 is a schematic block diagram illustrating the prediction and camera control system on which the method of Fig. 1 can be implemented;

Fig. 9 is a schematic block diagram illustrating the modules of the control unit of the prediction and camera control system shown in Fig. 8;

Fig. 10 shows examples of input data that can be received by the prediction and camera control system of Figs. 8 and 9;

P0003617AU_22118574_1

Fig. 11 shows examples of future positions of an object within a field of view at a future point in time; and

Figs. 12A and 12B form a schematic block diagram of a general purpose computer system upon which the control unit shown in Fig. 9 can be practiced.

DETAILED DESCRIPTION

Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.

It is to be noted that the discussions contained in the "Background" section and that above relating to prior art arrangements relate to discussions of documents or devices which form public knowledge through their respective publication and/or use. Such should not be interpreted as a representation by the present inventor(s) or the patent applicant that such documents or devices in any way form part of the common general knowledge in the art.

Aspects of the present disclosure address the problem of performing high quality photography of multiple objects by predicting future positions of the objects and controlling multiple cameras to photograph the objects.

The present disclosure provides a system for observing the environment using several cameras, including a wide-angle camera and one or more close-up cameras, for identifying objects of interest in the environment, for predicting the future positions of the objects of interest in the environment, and using the predictions to control multiple close-up cameras. In this way the present disclosure enables close-up photography of multiple moving objects of interest, such that high quality close-up images of the objects are recorded.

A method and system for prediction and camera control is described below.

Fig 8 shows a prediction and camera control system having a control unit 1250, one wide-angle camera 1230, and several close-up cameras 1231, 1232. The cameras 1230, 1231, 1232 are connected to respective pan-tilt units 1240, 1241, 1242 that adjust the orientation of the cameras 1230,1231,1232.

P0003617AU_22118574_1

The wide-angle camera 1230 is configured for capturing a scene 1210. The wide-angle camera 1230 has a field of view 1220 that covers a large region or the entirety of the scene 1210. The wide-angle camera 1230 records video or images that are transmitted to the control unit 1250. Although one wide-angle camera 1230 is shown, it is possible to have more than one wide angle camera 1230.

The close-up cameras 1231, 1232 are configured by the control unit 1250 to set the zoom configuration, focus, exposure, shutter speed, video frame rate, ISO setting, audio and other camera controls. The close-up cameras 1231, 1232 capture close-up fields of view 1221, 1222 of objects of interest 1201, 1202, 1203. The close-up fields of view 1221, 1222 depend on the configuration of the pan-tilt units 1241, 1242 and configuration of the close-up cameras 1231, 1232.

The control unit 1250 uses the wide-angle camera 1230 to determine orientation data relating to the objects of interest 1201, 1202, 1203. For example, the orientation data relates to the positions of the objects 1201, 1202, 1203. The orientation data can then be used to predict possible future positions of each of the objects 1201, 1202, 1203. Based on the predicted future positions, the control unit 1250 then configures the pan-tilt units 1240, 1241, 1242 and camera settings of the close-up cameras 1231, 1232 to better track the objects (e.g., a ball 1203, persons 1201, 1202).

Structure of Control Unit 1250

Figs. 12A and 12B depict a general-purpose computer system 1600, upon which the control unit 1250 can be practiced.

As seen in Fig. 12A, the computer system 1600 includes: a computer module 1601; input devices such as a keyboard 1602, a mouse pointer device 1603, a scanner 1626, a camera 1627, and a microphone 1680; and output devices including a printer 1615, a display device 1614 and loudspeakers 1617. An external Modulator-Demodulator (Modem) transceiver device 1616 may be used by the computer module 1601 for communicating to and from a communications network 1620 via a connection 1621. The communications network 1620 may be a wide-area network (WAN), such as the Internet, a cellular telecommunications network, or a private WAN. Where the connection 1621 is a telephone line, the modem 1616 may be a traditional "dial-up" modem. Alternatively, where the connection 1621 is a high capacity (e.g., cable)

P0003617AU_22118574_1 connection, the modem 1616 may be a broadband modem. A wireless modem may also be used for wireless connection to the communications network 1620.

The computer module 1601 typically includes at least one processor unit 1605, and a memory unit 1606. For example, the memory unit 1606 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 1601 also includes a number of input/output (1/0) interfaces including: an audio-video interface 1607 that couples to the video display 1614, loudspeakers 1617 and microphone 1680; an 1/0 interface 1613 that couples to the keyboard 1602, mouse 1603, scanner 1626, camera 1627 and optionally a joystick or other human interface device (not illustrated); and an interface 1608 for the external modem 1616 and printer 1615. In some implementations, the modem 1616 may be incorporated within the computer module 1601, for example within the interface 1608. The computer module 1601 also has a local network interface 1611, which permits coupling of the computer system 1600 via a connection 1623 to a local-area communications network 1622, known as a Local Area Network (LAN). As illustrated in Fig. 12A, the local communications network 1622 may also couple to the wide network 1620 via a connection 1624, which would typically include a so-called "firewall" device or device of similar functionality. The local network interface 1611 may comprise an Ethernet circuit card, a Bluetooth© wireless arrangement or an IEEE 802.11 wireless arrangement; however, numerous other types of interfaces may be practiced for the interface 1611.

The I/O interfaces 1608 and 1613 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 1609 are provided and typically include a hard disk drive (HDD) 1610. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 1612 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g., CD-ROM, DVD, Blu-ray DiscTM), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the system 1600.

The components 1605 to 1613 of the computer module 1601 typically communicate via an interconnected bus 1604 and in a manner that results in a conventional mode of operation of the computer system 1600 known to those in the relevant art. For example, the processor 1605 is coupled to the system bus 1604 using a connection 1618. Likewise, the memory 1606 and

P0003617AU_22118574_1 optical disk drive 1612 are coupled to the system bus 1604 by connections 1619. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations, Apple MacTM or like computer systems.

The method of predicting possible future positions of an object 1201, 1202, 1203 and controlling the close-up cameras 1231, 1232 to capture the predicted possible future positions may be implemented using the computer system 1600 wherein the processes of Figs. 1 to 4 (and the modules 1310, 1330, 1360), to be described, may be implemented as one or more software application programs 1633 executable within the computer system 1600. In particular, the steps of the method of predicting possible future positions of an object 1201, 1202, 1203 and controlling the close-up cameras 1231, 1232 are effected by instructions 1631 (see Fig. 12B) in the software 1633 that are carried out within the computer system 1600. The software instructions 1631 may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the future position prediction and the camera control methods and a second part and the corresponding code modules manage a user interface between the first part and the user.

The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 1600 from the computer readable medium, and then executed by the computer system 1600. A computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program product in the computer system 1600 preferably effects an advantageous apparatus for predicting possible future positions and controlling cameras based on the predicted possible future positions.

The software 1633 is typically stored in the HDD 1610 or the memory 1606. The software is loaded into the computer system 1600 from a computer readable medium, and executed by the computer system 1600. Thus, for example, the software 1633 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 1625 that is read by the optical disk drive 1612. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the computer system 1600 preferably effects an apparatus for predicting possible future positions and controlling cameras based on the predicted possible future positions.

P0003617AU_22118574_1

In some instances, the application programs 1633 may be supplied to the user encoded on one or more CD-ROMs 1625 and read via the corresponding drive 1612, or alternatively may be read by the user from the networks 1620 or 1622. Still further, the software can also be loaded into the computer system 1600 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 1600 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu rayM Disc, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 1601. Examples of transitory or non tangible computer readable transmission media that may also participate in the provision of software, application programs, instructions and/or data to the computer module 1601 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like.

The second part of the application programs 1633 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 1614. Through manipulation of typically the keyboard 1602 and the mouse 1603, a user of the computer system 1600 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 1617 and user voice commands input via the microphone 1680.

Fig. 12B is a detailed schematic block diagram of the processor 1605 and a "memory" 1634. The memory 1634 represents a logical aggregation of all the memory modules (including the HDD 1609 and semiconductor memory 1606) that can be accessed by the computer module 1601 in Fig. 12A.

When the computer module 1601 is initially powered up, a power-on self-test (POST) program 1650 executes. The POST program 1650 is typically stored in a ROM 1649 of the semiconductor memory 1606 of Fig. 12A. A hardware device such as the ROM 1649 storing software is sometimes referred to as firmware. The POST program 1650 examines hardware

P0003617AU_22118574_1 within the computer module 1601 to ensure proper functioning and typically checks the processor 1605, the memory 1634 (1609, 1606), and a basic input-output systems software (BIOS) module 1651, also typically stored in the ROM 1649, for correct operation. Once the POST program 1650 has run successfully, the BIOS 1651 activates the hard disk drive 1610 of Fig. 12A. Activation of the hard disk drive 1610 causes a bootstrap loader program 1652 that is resident on the hard disk drive 1610 to execute via the processor 1605. This loads an operating system 1653 into the RAM memory 1606, upon which the operating system 1653 commences operation. The operating system 1653 is a system level application, executable by the processor 1605, to fulfil various high level functions, including processor management, memory management, device management, storage management, software application interface, and generic user interface.

The operating system 1653 manages the memory 1634 (1609, 1606) to ensure that each process or application running on the computer module 1601 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the system 1600 of Fig. 12A must be used properly so that each process can run effectively. Accordingly, the aggregated memory 1634 is not intended to illustrate how particular segments of memory are allocated (unless otherwise stated), but rather to provide a general view of the memory accessible by the computer system 1600 and how such is used.

As shown in Fig. 12B, the processor 1605 includes a number of functional modules including a control unit 1639, an arithmetic logic unit (ALU) 1640, and a local or internal memory 1648, sometimes called a cache memory. The cache memory 1648 typically includes a number of storage registers 1644 - 1646 in a register section. One or more internal busses 1641 functionally interconnect these functional modules. The processor 1605 typically also has one or more interfaces 1642 for communicating with external devices via the system bus 1604, using a connection 1618. The memory 1634 is coupled to the bus 1604 using a connection 1619.

The application program 1633 includes a sequence of instructions 1631 that may include conditional branch and loop instructions. The program 1633 may also include data 1632 which is used in execution of the program 1633. The instructions 1631 and the data 1632 are stored in memory locations 1628, 1629, 1630 and 1635, 1636, 1637, respectively. Depending upon the relative size of the instructions 1631 and the memory locations 1628-1630, a particular instruction may be stored in a single memory location as depicted by the instruction shown in

P0003617AU_22118574_1 the memory location 1630. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 1628 and 1629.

In general, the processor 1605 is given a set of instructions which are executed therein. The processor 1605 waits for a subsequent input, to which the processor 1605 reacts to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 1602, 1603, data received from an external source across one of the networks 1620, 1602, data retrieved from one of the storage devices 1606, 1609 or data retrieved from a storage medium 1625 inserted into the corresponding reader 1612, all depicted in Fig. 12A. The execution of a set of the instructions may in some cases result in output of data. Execution may also involve storing data or variables to the memory 1634.

The disclosed position prediction and camera control arrangements use input variables 1654, which are stored in the memory 1634 in corresponding memory locations 1655, 1656, 1657. The position prediction and camera control arrangements produce output variables 1661, which are stored in the memory 1634 in corresponding memory locations 1662, 1663, 1664. Intermediate variables 1658 may be stored in memory locations 1659, 1660, 1666 and 1667.

Referring to the processor 1605 of Fig. 12B, the registers 1644, 1645, 1646, the arithmetic logic unit (ALU) 1640, and the control unit 1639 work together to perform sequences of micro operations needed to perform "fetch, decode, and execute" cycles for every instruction in the instruction set making up the program 1633. Each fetch, decode, and execute cycle comprises:

a fetch operation, which fetches or reads an instruction 1631 from a memory location 1628, 1629, 1630;

a decode operation in which the control unit 1639 determines which instruction has been fetched; and

an execute operation in which the control unit 1639 and/or the ALU 1640 execute the instruction.

P0003617AU_22118574_1

Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 1639 stores or writes a value to a memory location 1632.

Each step or sub-process in the processes of Figs. I to 4 (or modules 1310, 1330, 1360) is associated with one or more segments of the program 1633 and is performed by the register section 1644, 1645, 1647, the ALU 1640, and the control unit 1639 in the processor 1605 working together to perform the fetch, decode, and execute cycles for every instruction in the instruction set for the noted segments of the program 1633.

The method of predicting future positions of an object and controlling cameras based on the predicted future positions may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions of predicting future position and controlling camera configurations. Such dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories.

Modules of Control Unit 1250

Fig. 9 is a schematic block diagram of the control unit 1250 having a tracking module 1310, a prediction module 1320, a camera control module 1330 and an image content module 1360. Each of the modules 1310, 1320, 1330, 1360 can be implemented as one or more software application programs 1633 that are executable by the processor 1605.

The tracking module 1310 receives image or video data from the wide-angle camera 1230 and produces orientation data representing the position of each object 1201, 1202, 1203 in the scene 1210 for each point in time. The tracking module 1310 represents the position of the objects 1201, 1202, 1203 in image coordinates within the image produced by the wide-angle camera 1230, as shown in Fig. 10. In one arrangement, orientation positions are represented using a horizontal 1410 and vertical 1430 coordinate of the position of objects 1420, 1421, 1440 within the field of view 1415, for a series of frames 1405,1406 (which correspond with a series of observed points in time). In another arrangement, orientation positions are represented by physical coordinates of the position of the objects 1420, 1421, 1440 within the physical space.

Although the tracking module 1310 is shown as a separate module in Fig. 9, the tracking module 1310 can be integrated as part of the prediction module 1320. As described below in

P0003617AU_22118574_1 relation to method 210 (shown in Fig. 4), the tracking module 1310 is part of the prediction module 1320 and is performed at step 405.

For example, video data includes 10 image frames over a period of time of t Ito t10. The object 1201 may be in a position x1, yl in a first image frame, which is captured at time tI. Subsequently, the object 1201 may be in a position x2, y2 in a second image frame, which is captured at time t2. The tracking module 1310 then records the positions (e.g., x1, yl, x2, y2, etc.) of the object 1201 over the time period of tl tot10. Therefore, the tracking module 1310 acquires orientation data over a period of time when a number of image frames is received.

The tracking module 1310 then transmits the orientation data to the prediction module 1320. The orientation data is used by the prediction module 1320 to predict possible future positions of the objects 1201, 1202, 1203 at a future point in time. Each predicted possible future position is associated with a probability score of the objects 1201, 1202, 1203 moving to that predicted possible future position.

Fig. 5 shows an illustration of the prediction module 1320 (which is implemented as neural network layers 920) receiving the orientation data of an object (e.g., 1201, 1202, 1203) to predict possible future positions of the object (e.g., 1201, 1202, 1203). Fig. 6 then shows an example structure of the neural network layers 920 for implementing the prediction module 1320. The prediction module 1320 will be further discussed below.

The prediction module 1320 transmits the predicted possible future positions of the objects 1201, 1202, 1203 to the camera control module 1330.

The camera control module 1330 uses the predicted possible future positions of the objects 1201, 1202, 1203 to determine configurations of the close-up pan-tilt units 1241, 1242, 1243 and the close-up cameras 1231, 1232 that would be able to capture one or more of the predicted possible future positions of the objects 1201, 1202, 1203. The object 1201, 1202, 1203 can therefore be captured by one or more of the close-up cameras 1231, 1232 at high quality as discussed above. In one arrangement, the close-up cameras 1231, 1232 capture the predicted possible future positions of the objects 1201, 1202, 1203 with the highest total probability.

The predicted possible future positions received from the prediction module 1320 may be represented as a distribution of probability estimates over a number of possible future positions for each object 1201, 1202, 1203. The camera control module 1330 then selects a fixed number

P0003617AU_22118574_1 of the predicted possible future positions. In one arrangement, the number of future positions chosen is equal to the number of close-up cameras 1231, 1232 controlled by the camera control module 1330. In another arrangement, a close-up camera 1231, 1232 may be able to capture two of the predicted possible future positions and therefore the number of future positions chosen may be more than the number of close-up cameras 1231, 1232. The selected future positions are chosen based on the probability scores at the predicted possible future positions. For example, a predicted possible future position with a probability score of 0.8 is likely to be selected over another predicted possible future position with a probability score of 0.4. The configurations of the close-up pan-tilt units 1241, 1242, 1243 and close-up cameras 1231, 1232 are chosen such that the fields of view 1221, 1222 cover one or more of the selected possible future positions of the objects 1201, 1202, 1203.

The configurations are chosen to account for time delays in adjusting the configurations of the close-up pan-tilt units 1241, 1242, 1243 and close-up cameras 1231, 1232. For example, if the close-up pan-tilt units 1241, 1242, 1243 or the close-up cameras 1231, 1232 are slow to adjust (or move), then the prediction system has to predict further in the future to provide sufficient time to adjust (or move) the close-up pan-tilt units 1241, 1242, 1243 or the close-up cameras 1231, 1232.

The image content module 1360 receives image or video data from the close-up cameras 1231, 1232 and uses the received image or video data. The received image or video data can, for example, be stored for future viewing, or be transferred to a broadcasting system.

Operation of Control Unit 1250

The schematic flow diagram in Fig. 1 shows a method 100 for operating the control unit 1250. The method 100 first trains 110 the prediction module 1320 of the control unit 1250.

Once the prediction module 1320 is trained, the method 100 proceeds to a method 120. The method 120 is performed by the control unit 1250 to predict future positions of objects using the prediction module 1320. The prediction of future positions of an object is described in relation to method 210 (shown in Fig. 4).

In the method 120, the control unit 1250 also uses the predicted future positions of object to adjust the configuration of the pan-tilt units 1241, 1242, 1243 and close-up cameras 1231, 1232 to perform photography of a scene 1210.

P0003617AU_22118574_1

The method 120 (shown in Fig. 3) commences at a method 210 for predicting possible future positions of an object (e.g., 1201, 1202, 1203). The method 210 is described below and is shown in Fig. 4. The method 210 is performed by the prediction module 1320. In summary, the method 210 predicts for an instance or a set of instance videos possible future positions of a set ofobjects. The method 210, as will be described below, uses input from the wide-angle camera 1230 photographing a scene 1210 to perform the prediction. The method 210 produces a set of predicted future positions and associated probability scores for each object 1201, 1202, 1203. In one arrangement, the set of prediction future positions is associated with a continuous distribution over space representing the estimated probability distribution. The method 120 then proceeds from the method 210 to step 310.

In step 310, the method 120 determines the predicted possible future positions with the highest probability scores. Step 310 is performed by the camera control module 1330. Sucha determination maximises the probability of capturing the objects 1201, 1202, 1203. Further, by properly capturing the objects 1201, 1202, 1203, the captured images or videos do not need to be manipulated and therefore the quality of the captured images or videos can be preserved. Therefore, the probability of capturing high quality photographs of objects 1201, 1202, 1203 in the scene 1210 is maximised. However, the captured images or videos may still be manipulated after.

Examples of high quality photographs are a narrow field of view 1221,1222 that allows high resolution photographs to be obtained, a wide enough field of view 1221,1222 that allows the extent of the objects 1201, 1202, 1203 being photographed to be covered in the photograph, control of focus to ensure the object 1201, 1202, 1203 are in focus, control of aperture to produce an appealing image such as using a narrow depth of field to emphasise the objects 1201, 1202, 1203 or a long depth of field to allow more of the scene in front of and behind to be visible.

The method 120 then proceeds from step 310 to step 320.

In step 320, the method 120 selects camera configurations of the close-up cameras 1231, 1232 based on the determination of step 310. Step 320 is performed by the camera control module 1330.

The prediction and camera control system shown in Fig. 8 has a fixed number of close-up cameras 1231, 1232 to control. Therefore, it is not possible to instruct a camera to capture each

P0003617AU_22118574_1 and every possible future position of the object 1201, 1202, 1203. Accordingly, the method 120, in step 320, chooses a configuration including the field of view 1221, 1222 of the cameras that maximises the probability that the objects 1201, 1202, 1203 will be present and captured in high quality by the close-up cameras 1231, 1232. An example of maximising the probability is to choose field of view configurations 1221,1222 such that the sum of the estimated probability values for spatial regions within the chosen fields of view 1221,1222 is maximised, and that the field of view configurations 1221,1222 will produce high-quality photographs, for example by choosing field of view configurations 1221,1222 that cover the extent of the objects 1201, 1202, 1203 being photographed. In another example, the field of view configurations 1221, 1222 are chosen to cover the possible future positions with the highest probability scores.

In one arrangement, one close-up camera 1231, 1232 may be able to capture two or more predicted future positions of the object 1201, 1202, 1203. This arrangement improves the probability of capturing the object 1201, 1202, 1203.

In one arrangement, the configurations of the close-up cameras 1231, 1232 can be selected by first determining content quality of the image of the object. For example, a close-up camera 1231 nearest a selected possible future position is configured to cover the selected possible future position, rather than another close-up camera 1232 that is further away from the selected possible future position.

The method 120 then proceeds from step 320 to step 325.

In step 325, the control module 1330 operates the close-up pan-tilt units 1241, 1242, 1243 and close-up cameras 1231, 1232 to implement the configuration selected in step 320.

The method 120 concludes at the conclusion of step 325.

The method 120 is repeatedly performed to continuously predict future positions of the object 1201, 1202, 1203 so that the configurations of the close-up pan-tilt units 1241, 1242, 1243 and the close-up cameras 1231, 1232 are continuously determined. For example, over the duration of an event, the method 120 is repeatedly performed so that the configurations of the close-up pan-tilt units 1241, 1242, 1243 and the close-up cameras 1231, 1232 are continuously updated to capture the objects 1201, 1202, 1203 in the scene 1210.

Prediction module 1320

P0003617AU_22118574_1

Training

As described above, the prediction module 1320 can be implemented as neural network layers 920. The prediction module 1320 is trained to predict possible future positions of each object provided by the tracking module 1310, along with associated probability scores.

Before the prediction module 1320 (which is implemented as neural network layers 920) can be used in the method 120, the prediction module 1320 is first trained using training data.

The schematic flow diagram in Fig. 2 shows the method 110 of performing training of the prediction system 1320. The method 100 commences by predicting 210 possible future positions of an object. The training data is used for the prediction.

The training data includes positions of various objects at various points in time. The prediction module 1320 then predicts possible future positions of the various objects in the training data. The predicted possible future positions are then compared with the actual future positions of the various objects in the training data. The differences between the predicted possible future positions and the actual future positions are known as the loss value. The neural network layers 920 can then be updated based on the loss value. Once a stopping criterion is reached, for example when the loss value on a validation dataset reaches a minimum, the neural network layers 920 can be used to perform the method 120.

An artificial neural network (e.g., the neural network layers 920), which forms the prediction module 1320, includes several layers, where each layer has a network of nodes and edges. Edges are also known as connections. Each edge is weighted, having an associated real-value number or weight representing a strength of a connection between the nodes. When the artificial neural network is trained for prediction, each node has an activation function to produce an output from an input. The output is an activation value, which is a real-valued number. In other words, the activation value is a response produced by the node based on a value of inputs to the node, the activation function of the node, and the connection strengths (edge weights). As discussed above, an example of the neural network layers 920 is shown in Fig. 6, showing input nodes 1010, edges 1020, 1040, nodes 1030 in a second layer and output nodes 1050. Each input node 1010 in the figure is represented by a circle, and each edge 1020,1040 is represented by a line.

P0003617AU_22118574_1

To perform future prediction of an instance of data, the instance of data is provided as input data values to the artificial neural network. In this case, the input data is the spatial positions of one or more objects 1201, 1202, 1203 at one or more points in time. The input data values are processed, for example by normalising such that the mean value is zero and standard deviation is one, to produce activation values for each of the input nodes 1010 of the artificial neural network (i.e., the neural network layers 920). Activation of nodes 1010, 1030, 1050 in each layer of the artificial neural network is performed by propagation of values from the input nodes 1010 to nodes 1030, 1050 in subsequent layers of the artificial neural network, according to the strength of the connections between the nodes, and other functions defined by the artificial neural network, such as an activation function.

Possible activation functions include, but are not limited to: the sigmoid function, the hyperbolic tangent function, and the rectified linear function. An example of the activation value of a node 1030 in the first layer is given as follows, where r is the activation value of the first node 1030 in the first layer, a,a2,a3,a4 are the activation values of nodes 1010 in the input layer, and wi, w2, w3, w4 are the weight values of edges 1020 in the first layer. The activation function of the node 1030 is tanh of the sum of the weighted input values. Therefore, the activation value of the first node 1030 is:

r1 = tanh(w 1 ai + w 2 a 2 + w3 a3 + w 4 a4 )

During training of the artificial neural network, the weight values and parameters of the activation functions are adjusted so that the loss value is low.

The method 110 commences at the method 210 where the method 110 predicts possible future positions of an object 1201, 1202, 1203 from the training data. The method 210 is described below in relation to Fig. 4. The method 110 then proceeds from the method 210 to step 220.

In step 220, the method 110 compares the predicted possible future positions of the object 1201, 1202, 1203 with known future positions. These known future positions are provided in the training data. For each instance of the training data, the activation value of output nodes of the artificial neural network (i.e., the neural network layers 920) is determined and compared with a corresponding training target. In one example, the training target represents a fixed number of estimates of the future position of each of the objects provided as input, along with an associated probability that each estimate will occur. In a second example, the training target encodes a distribution of the probability of each spatial position at the future point in time,

P0003617AU_22118574_1 representing the probability that each position is the actual position that the object occurs at the future point in time. The relationship between the activation values of the output nodes and the corresponding training target is the loss value. The loss value is used to modify the connections of the artificial neural network. In this manner, the artificial neural network is trained to recognise the relationship between the input data values and the target values. The loss value may, for example, relate to a difference, a mean-square error, a Euclidian distance, or a distribution-based function between the activation values of the output nodes and the corresponding training target.

Therefore, for each training instance, or set of training instances, the gradient 2 of the cost c

(i.e., a loss value) with respect to each trainable parameterp is determined using the chain rule.

The method 110 then proceeds from step 220 to step 230.

In step 230, parameters of the neural network layers 920 are updated based on the loss value determined at step 220. The values of the trainable parameters of the network are updated according to the gradient of the loss value with respect to the trainable parameters. Examples of trainable parameters in the neural network layers 920 are the weighted edges and biases 1020 1040, indicated by wi andw 2 in Figure 6.

Each trainable parameter p is updated using the gradient of the loss value. Update of each

trainable parameterp based on the gradient of the cost c (i.e., loss value) with respect to the

parameter p can be performed as follows:

ac p =p - a ap

Where a is a hyper-parameter representing the rate of learning for each training instance, for example a is set to 0.01.

The method 110 then proceeds from step 230 to 240.

In step 240, the method 110 determines whether training of the neural network layers 920 should be stopped. One criterion for such a determination if to determine whether the loss value determined at step 220 on training instances of a validation set is at a minimum.

P0003617AU_22118574_1

Another example of the stopping criterion is when a fixed number of instances have been processed. Another example of the stopping criterion is according to the classification values produced, such as when the classification error of values in a set of validation data begins to increase on average.

If the method 110 determines at step 240 that the stopping criterion is reached (YES), then the method 110 proceeds from step 240 to end step 299. Otherwise (NO), the method 110 proceeds from step 240 to the method 210.

Predicting possible future positions of an object

The schematic flow diagram in Fig. 4 shows the method 210 of predicting possible future positions of an object 1201, 1202, 1203, where the prediction system 1320 is implemented by an artificial neural network, such as that shown in Fig. 9.

The method 210 commences at step 405, by first pre-processing the input video data. For example, the tracking module 1310 can be used to produce orientation data having the positions at various points in time of the objects 1201, 1202, 1203 present in the observed scene 1210. This is described above in relation to the tracking module 1310. The method 210 then proceeds from step 405 to step 410.

In step 410, response values for the first, second, and third layers of the neural network layers 920 are determined. If different neural network layers are used, the response values for all the layers are determined. The method 210 proceeds from step 410 to step 450.

In step 450, using the response values, prediction values of the future orientation positions of each object is determined. Fig. 11 shows an example of such prediction. The object 1420 is predicted to move to possible future positions 1531,1532 for a future point in time 1505. Each possible future position 1531, 1532 is associated with a probability score. In this example each predicted future orientation position 1531,1532 is represented using vertical 1530 and horizontal 1510 position values. The method 210 concludes at the conclusion of step 450.

The method 210 of predicting possible future positions 1531, 1532 of the object 1420 uses activation values of nodes of the neural network layers 920. The method 210 then produces a set of predictions of future positions 1531, 1532 of objects 1420 and associated probability

P0003617AU_22118574_1 scores. The probability score, as discussed above, is an estimate of the probability that the object will occur at the future position at a future point in time.

The possible future positions of an object 1201, 1202, 1203, as predicted by the prediction module 1320, can be coordinates in a physical space or in the field of view of a camera. The predicted possible future position of an object 1201, 1202, 1203 may be a single possible future position at a future point in time, or a distribution of multiple possible future positions.

For example, the prediction module 1320 may predict that an object 1201, 1202, 1203 may be at a single possible future position x4, y4 at time t4 with a probability score of 0.8. In other words, the probability that the object 1201, 1202, 1203 would be at position x4, y4 at time t4 is 80%.

In another example, the prediction module 1320 may predict that an object 1201, 1202, 1203 may be a distribution of multiple future positions. The multiple future positions are position x4, y4 at time t4 with a probability score of 0.8; position x4-2, y4-2 at time t4 with a probability score of 0.15; and position x4-3, y4-3 at time t4 with a probability score of 0.05. Therefore, the probability that the object 1201, 1202, 1203 at time t4 is at position x4, y4 is 80%, at position x4-2, y4-2 is 15%, and at position x4-3, y4-3 is 5%, where the total probability score comes to 100%.

As described above, the orientation data representing objects can be in a coordinate space of the physical space. Fig. 7 shows an example of objects 1150, 1151, 1170 in a playing field 1130, where the position of each object 1150, 1151, 1170 is represented in a coordinate space of the physical dimensions of the playing field, for example using coordinates of northerly 1160 and easterly 1140 values within the dimensions of the playing field 1130 for a number of points in time 1135, 1136.

In one arrangement, several wide-angle cameras 1230 are positioned around a playing field 1130. Each wide-angle camera 1230 then observes the playing field 1130 from a different viewing angle, and the field of view 1220 of each of the wide-angled cameras 1230 does not necessarily cover the extent of the playing field 1130. The prediction module 1320 then predicts future positions of the objects 1150, 1151, 1170 in the coordinates of the playing field 1130 along with associated estimated probability values that the estimated future positions are

P0003617AU_22118574_1 the correct future positions. The camera control module 1330 projects the estimated future positions probability values in the playing field 1130 into the coordinates of the fields of view 1221, 1222 of the close-up cameras 1231, 1232 observing the scene 1130, in order to determine a chosen configuration of the close-up pan-tilt units 1241, 1242, 1243 and close-up cameras 1231, 1232.

Example(s)/Use Case(s)

An example of the prediction and camera control system is described. The example is to control multiple close-up cameras to photograph a sports event, consisting of several players on a sports field playing a game of football.

In this example, a wide-angle camera 1230 is positioned such that the extent of the football playing field 1210 is visible in the camera. Training of the prediction module 1320 is performed by receiving images from the wide-angle camera 1230, for example using recorded video of football matches, and performing pre-processing 405 with the tracking module 1310 to produce orientation data of the positions of all players 1201,1202 on the field 1210 and the ball 1203 for each observed point in time. A further pre-processing step selects a fixed number of player orientation tracks, for example three (3) player orientation tracks, choosing the player orientation tracks that are closest to the ball. The chosen player orientation tracks and the ball orientation track are passed to the prediction module 1320, representing the player orientation tracks for a constant number of points in time up to and including the most recent point in time, for example 20 points in time.

The prediction module 1320 selects a fixed number of points in time to use as input for the prediction system, for example the first 10 points in time, referred to as the observed period. The most recent point in time is selected as the ground-truth of the prediction task. The prediction system 1320 used is an artificial neural network 900 such as shown in Fig. 5, and the layers of the neural network 920 consist of three fully-connected layers, as shown in Fig. 6. The number of nodes in each layer in this example is 1024. Training is performed by presenting the orientation positions for each chosen player and the ball for the observed period as input 910 to the neural network layers 920, and operation of the neural network layers 920 produces predictions of the position of each chosen player and the ball for the most recent point in time. A constant number of predictions are produced for each chosen player and the ball, for example four (4) predictions, and a probability value is produced for each prediction, using a determined confidence value for each prediction that is normalised such that the sum of probability values

P0003617AU_22118574_1 for each chosen player and the ball is equal to 1. The estimated values are compared with the actual values of the positions of each player and the ball at the most recent point in time, and a loss value is determined, for example using the difference between the nearest predicted position and the actual future position for each chosen player and the ball, and the difference between the predicted confidence value of the nearest predicted position and one (1), and the difference between each confidence value of the other predicted positions and zero (0).

After training the prediction module 1320 using examples of observed football matches for training, the method 120 can be performed. During a football match, a wide-angle camera 1230 is positioned to record images of the football playing field 1210, and passes the images to the control unit 1250. The tracking module 1310 and prediction module 1320 use a fixed number of the most recent points in time, for example 10, of the received wide-angle camera 1230 images as the observed points in time, and produce predictions of the future positions of a fixed number of chosen player orientation tracks and the ball. In this example eight (8) close-up cameras 1231,1232 are used. The camera control module 1330 selects a fixed number of chosen predicted future positions to maximise the sum of the estimated probability values of the chosen positions. In this example, the number of chosen positions matches the number of close-up cameras 1231,1232, for example eight (8) chosen positions. However, as described above, one close-up camera 1231, 1232 may capture more than one possible future positions, which then enables the number of selected possible future positions to exceed the number of close-up cameras 1231, 1232.

The configuration of the close-up pan-tilt units 1241, 1242, 1243 and close-up cameras 1231, 1232 are controlled such that the close-up cameras 1231, 1232 collect images covering the chosen positions when the future point in time is reached. In this example, two chosen positions are selected for each chosen player, and close-up recording performed for each chosen position, such that the probability that the chosen player appears in at least one of the close-up camera 1231, 1232 recordings is high. The recorded images from the close-up cameras 1231, 1232 are used by an image content system 1360, for example by recording the images for later use.

Industrial Applicability

The arrangements described are applicable to the computer and data processing industries and particularly for controlling cameras by predicting possible future positions of an object.

P0003617AU_22118574_1

The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings.

P0003617AU_22118574_1

Claims

CLAIMS The claim(s) defining the invention are as follows:

1. A method of capturing an image of an object using a plurality of cameras in a network, the method comprising:

receiving orientation information of the object whose image is to be captured;

predicting future positions of the object, wherein each of the future positions is associated with a probability score based on the plurality of cameras, wherein the prediction of the future positions is based on the received orientation information; and

configuring the plurality of cameras to capture an image of the object at one or more of the predicted future positions.

2. The method of claim 1, further comprising:

determining the predicted future positions with the highest probability scores; and

selecting the predicted future positions based on the determination, wherein the plurality of cameras is configured to capture images at the selected predicted future positions.

3. The method of claim 2, wherein the number of selected predicted future positions is based on a number of the plurality of cameras.

4. The method of claim 2, wherein the selection of the predicted future positions is to maximise a probability of capturing the object.

5. The method of claim 1, wherein the object comprises a person or a ball.

6. The method of claim 1, wherein the configuring of the plurality of cameras further comprises determining content quality of the image of the object to be captured by one or more of the plurality of cameras at one or more of the predicted future positions.

P0003617AU_22118574_1

7. The method of claim 1, wherein the object is associated with an event of interest.

8. The method of claim 9, wherein the event of interest is a sporting event.

9. A system of capturing an image of an object, the system comprising:

a plurality of cameras; and

a control unit in communication with the plurality of cameras, wherein the control unit comprises a processor and a memory, wherein the memory is in communication with the processor and stores an application program that is executable by the processor, when executing the application program the processor is configured for:

receiving orientation information of the object whose image is to be captured;

10. The system of claim 9, wherein the processor is further configured for:

11. The system of claim 10, wherein the number of selected predicted future positions is based on a number of the plurality of cameras.

12. The system of claim 10, wherein the selection of the predicted future positions is to maximise a probability of capturing the object.

13. The system of claim 9, wherein the object comprises a person or a ball.

P0003617AU_22118574_1

14. The system of claim 9, wherein the processor is further configured for:

determining content quality of the image of the object to be captured by one or more of the plurality of cameras at one or more of the predicted future positions.

15. The system of claim 9, wherein the object is associated with an event of interest.

16. The system of claim 9, wherein the event of interest is a sporting event.

17. A non-transitory computer readable medium comprising a method of capturing an image of an object using a plurality of cameras in a network, the method comprising:

receiving orientation information of the object whose image is to be captured;

18. The non-transitory computer readable medium of claim 17, wherein the method further comprises:

19. The non-transitory computer readable medium of claim 18, wherein the number of selected predicted future positions is based on a number of the plurality of cameras.

20. The non-transitory computer readable medium of claim 18, wherein the selection of the predicted future positions is to maximise a probability of capturing the object.

P0003617AU_22118574_1

Canon Kabushiki Kaisha

Patent Attorneys for the Applicant

SPRUSON&FERGUSON

P0003617AU_22118574_1

1 / 13 07 Feb 2019

100 2019200854

Start

110

Train predictor

120

Operate prediction and camera control

End

Fig. 1

22118577_1

2 / 13 07 Feb 2019

Start 110 2019200854

Predict possible future positions 210 of an object

220 Compare predicted possible future positions with known future positions 230

Update parameters of the neural network layers based on a loss value

240 No Stop training?

Yes

299 End

Fig. 2

22118577_1

3 / 13 07 Feb 2019

Start 120 2019200854

Predict possible future positions 210 of an object

310

Determine possible future positions with the highest probability scores

320 Select camera configurations of the cameras based on the determination of step 310

325 Adjust camera positions

399 End

Fig. 3

22118577_1

4 / 13 07 Feb 2019

210

Start

405 2019200854

Perform preprocessing 410

Find activation values for all nodes in all layers 450

Find predictions for all objects 499

End

Fig. 4 22118577_1

5 / 13 07 Feb 2019

900 2019200854

910 920 930

Predicted Neural future Network positions and Orientation data layers probabilities

frames

Fig. 5 22118577_1

6 / 13 07 Feb 2019

920 2019200854

1010 1020 1030 1040 1050

Fig. 6 22118577_1

7 / 13 07 Feb 2019 2019200854

1136 1135 1151

1150

1140

1130 1160 1170

Fig. 7 22118577_1

8 / 13 07 Feb 2019

1210 2019200854

1201 1202

1203

1222

1220 1221

1231 1232 1230 1240 1241 1242

1250

Fig. 8 22118577_1

9 / 13 07 Feb 2019

1250 1230

Wide-angle 2019200854

camera 1310

Tracking module 1320

Prediction module 1241, 1330 1242, 1243 Close-up Camera pan-tilt Control units module

1231, 1232 Close-up cameras

1360

Image content module

22118577_1 Fig. 9

10 / 13 07 Feb 2019 2019200854

1406 1405

1421 1420

1410

1415 1430 1440

Fig. 10 22118577_1

11 / 13 07 Feb 2019 2019200854

1505

1531

1420

1510

1515 1530 1532

Fig. 11 22118577_1

12 / 13 07 Feb 2019

(Wide-Area) Communications Network 1620 2019200854

Printer 1615

Microphone 1624 1680 1621 1617 (Local-Area) Video Communications Display Network 1622 Ext. 1623 1614 Modem 1616 1600

1601

Appl. Prog Storage Audio-Video I/O Interfaces Local Net. 1633 Devices Interface 1607 1608 I/face 1611 HDD 1610 1609

1604

1618 1619

Processor I/O Interface Memory Optical Disk 1605 1613 1606 Drive 1612

Keyboard 1602

Scanner 1626 Disk Storage 1603 Medium 1625 Camera 1627

Fig. 12A 22118577_1

13 / 13 07 Feb 2019

1634 1633

Instruction (Part 1) 1628 Data 1635 Instruction (Part 2) 1629 Data 1636 1632 1631 2019200854

Instruction 1630 Data 1637

ROM 1649 POST BIOS Bootstrap Operating 1650 1651 Loader 1652 System 1653

Input Variables 1654 Output Variables 1661

1655 1662

1656 1663

1657 1664

Intermediate Variables 1658 1659 1666 1660 1667

1619 1604

1618

1605 Interface 1642

1641 1648 Reg. 1644 (Instruction) Control Unit 1639 Reg. 1645

ALU 1640 Reg. 1646 (Data)

22118577_1 Fig. 12B