AU2013263838A1

AU2013263838A1 - Method, apparatus and system for classifying visual elements

Info

Publication number: AU2013263838A1
Application number: AU2013263838A
Authority: AU
Inventors: Amit Kumar Gupta
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2013-11-29
Filing date: 2013-11-29
Publication date: 2015-06-18

Abstract

-43 Abstract METHOD, APPARATUS AND SYSTEM FOR CLASSIFYING VISUAL 5 A method of classifying visual elements in an image is disclosed. Mask information about a plurality of objects detected within each of the visual elements, is received from a plurality of detectors, each of the detectors being configured to detect a different type of object. A temporal threshold is determined for each of the visual elements based on the type of the objects detected within each of the visual elements. Each visual element is classified as one of 10 foreground and background, using the determined temporal thresholds. QfQ11-7-70,1 fDflQ2Afl Cnr-i Ae Eilcrl\ -4/11 Start For each object40 detector d, receive mask d ny unprocesse vi ua el m e ts No IFYes C40 Ed Select unprocessed visual element v Determine age threshold Determine temporal characteristics (age) Fig. 4 Classify v foreground/ background 470 Mark v as processed 8091345vl (P092025_FigsAs Filed)

Description

-1 METHOD, APPARATUS AND SYSTEM FOR CLASSIFYING VISUAL ELEMENTS TECHNICAL FIELD The present disclosure relates to object detection in video images and, in particular, to a method, apparatus and system for classifying visual elements in an image into foreground and 5 background. The present disclosure also relates to a computer program product including a computer readable medium having recorded thereon a computer program for classifying visual elements in an image into foreground and background. BACKGROUND A video is a sequence of images. The images may also be referred to as "frames". The 10 terms 'frame' and 'image' may be used to refer to a single image in an image sequence, or a single frame of a video image sequence. An image (or frame) is made up of visual elements. Visual elements may be, for example, pixels or blocks of wavelet coefficients. As another example, visual elements may be frequency domain 8x8 DCT (Discrete Cosine Transform) coefficient blocks, as used in JPEG images. As still another example, visual elements may be 15 32x32 DCT-based integer-transform blocks as used in AVC or h.264 coding. Images of a video image sequence are captured and recorded by cameras, including network cameras providing live streams, e.g. for surveillance. Automatic video analysis assists in the processing and understanding of a video image sequence, e.g. by triggering an alert when a person enters a zone without authorisation. 20 One of the processing steps in automatic video analysis is foreground separation. Foreground separation may also be described by its inverse (i.e., background separation). Examples of foreground separation applications include activity detection, unusual object or behaviour detection, and scene analysis. The concept of "foreground" is semantic and subject to various interpretations. One 25 definition of a "foreground object" is that the object is transient. For example, the object is classified as a foreground object if the object is observed moving in images of a video image sequence. In contrast, in a video image sequence showing a parked car, the car is classified as "background". However, when the car drives away, the car becomes "foreground". A question Qf 21-7-70,1 1DfAQ2Al Cr- i Ae EilAr -2 is whether the parked car should be classified as foreground already before the car drives away. Similar difficulties occur in determining whether swaying trees are classified as foreground or background, and if water ripples are classified as foreground. In practice, however, users of applications are interested in achieving an objective, such as sounding an intruder alert 5 (ignoring water) or sounding a water overflow alert (ignoring intruders). The definition of "foreground" then becomes: whatever the user is interested in. In addition, users have contextual knowledge about a scene captured in an image and an application in which foreground separation operates. The contextual knowledge may be used by human observers to determine what objects in an image are classified as foreground and 10 background. For example, detection of a face in an image makes it likely that the face is associated with a foreground object. Knowledge that the location of the face is a museum and that there are paintings of faces may affect assessment of the image as well. One foreground separation method models a scene captured in an image by temporally modelling visual content of a pixel or DCT block in the image. Pixels in a new input image 15 which are similar in visual content to the scene model at the same location can then be considered background, while dissimilar pixels may be considered foreground. In another method, an input image is segmented into similarly coloured regions, and the difference between the input image and the preceding images of a video image sequence is used for motion segmentation. Two segmentations are then combined to provide a foreground separation 20 of the input image. Other methods of performing foreground segmentation of an image takes higher level semantic information into account. One method detects foreground objects as described above, and then performs a recognition method on a resulting mask. If there is a positive recognition in part of the mask, a corresponding scene model is corrected for corresponding visual elements to 25 ensure that a recognised object is classified as foreground. Another method of performing foreground separation of an image incorporates feedback from an object tracking module. However, the feedback methods are specific to a recognition method being used employed and is unable to handle multiple varying information sources. Other methods of performing foreground separation of an image use knowledge about a 30 scene to adjust parameter settings in parts of the scene. However, such methods of performing Qf 21-7-70,1 1DfAQ2Al Cr- i Ae EilArl -3 foreground separation do not incorporate higher level information about an application or objects in the scene. Thus, a need exists for an improved method of performing foreground separation of an image, to achieve increased accuracy. 5 SUMMARY It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements. Disclosed are arrangements which seek to address the above problems by proving a method for performing foreground separation which adapts to available contextual information 10 from multiple varying information sources. According to one aspect of the present disclosure, there is provided a method of classifying visual elements in an image, said method comprising: receiving mask information about a plurality of objects detected within each of said visual elements, from a plurality of detectors, each of the detectors being configured to detect a 15 different type of object; determining a temporal threshold for each of said visual elements based on the type of said objects detected within each of said visual elements; and classifying each said visual element as one of foreground and background, using said determined temporal thresholds. 20 According to another aspect of the present disclosure, there is provided an apparatus for classifying visual elements in an image, said apparatus comprising: means for receiving mask information about a plurality of objects detected within each of said visual elements, from a plurality of detectors, each of the detectors being configured to detect a different type of object; 25 means for determining a temporal threshold for each of said visual elements based on the type of said objects detected within each of said visual elements; and QfQ11-7-70,1 fDflQ2Afl Cnr-i Ae Eilcr\ -4 means for classifying each said visual element as one of foreground and background, using said determined temporal thresholds. According to still another aspect of the present disclosure, there is provided a system for classifying visual elements in an image, said system comprising: 5 a memory for storing data and a computer program; a processor coupled to the memory for executing the computer program, the computer program comprising instructions for: receiving mask information about a plurality of objects detected within each of said visual elements, from a plurality of detectors, each of the detectors being 10 configured to detect a different type of object; determining a temporal threshold for each of said visual elements based on the type of said objects detected within each of said visual elements; and classifying each said visual element as one of foreground and background, using said determined temporal thresholds. 15 According to still another aspect of the present disclosure, there is provided a non transitory computer readable medium having a computer program stored thereon for classifying visual elements in an image, said program comprising: code receiving mask information about a plurality of objects detected within each of said visual elements, from a plurality of detectors, each of the detectors being configured to 20 detect a different type of object; code for determining a temporal threshold for each of said visual elements based on the type of said objects detected within each of said visual elements; and code for classifying each said visual element as one of foreground and background, using said determined temporal thresholds. QfQ11-7-70,1 fDflQ2Afl Cnr-i Ae Eilr\ -5 According to still another aspect of the present disclosure, there is provided a method of classifying a visual element in an image, said method comprising: generating a segmentation of the image by combining a plurality of image segmentations based on accuracy of the image segmentations with respect to 5 foreground/background classification; determining a neighbourhood relationship for one of said visual elements with respect to other visual elements using the generated segmentation; and classifying said one visual element as one of foreground and background by combining foreground evidence from the determined neighbourhood relationship with foreground 10 evidence for said one visual element. According to still another aspect of the present disclosure, there is provided an apparatus for classifying a visual element in an image, said apparatus comprising: means for generating a segmentation of the image by combining a plurality of image segmentations based on accuracy of the image segmentations with respect to 15 foreground/background classification; means for determining a neighbourhood relationship for one of said visual elements with respect to other visual elements using the generated segmentation; and means for classifying said one visual element as one of foreground and background by combining foreground evidence from the determined neighbourhood relationship with 20 foreground evidence for said one visual element. According to still another aspect of the present disclosure, there is provided a system for classifying a visual element in an image, said system comprising: a memory for storing data and a computer program; a processor coupled to the memory for executing the computer program, the computer 25 program comprising instructions for: QfQ11-7-70,1 fDflQ2Afl Cnr-i Ae Eilr\ -6 generating a segmentation of the image by combining a plurality of image segmentations based on accuracy of the image segmentations with respect to foreground/background classification; determining a neighbourhood relationship for one of said visual elements with 5 respect to other visual elements using the generated segmentation; and classifying said one visual element as one of foreground and background by combining foreground evidence from the determined neighbourhood relationship with foreground evidence for said one visual element. According to still another aspect of the present disclosure, there is provided a non 10 transitory computer readable medium having a computer program stored thereon for classifying a visual element in an image, said program comprising: code for generating a segmentation of the image by combining a plurality of image segmentations based on accuracy of the image segmentations with respect to foreground/background classification; 15 code for determining a neighbourhood relationship for one of said visual elements with respect to other visual elements using the generated segmentation; and code for classifying said one visual element as one of foreground and background by combining foreground evidence from the determined neighbourhood relationship with foreground evidence for said one visual element. 20 Other aspects of the invention are also disclosed. BRIEF DESCRIPTION OF THE DRAWINGS One or more embodiments of the invention will now be described with reference to the following drawings, in which: Figs. 1 and 2 collectively form a schematic block diagram representation of a camera 25 system upon which described arrangements can be practiced; Qfl21-7-7"1 lDfAQ2Afl Cr- i Ae EilArl -7 Fig. 3A is a block diagram of an input image; Fig. 3B is a block diagram of a scene model for the input image of Fig. 3A that includes visual element models, each visual element model including mode models and ages and age 5 thresholds; Fig. 4 is a schematic flow diagram showing a method of classifying visual elements in an image into foreground and background; 10 Fig. 5 is a schematic flow diagram showing a method of determining a temporal threshold set for a visual element, as executed in the method of Fig. 4; Fig. 6 is a schematic flow diagram showing a method of classifying visual elements in an image into foreground and background, using still image segmentation inputs; 15 Fig. 7A shows an example image; Fig. 7B shows an example of outline masks generated for the image of Fig. 7A; 20 Fig. 8A shows an outline detection mask for the image of Fig. 7A, as generated by a human body detector; Fig. 8B shows an outline detection mask for the image of Fig. 7A, as generated by a car detector; 25 Fig. 8C shows an outline detection mask for the image of Fig. 7A, as generated by a plant detector; Fig. 9A shows an example of a human body detection mask for the image of Fig. 7A, as 30 generated by a human body detector, where human bodies are represented by bounding boxes; Fig. 9B shows an example of a car detection mask for the image of Fig. 7A, as generated by a car detector, where cars are represented by bounding boxes; Qfl21-7-7"1 lDfAQ2Afl Cr- i Ae EilArl -8 Fig. 9C shows an example of a plant detector mask for the image of Fig. 7A, as generated by a plant detector, where plants are represented by bounding boxes; Fig. 10A shows a set of sixteen visual elements of an input image; 5 Fig. 10B shows the visual elements of Fig. 10A labeled by a first detector; Fig. 10C shows the visual elements of Fig. 10A labeled by a second detector; 10 Fig. 10D shows the visual elements of Fig. 10A grouped into several groups; Fig. 10E shows a segmentation generated for the input image of Fig. 1OA; Fig. 1 1A shows an example of a confidence distribution on a human body detection 15 mask determined for the image of Fig. 7A; Fig. 1 1B shows an example of a confidence distribution on a car detection mask determined for the image of Fig. 7A; and 20 Fig. 1 1C shows an example of a mixture of confidence distributions for detection masks determined over a period of time. DETAILED DESCRIPTION INCLUDING BEST MODE Where reference is made in any one or more of the accompanying drawings to steps and/or features that have the same reference numerals, those steps and/or features have for the 25 purposes of this description the same function(s) or operation(s), unless the contrary intention appears. Conventional pixel based modelling methods described above classify transient objects as foreground objects. However, foreground separation is performed at a pixel or block level rather than at an object level in the pixel based modelling described above. The conventional 30 pixel based modelling foreground separation methods do not incorporate object level knowledge or information and require a generic definition for foreground, under the assumption that clusters of neighbouring foreground form objects. The pixel based modelling methods described above are suitable for deployment where contextual information (e.g. scene specific Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilAr -9 or application specific information) cannot be provided. However, ignoring contextual information limits the accuracy that can be achieved in conventional pixel based modelling methods. A computer-implemented method, system, and computer program product for separating 5 an input image into foreground and background is described below. In the foreground/background separation methods described below, visual elements in an image are classified as foreground or background using labels to label the visual elements depending on the classification. Figs. 1 and 2 collectively form a schematic block diagram of a camera system 101 10 including embedded components, upon which foreground/background separation methods to be described are desirably practiced. The camera system 101 may be, for example, a digital camera or a mobile phone, in which processing resources are limited. Nevertheless, the methods to be described may also be performed on higher-level devices such as desktop computers, server computers, and other such devices with significantly larger processing 15 resources. The camera system 101 is used to capture input images representing visual content of a scene appearing in the field of view (FOV) of the camera system 101. Each image captured by the camera system 101 comprises a plurality of visual elements. A visual element is defined as an image sample. In one arrangement, the visual element is a pixel, such as a Red-Green-Blue 20 (RGB) pixel. In another arrangement, each visual element comprises a group of pixels (e.g, an image block including a number of pixels). In yet another arrangement, the visual element is an 8 by 8 block of transform coefficients, such as Discrete Cosine Transform (DCT) coefficients as acquired by decoding a motion-JPEG frame, or Discrete Wavelet Transformation (DWT) coefficients as used in the JPEG-2000 standard. The colour model is 25 YUV, where the Y component represents luminance, and the U and V components represent chrominance. As seen in Fig. 1, the camera system 101 comprises an embedded controller 102. In the present example, the controller 102 has a processing unit (or processor) 105 which is bi directionally coupled to an internal storage module 109. The storage module 109 may be 30 formed from non-volatile semiconductor read only memory (ROM) 160 and semiconductor QfQ11-7-70,1 fDflQ2l Cnr-i Ae Eilcrl\ -10 random access memory (RAM) 170, as seen in Fig. 2. The RAM 170 may be volatile, non volatile or a combination of volatile and non-volatile memory. The camera system 101 includes a display controller 107, which is connected to a display 114, such as a liquid crystal display (LCD) panel or the like. The display controller 107 5 is configured for displaying graphical images on the display 114 in accordance with instructions received from the controller 102, to which the display controller 107 is connected. The camera system 101 also includes user input devices 113 which are typically formed by a keypad or like controls. In some implementations, the user input devices 113 may include a touch sensitive panel physically associated with the display 114 to collectively form a touch 10 screen. Such a touch-screen may thus operate as one form of graphical user interface (GUI) as opposed to a prompt or menu driven GUI typically used with keypad-display combinations. Other forms of user input devices may also be used, such as a microphone (not illustrated) for voice commands or a joystick/thumb wheel (not illustrated) for ease of navigation about menus. As seen in Fig. 1, the camera system 101 also comprises a portable memory interface 15 106, which is coupled to the processor 105 via a connection 119. The portable memory interface 106 allows a complementary portable memory device 125 to be coupled to the electronic device 101 to act as a source or destination of data or to supplement the internal storage module 109. Examples of such interfaces permit coupling with portable memory devices such as Universal Serial Bus (USB) memory devices, Secure Digital (SD) cards, 20 Personal Computer Memory Card International Association (PCMIA) cards, optical disks and magnetic disks. The camera system 101 also has a communications interface 108 to permit coupling of the camera system 101 to a computer or communications network 120 via a connection 121. As seen in Fig. 2, external modules 111A, 111B, and 11 1C may be connected to camera system 25 101 via the network 120. The external modules 111A, 111B, and 11 1C may be object detectors (e.g. human body detector, car detector or plant detector as described below. Each of the external modules detects a different type of object and is configured for sending mask information including a size and a location of each detected object to the camera system via the network 120. The connection 121 may be wired or wireless. For example, the connection 121 30 may be radio frequency or optical. An example of a wired connection includes Ethernet. Further, an example of wireless connection includes BluetoothTM type local interconnection, Qf 21-7-70,1 1DfAQ2Al Cr- i Ae EilAr -11 Wi-Fi (including protocols based on the standards of the IEEE 802.11 family), Infrared Data Association (IrDa) and the like. Typically, the controller 102, in conjunction with an image sensing device 110, is provided to perform the functions of the camera system 101. The image sensing device 110 5 may include a lens, a focus control unit and an image sensor. In one arrangement, the sensor is a photo-sensitive sensor array. As another example, the camera system 101 may be a mobile telephone handset. In this instance, the camera system 102 may also comprise those components required for communications in a cellular telephone environment. The camera system 102 may also comprise (not shown) a number of encoders and decoders of a type 10 including Joint Photographic Experts Group (JPEG), (Moving Picture Experts Group) MPEG, MPEG-1 Audio Layer 3 (MP3), and the like. The image sensing device 102 captures an input image (e.g., 310) in conjunction with the controller 102 The methods described below may be implemented using the embedded controller 102, where the processes of Figs. 3A to 11B may be implemented as one or more software 15 application programs 133 executable within the embedded controller 102. The camera system 101 of Fig. 1 implements the described methods. In particular, with reference to Fig. 1B, the steps of the described methods are effected by instructions in the software 133 that are carried out within the controller 102. The software instructions may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided 20 into two separate parts, in which a first part and the corresponding code modules performs the described methods and a second part and the corresponding code modules manage a user interface between the first part and the user. The software 133 of the embedded controller 102 is typically stored in the non-volatile ROM 160 of the internal storage module 109. The software 133 stored in the ROM 160 can be 25 updated when required from a computer readable medium. The software 133 can be loaded into and executed by the processor 105. In some instances, the processor 105 may execute software instructions that are located in RAM 170. Software instructions may be loaded into the RAM 170 by the processor 105 initiating a copy of one or more code modules from ROM 160 into RAM 170. Alternatively, the software instructions of one or more code modules may be 30 pre-installed in a non-volatile region of RAM 170 by a manufacturer. After one or more code modules have been located in RAM 170, the processor 105 may execute software instructions of the one or more code modules. Qf 21-7-70,1 1DfAQ2Al Cr- i Ae EilAr -12 The application program 133 is typically pre-installed and stored in the ROM 160 by a manufacturer, prior to distribution of the electronic device 101. However, in some instances, the application programs 133 may be supplied to the user encoded on one or more CD-ROM (not shown) and read via the portable memory interface 106 of Fig. 1A prior to storage in the 5 internal storage module 109 or in the portable memory 125. In another alternative, the software application program 133 may be read by the processor 105 from the network 120, or loaded into the controller 102 or the portable storage medium 125 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that participates in providing instructions and/or data to the controller 102 for execution and/or 10 processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, flash memory, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the device 101. Examples of transitory or non-tangible computer readable transmission media that may also participate in the provision of software, 15 application programs, instructions and/or data to the device 101 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like. A computer readable medium having such software or computer program recorded on it is a computer program product. 20 The second part of the application programs 133 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 114 of Fig. 1. Through manipulation of the user input device 113 (e.g., the keypad), a user of the device 101 and the application programs 133 may manipulate the interface in a functionally adaptable manner to provide 25 controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via loudspeakers (not illustrated) and user voice commands input via the microphone (not illustrated). Fig. 2 illustrates in detail the embedded controller 102 having the processor 105 for 30 executing the application programs 133 and the internal storage 109. The internal storage 109 comprises read only memory (ROM) 160 and random access memory (RAM) 170. The processor 105 is able to execute the application programs 133 stored in one or both of the connected memories 160 and 170. When the electronic device 101 is initially powered up, a Qf 21-7-70,1 1DfAQ2Al Cr- i Ae EilAr -13 system program resident in the ROM 160 is executed. The application program 133 permanently stored in the ROM 160 is sometimes referred to as "firmware". Execution of the firmware by the processor 105 may fulfil various functions, including processor management, memory management, device management, storage management and user interface. 5 The processor 105 typically includes a number of functional modules including a control unit (CU) 151, an arithmetic logic unit (ALU) 152 and a local or internal memory comprising a set of registers 154 which typically contain atomic data elements 156, 157, along with internal buffer or cache memory 155. One or more internal buses 159 interconnect these functional modules. The processor 105 typically also has one or more interfaces 158 for 10 communicating with external devices via system bus 181, using a connection 161. The application program 133 includes a sequence of instructions 162 through 163 that may include conditional branch and loop instructions. The program 133 may also include data, which is used in execution of the program 133. This data may be stored as part of the instruction or in a separate location 164 within the ROM 160 or RAM 170. 15 In general, the processor 105 is given a set of instructions, which are executed therein. This set of instructions may be organised into blocks, which perform specific tasks or handle specific events that occur in the electronic device 101. Typically, the application program 133 waits for events and subsequently executes the block of code associated with that event. Events may be triggered in response to input from a user, via the user input devices 113 of Fig. 1, as 20 detected by the processor 105. Events may also be triggered in response to other sensors and interfaces in the electronic device 101. The execution of a set of the instructions may require numeric variables to be read and modified. Such numeric variables are stored in the RAM 170. The disclosed methods use input variables 171 that are stored in known locations 172, 173 in the RAM memory 170. The input 25 variables 171 are processed to produce output variables 177 that are stored in known locations 178, 179 in the RAM memory 170. Intermediate variables 174 may be stored in additional memory locations in locations 175, 176 of the RAM memory 170. Alternatively, some intermediate variables may only exist in the registers 154 of the processor 105. The execution of a sequence of instructions is achieved in the processor 105 by repeated 30 application of a fetch-execute cycle. The control unit 151 of the processor 105 maintains a register called the program counter, which contains the address in ROM 160 or RAM 170 of the Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilAr -14 next instruction to be executed. At the start of the fetch execute cycle, the contents of the memory address indexed by the program counter is loaded into the control unit 151. The instruction thus loaded controls the subsequent operation of the processor 105, causing for example, data to be loaded from ROM memory 160 into processor registers 154, the contents of 5 a register to be arithmetically combined with the contents of another register, the contents of a register to be written to the location stored in another register and so on. At the end of the fetch execute cycle the program counter is updated to point to the next instruction in the system program code. Depending on the instruction just executed this may involve incrementing the address contained in the program counter or loading the program counter with a new address in 10 order to achieve a branch operation. Each step or sub-process in the processes of the methods described below is associated with one or more segments of the application program 133, and is performed by repeated execution of a fetch-execute cycle in the processor 105 or similar programmatic operation of other independent processor blocks in the electronic device 101. 15 Fig. 3A shows a schematic representation of an input image 310 that includes a plurality of visual elements (e.g., 320). A visual element is the elementary unit at which processing takes place and is based on capture by an image sensor 100 of the camera system 101. In one arrangement, a visual element is a pixel. In another arrangement, a visual element is an 8x8 pixel DCT block. 20 Fig. 3B shows a schematic representation of a scene model 330 for the image 310, where the scene model 330 includes a plurality of visual element models. In the example shown in Figs. 3A and 3B, the input image 310 includes visual element 320 and the scene model 330 includes a corresponding example visual element model 340. In one arrangement, the scene model 330 is stored in the RAM memory 170 of the camera system 101. In one 25 arrangement, the processing of the image 310 is executed by the controller 102 of the camera system 101. In an alternative arrangement, processing of an input image is performed by instructions executing on a processor of a general purpose computer. A scene model includes a plurality of visual element models. As seen in Figs. 3A and 3B, for each input visual element that is modelled, such as the visual element 320, a 30 corresponding visual element model 340 is maintained in the scene model 330. Each visual element model 340 includes a set of one or more mode models 360-1, 360-2 and 360-3. Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilAr -15 Several mode models may correspond to the same location in the captured input image 310. Each of the mode models 360-1, 360-2, 360-3 are based on history of visual appearance of the corresponding visual element 320. The visual element model 340 includes a set of mode models that includes "mode model 1" 360-1, "mode model 2" 360-2, up to "mode model N" 5 360-3. Each mode model (e.g., 360-1) corresponds to a different state or appearance of a corresponding visual element (e.g., 340). For example, where a flashing neon light is in the scene being modelled, and mode model 1, 360-1, represents "background - light on", mode model 2, 360-2, may represent "background - light off', and mode model N, 360-3, may 10 represent a temporary foreground element such as part of a passing car. In one arrangement, a mode model represents visual appearance. For example, the visual appearance of the mode model is a mean value of pixel intensity values. In another arrangement, the visual appearance of the mode model is a median or approximated median of observed DCT coefficient values for each DCT coefficient in the location 340 of the scene 15 model 330, and the mode model records temporal characteristics (e.g., age 365-1 of the mode model 360-1 of Fig. 3B). The age 365-1 of the mode model 360-1 of Fig. 3B refers to a period of time since the mode model 360-1 was generated. Age threshold 367-1 is used to determine whether an age 365-1 is considered old enough to be considered as background. If the age 365 1 exceeds the age threshold 367-1, then the mode model 360-1 is a background mode model. If 20 the age 365-1 does not exceed the threshold 367-1, then the mode model 360-1 is a foreground mode model. Other mode models 360-2 and 360-3 in the example of Fig. 3B have ages 365-2 and 365-3, and age thresholds 367-2 and 367-3 respectively. The age threshold (e.g., 367-1) is described below as a temporal threshold. If the visual appearance of an incoming visual element in the input image (e.g., 310) is 25 similar to one of the mode models, then temporal information about the similar mode model, such as the age of the mode model, may be used to classify a corresponding block of a scene into foreground or background. For example, if an incoming visual element has the similar visual appearance as a very old visual element mode model, then the location of the visual element may be considered to be established background. If an incoming visual element has the 30 similar visual appearance as a young visual element mode model, then the visual element location may be considered to represent at least a portion of a background region or a foreground region depending on an age threshold value. If the visual appearance of the Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl -16 incoming visual element does not match any known mode model, then the visual information at the location of the mode model has changed and the mode model location may be considered to be a foreground region. In one arrangement, there may be one matched mode model in each visual element 5 model. That is, there may be one mode model matched to a new, incoming visual element. In another arrangement, multiple mode models may be matched at the same time by the same visual element. In one arrangement, at least one mode model matches a visual element model. In another arrangement, it is possible for no mode model to be matched in a visual element model. 10 In one arrangement, a visual element may only be matched to the mode models in a corresponding visual element model. In another arrangement, a visual element is matched to a mode model in a neighbouring visual element model. In yet another arrangement, there are visual element models representing a plurality of visual elements and a mode model in that visual element mode model is matched to any one of the plurality of visual elements, or to a 15 plurality of those visual elements. Different visual elements, descriptors and other attributes perform differently in different video scenarios. Context provides rich information about images of a video sequence. For example, if the position of dynamic backgrounds in an input image is known, a modelling scheme can use pixel-level modelling in image regions with dynamic background and block 20 level modelling in remaining image regions, thus achieving a good balance between computational cost and accuracy. As described below, each of the external modules 111A, 111B, and 11 1C is an image and video processing module which process images, including images of a video image sequence, for purposes different to video foreground separation. For example, an object 25 tracking module may be referred to as an external module. By using an external module (e.g., 111A) which generates video specific information, the performance of a foreground separation method in the camera system 101 can be enhanced for a specific video sequence. By using external modules (e.g., 111A) rather than integrated features, independent advancements in computer vision can be utilized without having to change methods of detection as described 30 below. Using external modules (e.g., 111A) allows for a fast and flexible use of new technology. Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl -17 There are many external modules which generate information which may be used for performing foreground separation. For example, an object tracking module generates track information for object blobs detected in a current image (or frame) and a prediction of the location of each blob in a next image (or frame). The term "object blob" refers to a group of 5 connected pixels (or blocks). Location prediction is useful in performing foreground separation on an image as prior information about foreground at a predicted location means that foreground is more likely at that location. The prior information may be used to accurately estimate probability of a predicted location being foreground. 10 Information from external modules, such as the external modules 111A, 11 1B and 11 1C, is categorised as follows: * Visual element and descriptor details: The visual element and descriptor details category refers to information related to visual element, descriptor and classifier settings for use in an area for foreground and background modelling. The visual element, descriptor and classifier 15 settings information is used to accurately model the background and foreground distributions. * Priors. There are two "priors" categories as follows: - Foreground/background prior: The foreground/background prior category contains prior probability information about an image area being foreground or background. The 20 foreground/background prior information is used to estimate background and foreground match probabilities of an incoming image (or frame). - Foreground/background predictions: The foreground/background predictions category contains foreground or background predictions in a current image. The foreground/background predictions information is used to create expected foreground or 25 background modes and spatial neighbourhood. Foreground predictions are useful to accurately detect foreground especially in scenarios where foreground and background have similar visual characteristics in a region. * Spatial neighbourhood: The information in the spatial neighbourhood category indicates the spatial neighbourhood of visual elements. For example, an image segmentation input is QfQ11-7-70,1 fDflQ2l Cnr-i Ae Eilcrl\ -18 given where each segment of the image is expected to have a same foreground/background label. For example, image segmentation modules which implement superpixel methods provide spatial neighbourhood information. * Semantics: The semantics category refers to information about higher level semantics of a 5 scene. Semantics may include the a "region feature" or "blob semantics" as follows: - Region Feature: A region feature describes a region. For example, a region feature may describe a region as a "shadow". Semantic information such as "a region is currently a shadow" does not indicate whether the region is foreground or background. Region features may be integrated by using a rule to map a feature to an expected 10 foreground/background region (the priors). A rule may be given to link the feature to foreground/background status. For example, a rule may specify that shadow in the left of an image (or frame) is foreground, while shadow in the right of the image is background. After combining the rule with the detected shadow, a region map is generated which comprises probable regions for foreground and background. 15 - Blob semantics: Blob semantics information is information which includes high level information about object blobs. External modules such as human body detection modules and face detection modules may generate such high level information about object blobs. The high level information may be used to adapt the impact of age 365-1 on foreground/background classification and the scene model update process. For 20 example, modes corresponding to a blob marked as a human body use a different age feature formulation than other modes, such as requiring more hits for those modes to become background. Using information from an external module, such as the external modules 111A, 11 1B and 11 1C, to enhance foreground separation of image of a video image sequence can be divided 25 into two steps. The first step is to identify the information required to enhance the performance in a video scenario. For example, to enhance performance in video scenarios containing ripples, a combination of a pixel-based method (i.e., for the ripple areas) and a block based method (i.e., in non-ripple areas) may be used. Hence, an external module which detects a ripple area is useful in video images containing ripples. Qf 21-7-70,1 IDfAQ2Al Cr- i Ae EilArl -19 The second step is to integrate the information from the external module to perform the foreground separation on the image. Information from the external modules may be categorised into a finite number of distinct categories. Multiple inputs of each category may be combined and used in foreground separation using a probabilistic method. 5 A visual element in an input image is classified as follows. Features are extracted from the visual element. In one arrangement, the first six (6) DCT coefficients of the intensity (Y) channel, and the first DCT coefficients of the chroma channels (U and V) in an 8x8 DCT block are used as features. Then an initial foreground/background classification is determined on the visual element. In one arrangement, a trained Support Vector Machine (SVM) may be used to 10 classify the visual element using the eight (8) features as input. After initial classification, final classification is performed. In one arrangement, a Conditional Random Field (CRF) may be used to perform the final classification. The CRF for an image specifies a probability distribution over all possible foreground/background labellings for the image, conditioned on the input features for the image. In one arrangement, input 15 features to a final classification are the results of the initial classifications of the visual elements. The probability P of a foreground/background labelling j (of a set of visual elements from an image) being correct is estimated in accordance with Equation (1) as follows: P (Y =y =x = exp[L(jf)], (1) where L(,f)= XiEVE(OfO(iY i+ W 1 f 1 (xi,yi) + iEVE,jEVr (W2f2(xiLixjJYJ + 20 wf 3 (xi,y xj, yj)) - V is a vector of foreground/background labels, one for each visual element in the image; - X is output of initial classification, one for each visual element in the image; - Z is a normalisation factor, Z = Z , exp [L(2, j')]; - w 0 to w 3 are parameters learned via offline training, say wo = 0.7863976986575667, wi 25 2.8361295178435353, w2= 1.7146068263152774, w3 = 0.2853931736848092; QfQ11-7-70,1 fDflQ2Afl Cnrci Ae Eilcrl\ -20 - fo to f3 are scalar-valued feature functions, based on the input image and output labels. The feature functions will be described in detail below; - VE is the set of visual elements from the input image. - JX is the neighbourhood of the i th visual element, as the second term takes 5 neighbourhood foreground evidence into account to separate the image. In one arrangement, the probability of each possible foreground/background labelling j is determined. That is, representing foreground with one (1) and background with zero (0), for an image with N visual elements, 2 N permutations of ' are considered. For example, for N=3 the eight (8) permutations are as follows: 10 P (Y = [0,0,0] X = , P (Y = [1,0,0] X = , P (Y = [0,1,0] $ = , P (Y = [1,1,1] IX = 'X. In one arrangement, the j with the highest probability value P Y = y x is selected as the classification. In practice, the number of visual elements in an image is significantly higher than three 15 (3). For example, a VGA sized frame has four thousand eight hundred (4800) DCT blocks corresponding to 4800 visual elements. Determining all permutations online (e.g. at a frame rate of 30 frames per second), requires expensive computing resources. Therefore, in another arrangement, a sliding window method may be used to reduce the computational cost in determining the permutations. The sliding window (e.g., 3 by 3 visual elements), centred at 20 visual element i determines the probabilities of a localised subset ' of f. In the window around visual element i, probabilities of 29 permutations of ' are determined and a value for ' is selected. In one arrangement, the window then slides to a next visual element j which hasn't been the window centre yet, and repeats the computations for the window around j to select a value for '. For the example of the VGA image, 4800 sliding window results are generated. In 25 another arrangement, the window then slides to next visual element j which has not been included in any window yet and repeats the computations for the window around j to select a value for '. For the example of the VGA image, twelve hundred (1200) sliding window results are generated. In a further arrangement, the next visual element j can be selected such that the Qf 21-7-70,1 1DfAQ2Al Cr- i Ae EilAr -21 window of the next visual element j does not overlap a previous window. When starting j at a visual element that has eight (8) neighbours (i.e. not having a position at the border of the image), approximately five hundred and forty (540) sliding windows are generated.In one arrangement, j is constructed by taking the foreground/background classification for each 5 visual element i from '. Where the sliding windows were designed to not be overlapping, the values from the visual elements surrounding visual element i are selected from ' as well. In another arrangement, for visual element i, the values for all sliding windows in which the visual element i was represented (i.e., not just as the centre but also elsewhere in the window) are combined. 10 In one arrangement, the values are added and divided by the number of values to determine a resulting number. The resulting number is then rounded to zero (0) if the resulting number is below 0.5. The resulting number is then rounded to one (1) if the resulting number is equal or greater than 0.5. In another arrangement, a median value is selected. In another arrangement, j is initialised with an initial classification. In one arrangement, 15 the initial classification is random. In another arrangement, the initial classification is determined with function L as described above, but where w2 and w3 are set to zero (0), i.e. the second summation in the equation is not determined. The initial classification is then updated in an iterative fashion until j converges with respect to the probability of f being correct. In one arrangement, Gibbs sampling is used to update fY. 20 A probability distribution is determined for the visual element i conditional on the visual elements in a window around the visual element i. In one arrangement, the window comprises the four (4)-connected neighbours of i. For example, if the initial classification for i is background, and two (2) neighbours have an initial classification for background, and two (2) neighbours have an initial classification for foreground, the probability distribution is 0.6 25 background and 0.4 foreground. In another example, the conditional distribution for the visual element i is derived from the part of the L equation that mentions the visual element i, keeping only the summands in the summations that include i. A random sample is then drawn from the probability distribution, and the classification of i is updated according to the sample. A counter counter[i] is kept in 30 RAM 170, for example, for the visual element i, initially set to zero (0). If the classification of i was updated to foreground, counter[i] is increased by one (1). The conditional distribution is Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl -22 determined for each visual element in the frame. Then another iteration is executed, where the probability distribution for a visual element i may have changed because the visual element or neighbours of the visual element were updated in the previous iteration. In one arrangement, the iterations are stopped when a maximum number of iterations, (e.g., 1000 iterations), has been 5 reached. The final probability of visual element i being foreground is then determined by dividing counter[i] by the number of iterations. A binary classification is determined in accordance with Equation (2), as follows: .f ore ground , if n countertio Tu classification(i) number of iterations > Tcou (2) ,background , otherwise where Tcounter is a threshold value, say 0.5. 10 In another arrangement, the initial classification is updated through local optimisation of the probability (or the L function). For each visual element i, the change in value for P ( = y = x ) is determined by flipping the classification for visual element i (i.e. from foreground to background or from background to foreground). The visual element that results in the biggest improvement in the probability of f being correct is then selected to update f. The 15 updated j is used for the next iteration and the process is repeated. In one arrangement, the iterations are stopped when a maximum number of iterations (e.g., 1000 iterations), has been reached. In another arrangement, the iterations are stopped when the change in the probability of f being correct (or the change in L) is smaller than a predetermined threshold value (e.g., 0.001). In yet another arrangement, the iterations are stopped when the change in the 20 probability of f being correct is negative. In the above arrangements, labelling vector j represents a binary classification. In another arrangement, q represents the probabilities for the label of each of the corresponding visual elements i in f. In one arrangement, the probability for a visual element i with a foreground label in j is set to: counter[i] and the probability for a visual element i number of iterations' 25 with a background label in j is set to: 1 - counter[i] . In another arrangement, the number of iterations foreground probability value of the ith element in j is determined in accordance with Equation (3) as follows: Qf 21-7-70,1 1DfAQ2Al Cr- i Ae EilArl -23 exp {wofo(xiyi) + w 1 f 1 (xi,yi) + (wziz(x, yixjyj) + w 3 f 3 (xbY,x;,yj)) . (3) 5 The unnormalised probability for the individual element, similar to the probability used in the equation for L(x, f), is assigned to the visual elements. Fig. 4 is a flow diagram showing a method 400 of classifying visual elements in an input image as foreground or background. The method 400 separates the image into foreground and background. The method 400 may be implemented as one or more code modules of the 10 software application program 133 resident in the ROM 160 and be controlled in its execution by the controller 102. The method 400 will be described by way of an example with reference to an object detector in the form of an external "human body detector" (HBD) module. For example, the external module 111A may be a HBD. The HBD module may be implemented as one or more 15 code modules of the software application program 133 resident on the internal storage module 109 and being controlled in its execution by the processor 105. The HBD module provides masks around locations where the HBD module expects people. For example, Fig. 7A shows the image 310 of Fig. 3A including a scene captured in the image 310. The image 310 is one image of a video image sequence. In the image 310, two (2) people 720 and 730 move through 20 the scene captured in the image 310. The scene captured in the image 310 also contains a plant 740 and a car 725. Fig. 8A shows an outline detection mask 810 for the image 310. The outline detection mask 810 includes blobs 811 and 812 for the two (2) people 720 and 730 of the image 310, respectively. The outline detection mask 810 may be generated by the human body detector 25 (HBD) external module. A car detector module provides another outline detection mask 820 as seen in Fig. 8B, as a result of car detection by the car detector module. The mask 820 shows the outline 821 of the car 725. Again, the car detector may be implemented as one or more code modules of the software application program 133 resident on the hard disk drive 109 and being controlled in its 30 execution by the processor 105. Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl -24 A plant detector module provides another outline detection mask 830, as seen in Fig 8C, showing the outline 831 of plant 740 as a result of the plant detection by the plant detector module. Again, the plant detector may be implemented as one or more code modules of the software application program 133 resident on the hard disk drive 109 and being controlled in its 5 execution by the processor 105. Thus, in the example of Figs. 8A, 8B and 8C, three types of object detectors are used to generate the outline detection masks 810, 820 and 830. In one arrangement, the human body detector, the car detector and the plant detector provide the masks 810, 820 and 830, respectively. One method of implementing an object detector like a human body detector, a car 10 detector or a plant detector is by training a neural network using multiple example images of an object. In another arrangement, the human body detector, the car detector and the plant detector combine the masks 810, 820 and 830 into a multi-value mask 750 as shown in Fig. 7B. In the example of Fig. 7B, human bodies are assigned label 760, cars are assigned label 780, and plants are assigned label 770. In addition, overlap 765 resulting from a two dimensional (2D) 15 projection of a three dimensional (3D) world gets multiple labels (i.e., human and car). In one arrangement, as shown in Fig. 9, the human body detector (HBD) module (e.g., 111A) is configured to provide a detection mask 910 including bounding boxes 911 and 912 as human body masks, or sometimes elliptical masks, rather than the object outline mask 810. The human body mask 911 partially includes visual elements that are not part of the human body 20 720 shown in the image 310. Similarly, the car detector generates a detection mask 920including a bounding box 921 as a car mask for the car 725. The plant detector generates a detection mask 930 including a bounding box 931 as a plant mask for the plant 740. Each of the external modules 111A, 111B, and 11 1C as a detector is configured for sending information about a size and a location of each mask in the input image (e.g., 310) to the camera system 101 25 as "mask information". The method 400 will be described by way of example with reference to the input image 310 and the scene model 330 of Figs. 3A, 3B and 7A. The method 400 begins at a receiving step 410, where the controller 102 receives masks corresponding to the input image 310. As described above, the controller 102 also receives 30 mask information about a size and location of the each of the masks received at step 410. Qf 21-7-70,1 1DfAQ2Al Cr- i Ae EilAr -25 A mask, mask, is received at step 410 for each object detector (e.g., for each enternal module 111A, 111B and 111C) being used to generate the masks. However, the method 400 will be described by way of example with reference to the human body masks 911, 912 corresponding to the input image 310, which are generated by the human body detector. The 5 input image 310 and the human body masks 911, 912 are received by the controller 102 at step 410 and may be stored in the RAM 170 by the controller 102. In one arrangement, the method 400 may receive any of the masks 911, 912, 921 or 931 at step 410, as described below, determined by each of the human body detector, the car detector or the plant detector, respectively. In such an arrangement, the controller 102 is used 10 for receiving the masks corresponding to a plurality of objects detected within visual elements of the input image 310 from the plurality of object detectors, each of the object detectors being configured to detect a different type of object in the image 310. Control passes to a decision step 420, where if the controller 102 determines that any visual elements 320 of the input image 310, such as pixels or pixel blocks, are yet to be 15 processed, then control passes from step 420 to selecting step 430. Otherwise, the method 400 concludes if all visual elements 320 of the input image 310 have been classified. In the following steps of the method 400, the controller 102 is used for determining a temporal threshold, in the form of an age threshold, for each of the visual elements 320 based on a plurality of age thresholds corresponding to the type of the objects detected within each of 20 the visual elements 320 and a type of each of the masks received at step 410. At selecting step 430, the controller 102 selects a visual element v (e.g., 320) from the input image 310 for further processing and identifies a spatially corresponding visual element model (e.g., 340) from the scene model 330. Control then passes to age threshold determination step 440, in which the controller 102 25 determines the age threshold 367-1, 367-2, and 367-3 of each mode model 360-1, 360-2, and 360-3 for the visual element v 320 selected at step 430. The age threshold 367-1 is determined from a temporal threshold set T configured within RAM 170. A method 500 of determining a temporal threshold set T, as executed at step 440, will be described in detail below with reference to Fig. 5. The value of the age threshold 367-1 determined at step 440 may be stored 30 by controller 102 in RAM 170. Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilAr -26 Control then passes from step 440 to the temporal characteristics determination step 450, where the controller 102 is used for determining temporal characteristics for the visual element v selected at step 430 using a scene model 330. In particular, at step 450, the controller 102 retrieves temporal characteristics in the form of the age 365-1, 365-2 and 365-3 corresponding 5 to each of the mode model 360-1, 360-2, and 360-3 for the visual element v 320 selected at step 430 from RAM 170. As described in detail below, the determined temporal characteristics are used to classify the visual element v selected at step 430 as one of foreground and background. Control then passes to foreground/background classification step 460, where the controller 102 is used for classifying the visual element v 320 as one of foreground and 10 background, by comparing the temporal characteristics (e.g. age 365-1) with the determined temporal thresholds in the form of the age thresholds. As described in detail below, the visual element v 320 is classified based on the age 365-1 of the visual element v 320 and the human body masks 911, 912 received at step 410. In one arrangement, if the age 365-1 is greater than or equal to the age threshold 367-1, then the visual element v 320 is classified at step 460 as 15 background. Otherwise, the visual element v 320 is classified as foreground. Following step 460, the process is complete for visual element v 320, and control passes to processing marking step 470, where the visual element v 320 is marked as being processed using the controller 102. Control then passes to step 420, resulting in either termination of the method 400 or 20 processing of another visual element which is not yet marked as being processed at step 430, as described above. In another arrangement, one of two rules can be selected at step 460 in determining the classification of the visual element v 320, as follows: Rule (i): Each human body mask 911, 912 is foreground and all visual elements (e.g., 25 pixels) falling within the human body masks 911, 912 are expected to be foreground. Rule (i) is an example of a Prior category. The foreground prior category information may be combined as described above in order to classify visual elements. Rule (ii): A visual element (e.g., a pixel) falling within the human body masks 911, 912 for a human body (e.g., 720) become background depending on an age threshold 367 associated 30 with the visual element. Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilAr -27 The human body detector (HBD) (e.g., external module 111A) uses age threshold, ahbd, to determine if the visual elements within the human body masks 911, 912 become background in case that the age of the visual elements exceeds the age threshold 367, ath=ahbd. In one arrangement, the value of the age threshold ahbd is a user input parameter (e.g., 5 100,000 frames which represents approximately one (1) hour for a thirty (30) frames per second video image sequence). Rules (i) and (ii) above may be referred to "inferencing rules" and may be implemented by configuring a logistic function to convert age information of a mode model into a foreground/background probability. In one arrangement, a logistic function 1 in accordance 10 with Equation (4), as follows, may be used: 1(t) = 1 (4) 1+e-t where t = y * x a - 10 and where xi.a is the age of the mode model for visual element xi.ath xi and xi.ath is the selected age threshold for the mode model in the visual element. In one arrangement, the age threshold, ahbd, is used as an age threshold ath like the age 15 threshold 367 of each mode model 360 for visual elements that are positioned within the HBD masks 911, 912. The age acar is used as an age threshold ath for visual elements that are positioned within the car mask 921, and the age apit is used as an age threshold ath for visual elements that are positioned within the plant mask 931. Each of the age thresholds, ahbd, acar, and apit are defined based on the type of object and are different from each other. For visual 20 elements outside the masks 911, 912, 921, and 931, a default age threshold adefault of nine thousand (9000) (i.e., which equates to five (5) minutes for a thirty (30) frames per second video) may be used in classifying the visual elements. In one arrangement, the local feature functions fo to f3, as used in equation L (X, f) described above, for the human body detector example are: 25 fo(xi, yi) = j1 - 1(t) if y = foreground (5) 0 otherwise 1 - 1 (20 * xa - 19) if yi = background f1 (Xi, yi) =xi.arth (6) 0 otherwise Qf 21-7-70,1 1DfAQ2Al Cr- i Ae EilArl -28 f2 (xi, Y X, y = 9(2 * M(xi. fv, xj. fv)) if yi = yj (7) 0 otherwise f 3 (xi,yi,xi,yj) = f - 1(2 * M(xi.fv,xj.fv)) if Yi # Yj (8) 0 otherwise where xi. fv is the feature vector describing the selected mode for visual element xi, and M(fv 1 , fv 2 ) is a measure of the difference between the visual feature vectors fvi and fv 2 , 5 where M(fvi, fv 2 ) = M(fv 2 , fV 1 ). In one arrangement, M(fvi, fv 2 ), sums the absolute differences of the elements (represented by an index into feature vectors fv 1 and fv 2 ) in the feature vectors: M(fv 1 , fv 2 ) Z indexEv 1 pinaex Ifv 1 [index] - fv 2 [index] . In another arrangement, M(fv 1 , fv 2 ) sums the squares differences of the elements in 10 the feature vectors: M(fvi, fv 2 ) = ZindexEv 1 findex(f V 1 [index] - fV 2 [index]) 2 . In one arrangement, weight P index is an equal weight (e.g., 0.125 for a feature vector with eight (8) elements). In another arrangement, the value of the weight P index is specific to the application (e.g., slo = 0.3, f1 = 0.11, #l2 = 0.11, #l3 = 0.06, #l4 = 0.06, #s = 0.06, #6 = 0.15, #7 = 0.15). 15 Functions f 2 and f 3 are measured over a neighbourhood JV. In one arrangement, XNj is the whole image (i.e. all visual elements (e.g., 320) in the image). However, setting XJt to be the whole image leads to practical computational problems, and does not take advantage of local context. That is, a global approach to XJt biases all visual elements in an image towards a majority decision, resulting in less accurate outlines. In another arrangement, JV, has been 20 implemented as 8-connected neighbours of visual element i. The 8-connected neighbourhood region reflects that nearby visual elements are more likely to have the same classification, as an object usually is represented by multiple adjacent visual elements. In yet another arrangement, XJt is a set of visual elements which share a label provided by an external module, as will be described in more detail below. 25 In the example above, only the HBD detection mask 910 and corresponding human body masks 911, 912 are used in classifying the visual elements in accordance with the method 400. However, the car detector detection mask 920 and plant detector mask 930 may also be used in classifying the visual elements as described above. Each different type of detection Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl -29 mask 910, 920 and 930 corresponds to a different age threshold. In one example, the HBD detection mask 910 correspond to an age threshold of ah=9000, the car detector detection mask 920 correspond to an age threshold of ath=5000, and the plant detection mask 930 corresponds to an age threshold of aa=500, as cars and plants are more likely to be background than humans. 5 When each visual element v 320 is either not overlapping with any of the masks 911, 912, 921, 931, or overlapping with exactly one of the masks 911, 912, 931 (e.g. for visual elements corresponding to the plant 740), the method 400 described above is followed. However, where a visual element v 320 corresponds to more than one of the masks 912, 921, 931 (e.g. the visual elements corresponding to the car 725 and the first person 720 in the 10 overlap area 765), the age threshold 367-1 is determined from multiple inputs. The method 500 of determining a temporal threshold set T for each visual element in the input image 310, as executed at step 440, will now be described with reference to Fig. 5. In accordance with the method 500, the temporal threshold set T comprises the received masks (e.g. human body mask 911, 912, car mask 921 and plant mask 931). The temporal threshold 15 set T may include a default age threshold. An age threshold (e.g. age threshold 367-1) is determined from the temporal threshold set T as described below. The method 500 may be implemented as one or more code modules of the software application program 133 resident in the ROM 160 and be controlled in its execution by the controller 102 The method 500 uses the masks received at step 410 and a visual element v selected for 20 processing in step 430. The method 500 will be described by way of example where as well as the human body masks 911 and 912, the car mask 921 and the plant mask 931 are also received by the controller 102 at step 410. The method 500 begins at a temporal threshold set initialisation step 510. At step 410, a default age threshold t erault is added to the temporal threshold set T configured within RAM 170. 25 Control passes to a decision step 520, where if the controller 102 determines that any of the masks 911, 912, 921 and 931 received at step 410 are yet to be processed, then control passes from step 520 to selecting step 530. Otherwise, control passes to temporal threshold selection step 570. Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilAr -30 At selecting step 530, the controller 102 selects a mask, maskd, generated by an object detector d, (e.g., the human body detector, the car detector or the plant detector) for further processing. Control then passes to overlap checking step 540 by using the received mask 5 information including a size and a location of each mask. If visual element v from the input image 310 and maskd overlap spatially (e.g. a part of the human body mask 911 corresponding to human body 760 and the car mask corresponding to a car 780 overlapping the visual element in region 765 seen in Fig. 7B), control passes to threshold addition step 550 as seen in Fig. 5. Otherwise, control passes to mask processing marking step 560. 10 At threshold addition step 550, the controller 102 is used to add an age threshold td corresponding to object detector d to the temporal threshold set T for the visual element v configured within the RAM 170 (e.g. for the visual element in region 765, age threshold ahbd and acar are added to the temporal threshold set T). Then, control passes to mask processing marking step 560. 15 At mask processing marking step 560, the controller 102 is used to mark maskd as processed, and the method 500 iterates by passing control back to step 520. When all masks received at step 410 have been marked as processed, control passes from step 520 to temporal threshold selection step 570. At temporal threshold determination step 570, the controller 102 is used to determine an 20 age threshold ath 367 based on the temporal threshold set T configured within RAM 170. For visual elements not being overlapped with any mask, the age threshold ath 367 is set to the default default default threshold t , as the default threshold t is the only member of the temporal threshold set T. If the temporal threshold set T has two members (i.e., tdefault and t for one detector d), td is selected and the age threshold ath 367 is set to t , the default threshold teut is 25 deleted at step 570. If the temporal threshold set T has more than two members, a strategy is used at step 570 to select an age threshold ath 367-1 from temporal threshold set T. In accordance with such a selection strategy, the default threshold tdefault is firstly removed from the set T, as the default threshold tdefault is considered overridden. In one arrangement, the human body detector, the car 30 detector and the plant detector are considered equally valuable. In one arrangement, the age Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilAr -31 threshold ath 367-1 is selected from temporal threshold set T at step 570 by using a minimum operator on the set T (i.e. the lowest age threshold is selected). In another arrangement, the age threshold ath 367-1 is selected at step 570 from temporal threshold set T by using a maximum operator on the set T (i.e. the highest threshold is selected). In yet another arrangement, a 5 median or average operator is used on the values in the temporal threshold set T in order to select the age threshold at step 570. d At the step 570, an accuracy confidence value, confidence , of the output of the object d detector is considered. If an object detector d has a confidence value, confidence , lower than a confidence threshold (e.g., a confidence threshold equal to 0.5), the age threshold td 10 corresponding to the object detector d is removed from the temporal threshold set T. On the remaining members of the temporal threshold set T, the age threshold which has a maximum confidence value is selected as the age threshold ath. A maximum operator is applied to the temporal threshold T to select the age threshold ath, at step 570 in accordance with Equation (9), as follows: 15 ath ftdefault if T 0 except for tdefault ImaxdET td , otherwise The age threshold corresponding to the object detector with the highest associated confidence value (i.e., confidence d) that the mask (i.e., selected mask') applies to the visual element v is selected as the age threshold ath at step 570. d The accuracy confidence value, confidence , associated with a mask (e.g., 910) is 20 determined by an object detector d based on previous output of the object detector d. In d particular, the accuracy confidence value, confidence , associated with a mask (e.g., 910) may be determined by object detector d a priori or online, based on a previous detection mask 910. In one a priori arrangement, a confidence value is manually set for each detector d (e.g., the human body detector, car detector, plant detector). The confidence value setting is provided by 25 a user (e.g. based on application knowledge or domain knowledge about the detectors and the contents of a scene). For example, if the object detector d is a car detector known to have an accuracy of 70% in detecting cars, the confidence value associated with the car detector is set to 0.7. In another arrangement, the confidence value associated with an object detector d is 30 determined by analysis of a training set with labelled previous observations about Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl -32 foreground/background classification. For example, if the object detector d was correct in 80% of a training set of labels, the confidence value is set to 0.8. In yet another arrangement, an unlabelled training set is used where an oracle provides the foreground/background classification. Such an oracle is very accurate in classification, but the oracle is too expensive 5 computationally to use permanently and can therefore be used during training only. In another arrangement, the confidence value associated with an object detector d is determined online. As noted before, an object detector (e.g., the human body detector, car detector, plant detector) may be correct in its classification of visual elements falling within the a bounding box as foreground. However, if the object is not rectangular, the bounding box does 10 includes visual elements classified as background as well, especially near the borders of the bounding box. As such, in one arrangement for determining the confidence value online, a multivariate Gaussian modelling of an aspect ratio and size of the bounding box is performed for each bounding box, resulting in a confidence distribution over an image. For example, Fig. 1 1A shows a multivariate Gaussian modelling 1120 of an aspect ratio and size of the bounding 15 box 911 determined by the human body detector, to determine a confidence distribution 1110 over the image 310. Fig. 1 1B shows a multivariate Gaussian modelling 1140 of an aspect ratio and size of the bounding box 921 determined by the car detector, to determine a confidence distribution 1130 over the image 310. For each mask overlapping a visual element, a confidence is determined, since the confidence is not the same for each visual element dd 20 overlapping mask. The confidence is higher for visual elements near the centre of a bounding box. As different masks may overlap, but generally do not completely occlude one another, different confidence values are determined for each object detector for a visual element. In one arrangement, previous masks (e.g., 810, 820, 830) determined by each object detector (e.g., the human body detector) for a previous image of a video image sequence are d 25 considered to determine the confidence . In such an arrangement, for each detected blob in previous images (e.g. the past ten (10) images of a video sequence), the aspect ratio and size of the bounding box for the blob are determined. A bounding box Gaussian distribution, similar to distribution 1150 of Fig. 11 C, of the values of aspect ratio and size for the bounding boxes of the blobs are determined. As blobs corresponding to the same object vary only slightly, each 30 object corresponds to a peak (e.g., 1160, 1170) in the mixture of gaussian bounding box distribution. For multiple objects, there are multiple peaks. If different objects in the real world share a peak, the different objects may be modelled as one object. For each peak, a mean and standard deviation for aspect ratio and size is determined from the bounding box distribution. A Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl -33 bounding box in mask in the current image is then compared to each of the determined means d and standard deviations, resulting in a matching value as the confidence . If the highest matching value exceeds a threshold (e.g., 0.5), the determined means and standard deviations corresponding to the highest matching value are used as parameters for the multivariate 5 Gaussian modelling described above. Otherwise, default parameters are used (e.g., the centre of the bounding box as mean), and 0.25 times the dimension for the standard deviations in the corresponding direction. In yet another arrangement for determining the confidence value for an object detector online, the ratio of the size between a previously detected object in a previous image and a 10 currently detected object the object detection is taken into account, making use of temporal continuity in a scene captured by the previous images. If an object detector d processing the previous image determined a bounding box with one-hundred (100) visual elements, and sixty (60) of the visual elements had been detected as an object (e.g. human body) corresponding to a prediction determined by object detector d in the previous image, the confidenced value for the 15 object detector d is set to 0.6 for the current image. Similarly, the confidence value may be determined over multiple images of a video image sequence. For example, the confidence value may be determined for the whole sequence of images in the video image sequence or for a window of images (e.g., twenty-five (25) images). In another embodiment, confidence value may be estimated by fitting a parametric 20 model such as multi-variate Gaussian model to previous outputs of the object detector. In this case, confidence value is estimated by determining the probability of the current output by the object detector using the learned parametric model. When the temporal threshold selection step 570 has been completed, the method 500 concludes with all detection masks 910, 920, 930 relating to visual element 320 being processed. 25 As described above, in one arrangement, a neighbourhood relationship XN is a set of visual elements which share a label provided by an external module. For example, if visual element i is labelled as part of a detected face mask, the neighbourhood relationship XN is all visual elements that are part of the detected face mask. The neighbourhood relationship XN is not limited to a rectangular window around the visual element i. The neighbourhood 30 relationship XN is not even limited to connectivity of the visual elements. For example, in one arrangement where one face is detected in two halves separated by an occlusion, neighbourhood Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilAr -34 relationship X includes visual elements belonging to both halves. An example of an external module that provides labels are face locators. Another example of an external module that provides labels is a superpixel segmentor which divides an image into small segments that are visually coherent, and that respect object boundaries. Still another example of an external 5 module that provides labels is a semantic image segmentor. When multiple external modules are provided, conflicts may arise (i.e. provided labels may contradict). For example, Fig. 10A shows a set of sixteen (16) visual elements of an input image 1010, veoo to ve3,3 where each visual element is indexed as verow,colum. Fig. 10B shows a segmentation 1020 of the input image 1010 by a first segmentor. As 10 described below, the segmentation 1020 may be a super-pixel segmentation. The visual elements of the image 1010 are labelled by the first segmentor as one of two labels 1021, 1022, where the labels 1021 and 1022 are represented by different shading as seen in Fig. 10B. Fig. 10C shows a segmentation 1030 of the input image 1010 by a second segmentor. As described below, the segmentation 1030 may be a semantic image segmentation. The 15 visual elements of the image 1010 labelled by the second segmentor as one of two labels 1031, 1032, again, where the labels 1031 and 1032 are represented by different shading. Visual element 1023 is labelled with a label 1021 by the first segmentor and labelled with label 1031 by the second segmentor. The eight (8)-connected neighbours of visual element 1023 have been labelled with the same label (although different for each segmentor) by both 20 segmentors. Similarly, visual element 1024 and 4-connected neighbouring visual elements share label 1022 and label 1032. However, visual element 1025 has been labelled differently (compared to neighbours of the visual element 1025) as seen in Figs. 10B and 10C. The neighbourhood relationship XNj for the visual elements of the image 1010 is ambiguous. As seen in Fig. 10D, visual element group 25 1041 (veoo, veo 1 , veio, ve 1 1 , ve 2

,

0 , ve 21 , ve 3 ,, ve 3

,

1 ) and visual element group 1042 (veo, 3 , vei, 2 , vel,3, ve2,2, ve2,3, ve3,3) are all connected to each other (i.e. the label of the groups 1041 and 1042 are consistent for both the first and second segmentors). The status of the visual elements veo, 2 and ve3,2 1025 needs to be resolved. Fig. 6 is a flow diagram showing a method 600 of classifying visual elements in an 30 image as foreground or background. The method 600 separates the image into foreground and Qf21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl -35 background. The method 600 generates a single input image segmentation result by combining a plurality of input image segmentation results. The method 600 may be implemented as one or more code modules of the software application program 133 resident in the ROM 160 and be controlled in its execution by the controller 102 as previously described. 5 The method 600 will be described by way of example with reference to the input image 1010 and image segmentations 1020, 1030 of Figs. 10B and 10C. The method 600 begins at a segmentation receiving step 610, where the controller 102 receives input segmentations such as super-pixel segmentation ("segi") 1020 and semantic image segmentation ("seg 2 ") 1030. The segmentations 1020, 1030 may be stored in the RAM 10 170 by the controller 102. Control passes to a segmentation combination step 620, where the controller 102 is used for generating a single segmentation ("SEG") 1040 for the input image, as seen in Fig. 10D. The single segmentation ("SEG") 1040 for the input image is determined at step 620 by combining (or integrating) the plurality of received image segmentations 1020, 1030 into the 15 single segmentation ("SEG") 1040 and resolving ambiguity where the segmentations 1020 and 1030 do not agree. As described below, the single segmentation ("SEG") 1040 may be determined based on accuracy of the image segmentations 1020 and 1030 with respect to foreground/background classification. In one arrangement, a defensive strategy is used at step 620 to resolve the ambiguity between the segmentations 1020 and 1030. Neighbourhood 20 relationship is determined through an "intersection" of labels of the segmentations 1020, 1030. In the example of Figs. 10A to 10D, step 620 produces the two groups of visual elements 1041, 1042, with the visual elements veo, 2 1045 and ve 3

,

2 1046 being single element groups. For example, the neighbourhood relationship JX,e 2 1 of the group of visual elements 1041 is { ve 0

,

0 , ve 0

,

1 , vei,o, vei, 1 , ve 2

,

0 , ve 2

,

1 , ve 3

,

0 , ve 3

,

1 }. The neighbourhood relationship X', e 1 2 of the group of 25 visual elements 1042 is { veo,3, vel,2, vel,3, ve2,2, ve2,3, ve3,3}. The neighbourhood relationship Xe,2 of the visual element 1046 is { ve3,21, which means there is no point in computing the second term in L(x, f) as the visual element 1042 would be compared with itself. In another arrangement, an aggressive strategy is used at segmentation combination step 620 to resolve the ambiguity between the segmentations 1020 and 1030 by generating a single 30 segmentation 1050 as seen in Fig. 10E. In such an aggressive arrangement, the neighbourhood relationship is determined through a "union" of labels of the segmentations 1020, 1030. In the Qf 21-7-70,1 1DfAQ2Al Cr- i Ae EilAr -36 example of Figs. 10A, 10B, 10C and 10E, an aggressive strategy leads to two groups of visual elements 1051, 1052 as seen in Fig. 10E, with the visual elements veo, 2 1055 and ve 3

,

2 1056 being a member of both groups of visual elements 1051, 1052. For example, the neighbourhood relationship XY,,e 1 for the group 1051 of visual elements is I veo,o, ve 0

,

1 , veo, 2 , vei,o, vei, 1 , ve 2

,

0 , 5 ve2, 1 , ve3,o, ve3,1, ve3,21 1051. The neighbourhood relationship JX,,,, for the group of visual elements 1052 is I veo, 2 , veo, 3 , vei, 2 , vei, 3 , ve 2

,

2 , ve 2

,

3 , ve 3

,

2 , ve 3

,

3 }. The neighbourhood relationship Xe 2 for the image 1010 comprising all of the sixteen visual elements is { veo,o, ve 0

,

1 , veo,2, veo,3, vei,o, ve, 1 , vel,2, vei,3, ve2,o, ve2,1, ve2,2, ve2,3, ve3,o, ve3,1, ve3,2, ve3,3 }. 10 In yet another arrangement, both the intersection and union strategies described above for step 620 are used dependent on an estimation of the accuracy of segmentations determined by the segmentors providing the segmentations 1020, 1030. In such an arrangement, the neighbourhood relationship is determined through a combination of an intersection and a union of labels of the segmentations 1020, 1030. 15 In one arrangement, if the accuracy of both the first and second segmentors exceeds a threshold value (e.g., 0.5), the aggressive strategy (i.e., a union of the labels) described above is used at step 620. Otherwise, the defensive strategy (i.e., an intersection of the labels) is used at step 620. In another arrangement, if the accuracy of the combination of segmentors exceeds a threshold value (e.g., 0.5), the aggressive strategy (i.e., a union of the labels) is used. Otherwise, 20 the defensive strategy (i.e., an intersection of the labels) is used. In one arrangement, the accuracy of a segmentor is set a priori as a system variable. The accuracy is set both for individual segmentors and for combinations of segmentors (i.e.: what is the accuracy in case of disagreement). In another arrangement, segmentor outputs are analysed online by comparing segmentor segmentations 1020, 1030 with foreground classification 25 results. A number of hits may be counted by increasing a hit counter each time a segment 1021, 1022 with more than 1 visual element has a consistent foreground/background classification. In one arrangement, a segment 1021 has a consistent foreground/background classification when a percentage, say 80%, of the visual elements in the segment 1021 have the same classification (either foreground or background). The accuracy of a segmentor is then determined as 30 number of hits number of samples Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl -37 When the segmentation combination step 620 has created a single segmentation SEG 1040, control passes to initialisation step 630. In one arrangement, at step 630, an initial classification cvfor each visual element v of the single segmentation SEG 1040 is determined with function L (', f) as described above, but where w2 and w3 are set to zero (0) (i.e. a second 5 summation in the equation L(2i, f) is not determined). Control then passes to unprocessed checking step 640, where the controller 102 checks if any visual elements of the single segmentation SEG 1040 remain unprocessed (i.e. if there is a visual element which has only an initial classification, but no final classification). If there are no unprocessed visual elements at step 640, then the method 600 concludes. If there is an 10 unprocessed visual element, then the method 600 proceeds to selection step 650. At selection step 650, an unprocessed visual element vO is selected by the controller 102. Control then passes to segmentation label selection step 660. At the segmentation label selection step 660, the controller 102 retrieves the label identifier seg' from the single segmentation SEG 1050 which corresponds to visual element vO. Where the aggressive (union) 15 strategy is used, seg'o comprises multiple labels. Control then passes to neighbourhood relationship selection step 670, where the controller 102 selects all visual elements of the single segmentation SEG 1040 having label identifier seg' o. Also at step 670, the controller 102 is used for determining a neighbourhood relationship XO for the visual element selected at step 650 with respect to other visual elements 20 of the image using the generated segmentation SEG 1040. The initial classifications cvl to c"" for the visual elements in neighbourhood relationship JVO are selected as foreground evidence from the neighbourhood relationship JO, so that function L(2, f) can be computed following execution of the method 600. Control then passes to the foreground/background classification step 680, where the 25 controller 102 is used for classifying the visual element vO selected at step 650 as one of foreground and background by combining foreground evidence from the determined neighbourhood relationship with foreground evidence for the visual element vO. The controller 102 uses the initial classifications with foreground evidence for c"O and cvi to c"" (corresponding to the y variable in the equation) to update the foreground/background 30 classification for the visual element vO in accordance with Equation (10), as follows: Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl -38 wOfO(xV 0 , yVO) + w 1 f 1 (xV 0 ,yVO) + foreground if exp > 1 EJV (Wz2 f2 Xvo, Yvo, Xj,7 j ) + W3 f 3 (x,0, yv0, x;, y;)) >is background otherwise (10) where ryg is a threshold value (e.g., 0.5). 5 After the foreground/background classification step 680 is completed, control passes to processed marking step 690, where the controller 102 marks the visual element vO as processed. Control then returns to unprocessed checking step 640, and eventually the method 600 concludes when all visual elements the single segmentation SEG 1040 have been processed. Integration of external modules as described above allows systems requiring 10 foreground/background separation, such as surveillance systems, to achieve increased accuracy. The increased accuracy is achieved using additional information without the need for adjustment of the core algorithms. For example, a ripple detector for detecting dynamic background, provides over 30% improvement in accuracy, compared to the same system without external modules, on a data set with dynamic background video images. The described 15 methods allow for dynamic configuration based on availability of external resources, contextual information such as knowledge about the scene and contents of the scene, the application and the domain. Industrial Applicability The arrangements described are applicable to the computer and data processing 20 industries and particularly for the image processing. The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive. In the context of this specification, the word "comprising" means "including principally 25 but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings. Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl

Claims

1. A method of classifying visual elements in an image, said method comprising: receiving mask information about a plurality of objects detected within each of said 5 visual elements, from a plurality of detectors, each of the detectors being configured to detect a different type of object; determining a temporal threshold for each of said visual elements based on the type of said objects detected within each of said visual elements; and classifying each said visual element as one of foreground and background, using said 10 determined temporal thresholds.

2. A method according to claim 1, wherein the temporal thresholds are determined based on a determined accuracy confidence.

3. A method according to claim 2, wherein said accuracy confidence is determined for a detector based on previous output of said detector. 15

4. A method according to claim 2, wherein said accuracy confidence is determined based on ratio of size between a previously detected object and a currently detected object.

5. A method according to claim 1, wherein the mask information is a size and a location of said detected object.

6. An apparatus for classifying visual elements in an image, said apparatus comprising: 20 means for receiving mask information about a plurality of objects detected within each of said visual elements, from a plurality of detectors, each of the detectors being configured to detect a different type of object; means for determining a temporal threshold for each of said visual elements based on the type of said objects detected within each of said visual elements; and Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilAr -40 means for classifying each said visual element as one of foreground and background, using said determined temporal thresholds.

7. A system for classifying visual elements in an image, said system comprising: a memory for storing data and a computer program; 5 a processor coupled to the memory for executing the computer program, the computer program comprising instructions for: receiving mask information about a plurality of objects detected within each of said visual elements, from a plurality of detectors, each of the detectors being configured to detect a different type of object; 10 determining a temporal threshold for each of said visual elements based on the type of said objects detected within each of said visual elements; and classifying each said visual element as one of foreground and background, using said determined temporal thresholds.

8. A non-transitory computer readable medium having a computer program stored thereon 15 for classifying visual elements in an image, said program comprising: code receiving mask information about a plurality of objects detected within each of said visual elements, from a plurality of detectors, each of the detectors being configured to detect a different type of object; code for determining a temporal threshold for each of said visual elements based on the 20 type of said objects detected within each of said visual elements; and code for classifying each said visual element as one of foreground and background, using said determined temporal thresholds.

9. A method of classifying a visual element in an image, said method comprising: Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl -41 generating a segmentation of the image by combining a plurality of image segmentations based on accuracy of the image segmentations with respect to foreground/background classification; determining a neighbourhood relationship for one of said visual elements with respect to 5 other visual elements using the generated segmentation; and classifying said one visual element as one of foreground and background by combining foreground evidence from the determined neighbourhood relationship with foreground evidence for said one visual element.

10. A method according to claim9, wherein the neighbourhood relationship is determined 10 through an intersection of labels of said plurality of segmentations.

11. A method according to claim 9, wherein the neighbourhood relationship is determine through a union of labels of said plurality of segmentations.

12. A method according to claim 9, wherein the neighbourhood relationship is determined through a combination of an intersection and a union of labels of said plurality of segmentations. 15

13. An apparatus for classifying a visual element in an image, said apparatus comprising: means for generating a segmentation of the image by combining a plurality of image segmentations based on accuracy of the image segmentations with respect to foreground/background classification; means for determining a neighbourhood relationship for one of said visual elements 20 with respect to other visual elements using the generated segmentation; and means for classifying said one visual element as one of foreground and background by combining foreground evidence from the determined neighbourhood relationship with foreground evidence for said one visual element.

14. A system for classifying a visual element in an image, said system comprising: 25 a memory for storing data and a computer program; Qf 21-7-70,1 lDfAQ2Afl Cr- i Ae EilArl -42 a processor coupled to the memory for executing the computer program, the computer program comprising instructions for: generating a segmentation of the image by combining a plurality of image segmentations based on accuracy of the image segmentations with respect to 5 foreground/background classification; determining a neighbourhood relationship for one of said visual elements with respect to other visual elements using the generated segmentation; and classifying said one visual element as one of foreground and background by combining foreground evidence from the determined neighbourhood relationship with 10 foreground evidence for said one visual element.

15. A non-transitory computer readable medium having a computer program stored thereon for classifying a visual element in an image, said program comprising: code for generating a segmentation of the image by combining a plurality of image segmentations based on accuracy of the image segmentations with respect to 15 foreground/background classification; code for determining a neighbourhood relationship for one of said visual elements with respect to other visual elements using the generated segmentation; and code for classifying said one visual element as one of foreground and background by combining foreground evidence from the determined neighbourhood relationship with 20 foreground evidence for said one visual element. CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant/Nominated Person SPRUSON & FERGUSON QfOQj1-7-7"1 fDflQ2Afl Cnr-i Ae Eilr\