AU2021242795A1

AU2021242795A1 - Touch-free interaction with a self-service station in a transit environment

Info

Publication number: AU2021242795A1
Application number: AU2021242795A
Authority: AU
Inventors: Aaron Jason Hornlimann; Nicolas Peter Osborne
Original assignee: Elenium Automation Pty Ltd
Current assignee: Elenium Automation Pty Ltd
Priority date: 2020-03-23
Filing date: 2021-03-17
Publication date: 2022-11-24
Also published as: WO2021189099A1; EP4128031A1

Abstract

Embodiments relate generally to systems, methods, and processes that may use touch-free interactions at self-service interaction stations. In particular, embodiments relate to use of such stations in transit environments, such as airports or other transport hubs. Some embodiments relate to a self-service interaction station, having a video image recording device with a view of view in a display direction of display screen, a processor configured to determine a human face within captured images, to identify and track the movement of a tracking feature of a face identified within the captured images, and to cause the display screen to display a cursor over content displayed on the display screen during an interaction process and to move the cursor in response to movement of the tracking feature to interact with the content.

Description

"Touch-free interaction with a self-service station in a transit environment"

Cross-Reference to Related Applications

[0001] The present application claims priority from Australian Provisional Patent Application No 2020900882 filed on 23 March 2020, the contents of which are incorporated herein by reference in their entirety.

Technical Field

[0002] Embodiments relate generally to systems, methods, and processes that may use touch-free interactions at self-service interaction stations. In particular, embodiments relate to use of such stations in transit environments, such as airports or other transport hubs.

Background

[0003] As air travel becomes more affordable, there are greater numbers of passengers passing through airports in order to reach their destinations. Airlines and airports offer self- service channels in order to improve the customer experience and passenger processing volume capabilities with customer convenience and more efficient use of space in an increasingly busy airport environment. As a consequence of increased people movement across borders, airports, airlines and immigration departments are acutely aware of the increased potential for transmission of contagious disease or illness to other passengers in an airport or aircraft as well as to other people in the country of travel or destination. The self- service channels often utilise computer-driven devices which require a passenger to physically touch or interact with the screen or other components of the device, thereby creating multiple transmission surfaces upon which a contagious passenger may leave residue of a contagious illness. When a subsequent unrelated passenger comes into contact with that surface, they may pick up the contagion and either become ill or pass to another surface in the airport or aircraft.

[0004] This represents a significant risk in airport situations due to the nature of a single airport having flights leaving to multiple destinations. In scenarios of contagious illness, this can multiply the pandemic potential as passengers flying to different destinations around the world can all contract a contagion from a single self-service touchpoint, and then transport that contagion around the world.

[0005] It is desired to address or ameliorate one or more shortcomings or disadvantages associated with prior techniques for interaction processes in transit environments such as airports, or to at least provide a useful alternative thereto.

[0006] Throughout this specification the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.

[0007] Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present disclosure as it existed before the priority date of each of the appended claims.

Summary

[0008] Some embodiments relate to a self-service station for touch-free interaction in a transit environment, the station including: a display screen having a display direction; a video image recording device with a field of view in the display direction; a processor to control the display of display images on the display screen and to process live video images recorded by the video image recording device; a memory accessible to the processor and storing executable program code that, when executed by the processor, causes the processor to: determine a human face in the video images, determine whether the face is proximate the display screen, identify a tracking feature of the face in the video images and track movement of the tracking feature in the video images, cause the display screen to initiate an interaction process in response to determining that the face is proximate the display screen, and cause the display screen to display a cursor over content displayed on the display screen during the interaction process and to move the cursor in response to movement of the tracking feature to interact with the content. [0009] The executable program code, when executed by the processor, may cause the processor to: in response to determining that the cursor is positioned over one of at least one predefined area of the content for at least a predetermined dwell time, record a user selection in relation to a screen object associated with the one predefined area. The predetermined dwell time may be between about 2 seconds and about 5 seconds. The predetermined dwell time may between about 2 seconds and about 4 second or between about 3 seconds and about 4 seconds.

[0010] The executable program code, when executed by the processor, may cause the processor to: cause the display screen to visually emphasise the screen object when the cursor is positioned over the one predefined area.

[0011] The executable program code, when executed by the processor, may cause the processor to: cause the display screen to visibly and/or audibly indicate the recording of the user selection.

[0012] The one predefined area may cover between about 5% and about 35% of a visible area of the display screen.

[0013] The executable program code, when executed by the processor, may cause the processor to: cause the display screen to show a progress indicator that is timed to progressively show elapsing of the predetermined dwell time.

[0014] The executable program code, when executed by the processor, may cause the processor to: before causing the display screen to display the cursor over the content, cause the display screen to display a training task to show that face movement correlates to cursor movement.

[0015] The executable program code, when executed by the processor, may cause the processor to: determine that the face is proximate the display screen when a number of pixels in a video image frame of the face exceeds a predetermined pixel count threshold.

[0016] The executable program code, when executed by the processor, may cause the processor to: apply a machine learning model to determine the face in the video images. The machine learning model may be a deep neural network model based on a single shot multibox detector (SSD) framework, for example.

[0017] The tracking feature may be identified by the processor by analysing an image frame of the face to determine a target pixel area of a fixed contrast window size, wherein the target pixel area has a greatest range of colour contrast among pixel areas of the fixed contrast window size in the image frame, wherein the target pixel area is used as the tracking feature. The fixed contrast window size may be selected to correspond to a face area of between about 1 cm² and about 10 cm².

[0018] The executable program code, when executed by the processor, causes the processor to: apply a scaling factor to movement of the tracking feature in the video images in order to cause the display screen to proportionately move the cursor over the content.

[0019] The executable program code, when executed by the processor, may cause the processor to: determine whether movement of the tracking feature in the video images is less than a predetermined minimum number of pixels over a predetermined time period or is greater than a predetermined maximum number of pixels over the predetermined time period; and increase the scaling factor if the movement is less than the predetermined minimum number of pixels or decrease the scaling factor is the movement is greater than the predetermined maximum number of pixels.

[0020] The executable program code, when executed by the processor, may cause the processor to: cause the display screen to statically display the cursor over the content for a first predetermined wait time after the tracking feature can no longer be identified in the video images.

[0021] The executable program code, when executed by the processor, may cause the processor to: store the tracking feature; and for a second predetermined wait time after the tracking feature can no longer be identified in the video images, attempt to determine the face and the tracking feature in the video images. The first predetermined wait time may be between about 5 seconds and about 10 seconds; the second predetermined wait time may between about 5 seconds and about 10 seconds. [0022] The station may further comprise a housing that houses the display screen, the video image recording device, the processor and the memory, wherein the housing holds the display screen and the video image recording device at a height above floor level sufficient to allow a face of a person to be generally within the field of view when the person stands between about 1 meter and about 2.5 meters in front of the station.

[0023] The display screen may be non-responsive to touch.

[0024] Some embodiments relate to a system for touch-free interaction in a transit environment, the system including: multiple ones of the station positioned to allow human interaction at one or more transit facilities; and a server in communication with each of the multiple stations to monitor operation of each of the multiple stations.

[0025] Some embodiments relate to a method of facilitating touch-free interaction in a transit environment, including: determining a human face in video images captured by a video image recording device positioned at a station; determining whether the face is proximate a display screen at the station, the display screen facing a same direction as field of view of the video image recording device; identifying a tracking feature of the face in the video images and tracking movement of the tracking feature in the video images; causing the display screen to initiate an interaction process in response to determining that the face is proximate the display screen; and causing the display screen to display a cursor over content displayed on the display screen during the interaction process and to move the cursor in response to movement of the tracking feature to interact with the content.

[0026] The method may further include: in response to determining that the cursor is positioned over one of at least one predefined area of the content for at least a predetermined dwell time, recording a user selection in relation to a screen object associated with the one predefined area. The predetermined dwell time may be between about 2 seconds and about 5 seconds.

[0027] The method may further include: causing the display screen to visually emphasise the screen object when the cursor is positioned over the one predefined area. [0028] The method may further include: causing the display screen to visibly and/or audibly indicate the recording of the user selection.

[0029] The method may further include: causing the display screen to show a progress indicator that is timed to progressively show elapsing of the predetermined dwell time.

[0030] The method may further include: before causing the display screen to display the cursor over the content, causing the display screen to display a training task to show that face movement correlates to cursor movement.

[0031] The method may further include: determining that the face is proximate the display screen when a number of pixels in a video image frame of the face exceeds a predetermined pixel count threshold.

[0032] The method may further include: applying a machine learning model to determine the face in the video images, wherein the machine learning model is a deep neural network model based on a single shot multibox detector (SSD) framework.

[0033] Identifying the tracking feature may include analysing an image frame of the face to determine a target pixel area of a fixed contrast window size, wherein the target pixel area has a greatest range of colour contrast among pixel areas of the fixed contrast window size in the image frame, wherein the target pixel area is used as the tracking feature.

[0034] The fixed contrast window size may be selected to correspond to a face area of between about 1 cm² and about 10 cm².

[0035] The method may further include: applying a scaling factor to movement of the tracking feature in the video images in order to cause the display screen to proportionately move the cursor over the content.

[0036] The method may further include: determining whether movement of the tracking feature in the video images is less than a predetermined minimum number of pixels over a predetermined time period or is greater than a predetermined maximum number of pixels over the predetermined time period; and increasing the scaling factor if the movement is less than the predetermined minimum number of pixels or decreasing the scaling factor is the movement is greater than the predetermined maximum number of pixels.

[0037] The method may further include: causing the display screen to statically display the cursor over the content for a first predetermined wait time after the tracking feature can no longer be identified in the video images.

[0038] The method may further include: storing the tracking feature; and for a second predetermined wait time after the tracking feature can no longer be identified in the video images, attempting to determine the face and the tracking feature in the video images. The first predetermined wait time is between about 5 seconds and about 10 seconds; or the second predetermined wait time is between about 5 seconds and about 10 seconds.

[0039] The transit environment may be an airport or other transit hub, for example.

Brief Description of Drawings

[0040] Figure 1 is a block diagram view of an interaction station system according to some embodiments;

[0041] Figure 2 is a block diagram view of an interaction station network according to some embodiments;

[0042] Figure 3 is a schematic illustration of a user at an interaction station according to some embodiments;

[0043] Figure 4 is an example field of view of a video recording device according to some embodiments;

[0044] Figure 5 is an example user interface display at an interaction station according to some embodiments;

[0045] Figure 6 is a flow chart of the operation of the interaction station according to some embodiments; [0046] Figure 7 is a flow chart of further aspects of the operation of the interaction station according to some embodiments; and

[0047] Figure 8 is a schematic block diagram of a computer system architecture that can be employed according to some embodiments.

Detailed Description

[0048] Embodiments relate generally to systems, methods, and processes that use touch-free interactions at self-service interaction stations. In particular, embodiments relate to use of such stations in transit environments, such as airports or other transport hubs.

[0049] Referring initially to Figures 1 to 5, a self-service station 101 is described, together with systems of which the self-service system may form a part. In some embodiments, a self- service interaction station 101 is provided to facilitate users conducting interaction processes. Such processes may include check-in processes for impending travel, incoming or outgoing immigration or customs processes, travel or event reservation processes or information querying processes, for example.

[0050] Multiple stations 101 may be connected to a client device 145 and database 155 over a network 140. Each station 101 is configured to identify faces of users 1.1 interacting with the station, through a video image recording device 125. The station 101 is further configured to track, through image processing module 114, the head movement of a user 1.1, in order to interact with the user interface 120 to conduct an interaction process.

[0051] Figure 1 is a block diagram of a system 100 for managing self-service interaction stations, comprising a station 101, a server 150, a database 155 accessible to the server 150, and at least one client device 145. Station 101 is in communication with server 150 and client device 145 over a network 140.

[0052] In the embodiments illustrated by Figure 1, station 101 comprises a controller 102. The controller 102 comprises a processor 105 in communication with a memory 110 and arranged to retrieve data from the memory 110 and execute program code stored within the memory 110. The components of station 101 may be housed in a housing 108. Station 101 may be connected to network 140, and in communication with client device 145, server 150, and database 155.

[0053] Processor 105 may include more than one electronic processing device and additional processing circuitry. For example, processor 105 may include multiple processing chips, a digital signal processor (DSP), analog-to digital or digital-to analog conversion circuitry, or other circuitry or processing chips that have processing capability to perform the functions described herein. Processor 105 may execute all processing functions described herein locally on the station 101 or may execute some processing functions locally and outsource other processing functions to another processing system, such as server 150.

[0054] The network 140 may comprise at least a portion of one or more networks having one or more nodes that transmit, receive, forward, generate, buffer, store, route, switch, process, or a combination thereof, etc. one or more messages, packets, signals, some combination thereof, or so forth. The network 140 may include, for example, one or more of: a wireless network, a wired network, an internet, an intranet, a public network, a packet- switched network, a circuit- switched network, an ad hoc network, an infrastructure network, a public- switched telephone network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, some combination thereof, or so forth.

[0055] Server 150 may comprise one or more computing devices configured to share data or resources among multiple network devices. Server 150 may comprise a physical server, virtual server, or one or more physical or virtual servers in combination.

[0056] Database 155 may comprise a data store configured to store data from network devices over network 140. Database 155 may comprise a virtual data store in a memory of a computing device, connected to network 140 by server 150.

[0057] Station 101 may further comprise a wireless communication device 115, user interface 120, video image recording device 125, and document printer 130.

[0058] Wireless communication device 115 may comprise a wireless Ethernet interface, SIM card module, Bluetooth connection, or other appropriate wireless adapter allowing wireless communication over network 140. Wireless communication device 115 may be configured to facilitate communication with external devices such as client device 145 and server 150. In some embodiments, a wired communication means is used.

[0059] User interface 120 may comprise a reader device 121, and is configured to allow a user to initiate and interact with an interaction process hosted by the user interface 120. In some embodiments, the interaction process comprises a series of steps allowing a user 1.1 to provide identification details to the station 101 to retrieve booking details and/or undertake a check-in process. The interaction process may comprise a series of steps wherein the user 1.1 provides booking details to the station 101 to identify themselves. The interaction process may take between 1 and 20 minutes, for example.

[0060] Reader device 121 may comprise a barcode scanner, QR code scanner, magnetic strip reader, or other appropriate device arranged to allow a user to scan a document (such as a passport, boarding pass, ticket, or other identification document) at the station 101. In such embodiments, the data read by the reader device 121 may be stored in the memory 110, or transmitted to database 155 through the server 150 over a network 140. In other embodiments, the data read by the reader device 121 may trigger the processor 105 to send a request for information associated with the data over network 140 to the server 150. The server 150 may then retrieve additional data associated with the identification data form database 155 and transmit the additional data over network 140 to the processor 105.

[0061] The user interface 120 may further comprise a display screen 122, configured to allow a user to be shown content during the interaction process. Such content may include a series of actionable items, buttons, information related to a booking, or other appropriate information, in order to conduct the interaction process. Display screen 122 may also depict the location of a moveable cursor on the screen to enable the use to interact with the content.

[0062] Video image recording device 125 may comprise a camera, arranged to capture images of an area from which the user interface 120 is accessible. In some embodiments, image capture device 130 comprises a digital camera device. The video image recording device 125 may have an image resolution of about 1280x720 pixels (known as 720p) or greater, for example. The display resolution of the display screen 122 may be less than the image resolution of the video image recording device 125 since display resolution is not of particular importance. However, various suitable levels of resolution can be used for display screen 122.

[0063] Document printer 130 may comprise a printer configured to allow for printing user documents as a result of the interaction process. In some embodiments, the document printer 136 prints boarding passes, receipts, or other documentation related to the user or the interaction process.

[0064] The memory 110 may further comprise executable program code that defines a communication module 111, user interface (UI) module 112, and image processing module 114. The memory 110 is arranged to store program code relating to the communication of data from memory 110 over the network 140.

[0065] Communication module 111 may comprise program code, which when executed by the processor 105, implements instructions related to initiating and operating the wireless communication device 115. When initiated by the communication module 111, the wireless communication device 115 may send or receive data over network 140. Communication module 111 may be configured to package and transmit data generated by the UI module 112 and/or retrieved from the memory 110 over network 140 to a client device 145, and/or to server 150. In some embodiments, this transmitted data includes an alert, relating to a person identified by image capture device 130. In some embodiments, the alert relates to a status of the interaction process. In some embodiments, the alert relates to data sent to the image processing module 114, indicating that a face has been lost or remains undetected from the video feed from video image recording device 125.

[0066] UI module 112 may comprise program code, which when executed by the proce or 105, implements instructions relating to the operation of user interface 120. UI module 112 may be configured to implement instructions related to the position of a user’s 1.1 head within a field of view of the video image recording device 125. In such embodiments, the UI module 112 may receive instructions from the user interface 120 or image processing module 114 about advancing, reverting, or otherwise interacting with stages of an interaction process.

[0067] Image processing module 114 may comprise program code, which when executed by the processor 105, implements instructions relating to the operation of the video image recording device 125. When initiated by the image processing module 114, the video image recording device 125 may activate and transmit a stream of captured video frames to the processor 105.

[0068] In some embodiments, image processing module 114 comprises program code, which when executed by the processor 105, implements instructions configured to allow the module 114 to identify pixel regions in images that correspond to faces within the image frame. Image processing module 114 may further comprise an artificial-intelligence (AI) model 116, trained on facial image frames. AI model 116 may be trained using supervised machine learning in order to accurately provide instructions to image processing module 114 to identify faces in image frames. In some embodiments, face images captured by video image recording device 125 are stored in memory 110, or image processing module 114 for verification by a human operator. The human verification of the stored image frames as containing a face or not may be used to generate the AI model 116. AI model 116 may utilise machine learning algorithms and increase accuracy of face detection by facial processing model 114. Image processing module 114 may further comprise feature tracking module 117, comprising program code, which when executed by the processor 105, implements instructions related to identifying a feature on the face of a user in an image frame, and to track the movement of the feature within a field of view across multiple image frames.

[0069] AI model 116 may include a machine learning-based model that includes a deep neural net (DNN)-based model to detect the face area of a person. The DNN may be based on a SSD framework (Single Shot MultiBox Detector), for example. The SSD framework may use a reduced ResNet-10 model (“resl0_300x300_ssd_iter_140000"), for example. However, any analogous model used for face detection may be used in addition to the aforementioned model.

[0070] In some embodiments, the station 101 may operate in an AI model 116 training mode in order to generate an accurate model for automatic face detection, developed specific to an individual station 101. In such embodiments, the individual generation of AI model 116 can accommodate the particular location, positioning, angle, and lighting of the field of view 1.5 of a particular station 101. [0071] The AI model 116 may be trained on a data set of captured images from the image capture device 130, a pre-existing data set of images from other sources, or some combination thereof. The AI model 116 may be generated by supervised learning, for example, through manual review of control images wherein the AI model 116 identifies a face or does not identify a face.

[0072] The AI model 116 may use captured images captured during operation of the image capture device 130 to continually develop a more accurate model throughout its normal operation.

[0073] AI model 116 may use a single-stage dense face localisation image processing technique, such as the RetinaFace (“RetinaFace: Single-stage Dense Face Localisation in the Wild”, Deng et al, 4 May 2019) facial image processing model, to identify facial features in captured images. The AI model 116 is required to consistently locate faces in images across a wide variety of contexts. For example, the AI model 116 is preferably trained on a dataset that includes multiple faces in an image, occluded faces, blurry images, and images where the colour and rotation make face detection more difficult. The relatively high difficulty of the training dataset means that the trained AI model 116 is robust enough to correctly classify partially occluded faces (such as when a person is wearing a face mask) which is a likely and expected occurrence during normal usage of station 101.

[0074] Figure 2 depicts a block diagram of a self-service station network 200 according to some embodiments. The network 200 comprises an individual self-service station bank or array 210, a separately located self-service station bank or array 215, server 150, database 155, and client device array 220. The individual self-service station array 210 may comprise at least one self-service station 101 individually connected to network 140. In some embodiments, the stations 101 of array 210 are located together at a single installation site, such as an airport check-in, or an airport customs or immigration area. In other embodiments, the stations 101 of array 210 may be separately located throughout a number of individual sites throughout an airport, or may be located at multiple installation sites, such as a series of airports. In some embodiments, the locations of installation of array 210 comprise self-service facilities including, but not limited to, self-service check-in kiosks, self-service bag drop, automated departure gate boarding gates, automated immigration entry or exit gates, airline lounge gates, or other appropriate self-service areas, for example. [0075] The client device array 220 may comprise at least one client device 145 connected individually to network 140. In some embodiments, the array 220 comprises any combination of smartphones, tablet computing devices, personal computers, or other devices capable of sending instructions over network 140 and executing instructions from memory 147.

[0076] Figure 3 depicts a diagram of a user 1.1 interacting with a self-service interaction station 101. The self-service station 101 comprises a housing 108 that in some embodiments includes a solid upstanding cabinet 305 defining internal space to house the components of station 101 described herein. The housing 108 houses the display screen 122, the video image recording device 125, the processor 102 and the memory. The housing 108 holds the display screen 122 and the video image recording device 125 at a height above floor level (i.e. a bottom extent of the housing 108) sufficient to allow a face of a person to be generally within the field of view when the person stands between about 1 meter and about 2.5 meters in front of the station. The display screen 122 may be held by the housing 108 so that the bottom edge of the display screen 122 is at a height of between about 1.3 and about 1.6 metres above floor level, for example. The display screen 122 may have a top edge about 0.2 to about 0.5 metres above the bottom edge, for example. A light-receiving aperture of the video image recording device 125 may be positioned at or slightly above the top edge of the display screen, for example.

[0077] In Figure 3, user 1.1 is at least partially within the field of view 1.5 of the video image recording device 125. The video image recording device 125 may be positioned to ensure the field of view 1.5 defines an area substantially facing the direction from which the user interface 120 may be accessed from a user 1.1. In some embodiments, the user 1.1 may be an airline passenger, airline or airport staff, or other individual at an airport requiring self- service interaction or check-in processes. In some embodiments, the user 1.1 may be a train, ship or other transport passenger, staff, or other individual requiring self-service interaction or check-in processes for transport purposes. In some embodiments, the user 1.1 may be an event participant, attendee at a secure facility or other person requiring self-service check-in processes.

[0078] In some embodiments, the field of view 1.5 defines a horizontal range of approximately 1 meter either side of the anticipated position of a user 1.1 (standing at between about 1 metre and about 2.5 metres from the display screen) using the user interface 120. In some embodiments, the field of view 1.5 defines a vertical range of about 0.5 meters above and below the anticipated position (standing at between about 1 metre and about 2.5 metres from the display screen) of a user 1.1 using the interface 120.

[0079] In some embodiments, the field of view 1.5 is substantially centred at an anticipated average height of an adult person who would be accessing the user interface 120. The field of view 1.5 may extend in a horizontal and vertical area to cover other people close to the user 1.1. In some embodiments, other appropriate ranges may be defined. In other embodiments, the field of view 1.5 may be arranged to be substantially centred at the anticipated area of the upper portions of a user 1.1. The upper portions of a user 1.1 are intended to include at least the user’s chest, neck, face, and head. The field of view 1.5 may comprise an area aligned with the facing direction of the display screen 122.

[0080] In other embodiments, the field of view 1.5 may be dynamically altered by the video imaging device 125 to be extended, shrunk or laterally or vertically shifted in accordance with user specified requirements. The user specified requirements may be configurable by an operator to allow individual stations 101 to have an optimised field of view 1.5 depending on their installation position, angle, and lighting.

[0081] Figure 4 depicts an example field of view 400 of the video image recording device 125, with a user 1.1 within the field of view 1.5 frame. Figure 5 depicts an example user interface screen 500 with corresponding cursor 2.26 actions mapped to the movement of the head of user 1.1 in Figure 4.

[0082] During operation, a live continuous feed of video image frames may be sent to image processing module 114 by video image recording device 125. The image processing module 114 may then identify a face within the field of view 1.5 using face-detection algorithms within feature tracking module 117, AI model 116, or other appropriate means. The feature tracking module 117 may then identify an initially selected tracking feature (TF) position 2.9 on the user’s 1.1 face.

[0083] The TF position 2.9 is then mapped by feature tracking module 117 to an initial on screen cursor position of 2.25 on display screen 122. In some embodiments, the image processing module 114 may use AI model 116 in order to identify a region of the user’ s 1.1 face within the image frame to track throughout the interaction process. This tracked feature may be a user’s 1.1 eye, nose, glasses, or other suitably high contrast facial region.

[0084] Video image frames captured by the video image recording device 125 may be analysed by the feature tracking module 117 by applying a moving (scanning) contrast window of a fixed contrast window size to the area in the captured image frames identified as the face to select a target pixel area for the tracking feature. The moving contrast window may be moved pixel by pixel (or by blocks of pixels) across the face area in the image while calculating contrast values of the group of pixels currently in the contrast window. In some embodiments, the calculation of the contrast values comprises a binary value determination, associating each pixel with a zero or a one, corresponding to a black value or a light value. In other embodiments, a pixel colour determination is made, with specific RGB values from the contrast window mapped to positions within the contrast window. In other embodiments, other contrast calculation methods may be used.

[0085] The target pixel area is selected to have a greatest range of colour contrast among pixel areas of the fixed contrast window size in the image frame. In such embodiments, the target pixel area is used as the tracking feature. The fixed contrast window size may be selected to correspond to a face area of between about 1 cm² and about 10 cm², for example around 2 cm by 2 cm. In other embodiments, other sizes can be used.

[0086] The user 1.1 may then move their head, which in turn moves the TF along the path of 2.10, resting momentarily on position 2.11. In this example, the TF moves in a negative direction on the x-axis 2.2 and a positive direction on the y-axis 2.3 (as shown in Figure 4). The processor 105 may, when executing the feature tracking module 117 of image processing module 117, determine this movement of the tracking feature across successive captured live image frames and control the display screen to move the on-screen cursor accordingly along the path of the TF by performing corresponding (scaled) x/y-axis movements of the on-screen cursor 2.26. The movement of the on-screen cursor 2.26 on the display screen 122 provides feedback to the user 1.1, and allows them to observe the control they are performing, thereby improving usability. In the example of Figure 4, the TF rests on position 2.11, and the processor 105 controls the display to show the cursor resting on position 2.27 in Figure 5. [0087] The movement of the TF across live successive captured image frames may be calculated by the feature tracking module 117. In such embodiments, the feature tracking module 117 may identify the location of the TF in a first frame of a live video feed, and scan each successive frame within an anticipated region of the image frame for the TF at a subsequent location. The feature tracking module 117 may be configured to perform this subsequent position determination with a TF confidence match of 90% , 95%, or other appropriate percentage in order to successfully track the TF through the frames of a live video feed, for example.

[0088] The process of dwell and interaction event generation is illustrated in Figures 4 and 5. In this example, the user 1.1 has moved the TF to position 2.11, thereby moving the on screen cursor to position 2.27 in figure 5, where it is located over an on-screen button (Button A, 2.30). When the cursor 2.26 comes to rest over a location within the interaction area 2.20 on display 122, the image processing module 114 measures the time in which the cursor remains in the location, in order to determine an action event. A predetermined margin of movement is allowed as the user 1.1 will most likely not hold their head exactly still but will instead by slightly moving all the time. In some embodiments, the margin of movement may comprise a distance threshold. In such embodiments, the distance threshold may be between about 1cm to 5cm. In other embodiments, different threshold amounts may be provided in order to accurately capture user 1.1 intent when interacting with user interface 120. In other embodiments, distance thresholds may be measured in pixel distances between movement actions of a user 1.1 tracked feature between video image frames or may be measured as total pixel distance over a number of frames.

[0089] In the example of Figure 4, the user 1.1 the moves the TF by moving their head along path 2.12 which is followed by the on-screen cursor (2.28) before the TF comes to rest on position 2.13 and the on-screen cursor accordingly comes to rest on position 2.29, located over an on-screen button (Button B, 2.30). Given the need for the buttons shown as part of the content on the display screen 122 to be easily accessed by head controlled movement of the cursor, the buttons may be configured to be relatively larger than normal. For example, one or more of the buttons may take up an area of between about 5% and about 35%, optionally about 10% to about 25%, of the visible area of display screen 122. There may also be at least 25% inactive (i.e. non- selectable) space on the display screen to allow a user to put the cursor in that inactive space while considering which active (selectable) screen option to select. [0090] The process and relationship of TF to on-screen cursor 2.26 is also described in the flow chart of Figure 6, at items 4.9, 4.10 and 4.11. The process 600 may be event-driven, where the image processing module 114 continuously tracks the TF at step 4.8 and reacts to either the movement event of a user 1.1 at step 4.10 or the dwell event at step 4.13 as each occurs, before returning to the cycle of tracking in order to detect the next event.

[0091] Figure 5 depicts an example user interface screen 500 displayed on display screen 122. In this embodiment, the interface screen displays content that comprises button A 2.30 and button B 2.31 within an interaction area 2.20. In Figure 5, cursor 2.26 is provided to allow a user 1.1 to interact with buttons 2.30 and 2.31 through tracked head movements. In the example of Figure 5, cursor 2.26 may begin at a substantially central position within interaction area 2.20. A detected movement of the user’s 1.1 TF by image processing module 114 may then cause the processor 105 (executing UI module 112) to cause cursor 2.26 to move to a second position 2.27. A subsequent movement of the user’s 1.1 TF along path 2.28 may similarly cause cursor 2.26 to move to a third position 2.29.

[0092] In the example of Figure 5, if the image processing module 114 identifies that the user’s 1.1 TF has substantially stopped moving at positions 2.27 and 2.29 for a length of time, then the module 114 may associate the (relative) lack of movement as a dwell action. In such embodiments, a dwell action at 2.27 or 2.29 causes the image processing module 114 to transmit an interaction command to UI module 112 corresponding to the action of the respective button 2.30, 2.31. In some embodiments, the cursor 2.26 or a button (or other selectable screen content) over which the cursor 2.26 rests may display a dynamic graphical indicator of elapsing of a dwell action within display area 2.20 over to communicate to a user 1.1 that a dwell action is occurring (i.e. a dwell time is elapsing). In some embodiments, this graphical indicator may comprise a countdown timer, animated progress indicator, or other dynamic graphical progress indicator.

[0093] The feature tracking module 117 tracks tracking features and the UI module 112 determines dwell time of the cursor. These two modules 117, 112 cooperate to determine the “attention” of the user in order to determine whether the user intends to perform a selection interaction. The attention of the user is determined to be positive when the tracking feature (TF) is currently visible and tracked by the feature tracking module 117. If the tracking feature has moved out of the DVC 125 field-of-view 1.5, then the feature tracking module 117 changes the status of attention to negative. The tracking feature may be lost if the person has moved outside the DVC field-of-view for some reason, for example, if they have bent down to open their bag, or have turned around to speak to another person, to the point that the TF is not visible to the DVC 125. Making a determination by feature tracking module 117 regarding a binary attention status can help to avoid accidental activation (selection) of a button or other content. For example, if the on-screen cursor 2.26 has moved to a position over a button, and then the TF cannot be found by feature tracking module 117, then selection of the button should not be performed by feature tracking module 117 even though the dwell time may have elapsed. An interaction will only be performed when the dwell time threshold has been reached and feature tracking module 117 has the TF currently acquired (i.e. status of attention is positive).

[0094] The feature tracking module 117 has the capability to re-acquire the tracking feature when it is temporarily lost without needing to again begin the process of classifying and isolating a face in the DVC 125 field-of-view 1.5. Temporary loss of the tracking feature may occur when the user with TF attention turns their head or briefly moves out of the DVC field- of-view, causing TF attention to change to negative. At this point, feature tracking module 117 will attempt to re-acquire the TF using the latest tracking feature stored in memory by searching in the DVC field-of-view for a short period of time. This short time period should be relatively short, for example in the order of 5-10 seconds, but is configurable to a different time period based on user experience. During the period that the system is attempting to reacquire the TF, the cursor 2.26 may be shown on display screen 122 but will not move since no TF is being tracked. The feature tracking module 117 will not interrupt the passenger processing application (that is conducting the interaction process) during the time of attempting to reacquire the TF. If the TF is reacquired during this time, then the process of on screen cursor control using the TF movement of the user continues as normal. This allows on screen cursor control to be readily resumed after the user has momentarily moved in a way that the TF attention is lost, for example if the user were to look away for a moment, without any impact to the current interaction process.

[0095] In some embodiments, the dimensions of a live video image frame received by the image processing module 114 is mapped to the interaction area 2.20. In such embodiments, a mapping ratio may be applied, such that a detected pixel movement of a TF in the frame may correspond to a scaled movement of cursor 2.26 in the interaction area 2.20 on display 122. In some embodiments, a mapping (scaling) ratio of around 1:1, 1:5, 1:8, 1:10, 1:15, 1:20, 1:30, 1:40, 1:50, 1:60, 1:70, 1:80, 1:90, 1:100 (or ratios in between such numbers) or other ratios may be applied between image frame and interaction area 2.20, depending on the resolution difference between the video image recording device and the display screen 122. In some embodiments, movement distance thresholds may be provided in order to prevent stray or accidental movements of the TF by a user as registering as cursor 2.26 movements on interaction area 2.20.

[0096] In some embodiments, the mapping ratio may be adjusted by the feature tracking module 117 based on a scaling factor. In such embodiments, the scaling factor may be incrementally increased (up to an upper limit) if the movement of the TF is less than a predetermined minimum number of pixels over a period of time, or incrementally decreased (down to a lower limit) if the movement of the TF is greater than a predetermined maximum number of pixels over the period of time.

[0097] The capability of the feature tracking module 117 to dynamically adjust the sensitivity of the interpolation of physical movement to cursor movement may help to compensate for a person being too subtle or too exaggerated with their head movement. In the case of most modern technology, the pixel resolution of DVC 125 will be greater that the display resolution of the display screen 122. For this reason, there will be a sensitivity ratio applied by feature tracking module 117 to compute pixel movement in the DVC field-of-view relative to the display. The default sensitivity ratio is configurable based on observation of user experience and may be set at one of the example mapping ratios listed above, for example. If the detected TF movements are extremely small, then the user may be standing physically farther from the self-service station 101 and hence their movement will cross fewer pixels in the DVC field-of-view. If the detected TF movements are extremely small over a period of time while tracking that feature attention has not been lost, then the feature tracking module 117 may dynamically increase the sensitivity ratio from the default ratio to a higher level in order to add more weight to the smaller pixel movements in the DVC field-of-view. For example, the default sensitivity ratio may be 30:1 where 30 pixels in the DVC field-of- view is interpolated to 1 pixel of movement by the on-screen cursor. If feature tracking module 117 determines that a dynamic increase of sensitivity ratio is required, then it may increase the sensitivity ratio gradually until the on-screen cursor movement over time fits within an expected range. The feature tracking module 117 may therefore increase the sensitivity ratio from 30:1, to 40:1, then 50:1 and so on (for example), to an upper limit configured in the feature tracking module 117. This adjusted sensitivity ratio will be persisted or further adjusted as necessary until the TF attention is completely lost or the passenger processing application completes the current interaction process.

[0098] Figure 6 depicts a flow chart 600 of the operation of a user interaction process at station 101 according to some embodiments. In this embodiment, at step 4.1, the UI module 112 is initialised by processor 105, and the station 101 remains in a standby operation until a user 1.1 interacts with the station 101. The UI module 112 may implement an application at step 4.1, such as a passenger check-in application, immigration control application, vehicle boarding application, passport management application, or other application.

[0099] At step 4.2, the image processing module 114 is initialised by processor 105 and receives a stream of video image frames from video image recording device 125. The image processing module 114 then processes the image frames at step 4.3, using AI model 116 to identify the face of a user 1.1 within the field of view 1.5. If, at step 4.4, a user 1.1 moves within the field of view 1.5 of the video image recording device 125, and the user is determined to be close enough to the station 101, then the image processing module 114 may isolate the face of the user 1.1 at step 4.5 to subsequently determine a tracking feature (TF) of the face at step 4.6. In some embodiments, the processing that is applied by image processing module 114 seeks to determine both the presence of a face in the image frame, and the size of the detected face in the frame.

[0100] The size of the detected face may be analysed by the image processing module 114 and compare the number of pixels of the identified face to a range of known values. In some embodiments, the known values correspond to facial size and proximity models within AI model 116.

[0101] If a detected face is identified as being within a distance threshold of the station 101, then the facial processing module 114 may determine that the face belongs to a user 1.1, intending to initiate an interaction process. In such embodiments, the interaction process application implemented by UI module 112 may be activated to allow user 1.1 to conduct the interaction process. [0102] At step 4.7, the display screen 122 may display a cursor within the interaction area 2.20. In some embodiments, UI module 112 may cause display screen 122 to display a calibration prompt to the user 1.1. In such embodiments, the user 1.1 is informed that the interaction process is conducted through tracked head-movements. The user 1.1 may be prompted to undertake a training task, which may include making a series of example head movements, or to place their head within a specific area of the field of view 1.5 corresponding to a bounding box on the display screen 122. In such embodiments, the specific movements or location of the user’s 1.1 head allow for feature tracking module 117 to identify and track a specific feature of the user’s 1.1 face and/or help the user to learn how to use head movement to move the cursor 2.26 on the display screen 122.

[0103] The TF may be selected based on a series of potential factors including, but not limited to, eye, mouth, and ear location, or presence of eyeglasses, or other appropriate features that are easily trackable. In some embodiments, the feature tracking module 117 identifies an area within the face of high contrast between facial elements. Some examples of this may include the eyes which have a high contrast between the “white” of the eye and the iris or pupil, presence of eyeglasses, and so on. In some embodiments, feature tracking module 117 may use edge detection algorithms, models from AI model 116, or other means of identifying appropriate facial elements to track.

[0104] Once a user 1.1 has successfully engaged with the interaction process, TF movements are continually tracked at step 4.8 and 4.9 in order to allow the user 1.1 to move cursor 2.26 and engage with the interaction process via the display screen 122.

[0105] When a user 1.1 moves their head at 4.9, the TF is identified by the feature tracking module 117 as having moved location between a series of video image frames provided to the image processing module 114 by the video image recording device. These displacements of the TF are calculated by the feature tracking module 117 at step 4.10, and are then mapped by the feature tracking module 117 to cursor 2.26 position within interaction area 2.20. At step 4.11, the cursor 2.26 position is updated relative to the mapped TF displacement.

[0106] If, at step 4.12, a user’s 1.1 face is detected as remaining stationary, causing the cursor 2.26 to remain in the same area of the interaction area 2.20 for a longer period of time than a configured dwell time, then a dwell time threshold is exceeded at 4.13. In some embodiments, the dwell time may be between about 2 seconds and about 5 seconds, or optionally between about 3 and about 4 seconds, for example. The elapsing of the dwell time may be graphically indicated on display screen 122 through an animated countdown, timer, progress bar, or other dynamic graphical indicator or element, for example.

[0107] If a dwell time threshold is exceeded at 4.13, then a simulated ‘touch’ action occurs on the interaction area 2.20 under the current location of cursor 2.26. At 4.15, the simulated touch action is processed by UI module 112, and an interaction process action occurs relative to the simulated touch action. In some embodiments, this comprises a button press action, UI field selection, interaction process progression or a request to revert to a prior stage of the interaction process. In other embodiments, the simulated touch action may interact with other UI elements, or perform different interaction process actions.

[0108] At step 4.16, the action loop of the process 600 is concluded. This may comprise a loop for a single stage of an interaction process at station 101, or may comprise the entire interaction process. After the interaction process at station 101 is concluded, then the process 600 reverts to step 4.1 and awaits for a face to enter the field of view 1.5.

[0109] In some embodiments, the completion of the process 600 at step 4.16 may be due to the operation of the interaction process, indicating that a user has completed the process and the machine is ready to accept another user 1.1. In other embodiments, the interaction process is suspended or terminated due requirements of the interaction application in the UI module 112. In other embodiments, the process may conclude due to a lack of an identified face of a user 1.1 within the field of view 1.5.

[0110] In such embodiments, a user 1.1 may have left the field of view 1.5 before the interaction process has concluded, or the facial tracking module 117 have lost position of the TF for a configurable amount of time, indicating that the user 1.1 has been lost by the module 117. In such embodiments, the station 101 may display a warning prompt on display screen 122 notifying the user that their face has become undetected, and/or that the process may be lost unless the TF or their face is detected again. In some embodiments, the feature tracking module 117 may allow for a configurable detection-loss threshold to prevent unintended movements of the user 1.1 from unintentionally ending the interaction process. This detection-loss threshold may be about 5 seconds or 10 seconds. In other embodiments, the threshold may be other periods of time.

[0111] In some embodiments, the feature tracking module 117 may cause the display screen 122 to statically display the cursor 2.26 over the interaction area 2.20 for a first predetermined wait time after the tracking feature can no longer be identified in the video images. The first predetermined wait time may be between about 5 seconds and about 10 seconds, for example.

[0112] In some embodiments, the feature tracking module 117 may wait for a predetermined time after the tracking feature can no longer be identified in the video images while attempting to again determine the face and locate the previous (stored) tracking feature or a new tracking feature in the video images. The second predetermined wait time may between about 5 seconds and about 10 seconds.

[0113] Figure 8 illustrates an example computer system 800 according to some embodiments. In particular embodiments, one or more computer systems 800 perform one or more steps of one or more methods described or illustrated herein. In particular embodiments, one or more computer systems 800 provide functionality described or illustrated herein. In particular embodiments, software running on one or more computer systems 800 performs one or more steps of one or more methods described or illustrated herein or provides functionality described or illustrated herein. Particular embodiments include one or more portions of one or more computer systems 800. Herein, reference to a computer system may encompass a computing device, and vice versa, where appropriate. Moreover, reference to a computer system may encompass one or more computer systems, where appropriate. Controller 102 is an example of computer system 800.

[0114] This disclosure contemplates any suitable number of computer systems 800. This disclosure contemplates computer system 800 taking any suitable physical form. As example and not by way of limitation, computer system 800 may be an embedded computer system, a system-on-chip (SOC), a single -board computer system (SBC) (such as, for example, a computer-on-module (COM) or system-on-module (SOM)), a special-purpose computing device, a desktop computer system, a laptop or notebook computer system, a mobile telephone, a server, a tablet computer system, or a combination of two or more of these. Where appropriate, computer system 800 may: include one or more computer systems 800; be unitary or distributed; span multiple locations; span multiple machines; span multiple data centers; or reside partly or wholly in a computing cloud, which may include one or more cloud computing components in one or more networks. Where appropriate, one or more computer systems 800 may perform without substantial spatial or temporal limitation one or more steps of one or more methods described or illustrated herein. As an example and not by way of limitation, one or more computer systems 800 may perform in real time or in batch mode one or more steps of one or more methods described or illustrated herein. One or more computer systems 800 may perform at different times or at different locations one or more steps of one or more methods described or illustrated herein, where appropriate.

[0115] In particular embodiments, computer system 800 includes at least one processor 810, memory 815, storage 820, an input/output (I/O) interface 825, a communication interface 830, and a bus 835. Processor 105 is an example of processor 810. Memory 110 is an example of memory 815. Memory 110 may also be an example of storage 820. Although this disclosure describes and illustrates a particular computer system having a particular number of particular components in a particular arrangement, this disclosure contemplates any suitable computer system having any suitable number of any suitable components in any suitable arrangement.

[0116] In particular embodiments, processor 810 includes hardware for executing instmctions, such as those making up a computer program. As an example and not by way of limitation, to execute instructions, processor 810 may retrieve (or fetch) the instructions from an internal register, an internal cache, memory 815, or storage 820; decode and execute them; and then write one or more results to an internal register, an internal cache, memory 815, or storage 820. In particular embodiments, processor 810 may include one or more internal caches for data, instructions, or addresses. This disclosure contemplates processor 810 including any suitable number of any suitable internal caches, where appropriate. As an example and not by way of limitation, processor 810 may include one or more instruction caches, one or more data caches, and one or more translation lookaside buffers (TLBs). Instructions in the instruction caches may be copies of instructions in memory 815 or storage 820, and the instruction caches may speed up retrieval of those instructions by processor 810. Data in the data caches may be copies of data in memory 815 or storage 820 for instructions executing at processor 810 to operate on; the results of previous instructions executed at processor 810 for access by subsequent instmctions executing at processor 810 or for writing to memory 815 or storage 820; or other suitable data. The data caches may speed up read or write operations by processor 810. The TLBs may speed up virtual-address translation for processor 810. In particular embodiments, processor 810 may include one or more internal registers for data, instructions, or addresses. This disclosure contemplates processor 810 including any suitable number of any suitable internal registers, where appropriate. Where appropriate, processor 810 may include one or more arithmetic logic units (ALUs); be a multi-core processor; or include one or more processors 810. Although this disclosure describes and illustrates a particular processor, this disclosure contemplates any suitable processor.

[0117] In particular embodiments, memory 815 includes main memory for storing instructions for processor 810 to execute or data for processor 810 to operate on. As an example and not by way of limitation, computer system 800 may load instructions from storage 820 or another source (such as, for example, another computer system 800) to memory 815. Processor 810 may then load the instructions from memory 815 to an internal register or internal cache. To execute the instructions, processor 810 may retrieve the instructions from the internal register or internal cache and decode them. During or after execution of the instructions, processor 810 may write one or more results (which may be intermediate or final results) to the internal register or internal cache. Processor 810 may then write one or more of those results to memory 815. In particular embodiments, processor 810 executes only instructions in one or more internal registers or internal caches or in memory 815 (as opposed to storage 820 or elsewhere) and operates only on data in one or more internal registers or internal caches or in memory 815 (as opposed to storage 820 or elsewhere). One or more memory buses (which may each include an address bus and a data bus) may couple processor 810 to memory 815. Bus 835 may include one or more memory buses, as described below. In particular embodiments, one or more memory management units (MMUs) reside between processor 810 and memory 815 and facilitate accesses to memory 815 requested by processor 810. In particular embodiments, memory 815 includes random access memory (RAM). This RAM may be volatile memory, where appropriate. Where appropriate, this RAM may be dynamic RAM (DRAM) or static RAM (SRAM). Moreover, where appropriate, this RAM may be single-ported or multi-ported RAM. This disclosure contemplates any suitable RAM. Memory 815 may include one or more memories 815, where appropriate. Although this disclosure describes and illustrates particular memory, this disclosure contemplates any suitable memory. [0118] In particular embodiments, storage 820 includes mass storage for data or instructions. As an example and not by way of limitation, storage 820 may include a hard disk drive (HDD), a floppy disk drive, flash memory, an optical disc, a magnetooptical disc, magnetic tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Storage 820 may include removable or non-removable (or fixed) media, where appropriate. Storage 820 may be internal or external to computer system 800, where appropriate. In particular embodiments, storage 820 is non-volatile, solid-state memory. In particular embodiments, storage 820 includes read-only memory (ROM). Where appropriate, this ROM may be mask- programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), electrically alterable ROM (EAROM), or flash memory or a combination of two or more of these. This disclosure contemplates mass storage 820 taking any suitable physical form. Storage 820 may include one or more storage control units facilitating communication between processor 810 and storage 820, where appropriate. Where appropriate, storage 820 may include one or more storages 820.

[0119] Although this disclosure describes and illustrates particular storage, this disclosure contemplates any suitable storage. In particular embodiments, I/O interface 825 includes hardware, software, or both, providing one or more interfaces for communication between computer system 800 and one or more I/O devices. Computer system 800 may include one or more of these EO devices, where appropriate. One or more of these EO devices may enable communication between a person and computer system 800. As an example and not by way of limitation, an I/O device may include a keyboard, keypad, microphone, monitor, mouse, printer, scanner, speaker, still camera, stylus, tablet, touch screen, trackball, video camera, another suitable EO device or a combination of two or more of these. An I/O device may include one or more sensors. This disclosure contemplates any suitable EO devices and any suitable EO interfaces 825 for them. Where appropriate, I/O interface 825 may include one or more device or software drivers enabling processor 810 to drive one or more of these I/O devices. I/O interface 825 may include one or more EO interfaces 825, where appropriate. Although this disclosure describes and illustrates a particular EO interface, this disclosure contemplates any suitable EO interface.

[0120] In particular embodiments, communication interface 830 includes hardware, software, or both providing one or more interfaces for communication (such as, for example, packet-based communication) between computer system 800 and one or more other computer systems 800 or one or more networks. As an example and not by way of limitation, communication interface 830 may include a network interface controller (NIC) or network adapter for communicating with a wireless adapter for communicating with a wireless network, such as a WI-FI or a cellular network. This disclosure contemplates any suitable network and any suitable communication interface 830 for it. As an example and not by way of limitation, computer system 800 may communicate with an ad hoc network, a personal area network (PAN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), or one or more portions of the Internet or a combination of two or more of these. One or more portions of one or more of these networks may be wired or wireless. As an example, computer system 800 may communicate with a wireless cellular telephone network (such as, for example, a Global System for Mobile Communications (GSM) network, or a 3G, 4G or 5G cellular network), or other suitable wireless network or a combination of two or more of these. Computer system 800 may include any suitable communication interface 830 for any of these networks, where appropriate. Communication interface 830 may include one or more communication interfaces 830, where appropriate. Although this disclosure describes and illustrates a particular communication interface, this disclosure contemplates any suitable communication interface.

[0121] In particular embodiments, bus 835 includes hardware, software, or both coupling components of computer system 800 to each other. As an example and not by way of limitation, bus 835 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a frontside bus (FSB), a HYPERTRANSPORT (HT) interconnect, an Industry Standard Architecture (ISA) bus, an INFINIBAND interconnect, a low-pin-count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCIe) bus, a serial advanced technology attachment (SATA) bus, a Video Electronics Standards Association local (VLB) bus, or another suitable bus or a combination of two or more of these. Bus 835 may include one or more buses 835, where appropriate. Although this disclosure describes and illustrates a particular bus, this disclosure contemplates any suitable bus or interconnect.

[0122] Herein, a computer-readable non-transitory storage medium or media may include one or more semiconductor-based or other integrated circuits (ICs) (such, as for example, field-programmable gate arrays (FPGAs) or application-specific ICs (ASICs)), hard disk drives (HDDs), hybrid hard drives (HHDs), optical discs, optical disc drives (ODDs), magneto-optical discs, magneto-optical drives, (FDDs), solid-state drives (SSDs), RAM- drives, or any other suitable computer-readable non-transitory storage media, or any suitable combination of two or more of these, where appropriate. A computer-readable non-transitory storage medium may be volatile, non-volatile, or a combination of volatile and non-volatile, where appropriate.

[0123] It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the above-described embodiments, without departing from the broad general scope of the present disclosure. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Claims

CLAIMS:

1. A self-service station for touch-free interaction in a transit environment, the station including: a display screen having a display direction; a video image recording device with a field of view in the display direction; a processor to control the display of display images on the display screen and to process live video images recorded by the video image recording device; a memory accessible to the processor and storing executable program code that, when executed by the processor, causes the processor to: determine a human face in the video images, determine whether the face is proximate the display screen, identify a tracking feature of the face in the video images and track movement of the tracking feature in the video images, cause the display screen to initiate an interaction process in response to determining that the face is proximate the display screen, and cause the display screen to display a cursor over content displayed on the display screen during the interaction process and to move the cursor in response to movement of the tracking feature to interact with the content.

2. The station of claim 1, wherein the executable program code, when executed by the processor, causes the processor to: in response to determining that the cursor is positioned over one of at least one predefined area of the content for at least a predetermined dwell time, record a user selection in relation to a screen object associated with the one predefined area.

3. The station of claim 1 or claim 2, wherein the predetermined dwell time is between about 2 seconds and about 5 seconds.

4. The station of claim 3, wherein the predetermined dwell time is between about 2 seconds and about 4 seconds.

5. The station of any one of claims 2 to 4, wherein the executable program code, when executed by the processor, causes the processor to: cause the display screen to visually emphasise the screen object when the cursor is positioned over the one predefined area.

6. The station of any one of claims 2 to 5, wherein the executable program code, when executed by the processor, causes the processor to: cause the display screen to visibly and/or audibly indicate the recording of the user selection.

7. The station of any one of claims 2 to 6, wherein the one predefined area covers between about 5% and about 35% of a visible area of the display screen.

8. The station of any one of claims 2 to 7, wherein the executable program code, when executed by the processor, causes the processor to: cause the display screen to show a progress indicator that is timed to progressively show elapsing of the predetermined dwell time.

9. The station of any one of claims 1 to 8, wherein the executable program code, when executed by the processor, causes the processor to: before causing the display screen to display the cursor over the content, cause the display screen to display a training task to show that face movement correlates to cursor movement.

10. The station of any one of claims 1 to 9, wherein the executable program code, when executed by the processor, causes the processor to: determine that the face is proximate the display screen when a number of pixels in a video image frame of the face exceeds a predetermined pixel count threshold.

11. The station of any one of claims 1 to 10, wherein the executable program code, when executed by the processor, causes the processor to: apply a machine learning model to determine the face in the video images.

12. The station of claim 11, wherein the machine learning model is a deep neural network model based on a single shot multibox detector (SSD) framework.

13. The station of any one of claims 1 to 12, wherein the tracking feature is identified by the processor by analysing an image frame of the face to determine a target pixel area of a fixed contrast window size, wherein the target pixel area has a greatest range of colour contrast among pixel areas of the fixed contrast window size in the image frame, wherein the target pixel area is used as the tracking feature.

14. The station of claim 13, wherein the fixed contrast window size is selected to correspond to a face area of between about 1 cm² and about 10 cm².

15. The station of any one of claims 1 to 14, wherein the executable program code, when executed by the processor, causes the processor to: apply a scaling factor to movement of the tracking feature in the video images in order to cause the display screen to proportionately move the cursor over the content.

16. The station of claim 15, wherein the executable program code, when executed by the processor, causes the processor to: determine whether movement of the tracking feature in the video images is less than a predetermined minimum number of pixels over a predetermined time period or is greater than a predetermined maximum number of pixels over the predetermined time period; and increase the scaling factor if the movement is less than the predetermined minimum number of pixels or decrease the scaling factor is the movement is greater than the predetermined maximum number of pixels.

17. The station of any one of claims 1 to 16, wherein the executable program code, when executed by the processor, causes the processor to: cause the display screen to statically display the cursor over the content for a first predetermined wait time after the tracking feature can no longer be identified in the video images.

18. The station of claim 17, wherein the executable program code, when executed by the processor, causes the processor to: store the tracking feature; and for a second predetermined wait time after the tracking feature can no longer be identified in the video images, attempt to determine the face and the tracking feature in the video images.

19. The station of claim 18, wherein at least one of: the first predetermined wait time is between about 5 seconds and about 10 seconds; or the second predetermined wait time is between about 5 seconds and about 10 seconds.

20. The station of any one of claims 1 to 19, further including a housing that houses the display screen, the video image recording device, the processor and the memory, wherein the housing holds the display screen and the video image recording device at a height above floor level sufficient to allow a face of a person to be generally within the field of view when the person stands between about 1 meter and about 2.5 meters in front of the station.

21. The station of any one of claims 1 to 20, wherein the display screen is non- responsive to touch.

22. A system for touch-free interaction in a transit environment, the system including: multiple ones of the station of any one of claims 1 to 21 positioned to allow human interaction at one or more transit facilities; and a server in communication with each of the multiple stations to monitor operation of each of the multiple stations.

23. A method of facilitating touch-free interaction in a transit environment, including: determining a human face in video images captured by a video image recording device positioned at a station; determining whether the face is proximate a display screen at the station, the display screen facing a same direction as field of view of the video image recording device; identifying a tracking feature of the face in the video images and tracking movement of the tracking feature in the video images; causing the display screen to initiate an interaction process in response to determining that the face is proximate the display screen; and causing the display screen to display a cursor over content displayed on the display screen during the interaction process and to move the cursor in response to movement of the tracking feature to interact with the content.

24. The method of claim 23, further including: in response to determining that the cursor is positioned over one of at least one predefined area of the content for at least a predetermined dwell time, recording a user selection in relation to a screen object associated with the one predefined area.

25. The method of claim 23 or claim 24, wherein the predetermined dwell time is between about 2 seconds and about 5 seconds.

26. The method of any one of claims 23 to 25, further including: causing the display screen to visually emphasise the screen object when the cursor is positioned over the one predefined area.

27. The method of any one of claims 23 to 26, further including: causing the display screen to visibly and/or audibly indicate the recording of the user selection.

28. The method of any one of claims 23 to 27, further including: causing the display screen to show a progress indicator that is timed to progressively show elapsing of the predetermined dwell time.

29. The method of any one of claims 23 to 28, further including: before causing the display screen to display the cursor over the content, causing the display screen to display a training task to show that face movement correlates to cursor movement.

30. The method of any one of claims 23 to 29, further including: determining that the face is proximate the display screen when a number of pixels in a video image frame of the face exceeds a predetermined pixel count threshold.

31. The method of any one of claims 23 to 30, further including: applying a machine learning model to determine the face in the video images, wherein the machine learning model is a deep neural network model based on a single shot multibox detector (SSD) framework.

32. The method of any one of claims 23 to 31, wherein identifying the tracking feature includes analysing an image frame of the face to determine a target pixel area of a fixed contrast window size, wherein the target pixel area has a greatest range of colour contrast among pixel areas of the fixed contrast window size in the image frame, wherein the target pixel area is used as the tracking feature.

33. The method of claim 32, wherein the fixed contrast window size is selected to correspond to a face area of between about 1 cm² and about 10 cm².

34. The method of any one of claims 23 to 33, further including: applying a scaling factor to movement of the tracking feature in the video images in order to cause the display screen to proportionately move the cursor over the content.

35. The method of claim 34, further including: determining whether movement of the tracking feature in the video images is less than a predetermined minimum number of pixels over a predetermined time period or is greater than a predetermined maximum number of pixels over the predetermined time period; and increasing the scaling factor if the movement is less than the predetermined minimum number of pixels or decreasing the scaling factor is the movement is greater than the predetermined maximum number of pixels.

36. The method of any one of claims 23 to 35, further including: causing the display screen to statically display the cursor over the content for a first predetermined wait time after the tracking feature can no longer be identified in the video images.

37. The method of claim 36, further including: storing the tracking feature; and for a second predetermined wait time after the tracking feature can no longer be identified in the video images, attempting to determine the face and the tracking feature in the video images; wherein at least one of: the first predetermined wait time is between about 5 seconds and about 10 seconds; or the second predetermined wait time is between about 5 seconds and about 10 seconds.

38. The station of any one of claims 1 to 21, the system of claim 22 or the method of any one of claims 23 to 37, wherein the transit environment is an airport.

39. The steps, systems, devices, subsystems, features, integers, methods and/or processes disclosed herein or indicated in the specification of this application individually or collectively, and any and all combinations of two or more of said steps, systems, devices, subsystems, features, integers, methods and/or processes.