WO2024072410A1

WO2024072410A1 - Real-time hand gesture tracking and recognition

Info

Publication number: WO2024072410A1
Application number: PCT/US2022/045369
Authority: WO
Inventors: Jie Liu; Yang Zhou
Original assignee: Innopeak Technology, Inc.
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2024-04-04

Abstract

This application is directed to hand gesture recognition. An electronic device (e.g., an IOT device) obtains a sequence of image frames including a current image frame and identifies a bounding box capturing a hand gesture in the current image frame. The current image frame is cropped to generate a region of interest (ROI) image based on the bounding box. The electronic device determines a similarity level of the ROI image with a set of image frames. The set of image frames are cropped from a subset of the sequence of image frames including a first number of image frames that precede the current image frame. Based on the similarity level of the ROI image, the hand gesture in the ROI image is classified to an output hand gesture.

Description

Real-time Hand Gesture Tracking and Recognition

TECHNICAL FIELD

[0001] This application relates generally to computer technology including, but not limited to, methods, systems, and non-transitory computer-readable media for detecting and determining user gestures in electronic devices that may have limited resources.

BACKGROUND

[0002] Gesture control is an important component of user interface on modern day electronic devices. For devices with touch screens, touch gestures are used for invoking a graphic, icon, or pointer to point, select, or trigger user interface elements on two- dimensional displays (e.g., display monitors, computer screens). Common touch gestures include tap, double tap, swipe, pinch, zoom, rotate, etc. Each touch gesture is typically associated with a certain user interface function. In contrast, touchless air gestures are used to implement certain user interface functions for electronic devices having no touch screens, e.g., head mounted display, smart television devices, Internet of things (IOT) devices. These devices have no touch screens, and however, can include front-facing cameras or miniature sensors to track human hands in real time. For example, some head mounted displays (HMDs) have implemented hand tracking functions to complete user interaction including selecting, clicking, and typing on a virtual keyboard. Air gestures can also be used on the devices with touch screens when a user’s hands are not available to touch the screen.

[0003] Most existing gesture tracking methods are not specifically designed for the IOT applications that usually operate with constrained resources. Many existing hand detection algorithms require higher resolution input images and demand computing resources that cannot be offered by the IOT devices. Sophisticated architectures, control flows, and neural network structures of many gesture tracking methods cannot be applied in the IOT devices, and inner parameters involved in these methods cannot be updated on the IOT devices, thereby resulting in incorrect tracking results. Specifically, some hand detection algorithms rely heavily on object detection results to update tracking parameters, and suffer from unexpected tracking losses due to intermittent detection failures in IOT applications. In some situations, although meaningful hand gestures are captured by IOT cameras, object detection losses tend to occur and compromise an overall accuracy level of gesture tracking and recognition. It would be beneficial to have an efficient and effective mechanism to track and recognize hand gestures in various user applications (particularly, in the IOT devices and applications).

SUMMARY

[0004] Various embodiments of this application are directed to methods, systems, devices, non-transitory computer-readable media for recognizing hand gestures, which enables user interaction that assists a user’s action on target objects, provides user inputs, and controls an electronic device in a variety of user applications (e.g., an extended reality application, an IOT application). Such computer-vision-based hand gesture recognition includes hand detection, hand tracking, and gesture classification. Hand detection is implemented based on a plurality of images, and temporal information is considered during hand detection to reduce errors caused by occlusion, motion blurriness, and viewpoint variation. Results of hand detection are further compensated by hand tracking in which a unique identification is used to track each of a set of hand detection results as a hand moves with a sequence of image frames in a video. Hand tracking automatically identifies hand gestures and interprets them to one or more output hand gestures and related trajectories with high accuracy.

[0005] In some embodiments, hand tracking and gesture recognition are implemented on resource-contained IOT devices. Limited computing resources on IOT devices prohibit use of sophisticated hand tracking algorithms. Hand tracking algorithms applied in the IOT devices receive low-resolution and low-quality input images captured by low-end cameras on IOT devices. Hand gestures are often captured in these input images from varying distances. A lightweight appearance embedding network is implemented based on the feature pyramid structure for object tracking, which efficiently extracts multi-scale features. A track-detection mismatch detection method is applied to improve a tracking accuracy when the appearance embedding network generates false predictions. Additionally, a motion-compensated detection method is used to reduce tracking losses when intermittent detection failures occur. [0006] Specifically, in one aspect, a method is implemented by an electronic device for processing images, recognizing hand gestures, and enabling user interaction. The method includes obtaining a sequence of image frames including a current image frame and identifying a bounding box capturing a hand gesture in the current image frame. The method further includes cropping the current image frame to generate a region of interest (ROI) image based on the bounding box and determining a similarity level of the ROI image with a set of image frames. The set of image frames are cropped from a subset of the sequence of image frames including a first number of image frames that precede the current image frame. The method further includes classifying the hand gesture in the ROI image to an output hand gesture based on the similarity level of the ROI image.

[0007] In some embodiments, classifying the hand gesture in the ROI image further includes classifying the hand gesture in the ROI image to a preliminary hand gesture using a classifier neural network and classifying a corresponding hand gesture in each of the set of image frames to a preceding hand gesture using the classifier neural network. Further, classifying the hand gesture in the ROI image further includes determining a dominant hand gesture from preceding hand gestures of the set of image frames; in accordance with a determination that the similarity level of the ROI image satisfies a similarity requirement, selecting the preliminary hand gesture as the output hand gesture; and in accordance with a determination that the similarity level of the ROI image does not satisfy the similarity requirement, selecting the dominant hand gesture as the output hand gesture.

[0008] In some situations, the preliminary hand gesture is identical to more than a threshold number of corresponding hand gestures in the set of image frames and used as the output hand gesture. In some situations, the preliminary hand gesture is different from at least a threshold number of corresponding hand gestures in the set of image frames and not used as the output hand gesture. In some situations, the preliminary hand gesture is different from each and every corresponding hand gesture in the set of image frames and not used as the output hand gesture.

[0009] In another aspect, some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.

[0010] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0011] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there. BRIEF DESCRIPTION OF THE DRAWINGS

[0012] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0013] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0014] Figure 2 is a block diagram illustrating an electronic device configured to process content data (e.g., image data), in accordance with some embodiments.

[0015] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.

[0016] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.

[0017] Figure 5 is a flow diagram of an example hand gesture recognition process implemented by an electronic device (e.g., an IOT device, an HMD), in accordance with some embodiments.

[0018] Figures 6A-6D are temporal diagrams of four example sequences of preliminary hand gestures that correspond to a sequence region of interest (ROI) images cropped from a sequence of image frames, in accordance with some embodiments.

[0019] Figures 7A and 7B are temporal diagrams of a sequence of preliminary hand gestures determined by a classifier neural network and a sequence of output hand gestures outputted by a tracker, in accordance with some embodiments, respectively.

[0020] Figure 8 is a block diagram of an example appearance embedding network using a feature pyramid structure, in accordance with some embodiments.

[0021] Figure 9 is a flow diagram of another example hand gesture recognition process, in accordance with some embodiments.

[0022] Figure 10 is a flow diagram of an example image processing method for recognizing hand gestures, in accordance with some embodiments.

[0023] Like reference numerals refer to corresponding parts throughout the several views of the drawings. DETAILED DESCRIPTION

[0024] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0025] Various embodiments of this application are directed to methods, systems, devices, non-transitory computer-readable media for recognizing hand gestures including hand detection, hand tracking, and gesture classification on an electronic device (e.g., an IOT device). Hand gesture recognition (HGR) is a natural method for human-computer interaction (HCI) with a goal of interpreting human hand gestures from images or videos via machine learning algorithms. Hand detection is a task for detecting the position of hands by utilizing mathematical models and algorithms. Hand detection is a pre-processing procedure for many human hand related computer vision tasks, such as hand pose estimation, hand gesture recognition, human activity analysis, and the like. Object tracking (e.g., hand tracking) is an application in which a program takes an initial set of object detections, develops a unique identification for each initial detection, and then tracks the detected objects as the objects move in a sequence of image frames in a video. In other words, object tracking is a task of automatically identifying objects in the video and interpreting them as a set of trajectories with high accuracy, and each entire trajectory of a unique object’s moving path is identified in object tracking.

[0026] In some embodiments, a complex learning system is represented by a neural network model for recognizing hand gestures by hand detection, hand tracking, and gesture classification. The neural network model is trained in an end-to-end manner, and called an end-to-end model. This concept originates from neural networks and machine or deep learning, where a structure of the end-to-end model (e.g., a feature pyramid network in Figure 8) reuses multi-scale feature maps from different layers in a forward pass. In some embodiments, this structure reuses higher-resolution maps of a feature hierarchy to improve detection for small objects.

[0027] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, laptop computers 104A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network-connected IOT devices. In some implementations, the one or more client devices 104 include a head-mounted display (HMD) 104D configured to render extended reality content. Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104. For example, storage 106 may store video content (including visual and audio content), static visual content, and/or inertial sensor data.

[0028] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., formed by the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104C to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely. [0029] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other electronic systems that route data and messages.

[0030] IOT devices are networked electronic devices that are wirelessly coupled to, and configured to transmit data via, one or more communication networks 108. Examples of the IOT devices include, but are not limited to, a surveillance camera 104E, a smart television device, a drone, a smart speaker, toys, wearables/smart watches, and smart appliances. In some embodiments, an IOT device includes a camera, a microphone, or a sensor configured to capture video, audio, or sensor data, which are used to detect and recognize a user hand gesture.

[0031] The HMD 104D include one or more cameras (e.g., a visible light camera, a depth camera), a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera(s) and microphone are configured to capture video and audio data from a scene of the HMD 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures gestures of a user passing the IOT or wearing the HMD 104D. In some situations, the microphone records ambient sound, including the user’s voice commands. In some situations, in the HMD 104D, both video or static visual data captured by the visible light camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses (i.e., device positions and orientations). The video, static image, audio, or inertial sensor data captured by the HMD 104D are processed by the HMD 104D, server(s) 102, or both to recognize the device poses. Alternatively, in some embodiments, both depth data (e.g., depth map and confidence map) captured by the depth camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The depth and inertial sensor data captured by the HMD 104D are processed by the HMD 104D, server(s) 102, or both to recognize the device poses. The device poses are used to control the HMD 104D itself or interact with an application (e.g., a gaming application) executed by the HMD 104D. In some embodiments, the display of the HMD 104D displays a user interface. The recognized or predicted device poses are used to render virtual objects with high fidelity, and the user gestures captured by the camera are used to interact with visual content on the user interface.

[0032] In some embodiments, SLAM techniques are applied in the data processing environment 100 to process video data, static image data, or depth data captured by the HMD 104D with inertial sensor data. Device poses are recognized and predicted, and a scene in which the HMD 104D is located is mapped and updated. The SLAM techniques are optionally implemented by HMD 104D independently or by both of the server 102 and HMD 104D jointly.

[0033] Figure 2 is a block diagram illustrating an electronic device 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic device 200 includes a client device 104 (e.g., an IOT device). The electronic device 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). In some embodiments, the electronic device 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic device 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. In some embodiments, the electronic device 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.

[0034] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the electronic device 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);

• Model training module 302 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data; and

• One or more databases 240 for storing at least data including one or more of: o Device settings 242 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 244 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 246 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 248 for training one or more data processing models 250; o Data processing model(s) 250 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques; o Pose data database 252 for storing pose data of the camera 260; and o Content data and results 254 that are obtained by and outputted to the client device 104 of the electronic device 200 , respectively, where the content data is processed by the data processing models 250 locally at the client device 104 or remotely at the server 102 to provide the associated results to be presented on client device 104, and include the candidate images.

[0035] Optionally, the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic device 200 . Optionally, the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic device 200 . In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively. [0036] In some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224. In some embodiments, the data processing module 228 includes a hand gesture recognition module 230 configured to detect preliminary hand gestures 512 (Figure 5) in a sequence of image frames and determine an output hand gesture 516 (Figure 5) of each image frame based on a similarity level between the respective image frame and a subset of image frames preceding the respective image frame. More details on hand gesture recognition are explained below with reference to Figures 5-10.

[0037] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0038] Figure 3 is another example of a data processing system 300 for training and applying a neural network based (NN-based) data processing model 250 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 302 for establishing the data processing model 250 and a data processing module 228 for processing the content data using the data processing model 250. In some embodiments, both of the model training module 302 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct from the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, the model training module 302 and the data processing module 228 are both located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 302 and the data processing module 228 are separately located on a server 102 and client device 104 (e.g., an IOT device), and the server 102 provides the trained data processing model 250 to the client device 104.

[0039] The model training module 302 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 250 is trained according to the type of content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 250, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 250 to reduce the loss function, until the loss function satisfies a loss criterion (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 250 is provided to the data processing module 228 to process the content data.

[0040] In some embodiments, the model training module 302 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 302 offers unsupervised learning in which the training data are not labelled. The model training module 302 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 302 offers partially supervised learning in which the training data are partially labelled.

[0041] The data processing module 228 includes a data pre-processing module 314, a model-based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of the following: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 250 provided by the model training module 302 to process the pre- processed content data. The model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing module 228. In some embodiments, the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0042] Figure 4A is an exemplary neural network (NN) 400 applied to process content data in an NN-based data processing model 250, in accordance with some embodiments, and Figure 4B is an example of a node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 250 is established based on the neural network 400. A corresponding model-based processing module 316 applies the data processing model 250 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the node input(s). As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the node input(s) can be combined based on corresponding weights wi, W2, W3, and W4 according to the propagation function. For example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the node input(s).

[0043] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the layer(s) may include a single layer acting as both an input layer and an output layer. Optionally, the layer(s) may include an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layer 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0044] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 250 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The hidden layer(s) of the CNN can be convolutional layers convolving with multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0045] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 250 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. For example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0046] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0047] Figure 5 is a flow diagram of an example hand gesture recognition process 500 implemented by an electronic device 200 (e.g., an IOT device, an HMD 104D), in accordance with some embodiments. The electronic device 200 receives a sequence of image frames 502 including a current image frame 502C. A detector 504 receives the sequence of image frames 502, generates a feature map from each image frame 502, and decodes a bounding box 520 from the feature map. The bounding box 520 identifies a region of interest (ROI) including a hand. Each image frame 502 is cropped (506) to generate a respective ROI image 508 based on the bounding box 520. Each respective ROI image 508 is further processed by a classifier neural network 510, and the classifier neural network 510 identifies a hand gesture in the respective ROI image 508 as a respective preliminary hand gesture 512 including one of a plurality of predefined hand gestures, e.g., a fourth predefined hand gesture. An example of the classifier neural network 510 is MobileNetV2. That said, the current image frame 502C is also cropped to generate an ROI image 508C from which a predefined hand gesture is identified as the preliminary hand gesture 512C by the classifier neural network 510.

[0048] Information of the identified preliminary hand gesture 512 in each image frame 502 is provided to a tracker 514. The tracker 514 tracks the identified preliminary hand gesture 512 among the sequence of ROI images 508 and determines an output hand gesture 516 for the current image frame 502C based on a similarity level of the ROI image 508C of the current image frame 502C with a set of image frames 518C. The set of image frames 518C are cropped from a subset of the sequence of image frames 502 including a first number (Ni) of image frames that precede the current image frame 502C. The sequence of ROI images 508 include the set of image frames 518C. In some embodiments, the output hand gesture 516 is identical to the predefined hand gesture of the preliminary hand gesture 512C identified for the current image frame 502C. In some embodiments, the output hand gesture 516 is distinct from the predefined hand gesture of the preliminary hand gesture 512C identified for the current image frame 502C. Given that the output hand gesture 516C is tracked based on the set of images frames 518C, such a hand gesture recognition process 500 considers recent temporal variations of the hand gesture (which may be caused by noises) and reduces errors caused by occlusion of gestures, motion blurriness, and viewpoint variation. [0049] In some embodiments, the tracker 514 includes an appearance embedding network 522 and a hand gesture recognition module 524. The appearance embedding network 522 receives the respective ROI image 508 generated by cropping each image frame 502 based on the bounding box 520, and generates a respective embedding vector 526. The hand gesture recognition module is couped to the appearance embedding network 522 and configured to determine a similarity level of the current ROI image 508C with a set of ROI images that precede the current ROI image 508C based on the embedding vector 526 of each ROI image 508, and classifies the hand gesture in the current ROI image 508C to the output hand gesture 516 based on the similarity level.

[0050] In some situations, the appearance embedding network 522 and hand gesture recognition module 524 forms a light-weight tracker 514, allowing the hand gesture recognition process 500 to be implemented by a lightweight hand gesture recognition system that is particularly applicable to resource-constrained IOT devices. In an example, the appearance embedding network 522 is established based on a feature pyramid structure that extracts a multi-scale embedding vector 526 from each ROI image 508 to detect hands from various distances. In another example, the hand gesture recognition module 524 is based on a DeepSort tracking pipeline and includes a plurality of adaptations to improve efficiency, accuracy, and robustness for hand gesture tracking on the IOT devices. Object tracking is a method of tracking detected objects throughout frames using their spatial and temporal features, and DeepSORT is s a computer vision tracking algorithm for tracking objects (e.g., hands) while assigning an identifier to each object. DeepSORT introduces deep learning into a Simple Online Realtime Tracking (SORT) algorithm by adding an appearance descriptor to reduce identity switches, thereby making gesture tracking more efficient. Stated another way, DeepSORT applies a tracking algorithm that tracks objects based on both the velocity and motion of an object and the appearance of the object. By these means, the hand gesture recognition process 500 detects, tracks, and classifies hand gestures in real time using processors of the IOT devices. In an example, a MediaTek (MT) 9652 processor is applied in an IOT device to recognize an output hand gesture 516 within a total inference time of 33 milliseconds. [0051] Figures 6A-6D are temporal diagrams of four example sequences of preliminary hand gestures 512-1, 512-2, 512-3, and 514-4 that correspond to a sequence of ROI images 508 cropped from a sequence of image frames 502, in accordance with some embodiments, respectively. An electronic device 200 (e.g., an IOT device, an HMD 104D) receives a sequence of image frames 502 including a current image frame 502C. A detector 504 receives the sequence of image frames 502, generates a feature map from each image frame 502, and decodes a bounding box 520 from the feature map. The bounding box 520 identifies an ROI including a hand. Each image frame 502 is cropped to generate a respective ROI image 508 based on the bounding box 520. Each respective ROI image 508 is further processed by a classifier neural network 510, and the classifier neural network 510 identifies a hand gesture in the respective ROI image 508 as a preliminary hand gesture 512 including one of a plurality of predefined hand gestures. Specifically, a current image frame 502C is cropped to generate a current ROI image 508C from which a preliminary hand gesture 512C is identified by the classifier neural network 510. Referring to Figures 6A-6D, each sequence of ROI images 508 includes a total number (N) of ROI images that are cropped from the total number of image frames 502 based on respective bounding boxes 520. In each sequence of ROI images 508, the current ROI image 508C includes a 16th, 18th, (N-7)-th, or (N-4)-th ROI image in the respective sequence of ROI images 508.

[0052] For simplicity, each ROI image 508 is classified to a first predefined hand gesture TO or a second predefined hand gesture T1 by the classifier neural network 510. The preliminary hand gesture 512C identified for the current image frame 502C is one of the first and second predefined hand gestures TO and Tl. For each current ROI image 508C, the electronic device 200 determines a similarity level of the current ROI image 508C with a set of image frames 602. The set of image frames 602 includes a first number (Ni) of ROI images, e.g., which immediately precede the current ROI image 508C within the corresponding sequence of ROI images 508. For each current ROI image 508C, the set of image frames 602 are cropped from a subset of the sequence of image frames 502 including the first number (Ni) of image frames 502 that precede the current image frame 502C. In some embodiments, prior to determining the similarity level of the ROI image, the electronic device 200 determines the first number (N i) based on a refresh rate of a camera that captures the sequence of image frames 502 and an image quality of the sequence of image frames 502. Based on the similarity level of the current ROI image 508C with the set of image frames 602, the hand gesture in the current ROI image 508C is reclassified to an output hand gesture 516 that may be identical to or distinct from the preliminary hand gesture 512C identified directly by the classifier neural network 510.

[0053] According to a similarity requirement, if the preliminary hand gesture 512C identified for the current ROI image 508C by the classifier neural network 510 is consistent with a dominant hand gesture of the set of image frames 602, the preliminary hand gesture 512C identified for the current ROI image 508C is used as the output hand gesture 516C. Conversely, if the preliminary hand gesture 512C identified for the current ROI image 508C is not consistent with the dominant hand gesture of the set of image frames 602, the preliminary hand gesture 512C identified for the current ROI image 508C is not used as the output hand gesture 516, and the dominant hand gesture of the set of image frames 602 is outputted as the output hand gesture 516 of the current ROI image 508C. Stated another way, in some embodiments, the dominant hand gesture of the set of image frames 602 is adopted and used as the output hand gesture 516 of the current ROI image 508C, independently of the preliminary hand gesture 512C. The electronic device 200 determines a dominant hand gesture from preceding hand gestures (e.g., preliminary hand gestures 512) of the set of image frames 602, and selects the dominant hand gesture as the output hand gesture of the current ROI image 508C.

[0054] Referring to Figure 6A, in an example, the sequence of ROI images 508 includes the total number (N) of ROI images, and the current ROI image 508C is the 16th ROI image in the sequence of ROI images 508. A similarity level is determined between the current ROI image 508C and the set of image frames 602-1 including 15 successive image frames that immediately precede the current ROI image 508C. The classifier neural network 510 identifies the second predefined hand gesture T1 in the current image frame 502C. The preceding hand gestures of the set of image frames 602-1 include 13 first predefined hand gesture TO and 2 second predefined hand gestures T1 in the 15 successive image frames of the set of image frames 602-1. The second predefined hand gesture T1 is dominant in the set of image frames 602-1, and used as a dominant hand gesture of the set of image frames 602- 1. For the sequence of ROI images 508, the second predefined hand gesture T1 of the current image frame 502C is consistent with the second predefined hand gesture T1 dominant in the set of image frames 602-1, and the second predefined hand gesture T1 is outputted as the output hand gesture 516 of the current image frame 502C.

[0055] Referring to Figure 6B, in another example, the current ROI image 508C is the 18th ROI image in the sequence of ROI images 508. A similarity level is determined between the current ROI image 508C and the set of image frames 602-2 including 15 successive image frames that immediately precede the current ROI image 508C. The classifier neural network 510 identifies the first predefined hand gesture TO in the current image frame 502C. The second predefined hand gesture T1 is dominant in the set of image frames 602-2 and used as a dominant hand gesture of the set of image frames 602-2. For the sequence of ROI images 508, the first predefined hand gesture TO of the current image frame 502C is not consistent with the second predefined hand gesture T1 dominant in the set of image frames 602-2, and the second predefined hand gesture T1 is outputted as the output hand gesture 516 of the current image frame 502C.

[0056] Further, referring to Figure 6C, in some embodiments, the current ROI image 508C is the (N-7)-th ROI image in the sequence of ROI images 508. The classifier neural network 510 identifies the second predefined hand gesture T1 in the current image frame 502C. The first predefined hand gesture TO is dominant in the set of image frames 602-3 and used as a dominant hand gesture of the set of image frames 602-3. For the sequence of ROI images 508, the first second hand gesture T1 of the current image frame 502C is not consistent with the second predefined hand gesture T1 dominant in the set of image frames 602-3, and the second predefined hand gesture T1 is outputted as the output hand gesture 516 of the current image frame 502C. Additionally, referring to Figure 6D, in some embodiments, the current ROI image 508C is the (N-4)-th ROI image in the sequence of ROI images 508. The classifier neural network 510 identifies the second predefined hand gesture T1 in the current image frame 502C. The second predefined hand gesture T1 is dominant in the set of image frames 602-4 and used as a dominant hand gesture of the set of image frames 602-4. For the sequence of ROI images 508, the second hand gesture T1 of the current image frame 502C is consistent with the second predefined hand gesture T1 dominant in the set of image frames 602-4, and the second predefined hand gesture T1 is outputted as the output hand gesture 516 of the current image frame 502C.

[0057] Stated another way, in some embodiments (512-1 and 512-4), the preliminary hand gesture 512C of the current ROI image 508C identified by the classifier neural network 510 is identical to more than a threshold number of corresponding hand gestures (e.g., 8 out of 15 hand gestures) in the set of image frames 602 and used as the output hand gesture 516. In some embodiments (512-2 and 512-3), the preliminary hand gesture 512C of the current ROI image 508C identified by the classifier neural network 510 is different from at least a threshold number of corresponding hand gestures (e.g., 8 out of 15 hand gestures) in the set of image frames 602 and not used as the output hand gesture 516. In some embodiments not shown, the preliminary hand gesture 512C of the current ROI image 508C identified by the classifier neural network 510 is different from each and every corresponding hand gesture in the set of image frames 602 and not used as the output hand gesture.

[0058] In some embodiments, the similarity level between the current ROI image 508C and the set of image frames 602 is not determined based on their corresponding hand gestures identified by the classifier neural network 510. Instead, embedding vectors 526 are extracted from the current ROI image 508C and the set of image frames 602 and used to determine the similarity level. Specifically, in some embodiments, the similarity level includes a cosine similarity value. The electronic device 200 determines a current embedding vector 526C of the current ROI image 508C and a prior embedding vector 526P of the set of image frames 602. The cosine similarity value is determined based on the current embedding vector 526C and the prior embedding vector 526P. Further, in some embodiments, for each of the set of image frames 602, a respective embedding vector 526 is determined for the respective one of the set of image frames 602. The respective embedding vectors 526 of the set of image frames 602 are combined to generate the prior embedding vector 526P of the set of image frames 602, e.g., in a weighted manner. In an example, weights of the respective embedding vectors 526 of the set of image frames 602 decrease with a temporal distance from the current image frame 502C. In another example, the weights are equal, so that the prior embedding vector 526P of the set of image frames 602 is an average or mean of the respective embedding vectors 526 of the set of image frames 602.

[0059] In some embodiments, the electronic device 200 resizes the current ROI image 508C to a predefined image size and generates the current embedding vector 526C using a sequence of neural networks including a convolutional neural network, one or more pyramid network layers, and a fully connected layer. Further, in some embodiments, the electronic device 200 resizes each of the set of image frames 602 and generates the respective embedding vector 526 using the sequence of neural networks. Additionally, in some embodiments, for at least one of the set of image frames 602, the electronic device 200 predicts a bounding box 520 in a corresponding image frame 502 based on locations of bounding boxes 520 in two or more image frames 502 captured prior to the corresponding image frame 502, e.g., using equation (1). The at least one of the set of image frames 602 is generated by cropping the corresponding image frame 502 based on the predicted bounding box 520.

[0060] Figures 7A and 7B are temporal diagrams of a sequence of preliminary hand gestures 512 determined by a classifier neural network 510 and a sequence of output hand gestures 516 outputted by a tracker 514, in accordance with some embodiments, respectively. As explained above, an electronic device 200 (e.g., an IOT device, an HMD 104D) receives a sequence of image frames 502 including a current image frame 502C. A detector 504 receives the sequence of image frames 502, generates a feature map from each image frame 502, and decodes a bounding box 520 from the feature map. The bounding box 520 identifies an ROI including a hand. Each image frame 502 is cropped to generate a respective ROI image 508 based on the bounding box 520. Each respective ROI image 508 is further processed by a classifier neural network 510, and the classifier neural network 510 identifies a preliminary hand gesture 512 in the respective ROI image 508 to one of a plurality of predefined hand gestures. The sequence of ROI images 508 includes a total number (N) of ROI images that are cropped from the total number of image frames 502 based on respective bounding boxes 520. A sequence of preliminary hand gestures 512 correspond to the sequence of ROI images 508, and has the total number (N) of successive preliminary hand gestures 512, and each preliminary hand gesture 512 is determined from a respective ROI image 508 without any input from other ROI images 508.

[0061] For simplicity, each ROI image 508 is classified to a first predefined hand gesture TO or a second predefined hand gesture T1 by the classifier neural network 510. The preliminary hand gesture 512C identified for the current image frame 502C is one of the first and second predefined hand gestures TO and Tl. In some embodiments, there are three or more predefined hand gestures, and each ROI image 508 is classified to one of the three or more predefined hand gestures. Examples of predefined hand gestures include, but are not limited to, waving, saluting, handshakes, pointing, and a thumbs up.

[0062] Referring to Figure 7A, the first 10 preliminary hand gestures 512 of the first 10 ROI images 508 are consistently classified as the second predefined hand gesture Tl. Starting from the 11th ROI image 508, the first predefined hand gesture TO starts to be detected by the classifier neural network 510, and gradually appears in the sequence of hand gestures 512 with an increasing frequency. From the (N-6)-th ROI image 508, the first predefined hand gesture TO entirely replaces the second predefined hand gesture Tl and stabilizes in the sequence of preliminary hand gestures 512.

[0063] Referring to Figure 7B, each output hand gesture 516 determined in real time by the tracker 514 is not entirely consistent with the preliminary hand gesture 512 detected by the classifier neural network 510 during a transition stage, e.g., between the 11^th and (N-6)-th ROI images. During the transition stage, the sequence of preliminary hand gestures 512 varies between the first and second predefined hand gestures TO and Tl, and the sequence of output hand gestures 516 stabilizes at the second predefined hand gesture Tl and switches to the first predefined hand gesture T2 at the (N-8)-th ROI image when the second predefined hand gesture T1 starts to dominate in the set of image frames 602 associated with the (N-8)-th ROI image. By these means, the sequence of output hand gestures 516 takes into account a recent history of each ROI image 508 and automatically eliminates gesture detection noise caused by random events (e.g., occlusion, low resolution images, failures to detect bounding boxes 520) that may compromise accuracy of a result of the classifier neural network 510. [0064] In some embodiments, the plurality of predefined hand gestures are limited to the first and second predefined hand gestures TO and Tl. For each current ROI image 508C, each hand gesture in the corresponding set of image frames 602 is classified to a respective one of the plurality of predefined hand gestures by the classifier neural network 510. In some embodiments, one of the plurality of predefined hand gestures dominants in the corresponding set of image frames 602 if the one of the plurality of predefined hand gestures appears in the set of image frames 602 more frequently than any other predefined hand gestures. Alternatively, in some embodiments, one of the plurality of predefined hand gestures dominants in the corresponding set of image frames 602 if the one of the plurality of predefined hand gestures appears in a second number of image frames 602 and the second number is greater than a threshold dominant number (e.g., 8 if the set of image frames 602 includes 15 image frames). In some embodiments, the dominant hand gesture of the set of image frames 602 is adopted and used as the output hand gesture 516 of the current ROI image 508C, independently of the preliminary hand gesture 512C. The electronic device 200 determines the dominant hand gesture from preceding hand gestures of the set of image frames 602, and selects the dominant hand gesture as the output hand gesture of the current ROI image 508C.

[0065] Figure 8 is a block diagram of an example appearance embedding network 522 using a feature pyramid structure 802, in accordance with some embodiments. This appearance embedding network 522 is applied to generate an embedding vector 526 from each ROI image 508 generated from a respective one of a sequence of image frames 502. In some embodiments, the appearance embedding network 522 includes the feature pyramid structure 802. Introduction of the pyramid structure 802 improves embedding accuracy of the appearance embedding network 522 by combing features from different scales [0066] An ROI image 508 is generated by cropping one of a sequence of image frames 502 based on a bounding box 520. A detector 504 in Figure 5 receives the sequence of image frames 502, generates a feature map from each image frame 502, and decodes the bounding box 520 from the feature map. The ROI image 508 is generated by cropping a corresponding image frame 502 according to the bounding box 520, and a coordinate system of the ROI image 508 is determined by the bounding box 520. In some embodiments, an image frame 502 is cropped based on the bounding box 520 has a size of W*H and resized (806) (e.g., to the size of 90x90 pixels, which is smaller than a common image size of 224x224 pixels) to generate the input image 804. In an example, each of the ROI image 508 and input image 504 includes a color image having 3 channels. A convolutional layer 808 is applied to extract an image feature map 810 from the input image 804. The image feature map 810 is further processed by a plurality of feature pyramid network layers (e.g., 802 A, 802B, and 802C) in the feature pyramid structure 802 to generate a pyramid feature map 812. The pyramid feature map 812 is further processed by a fully connected layer 814 to generate an embedding vector 526 for determining an output hand gesture 516 in each ROI image 508 based on gesture tracking.

[0067] Given the resolution of the current ROI image 508, computational resources required to process the input image 804 in the appearance embedding network 522 are reduced compared with those required for images having the common image resolution. The computational resources are measured by floating point operations per second (FLOPS), and a total number of FLOPS is reduced by six times for the input image 804 (e.g., having 90x90 pixels) compared with the images having the common image resolution (e.g., 224x224 pixels). In some embodiments, the appearance embedding network 522 is deployed on a MediaTek (MT) 9652 Processor with an estimated inference time of 30ms.

[0068] In some embodiments, the appearance embedding network 522 uses a bottleneck structure in MobilenetV2 as a backbone of the appearance embedding network including neural network structures 802, 808, and 814. Further, in some embodiments, the bottleneck structure includes a 1 x 1 pointwise convolution layer followed by a depth-wise convolution layer of kernel 3x3 and end with a 1 x 1 pointwise convolution layer. In some embodiments, the bottleneck structure includes a residual block skip connection with plus operation. Alternatively, the bottleneck structure includes a depth-wise layer with stride 2 to downgrade spatial dimensions by half. For example, during the course of generating the pyramid feature map 812, the feature pyramid network layers 802A, 802B, and 802C successively scales down a size of the image feature map 810 by 2x2, while successively doubling a channel number of the image feature map 810. Additionally, the 1 x 1 convolution layers tremendously save computational complexity and lose little performance compared to larger kernel sizes. [0069] Figure 9 is a flow diagram of another example hand gesture recognition process 900, in accordance with some embodiments. The hand gesture recognition process 900 is implemented by a hand gesture recognition module 524 in Figure 5 based on a DeepSort algorithm that introduces a two-stage matching mechanism, i.e., matching cascade 902 and intersection over union (loU) matching 904. In a matching cascade module 902, embedding vectors 526 of an appearance embedding network 522 are employed as a data association metric for associating detections with tracks, and dramatically increase a tracking stability. An loU matching module 904 is coupled to the matching cascade module 902, and configured to receive information of unmatched detection in which a preliminary hand gesture 512C of a current ROI image 508C does not match a set of confirmed hand gestures 512 recognized in a set of image frames 602 based on a similarity requirement. The similarity of the preliminary hand gesture 512C and the set of confirmed hand gestures 512 is determined based on the corresponding embedding vectors 526. The loU matching module 904 continues to compare an embedding vector 526 of the current ROI image 508C with embedding vectors 526 of a set of tentative hand gestures 512’ of the set of image frames 602 and determine the output hand gesture 516 based on a related similarity level. In some embodiments, the matching cascade module 902 applies hand gestures recognized in the set of image frames 502 with a first level of confidence, while the loU match module 904 applies hand gestures recognized in the set of image frames 502 with a second level of confidence that is reduced from the first level of confidence.

[0070] In some embodiments, a correction logic 906 is added to remove a subset of tracking results (i.e., one or more output hand gestures 516) that have contradictory labels with the associated detection (i.e., preliminary hand gestures 512) for several consecutive image frames 502. A proposed mismatch detection routine assigns a property called "id mismatch" 908 to record the occurrences of consecutive mismatches between a track and the corresponding detection associated with the track, i.e., between the preliminary hand gestures 512 and output hand gestures 516. If the track and detection belong to different categories, id mismatch 908 increases by 1. The track will be deleted if id mismatch 908 exceeds a threshold id mismatch max, i.e., id mismatch > id mismatch max. id mistmatch 908 will be reset to zero when a track has the same label as the associated detection.

[0071] Referring to Figures 7A and 7B, a mismatching indicator id mismatch 908 is generated by comparing the preliminary hand gesture 512 and output hand gesture 516 for each ROI image 508. Each mismatch between the gestures 512 and 516 increases the mismatching indicator id mismatch 908 by 1, and each match between the gestures 512 and 516 resets the mismatching indicator id mismatch 908 to 0. In accordance with a determination that the mismatching indicator id mismatch 908 is greater than the threshold id mismatch max, a corresponding output hand gesture 516 is deleted. For example, if the threshold id mismatch max is equal to 3, the two output hand gestures 516 are deleted because the corresponding mismatching indicator id mismatch 908 is equal to 4 and 5, respectively.

[0072] In some situations, embedding vectors 526 generated by the appearance embedding network 522 cannot differentiate different gestures, and detections belonging to different gestures are associated with similar embedding vectors 526. The underlying reasons are twofold. First, the appearance embedding network 522 is trained with person Re-ID datasets in DeepSort. The images in person Re-ID datasets are fundamentally different to hand gesture images. Second, the hand images are captured from various distances in practical applications. The scale ambiguities will decrease the accuracy of the appearance embedding network since one hand gesture captured at varied distances generate different appearance embeddings. The DeepSort algorithm in Figure 9 prioritizes the appearance matching procedure in the two-stage matching mechanism. Detection may be associated with a track belonging to other categories when they share similar appearance embeddings. The correction logic 906 is applied to compensate for this loss of accuracy.

[0073] Additionally, in some embodiments, a motion-compensated detection logic 910 is applied to reduce tracking losses when intermittent detection failures occur, i.e., when the preliminary hand gestures 512 are not generated. Specifically, when a track is not assigned with a detection (e.g., when a bounding box 520 is missing in an image frame 502), a pseudo detection is generated to update the parameters inside the track. The coordinates of the pseudo detection are interpolated from the previous two detections, i.e., pt = 2 p_t-i - pt-2, where pt is the position of the detection at time t. The predicted detection allows the tracking algorithm to update parameters to reflect the object movements even when valid detection is unavailable. The motion-compensated detection logic 910 raises an expected outcome, i.e., existing tracks will last forever since pseudo detections never perish. A predefined life cycle Tmax is assigned for each generated detection providing a preliminary hand gesture 512 to solve the issue. Let Tij represent the life cycle of the i-th pseudo detection assigned to the j-th track, both the i-th detection and the j-th track are deleted when Ty > T_max, where T_max is the predefined life cycle for each pseudo detection. In some embodiments, the predefined life cycle Tmax only occurs when a pseudo generation lasts for consecutive frames corresponding to the predefined life cycle T max. [0074] In some embodiments, referring to Figure 7A, a bounding box 520 of an image frame 502 corresponding to the missing preliminary hand gesture 512A is missing, thereby making the classifier neural network 510 fail to output the preliminary hand gesture 512A. A position of the bounding box 520 is pt in the image frame 502 corresponding to the missing preliminary hand gesture 512A. Position of bounding boxes 520 are pt-r and pt-2 in the image frames 502 that immediately precede the image frame 502 corresponding to the missing preliminary hand gesture 512A and correspond to the preliminary hand gestures 512C and 512D, respectively. In an example, the position p_t of the bounding box 520 in the image frame 502 corresponding to the missing preliminary hand gesture 512A is represented as follows: pt = 2 p_t-i - pt-2 (1)

[0075] Under some circumstances, the classifier neural network 510 in Figure 5 fails to generate one or more preliminary hand gestures 512 (e.g., 512A in Figure 7A), e.g., because of an intermittent detection failure. The missing preliminary hand gesture 512A is used in a set of image frames 602 of each of a plurality of ROI images 508 to determine an output hand gesture 516 in the respective ROI image 508. In some embodiments, the set of image frames 602 of each ROI image 508 includes a first number (Ni) of ROI images 508 immediately preceding the respective RIO image 508. The missing preliminary hand gesture 512A impacts determination of the output hand gesture 516 in each of the first number (Ni) of successive ROI images 508 following the missing preliminary hand gesture 512A. Further, in some embodiments, based on the equation pt = 2 p_t-i - pt-2, the missing preliminary hand gesture 512A also impacts two preliminary hand gesture 512B that immediately follow the missing preliminary hand gesture 512A, thereby impacting determination of the output hand gesture 516 in each of a second number (N^Ni+2) of successive ROI images 508 following the missing preliminary hand gesture 512A. The first number (Ni) or second number (N2) optionally defines the predefined life cycle T_max assigned for each generated detection.

[0076] In some embodiments, an output hand gesture 516 of each ROI image 508 is generated based on preliminary hand gestures 512 of a set of image frames 602. The set of image frames 602 are captured before the respective ROI image 508, and do not correspond to successive image frames 502. Rather, the first number (Ni) of image frames 602 that are sampled (e.g., uniformly) from a second subset of image frames captured by a camera immediately prior to each ROI image 508. The second subset of image frames include a third number (N3) of image frames. A missing preliminary hand gesture 512A impacts determination of the output hand gesture 516 in each of a subset of the third number (N3) of successive ROI images 508 following the missing preliminary hand gesture 512A. Further, in some embodiments, based on the equation pt = 2 p_t-i - pt-2, the missing preliminary hand gesture 512A also impacts two preliminary hand gesture 512B that immediately follow the missing preliminary hand gesture 512A, thereby impacting determination of the output hand gesture 516 in each of a subset of a fourth number (A/=Ai+2) of successive ROI images 508 following the missing preliminary hand gesture 512A. The third number (A3) or fourth number (A/) optionally defines the predefined life cycle T_max assigned for each generated detection.

[0077] The hand gesture recognition process 900 improves accuracy of hand gesture recognition. An appearance embedding network 522 applied in a classifier neural network 510 is based on a feature pyramid structure 802, and increases embedding accuracy by extracting features from different scales. Track-detection mismatch detection implemented by the tracker 514 improves tracking accuracy when the appearance embedding network generates false predictions. The motion-compensated detection logic 910 further reduces tracking losses when intermittent detection failures occur.

[0078] Various embodiments of this application not only apply a classifier neural network 510 to detect a preliminary hand gesture 512 in each image frame 502, but also track a temporal variation of the preliminary hand gesture 512 within a time window corresponding to a set of image frames 602 to determine an output hand gesture 516 for the respective image frame 502. First, gesture tracking effectively avoids false positives detected by the classifier neural network 510 in individual image frames by mismatching features. For example, the classifier neural network 510 recognize a “V” gesture in the second one of the four consecutive frames and an “OK” gesture in the other three of four consecutive frames. The “V” gesture in the second frame is recognized with error and can be properly fixed in accordance with a comparison with hand gestures detected for preceding frames 602 in the time window. Next, accuracy of hand gesture recognition is improved, and the hand gesture is tracked smoothly without any interruption among frames. This allows an application of dynamic gesture that requires continuous recognition among consecutive frames, and ensures that the application of dynamic gesture develops in a concise and easy manner. Additionally, one or more hand gestures may be missing in each of a subset of a sequence of image frames 502, particularly when multiple people holding multiple gestures appear in the sequence of image frames 502. The electronic device 200 can easily determine and fill the missing hand gesture(s) based on respective preceding image frames in the corresponding time windows. By these means, the hand gesture recognition process 900 allows robust multi-person hands tracking, allowing predefined hand gestures to be detected and tracked for different individuals.

[0079] Figure 10 is a flow diagram of an example image processing method 1000 for recognizing hand gestures, in accordance with some embodiments. For convenience, the method 1000 is described as being implemented by an electronic device 200 (e.g., an IOT device). Method 1000 is, optionally, governed by instructions that are stored in a non- transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 10 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., instructions stored in a hand gesture recognition module 230 of memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other nonvolatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 1000 may be combined and/or the order of some operations may be changed.

[0080] The electronic device 200 obtains (1002) a sequence of image frames 502 including a current image frame 502C, and identifies (1004) a bounding box 520 capturing a hand gesture in the current image frame 502C. The current image frame 502C is cropped (1006) to generate an ROI image 508C based on the bounding box 520. The electronic device 200 determines (1008) a similarity level of the ROI image 508C with a set of image frames 602. The set of image frames 602 are cropped (1010) from a subset of the sequence of image frames 502 including a first number (N i) of image frames that precede the current image frame 502C. Based on the similarity level of the ROI image 508C, the electronic device 200 classifies (1012) the hand gesture in the ROI image 508C to an output hand gesture 516. [0081] In some embodiments, the hand gesture in the ROI image 508C is classified by classifying the hand gesture in the ROI image 508C to a preliminary hand gesture 512C using a classifier neural network 510 and classifying a corresponding hand gesture in each of the set of image frames 602 to a preceding hand gesture using the classifier neural network 510. Further, in some embodiments, the hand gesture in the ROI image 508C is classified by determining a dominant hand gesture from preceding hand gestures of the set of image frames 602. In accordance with a determination that the similarity level of the ROI image 508C satisfies a similarity requirement, the electronic device 200 selects the preliminary hand gesture 512C as the output hand gesture 516. In accordance with a determination that the similarity level of the ROI image 508C does not satisfy the similarity requirement, the electronic device 200 selects the dominant hand gesture as the output hand gesture 516. [0082] Additionally, in some situations, the preliminary hand gesture 512C is identical to more than a threshold number of corresponding hand gestures in the set of image frames 602 and used as the output hand gesture 516. In some situations, the preliminary hand gesture 512C is different from at least a threshold number of corresponding hand gestures in the set of image frames 602 and not used as the output hand gesture 516. In some situations, the preliminary hand gesture 512C is different from each and every corresponding hand gesture in the set of image frames 602 and not used as the output hand gesture 516.

[0083] Stated another way, in some embodiments, the electronic device 200 classifies the hand gesture in the ROI image 508C by determining (1014) a dominant hand gesture from preceding hand gestures of the set of image frames 602 and selecting (1016) the dominant hand gesture as the output hand gesture 516 of the ROI image 508C.

[0084] In some embodiments, the similarity level includes (1008) a cosine similarity value, the similarity level of the ROI image 508C with the set of image frames 602 is determined by determining (1018) a current embedding vector 526C for the ROI image 508C, determining (1820) a prior embedding vector 526P of the set of image frames 602, and determining (1822) a cosine similarity value based on the current embedding vector 526C and the prior embedding vector 526P. Further, in some embodiments, the electronic device 200 determines the prior embedding vector 526P of the set of image frames 602 by for each of the set of image frames 602, cropping a corresponding image frame in the subset of the sequence of image frames 502 and determining a respective embedding vector of the respective one of the set of image frames 602, and by combining the respective embedding vectors 526 of the set of image frames 602 to generate the prior embedding vector 526P of the set of image frames 602. In some embodiments, the respective embedding vectors 526 of the set of image frames 602 are combined in a weighted manner. For each image frame, a respective weight depends on a temporal distance of the respective image frame from the current image frame 502C. Further, in some embodiments, the set of image frames 602 have equal weights. The prior embedding vector 526P is an average of the respective embedding vectors 526 of the set of image frames 602.

[0085] Additionally, in some embodiments, the electronic device 200 determines the current embedding vector 526C for the ROI image 508C by resizing the ROI image 508C to a predefined image size and generating the current embedding vector 526C using a sequence of neural networks including a convolutional neural network, one or more pyramid network layers, and a fully connected layer.

[0086] Further, in some embodiments, for at least one of the set of image frames 602, the electronic device 200 predicts a bounding box 520 in a corresponding image frame based on locations of bounding boxes 520 in two or more image frames captured prior to the corresponding image frame. The at least one of the set of image frames 602 is cropped from the corresponding image frame based on the predicted bounding box 520.

[0087] In some embodiments, prior to determining the similarity level of the ROI image 508C, the electronic device 200 determines the first number based on a refresh rate of a camera that captures the sequence of image frames 502 and an image quality of the sequence of image frames 502. The image quality is optionally determined based on a signal- to-noise ratio.

[0088] In some embodiments, the electronic device 200 includes a camera and is applied with or integrated in a television device for recognizing the output hand gesture 516. [0089] In some embodiments, the subset of the sequence of image frames 502 includes the first number of image frames that are uniformly sampled from a second subset of image frames 602 captured by a camera immediately prior to the current image frame 502C. [0090] In some embodiments, the second subset of image frames 602 includes a subset of successive images. For the successive images, the electronic device 200 classifies hand gestures in the successive images to successive hand gestures using a classifier neural network 510, determines tracked hand gestures in the successive images based on similarity levels with respective sets of image frames, and determines that each of the successive hand gestures does not match a corresponding tracked hand gesture. In accordance with a determination that the subset of successive images includes more than a threshold number of images, the electronic device 200 forgoes sampling of a last image in the subset of successive images into the set of image frames 602 for the ROI image 508C. More details on tracking correction are explained above with reference to Figure 9.

[0091] It should be understood that the particular order in which the operations in Figure 10 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to recognize hand gestures in extended reality. Additionally, it should be noted that details of other processes described above with respect to Figures 5-9 are also applicable in an analogous manner to method 1000 described above with respect to Figure 10. For brevity, these details are not repeated here. [0092] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[0093] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[0094] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[0095] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is:

1. An image processing method, implemented by an electronic device, comprising: obtaining a sequence of image frames including a current image frame; identifying a bounding box capturing a hand gesture in the current image frame; cropping the current image frame to generate a region of interest (ROI) image based on the bounding box; determining a similarity level of the ROI image with a set of image frames, wherein the set of image frames are cropped from a subset of the sequence of image frames including a first number of image frames that precede the current image frame; and based on the similarity level of the ROI image, classifying the hand gesture in the ROI image to an output hand gesture.

2. The method of claim 1, further comprising: classifying the hand gesture in the ROI image to a preliminary hand gesture using a classifier neural network; and classifying a corresponding hand gesture in each of the set of image frames to a preceding hand gesture using the classifier neural network.

3. The method of claim 2, classifying the hand gesture in the ROI image further comprising: determining a dominant hand gesture from preceding hand gestures of the set of image frames; and in accordance with a determination that the similarity level of the ROI image satisfies a similarity requirement, selecting the preliminary hand gesture as the output hand gesture; and in accordance with a determination that the similarity level of the ROI image does not satisfy the similarity requirement, selecting the dominant hand gesture as the output hand gesture.

4. The method of claim 2, wherein the preliminary hand gesture is identical to more than a threshold number of corresponding hand gestures in the set of image frames and used as the output hand gesture.

5. The method of claim 2, wherein the preliminary hand gesture is different from at least a threshold number of corresponding hand gestures in the set of image frames and not used as the output hand gesture.

6. The method of claim 2, wherein the preliminary hand gesture is different from each and every corresponding hand gesture in the set of image frames and not used as the output hand gesture.

7. The method of claim 2, classifying the hand gesture in the ROI image further comprising: determining a dominant hand gesture from preceding hand gestures of the set of image frames; and selecting the dominant hand gesture as the output hand gesture of the ROI image.

8. The method of any of the preceding claims, wherein the similarity level includes a cosine similarity value, determining the similarity level of the ROI image with the set of image frames further comprising: determining a current embedding vector for the ROI image; determining a prior embedding vector of the set of image frames; and determining a cosine similarity value based on the current embedding vector and the prior embedding vector.

9. The method of claim 8, determining the prior embedding vector of the set of image frames further comprising: for each of the set of image frames, cropping a corresponding image frame in the subset of the sequence of image frames and determining a respective embedding vector of the respective one of the set of image frames; and combining the respective embedding vectors of the set of image frames to generate the prior embedding vector of the set of image frames.

10. The method of claim 8 or 9, determining the current embedding vector for the ROI image further comprising: resizing the ROI image to a predefined image size, and generating the current embedding vector using a sequence of neural networks including a convolutional neural network, one or more pyramid network layers, and a fully connected layer.

11. The method of any of claims 8-10, further comprising: for at least one of the set of image frames, predicting a bounding box in a corresponding image frame based on locations of bounding boxes in two or more image frames captured prior to the corresponding image frame; and generating the at least one of the set of image frames by cropping the corresponding image frame based on the predicted bounding box.

12. The method of any of the preceding claims, further comprising, prior to determining the similarity level of the ROI image: determining the first number based on a refresh rate of a camera that captures the sequence of image frames and an image quality of the sequence of image frames.

13. The method of any of the preceding claims, wherein the electronic device includes a camera and is applied with or integrated in a television device for recognizing the output hand gesture.

14. The method of any of the preceding claims, wherein the subset of the sequence of image frames includes the first number of image frames that are uniformly sampled from a second subset of image frames captured by a camera immediately prior to the current image frame.

15. The method of claim 14, wherein the second subset of image frames includes a subset of successive images, the method further comprising: for the successive images, classifying hand gestures in the successive images to successive hand gestures using a classifier neural network, determining tracked hand gestures in the successive images based on similarity levels with respective sets of image frames, and determining that each of the successive hand gestures does not match a corresponding tracked hand gesture; and in accordance with a determination that the subset of successive images includes more than a threshold number of images, forgoing sampling of a last image in the subset of successive images into the set of image frames for the ROI image.

16. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-15.

17. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-15.