WO2023219629A1

WO2023219629A1 - Context-based hand gesture recognition

Info

Publication number: WO2023219629A1
Application number: PCT/US2022/029161
Authority: WO
Inventors: Jie Liu; Yang Zhou
Original assignee: Innopeak Technology, Inc.
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2023-11-16

Abstract

This application is directed to recognizing gestures in an image. An electronic device obtains an image including a hand region and detects the hand region in the image. A first hand gesture is determined from the hand region of the image, and a second hand gesture is determined from the image. In accordance with a determination that the first hand gesture is not any of a plurality of contextual gestures, the electronic device determines that a final hand gesture of the image is the first hand gesture. Conversely, in accordance with a determination that the first hand gesture is one of the plurality of contextual gestures, the electronic device determines the final hand gesture from the image based on the second hand gesture associated with the image and a corresponding confidence score.

Description

Context-Based Hand Gesture Recognition

TECHNICAL FIELD

[0001] This application relates generally to gesture recognition technology including, but not limited to, methods, systems, and non-transitory computer-readable media for recognizing gestures from image data.

BACKGROUND

[0002] Existing solutions for gesture recognition often require that a gesture (e.g., hand gesture, face gesture, etc.) be captured in close proximity to a camera (e.g., less than 1 meter away) to accurately detect the gesture. As the distance between the gesture and imaging device increases, the gesture becomes smaller and is blended with a number of other objects captured concurrently with the gesture. To overcome these challenges, electronic devices applied in gesture recognition are adjusted to focus on a small area containing the gesture, e.g., by zooming in and limiting a field of view to the small area. However, some electronic devices cannot focus on a particular gesture before capturing an image, and have to crop the image to obtain the small are containing the area. Other solutions rely on powerful deep learning models or a fusion of detection and classification processes, and therefore, demand large amount of computational resources for gesture detection and recognition. As such, it would be beneficial to have systems and methods for accurately and efficiently detecting gestures captured in images that, include gestures that, could be far from a camera and blended with a background of the images.

SUMMARY

[0003] V arious embodiments of this application are directed to gesture recognition techniques that fuse local gesture information and contextual gesture information to improve the accuracy and efficiency of gesture recognition. The local gesture information includes information concerning the gestures or portion of a body performing the gesture, and the contextual gesture information includes information concerning surroundings of the gesture, such as an environment (e.g., the office), a position of the gesture relation to a user, and/or other factors. Further, in some embodiments, an initial gesture classification is applied to streamline detection and classification processes, thereby improving an overall efficiency. In some embodiments, one or more gestures are recognized based on contextual information (e.g., moving a hand to a mouth to signify silence). Such context information in an image is applied to increase the accuracy of gesture recognition and reduce the number of false positives. In an example, context information is used to distinguish among local gestures, contextual gestures, and/or non-gestures.

[0004] In one aspect, a method for classifying a gesture is provided. The method includes obtaining an image including a hand region, detecting the hand region in the image, determining a first hand gesture from the hand region of the image, and determining a second hand gesture from the image (e.g., the entire image). The method further includes, in accordance with a determination that the first hand gesture is not any of a plurality of contextual gestures, determining that a final hand gesture of the image is the first hand gesture; and, in accordance with a determination that the first hand gesture is one of the plurality of contextual gestures, determining the final hand gesture based on the second hand gesture and a second confidence score, the second hand gesture and the second confidence score being associated with the image (e.g., the entire image).

[0005] In some embodiments, determining the first hand gesture from the hand region of the image further includes generating a first gesture vector from the hand region of the image. Each element of the first gesture vector corresponding to a respective hand gesture and representing a respective first confidence level of the hand region including the respective hand gesture. The method further includes determining the first hand gesture and a first gesture confidence score from the first gesture vector. In some embodiments, the method further includes associating detection of the hand region in the image with a bounding box confidence score and combining the bounding box confidence score with a confidence score associated with the first hand gesture to generate the first gesture confidence score. In some embodiments, the first hand gesture includes the respective hand gesture corresponding to a largest first confidence level of the respective first confidence level of each element of the first gesture vector, and the first gesture confidence score is equal to the largest first confidence level of the respective first confidence level of each element of the first gesture vector.

[0006] In some embodiments, determining the second hand gesture from the image further includes generating a second gesture vector from the image (e.g., the entire image). Each element of the second gesture vector corresponding to a respective hand gesture and representing a respective second confidence level of the image including the respective predefined hand gesture. The method further includes determining the second hand gesture and a second gesture confidence score from the second gesture vector. In some embodiments. the second hand gesture includes the respective hand gesture corresponding to a largest second confidence level of the respective second confidence level of each element of the second gesture vector, and the second gesture confidence score is equal to the largest second confidence level of the respective second confidence level of each element of the second gesture vector.

[ 0007] In some embodiments, the method includes, before determining whether the first hand gesture is at least one of the plurality of contextual gestures, determining whether the first gesture confidence score is greater than a second threshold P2 and, in accordance with a determination that the first gesture confidence score is less than the second threshold P2, determining that the image is not associated with any hand gesture. In some embodiments, the method further includes, before determining whether the first hand gesture is at least one of the plurality of contextual gestures, determining whether the second gesture confidence score of the second hand gesture is greater than a first threshold P l and, in accordance with a determination that the second gesture confidence score of the second hand gesture is less than the first threshold Pl, determining that the image is not associated with any hand gesture.

[0008] In some embodiments, determining the final hand gesture based on the second hand gesture and a second confidence score further includes, in accordance with a determination that the first and second hand gestures are distinct from each other, determining that the image is not associated with any hand gesture. The method further includes in accordance with a determination that the first and second hand gestures are identical to each other, (1) in accordance with a determination that, a third confidence score exceeds a comprehensive confidence threshold, determining that the final hand gesture is the second hand gesture; and (2) in accordance with a determination that the third confidence score does not exceed the comprehensive confidence threshold, determining that the image is not associated with any hand gesture.

[0009] In some embodiments, the method further includes filtering the final hand gesture using a filtering function, the filter function being configured to identify false positives with the help of temporal information, that is results from previous images. In some embodiments, the filtering function is one of a convolution function, Fourier filtering function, or Kalman filter. In some embodiments, the filtering function is a function of time. [0010] In another aspect, some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.

[0011] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0012] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof Additional embodiments are discussed in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0014] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0015] Figure 2 is a block diagram illustrating an electronic device configured to process content data (e.g., image data), in accordance with some embodiments.

[0016] Figure 3 is a flow diagram of a gesture detection and classification process using image data, in accordance with some embodiments.

[0017] Figure 4 is a flowchart of an example post processing technique for determining a final gesture from two gestures that are determined from image data, in accordance with some embodiments.

[0018] Figure 5 is a flow diagram of a method of classifying one or more gestures, in accordance with some embodiments.

[0019] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0020] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary' skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art. that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0021] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0022] The one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104(3, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely. [0023] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (L I E), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0024] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. The content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.

[0025] In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D). The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104 A and HMD 104D). The sewer 102 A obtains the training data from itself, another sewer 102 or the storage 106 applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the sewer 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the sewer 102A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a sewer 102 (e.g., the sewer 102B) associated with the client device 104. The sewer 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the sewer 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.

[0026] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The AR glasses 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server) s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses. The device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.

[0027] As explained above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D. 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using a second data processing model. Training of the first and second data processing models is optionally implemented by the server 102 or AR glasses 104D. Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.

[0028] Figure 2 is a block diagram illustrating an electronic device 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic device 200 is one of a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof. In an example, the electronic device 200 is a mobile device including a gesture recognition module 230 that applies a neural network model (e.g., in Figure 3) end-to-end to recognize hand gestures locally at the mobile device. The electronic device 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic device 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the electronic device 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard. In some embodiments, the electronic device 200 includes one or more optical cameras (e.g., an RGB camera 260), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. In some embodiments, the electronic device 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the electronic device 200 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the electronic device 200. Optionally, the electronic device 200 includes an inertial measurement unit (IMU) 280 integrating sensor data captured by multi-axes inertial sensors to provide estimation of a location and an orientation of the electronic device 200 in space. Examples of the one or more inertial sensors of the IMU 280 include, but are not limited to, a gyroscope, an accelerometer, a magnetometer, and an inclinometer.

[0029] Alternatively, or in addition, in some embodiments, the electronic device 200 is communicatively coupled, via the one or more network interfaces 204, to one or more devices (e.g., a server 102, a client device 104, a storage 106, or a combination thereof) that include one or more input devices 210, output device 212, IMUs 280, or other components described above and provide data to the electronic device 200.

[0030] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory’ devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory- computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

* Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks,

* Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication . networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on; . User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.); . Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction; . Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account; . One or more user applications 224 for execution by the electronic device 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices), . Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104; . Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224, and in an example, the data processing module 228 is applied to implement a gesture detection and classification process 300 in Figure 3; . Gesture classification module 230 for classifying one or more gestures in an image (as shown and described below in reference to Figures 3 and 4), where the gesture recognition module 230 further includes a detection module 232 for detecting one or more objects in an image and/or a classification module 234 for classifying one or more gestures in a region or portion of the image and/or the entire image, and the image data is processed jointly by the detection process 310 and classification process 320 of the gesture recognition module 230 and the data processing module 228; . One or more databases 240 for storing at least data including one or more of: o Device settings 242 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 244 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account setings; o Network parameters 246 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 248 for training one or more data processing models 250; o Data processing model(s) 250 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 250 includes an image compression model for implementing an image compression process, a feature extraction model for implementing a multi-scale feature extraction process, and/or one or more classification models and networks as described below in reference to Figures 3 and 4; o Gesture database 252 for storing one or more gestures associated with images (e.g., stored in a database (e.g., memory 204); and o Content data and results 254 that are obtained by and outputed by the electronic device 200 (or device communicatively coupled to the electronic device 200 (e.g., client device 104)), respectively, where the content data is processed by the data processing models 250 locally at the client device 104 or remotely at the server 102 to provide the associated results to be presented on client device 104.

[0031] Optionally, the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic device 200. Optionally, the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electroni c device 200. In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively. [0032] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory' 206, optionally, stores additional modules and data structures not described above.

[0033] Figure 3 is a flow diagram of a gesture detection and classification process 300 using image data 312, in accordance with some embodiments. In some embodiments, the gesture detection and classification process 300 is configured to detect and recognize gestures captured at least 0.5 m to 2 m away from an imaging device capturing the image data 312. In some embodiments, the gesture detection and classification process 300 is configured to detect and recognize gestures captured greater than 2 m away from an imaging device capturing the image data 312. The gesture detection and classification process 300 is optionally' performed by one or more client devices 104, servers 102, and/or combination thereof described above in reference to Figures 1 and 2. The gesture detection and classification process 300 includes a detection process 310 followed by a classification process 320, which is followed by a post processing phase 330 that determines a final gesture 340.

[0034] In some embodiments, the detection process 310 includes applying a first detection and classification model 314 to received image data 312, The first detection and classification model 314 generates one or more feature maps 302 to be used for object detection in an object detection phase 316 and gesture classification in a gesture classification phase 318. The detection process 310 is configured to provide a first output via the object detection phase 316 and a second output via the gesture classification phase 318 for the purposes of determining the final gesture 340. The first output of the object detection phase 316 includes information of a bounding box 303 and associated box confidence score 304 for each first gesture 305. In some embodiments, the information of the bounding box 303 is used to generate a cropped image 322 tightly enclosing the gesture, and a second classification network 324 is applied to determine the first gesture 305 from the cropped image 322. The second output of the gesture classification phase 318 includes information of a second gesture 307 and associated second confidence score 308. [0035] In some embodiments, the image data 312 is captured by an input device 210 (e.g., an RGB camera 260) of an electronic device 200 (Figure 2), The image data 312 is optionally processed locally at the electronic device 200. Alternatively, the image data 312 is uploaded to a server 102 or transferred to a distinct electronic device 200. The distinct electronic device 200 obtains the image 312 from the electronic device 200 having the camera 260, or downloaded the image 312 from the server 102 via a web browser module 222 and/or one or more user application 224. In some embodiments, the image data 312 includes one or more gestures. In some embodiments, a gesture within the image data 312 is at least 4 meters away relative to the electronic device 200 that captures the image data 312. Non-limiting examples of the one or more gestures include one or more hand gestures, facial gestures, and body gestures. The image data 312 is received with an initial resolution, e.g., 1080p.

[0036] The image data 312 is passed through the first detection and classification model 314 to compress the image data 312 and/or generate one or more feature maps 302 from the image data 312, In some embodiments, the image data 312 is processed (e.g., downscaled using one or more neural networks, such as one or more convolutional neural networks) before it is passed through the first detection and classification model 314. The first detection and classification model 314 includes one or more machine learning models. For example, in some embodiments, the first detection and classification model 314 includes one or more convolution neural networks (CNN)s known in the art. In some embodiments, the one or more machine learning models are configured to identify and enrich (e.g., extract detail from) one or more features from the image data 312, and downsize feature resolutions ( with respect to the initial resolution of the image data 312) of the one or more features, and/or generate a sequence of (scaled) feature maps based on the image data 312. In some embodiments, the one or more feature maps are provided as an output 302 of the first detection and classification model 314. Alternatively, in some embodiments, the sequence of feature maps are combined into a comprehensive feature map that is provided as an output 302 of the first detection and classification model 314. The output 302 of the first detection and classification model 314 is used by at least the object detection phase 316 and gesture classification phase 318.

[0037] The gesture classification phase 318 identifies the second gesture 307 based on the output 302 of the first detection and classification model 314. In particular, the gesture classification phase 318 is configured to determine the information of the second gesture 307 and associated second confidence score 308 (i.e., Det Cis Conf Score) indicating a confidence level of detecting the second gesture 307 from the one or more feature maps 302 corresponding to the entire image data 312. In some embodiments, the information of the second gesture 307 generated by the gesture classification phase 318 includes a second gesture vector. Each element of the second gesture vector corresponds to a respective gesture and represents a respective first probability or confidence level of the second gesture 307 corresponding to the respective gesture. In some embodiments, the second gesture 307 and second confidence score 308 are determined based on the second gesture vector. In some embodiments, the second gesture vector is normalized. In some embodiments, the gesture classification phase 318 is performed in parallel to the object detection phase 316 and the classification process 320.

[0038] In some embodiments, the gesture classification phase 318 is configured to recognize gestures for a particular application and/or system. In some embodiments, the second gesture 307 determined by the gesture classification phase 318 includes only local information that can be used to further determine the final gesture 340. Local information for a gesture includes information specific to the body performing the gesture, information specific to the gesture (e.g., the exact class), and/or information specific to a particular application and/or system. More specifically, local information is information based solely on the hand or portion of the body performing the gesture. For example, as shown in Figure 3, local information can be “scissors.”

[0039] The object detection phase 316 is configured to detect one or more gestures in the output of the first detection and classification model 314. In particular, the object detection phase 316 is configured to generate one or more bounding boxes 303 around one or more detected gestures within the image data 312. In some embodiments, each bounding box 303 corresponds to a respective first gesture 305, and is determined with a box confidence score 304 (i.e., BBox Conf Score) indicating a confidence level of the respective bounding box 303 associated with a first gesture 305. Further, in some embodiments, a gesture region 322 is cropped and resized (326) from the image data 312 for each first gesture 305 based on information of the bounding box 303.

[0040] The classification process 320 applies a second classification network 324 to each gesture region 322 (i.e., cropped image 322) to determine the first gesture 305. In some embodiments, the second classification network 324 is one or more of neural networks known in the art (e.g., mobilenet vl, mobilenet v2, ShuffleNet). In some embodiments, the second classification network 324 is selected based on expected gesture tasks (e.g., gestures expected for a particular application and/or system) and/or selected based on a number of classes for the classification (e.g., different types of gestures that can be classified by a particular application and/or system).

[0041] For each cropped image 322, the second classification network 324 determines a corresponding first gesture 305 by determining a first gesture vector for each cropped imaged 322 received. Each element of the first gesture vector determined by the second classification network 324 corresponds to a respective gesture and represents a respective first probability or confidence level of the gesture region 322 including the respective gesture. In some embodiments, the classification process 320 determines the first gesture 305 and first gesture confidence score 306 based on the first gesture vector. In some embodiments, the first gesture confidence score 306 for the first gesture 305 are combined with the box confidence score 304, which is determined by the object detection phase 316, to generate a first confidence score for the first gesture 305. The first gesture vector is normalized, indicating that a sum of the probability values corresponding to all gestures is equal to 1 .

[0042] Information of the first gesture 305 provided by the object detection phase 316 is used to determine whether the first gesture 305 is associated with contextual information. Such contextual information is used to determine whether the first gesture 305 or the second gesture 307 is selected to determine the final gesture 340 that is associated with the image data 312. If the first gesture 305 is not associated with contextual information, the first gesture 305 is used to determine the final gesture 340. Conversely, if the first gesture 305 is associated with contextual information, the second gesture 307 is used to determine the final gesture 340. In an example, the first gesture 305 or the second gesture 307 are used to distinguish between a gesture performed near a user’s face (e.g., raising the finger to their lips to indicate silence) and a gesture performed in space (e.g., raising a finger), respectively. Examples of contextual informati on include, but are not limited to, performance of the gesture on and/or near a specific portion of the body, performance of the gesture in a particular environment (e.g., at home, in the office, on the bus, in the library, etc.), previous gestures performed (e.g., an “answer phone call” gesture performed before a “hang up phone” gesture), and/or motion surrounding performance of the gesture (e.g., a pinching gesture and/or spread gesture to zoom in and/or out).

[0043] The first and seconds gestures 305 and 307 are determined from the gesture region 322 and the entire image 312, respectively, and both used for determining the final gesture 340 via the post processing phase 330. In the post processing phase 330, the outputs of the detection process 310 and classification process 320 are used together to determine the final gesture 340. The post processing phase 330 includes one or more filters applied on the first and second gestures 305 and 307 and associated confidence scores 304, 306 and 308. The filter function is optionally configured to identify false positives via temporal information from the previous state. In some embodiments, the filter is represented as a function of time (e.g.,

as follows:

where F is the filter function. In some embodiments, in the post processing phase 330, a selected on of the first gesture 305 and second gesture 307 is required to stabilize for at least a predefined number of successive frames prior to be selected as a final gesture 340.

[0044] Any number of methods can be used to construct the filter function. In some embodiments, the filter function is configured to smooth the outputs to avoid jitters and provide fluent user experiences. Non-limiting examples of filters and filtering techniques can be moving, 'weighted average smoothing (or convolution), Fourier filtering, Kalman filter, and their variants. Although the above filtering process is described for gestures (e.g., Kcis and Kdet), similar filters and filtering techniques can be applied to detection box smoothing.

[0045] More details on determining a final gesture 340 based on the first and second gestures 305 and 307 and associated confidence scores 304, 306, and 308 are discussed with reference to Figure 4. It is noted that Figure 3 shows an example process 300 for determination of a hand gesture and that the gesture detection and classification process 300 can be similarly applied to detect one or more of face gestures, arm gestures, body gestures, and/or other classifiable gestures performed by a user.

[0046] Figure 4 is a flowchart of an example post processing technique 400 for determining a final gesture 340 from a first gesture 305 and a second gesture 307 that, are determined from image data 312, in accordance with some embodiments. The post processing technique 400 is an embodiment of the one or more processes performed by the post processing phase 330 described above in reference to Figure 3. The post processing technique 400 shows two branches for determining the final gesture 340 - the branch on the left based on the second gesture 307 (Kdet), a second gesture confidence score 308 (DetCIsi) associated with the second gesture 307, and a box confidence score 304 (Pbox) associated with a bounding box 303 of the first, gesture 305; and the branch on the right, based on the first gesture 305 (Kcls) and first gesture confidence score 306 (Clsi), as described above in reference to Figure 3. [0047] Starting at operation 410, the post processing technique 400 obtains a first gesture 305 (Kcls) and the first gesture confidence scores 306 (Clsi) using the object detection phase 316. The processing technique 400 determines, at operation 420, whether the first gesture confidence score 306 (Clsi) of the first gesture 305 (Kcls) is greater than or equal to a second threshold probability (P2). In some embodiments, the second threshold probability P2 is at least 0. 10. In some embodiments, the second threshold probability P2 is any probability defined by a user and/or a system implementing the gesture detection and classification process 300 (e.g., at least 0.15). In some embodiments, the second threshold probability P2 is adjusted for the best performance and its values are highly dependent on the accuracy from the detection process 310 (e.g., including the classification phase 318).

[0048] If the first gesture confidence scores 306 (Clsi) is below the second threshold probability P2 (“No” at operation 420), the corresponding first gesture 305 (Kcls) is determined to be an invalid gesture for an image (i.e., no gesture is detected 480).

Conversely, if the first gesture confidence scores 306 (Clsi) is greater than or equal to the second threshold probability P2 (“Yes” at operation 420), the corresponding first gesture 305 (Kcls) remains as a candidate gesture for the final gesture 340 and is utilized at operation 430. At operation 430, the post processing technique 400 determines whether the first gesture 305 (KCls) is contextual (i.e., whether the first gesture 305 is associated with contextual information). If the first gesture 305 (KCls) is not contextual (“No” at operation 430), the post processing technique 400 determines that the first gesture 305 (KCls) is the gesture class 490 (i.e., the final gesture 340). Alternatively, if the first gesture 305 (KCls) is contextual (“Yes” at operation 430), the post processing technique 400 proceeds to operation 460 and utilizes the first gesture 305 (KCls) in conjunction with a second gesture 307 to determine the final gesture 340 (where first gesture 305 is focused on the gesture region 322 and the second gesture 307 is based on the entire image).

[0049] Turning to operation 440, the post processing technique 400 obtains the second gesture 307 (Kdet), the second gestyre confidence score 308 (DetClsi) associated with the second gesture 307, and the box confidence score 304 (Pbox). It is determined, at operation 450, whether the second gesture confidence score 308 (DetClsi) of the second gesture 307 is greater than or equal to a first threshold probability (P l). In some embodiments, the first, threshold probability Pl is at least 0. 10. In some embodiments, the first threshold probability Pl is any probability defined by a user and/or a system implementing the gesture detection and classification process 300 (e.g., at least 0.15). In some embodiments, the first threshold probability Pl is adjusted for the best performance and its values are highly dependent on the accuracy from the detection process 310 and classification processes 320.

[0050] If the second gesture confidence score 308 (DetClsi) of the second gesture 307 (Kdet) is below the first threshold probability Pl (“No” at operation 450), the second gesture 307 (Kdet) is determined to be an invalid gesture for the image (i.e., no gesture is detected 480). Conversely, if the second gesture confidence score 308 (DetClsi) of the second gesture 307 (Kdet) is greater than or equal to the first threshold probability Pl (“Yes” at operation 460), the second gesture 307 (Kdet) remains as a candidate gesture for the final gesture 340 and is utilized at operation 460.

[0051] At operation 460, the post processing technique 400 determines whether the second gesture 307 (Kdet) is the same as the first gesture 305. If the second gesture 307 (Kdet) and the first gesture 305 (KCls) are not the same (“No” at operation 460), the post processing technique 400 determines that the potential gestures are invalid (i.e., no gesture is detected 480). Alternatively, if the second gesture 307 (Kdet) and the first gesture 305 (KCls) are the same (“Yes” at operation 460), the post processing technique 400 proceeds to operation 470 and determines whether a third confidence score 402 is greater than a third threshold probability (P3). The third confidence score 402 is equal to the second gesture confidence score 308 (DetClsi) for the second gesture 307 (Kdet) times the box confidence score 304 (Pbox). In Figure 4, the third confidence score 402 is represented by DetClsKdet*Pbox. In some embodiments, the third threshold probability P3 is at least 0.10. In some embodiments, the third threshold probability P3 is any probability defined by a user and/or a system implementing the gesture detection and classification process 300 (e.g., at least 0.15). In some embodiments, the third threshold probability P3 is adjusted for the best performance and its values are highly dependent on the accuracy from the detection process 310 and classification processes 320.

[0052] If the third confidence score 402 is less than the third threshold probability P3 (“No” at operation 470), the second gesture 307 (Kdet) is determined to be invalid (i.e., no gesture is detected 480). If the third confidence score 402 is greater than the third threshold probability P3 (“Yes” at operation 470), the post, processing technique 400 determines that, the second gesture 307 (Kdet) is the gesture class 490 (i.e., the final gesture 340). By combining the information provided by the detection process 310 and the classification process 320, a holistic determination is made in determining whether a gesture is contextual or not (rather than relying solely on classification process). This holistic approach in the techniques described above improves an electronic device’s ability to detect and recognize gestures compared with existing solutions.

[0053] Figure 5 is a flow diagram of a method 500 of classifying one or more gestures, in accordance with some embodiments. The method 500 includes one or more operations described above in reference to Figures 3 and 4. Method 500 provides a solution for gesture recognition (e.g., hand gestures, face gestures, arm gestures, etc.) across different electronic devices and/or systems (e.g., as described above in reference to Figures 1 and 2). The gesture determination method 500 increases the accuracy of local gesture classification (e.g., classification of gestures that do not rely on context) and contextual gesture classification (e.g., classification of situation gestures based on context) relative to existing solutions. For example, in some embodiments, the gesture determination process 500 has shown between at least 4 - 10 % increase in accuracy for local gesture classification over existing solutions and between at least 43 - 120 % increase in accuracy for contextual gesture classification over existing solutions.

[0054] Operations (e.g., steps) of the method 500 are performed by one or more processors (e.g., CPU 202; Figure 2) of an electronic device (e.g., at a server 102 and/or a client device 104). At least some of the operations shown in Figure 5 correspond to instructions stored in a computer memory or computer-readable storage medium (e.g., memory 206; Figure 2). Operations 502-512 can also be performed, in part, using one or more processors and/or using instructions stored in memory or computer-readable medium of one or more devices communicatively coupled to together (e.g., such as a laptop, AR glasses (or other head-mounted display), a server, a tablet, a security camera, a drone, a smart televisions, a smart speaker, a toy, a smartwatch, a smart appliance, or other computing device that can perform operations 502-512 alone or in conjunction with respective processors of communicatively coupled electronic devices 200).

[0055] The method 500 includes obtaining (502) an image 312 including a hand region 322 and detecting (504) the hand region 322 in the image 312 and determining (506) a first hand gesture 305 from the hand region of the image. For example, as shown above in reference to Figure 3, image data 312 is obtained and the detection process 310 (which can be used to crop an image) and the classification process 320 are applied to the image data 312 (after the image data 312 is processed by a first, detection and classification model 314) to determine a classification gesture vector for a cropped image 322 (i.e., the hand region 322) that is fused with contextual information. [0056] In some embodiments, determining the first hand gesture 305 from the hand region 322 of the image 312 includes generating a first gesture vector from the hand region 322 of the image, each element of the first gesture vector corresponding to a respective hand gesture and representing a respective first confidence level of the hand region 322 including the respective hand gesture, and determining the first hand gesture 305 and a first gesture confidence score 306 from the first, gesture vector. For example, as shown above in reference to Figure 3, an object detection phase 316 is applied to detect one or more gestures of the image data 312 (after passed through the first detection and classification model 314) which are cropped and used to determine a first gesture vector. In some embodiments, the method 500 further includes associating detection of the hand region 322 in the image 312 with a bounding box confidence score 304 and combining the bounding box confidence score 304 with the first gesture confidence score 306 (Clsi) of the first hand gesture 305. For example, as shown above in reference to Figure 3 an output of the classification process 320 can be combined with an output of the object detection phase 316 (e.g., the bounding box confidence scores).

[0057] In some embodiments, the first hand gesture 305 includes the respective hand gesture corresponding to a largest first confidence level of the respective first confidence level of each element of the first gesture vector, and the gesture confidence score is equal to the largest first confidence level of the respective first confidence level of each element of the first gesture vector. In other words, the first hand gesture 305 can have a confidence score (e.g., first confidence score 306) greater than other gestures in a respective set of one or more gestures (e.g., first gesture 305 as described above in reference to Figure 3).

[0058] The method 500 includes determining (508) a second hand gesture 307 from the image (e.g,, the entire image). In some embodiments, determining the second hand gesture 307 from the image includes generating a second gesture vector from the image (e.g., the entire image), each element of the second gesture vector corresponding to a respective hand gesture and representing a respective second confidence level of the image including the respective hand gesture; and determining the second hand gesture 307 and a second gesture confidence score 308 from the second gesture vector.

[0059] In some embodiments, the second hand gesture 307 includes the respective hand gesture corresponding to a largest second confidence level of the respective second confidence level of each element of the second gesture vector, and the second gesture confidence score 308 is equal to the largest second confidence level of the respective second confidence level of each element of the second gesture vector. In other words, the second hand gesture 307 can have a confidence score (e.g., second confidence score 308) greater than other gestures in a respective set of one or more gestures (e.g., second set of one or more gestures as described above in reference to Figure 3).

[0060] The method 500 includes in accordance with a determination that the first hand gesture 305 is not any of a plurality of contextual gestures, determining (510) that a final hand gesture of the image is the first hand gesture 305. For example, as shown above in reference to Figure 4, in accordance with a determination that a gesture Kcls (e.g., a gesture for the first, gesture 305) is not contextual (“No” at operation 430), the gesture Kcls is determined to be the final gesture 340. In some embodiments, the method 500 includes, before determining whether the first hand gesture 305 is at least one of the plurality of contextual gestures, determining whether the first gesture confidence score 306 is greater than a second threshold P2 and, in accordance with a determination that the first gesture confidence score 306 is less than the second threshold P2, determining that the image is not associated with any hand gesture. For example, as shown above in reference to Figure 4, in accordance with a determination that a confidence score (e.g., CISKCIS) of a gesture Kcls is less that the second threshold probability P2 (“No” at operation 420), the method 500 determines that there is no gesture present.

[0061] In some embodiments, the method 500 includes, before determining whether the first hand gesture 305 is at least one of the plurality of contextual gestures, determining whether the second gesture confidence score 308 (DetClsi) of the second hand gesture 307 is greater than a first threshold Pl and, in accordance with a determination that the second confidence score 308 (DetClsi) of the second hand gesture 307 is less than the first threshold Pl , determining that the image is not associated with any hand gesture. For example, as shown above in reference to Figure 4, in accordance with a determination that a confidence score (e.g., DetClsKdet) of a gesture Kdet is less that the first threshold probability Pl (“No” at operation 450), the method 500 determines that there is no gesture present.

[0062] The method 500 further includes in accordance with a determination that the first hand gesture 305 is one of the plurality of contextual gestures, determining (512) the final hand gesture based on the second hand gesture 307 and a second gesture confidence score 308, the second hand gesture 307 and the second gesture confidence score 308 associated with the image (e.g., the entire image). In some embodiments, determining the final hand gesture 340 based on the second hand gesture 307 and a second confidence score 308 further includes, in accordance with a determination that the first and second hand gestures 305 and 307 are distinct from each other, determining that the image 312 is not associated with any hand gesture. For example, as shown above in reference to Figure 4, in accordance with a determination that gesture Kdet and gesture Kcls are not the same (“Yes” at operation 460), the method 500 determines that there is no gesture present.

[0063] Alternatively, in some embodiments, in accordance with a determination that the first and second hand gestures 305 and 307 are identical to each other and in accordance with a determination that, a third confidence score 402 does not exceed a comprehensive confidence threshold (e.g., P3 in Figure 4), the method 500 includes determining that the image is not associated with any hand gesture. For example, as shown above in reference to Figure 4, in accordance with a determination that a third confidence score 402 (e.g., DetClsKdet*Pbox - the confidence score of the gesture Kdet times the box confidence score 304 (Pbox) (output of the object detection phase 316; Figure 3)) is less that the third threshold probability P3 (“No” at operation 450), the method 500 determines that there is no gesture present. In some embodiments, in accordance with a determination that the first and second hand gestures 305 and 307 are identical to each other and in accordance with a determination that the third confidence score 402 exceeds the comprehensive confidence threshold P3, the method 500 includes determining that the final hand gesture 340 is the second hand gesture 307. For example, as shown above in reference to Figure 4, in accordance with a determination that a third confidence score 402 is greater than or equal to the third threshold probability P3 (“Yes” at operation 470), the method 500 determines that the second hand gesture (Kdet) is the final gesture 340.

[0064] In some embodiments, the first threshold Pl, the second threshold P2, and the comprehensive confidence threshold P3 are adjusted for optimal performance, and are highly dependent on the accuracy from the processes 310 and 320.

[0065] In some embodiments, the method 500 further includes filtering the final hand gesture using a filtering function. The filter function is configured to identify false positives. In some embodiments, the filtering function is one of a convolution function (or moving and/or weighted average smoothing function), a Fourier filtering function, or a Kalman filter. In some embodiments, the filtering function is a function of time, which allows the determination of the first and/or second hand gesture to stabilize (e.g., stabilize for at least 5 successive frames). In some embodiments, the filtering function helps smooth both the detection boxes and identified classes, helps avoid jittering and loss of detection in the processes, and helps make it easy for engineering implementation (e.g., implementation of gesture controls for volume adjustment). Additional information on the filter is provided above in Figure 3. [0066] It should be understood that the particular order in which the operations in Figure 5 have been described are merely exemplar}' and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to retrieve candidate images or determining a camera pose as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 3 and 4 are also applicable in an analogous manner to method 500 described above with respect to Figure 5. For brevity, these details are not repeated here.

[0067] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, ‘"an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated li sted items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

.Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[0068] As used herein, the term “if” is, optionally, constmed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[0069] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art,

[0070] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art., so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is:

1. A method for classifying a gesture, implemented by an electronic device, the method comprising: obtaining an image including a hand region; detecting the hand region in the image; determining a second hand gesture from the image; determining a first hand gesture from the hand region of the image; in accordance with a determination that the first hand gesture is not any of a plurality of contextual gestures, determining that a final hand gesture of the image is the first hand gesture; and in accordance with a determination that the first hand gesture is one of the plurality of contextual gestures, determining the final hand gesture based on the second hand gesture and a second gesture confidence score, the second hand gesture and the second gesture confidence score associated with the image,

2. The method of claim 1, wherein determining the second hand gesture from the image further comprises: generating a second gesture vector from the image, each element of the second gesture vector corresponding to a respective hand gesture and representing a respective second confidence level of the image including the respective hand gesture; and determining the second hand gesture and the second gesture confidence score from the second gesture vector.

3. The method of claim 2, wherein the second hand gesture includes the respective hand gesture corresponding to a largest second confidence level of the respective second confidence level of each element of the second gesture vector, and the second gesture confidence score is equal to the largest second confidence level of the respective second confidence level of each element of the second gesture vector.

4. The method of any of claims 1-3, further comprising: before determining whether the first hand gesture is at least, one of the plurality of contextual gestures, determining whether a first gesture confidence score is greater than a second threshold; in accordance with a determination that the first gesture confidence score is less than the second threshold, determining that the image is not associated with any hand gesture.

5. The method of any of the preceding claims, wherein determining the first hand gesture from the hand region of the image further comprises: generating a first gesture vector from the hand region of the image, each element of the first gesture vector corresponding to a respective hand gesture and representing a respective first confidence level of the hand region including the respective hand gesture; determining the first hand gesture and a first, gesture confidence score from the first gesture vector.

6. The method of claim 5, further comprising: associating detection of the hand region in the image with a bounding box confidence score; and combining the bounding box confidence score with a confidence score associated with the first hand gesture to generate the first gesture confidence score.

7. The method of claim 5 or 6, wherein the first hand gesture includes the respective hand gesture corresponding to a largest first confidence level of the respective first confidence level of each element of the first gesture vector, and the first gesture confidence score is equal to the largest first confidence level of the respective first confidence level of each element of the first gesture vector.

8. The method of any of claims 5-7, further comprising: before determining whether the first hand gesture is at least one of the plurality of contextual gestures, determining whether the second gesture confidence score of the second hand gesture is greater than a first threshold; in accordance with a determination that the second gesture confidence score of the second hand gesture is less than the first threshold, dele; mining that the image is not associated with any hand gesture.

9. The method of any of the preceding claims, wherein determining the final hand gesture based on the second hand gesture and the second gesture confidence score further comprises: in accordance with a determination that the first and second hand gestures are distinct from each other, determining that the image is not associated with any hand gesture, and in accordance with a determination that the first and second hand gestures are identical to each other: in accordance with a determination that a third confidence score exceeds a comprehensive confidence threshold, determining that the final hand gesture is the second hand gesture; and in accordance with a determination that the third confidence score does not exceed the comprehensive confidence threshold, determining that the image is not associated with any hand gesture.

10. The method of any of the preceding claims, further comprising filtering the final hand gesture using a filtering function, wherein the filter function is configured to identify false positives.

11. The method of claim 10, wherein the filtering function is one of a convolution function, Fourier filtering function, or Kalman filter.

12. The method of claim 10 or 11, wherein the filtering function is a function of time.

13. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-12.

14. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a method of any of claims 1-12.