WO2023063944A1 - Two-stage hand gesture recognition - Google Patents

Two-stage hand gesture recognition Download PDF

Info

Publication number
WO2023063944A1
WO2023063944A1 PCT/US2021/054825 US2021054825W WO2023063944A1 WO 2023063944 A1 WO2023063944 A1 WO 2023063944A1 US 2021054825 W US2021054825 W US 2021054825W WO 2023063944 A1 WO2023063944 A1 WO 2023063944A1
Authority
WO
WIPO (PCT)
Prior art keywords
hand
gesture
input
image
network
Prior art date
Application number
PCT/US2021/054825
Other languages
French (fr)
Inventor
Xiang Li
Jie Lu
Yang Zhou
Yuan Tian
Kai Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2021/054825 priority Critical patent/WO2023063944A1/en
Publication of WO2023063944A1 publication Critical patent/WO2023063944A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/0304Detection arrangements using opto-electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language

Definitions

  • This application relates generally to image data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for using deep learning techniques to recognize a hand gesture in an image.
  • Touchless air gestures are used to implement certain user interface functions for electronic devices having no touch screens, e.g., drones, smart television devices, and head mounted display (e.g., virtual reality headsets, augmented reality glasses, mixed reality headsets). These devices have no touch screens and include front-facing optical cameras, radar or ultrasound systems, and/or depth sensors to track human hands in real time. Features of a human body and/or face (e.g., key points of the hand) can be extracted from an image to help locate a hand location and improve an accuracy level of gesture recognition. Some head mounted displays have implemented hand tracking functions to complete user interaction including selecting, clicking, and typing on a virtual keyboard.
  • head mounted display e.g., virtual reality headsets, augmented reality glasses, mixed reality headsets.
  • Air gestures can also be used on devices having touch screens when a user’s hands are not available to touch the screen (e.g., while preparing a meal, the user can use air gestures to scroll down a recipe so that the user does not need to touch the device screen with wet hands).
  • Air gesture recognition increases a cost of an electronic device, particularly when deep learning techniques are applied to enhance an accuracy level.
  • Typical key points or human skeleton algorithms are computationally expensive and rarely work with hand gesture classification in real time on electronic devices that have limited computing resources. These algorithms are often implemented using multi-stage frameworks that either have a high latency or demand powerful hardware. It would be beneficial to have a more efficient air gesture recognition mechanism than the current practice.
  • Hand gesture recognition enables user interfaces in many electronic devices (e.g., drones, smart television devices, and head-mounted displays) where human hand gestures are captured and interpreted as commands.
  • a two- stage hand recognition framework is implemented by an electronic devices having limited computation resources to recognize hand gestures from high-resolution inputs in real time. High-resolution cameras are applied to capture the high-resolution inputs, e.g., more details for distant hands.
  • a high-resolution image is down-sampled to a pre-defined resolution, allowing a detection network to estimate a hand location and determine a first gesture label vector from a low-resolution down-sampled image.
  • the high-resolution image is then cropped according to the hand location, and a second gesture label vector is determined from a high resolution cropped image.
  • a hand gesture is identified for the high-resolution image based on both the first and second gesture label vectors.
  • a method is implemented at an electronic device (e.g., a mobile device) for recognizing hand gestures in an image.
  • the method includes obtaining an input image including an input hand region where a hand is captured, and down-sampling the input image to a first image, and determining a first gesture label vector from the first image.
  • the method further includes detecting in the first image a first hand region where the hand is captured; in accordance with the first hand region in the first image, cropping the input image to the input hand region corresponding to the first hand region of the first image; and determining a second gesture label vector from the input hand region of the input image.
  • the method further includes associating the input hand region with a first hand gesture based on both the first and second gesture label vectors.
  • the first gesture label vector and the second gesture label vector are combined to generate a comprehensive gesture label vector.
  • the first hand gesture is selected from a plurality of predefined hand gestures based on the comprehensive gesture label vector.
  • some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.
  • Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.
  • Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments
  • Figure 4B is an example node in the neural network, in accordance with some embodiments.
  • Figure 5 is a flow diagrams of a hand gesture recognition process in which hand gestures are identified in an image in real time, in accordance with some embodiments.
  • Figure 6 is a flowchart of a method for recognizing a hand gesture in an image, in accordance with some embodiments.
  • FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone).
  • HMD head-mounted display
  • AR augmented reality
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface.
  • the collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • the one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 with user data.
  • the game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera and a mobile phone 104C.
  • the networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
  • deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • the content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104.
  • data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D).
  • the client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models.
  • both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D).
  • the server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), rending virtual objects in a field of view based on the poses, or implements some other functions based on the results.
  • the client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A.
  • data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104.
  • the server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.
  • a pair of AR glasses 104D are communicatively coupled in the data processing environment 100.
  • the AR glasses 104D can be includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display.
  • the camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data.
  • the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model.
  • the microphone records ambient sound, including user’s voice commands.
  • both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses.
  • the video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses.
  • deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses.
  • the device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D.
  • the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.
  • deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D.
  • 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model.
  • Visual content is optionally generated using a second data processing model.
  • Training of the first and second data processing models is optionally implemented by the server 102 or AR glasses 104D.
  • Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.
  • FIG 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments.
  • the data processing system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof.
  • the data processing system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • CPUs processing units
  • the data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
  • a location detection device such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
  • Memory 206 includes high-speed random access memory, such as DRAM,
  • SRAM, DDR RAM, or other random access solid state memory devices and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices.
  • Memory 206 optionally, includes one or more storage devices remotely located from one or more processing units 202.
  • Memory 206, or alternatively the non-volatile memory within memory 206 includes a non-transitory computer readable storage medium.
  • memory 206, or the non- transitory computer readable storage medium of memory 206 stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
  • information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the data processing system 200 e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices;
  • Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;
  • Data processing module 228 e.g., applied in a hand gesture recognition process 500 in Figure 5
  • data processing models 240 e.g., a hand gesture recognition model
  • the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;
  • One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 240 is a hand gesture recognition model that includes a detection and classification model 512 and a second hand gesture network 516 and is applied to recognize hand gestures from images in real time and locally in an electronic device,
  • the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • memory 206 optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.
  • FIG. 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments.
  • the data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240.
  • both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104.
  • the training data source 304 is optionally a server 102 or storage 106.
  • both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300.
  • the training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
  • model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
  • the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312.
  • the data processing model 240 is trained according to a type of the content data to be processed.
  • the training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data.
  • an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size.
  • ROI region of interest
  • an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform.
  • the model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item.
  • the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
  • the model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
  • the modified data processing model 240 is provided to the data processing module 228 to process the content data.
  • the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
  • the data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318.
  • the data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data.
  • each image is pre-processed to extract an ROI or cropped to a predefined image size
  • an audio clip is pre-processed to convert to a frequency domain using a Fourier transform.
  • the content data includes two or more types, e.g., video data and textual data.
  • the model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data.
  • the model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240.
  • the processed content data is further processed by the data post- processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.
  • Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments
  • Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments.
  • the data processing model 240 is established based on the neural network 400.
  • a corresponding model -based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format.
  • the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights wi, W2, W3, and W4 according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
  • the collection of nodes 420 is organized into one or more layers in the neural network 400.
  • the one or more layers includes a single layer acting as both an input layer and an output layer.
  • the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406.
  • a deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer.
  • a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer.
  • one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
  • max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
  • a convolutional neural network is applied in a data processing model 240 to process content data (particularly, video and image data).
  • the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406.
  • the one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product.
  • Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network.
  • Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
  • the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
  • a recurrent neural network is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior.
  • each node 420 of the RNN has a time-varying real-valued activation.
  • the RNN examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • LSTM long short-term memory
  • BAM bidirectional associative memory
  • an echo state network an independently RNN (IndRNN)
  • a recursive neural network a recursive neural network
  • a neural history compressor examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • the RNN can be used for hand
  • the training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402.
  • the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
  • forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
  • backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
  • the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
  • a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied.
  • the network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data.
  • the result of the training includes the network bias parameter b for each layer.
  • FIG. 5 is a flow diagram of a hand gesture recognition process 500 in which hand gestures are identified in an image in real time, in accordance with some embodiments.
  • An electronic device 104 is configured to implement the hand gesture recognition process 500. Specifically, the electronic device 104 obtains an input image 502 captured by a camera, and the camera is optionally part of the same electronic device 104 or a distinct electronic device 104.
  • the input image 502 includes an input hand region 504 where a hand is captured.
  • the input image 502 is down-sampled to a first image 506.
  • the input image 502 has a first resolution (e.g., a first number of pixels).
  • the first image 506 has a second resolution smaller than the first resolution (e.g., has a second number of pixels that is scaled from the first number of pixels according to a down-sampling rate).
  • a first gesture label vector 508 is determined from the first image 506, e.g., using a first hand gesture network 512A.
  • the first image 506 is also processed, e.g., using a hand region detection network 512B, to detect a first hand region 510 where the hand is captured in the first image 506.
  • the first hand gesture network 512A and hand region detection network 512B collectively form a detection and classification model 512 configured to generate the first gesture label vector 508 and detect the first hand region 510 from the entire first image 506.
  • the first hand region 510 in the first image 506 corresponds to the input hand region 504 of the input image 502.
  • the input hand region 504 has a third number of pixels that is scaled down to the first hand region 510 having a fourth number of pixels according to the down-sampling ratio.
  • the first hand region 510 has a rectangular shape and tightly encloses the hand in the first image 506, so does the input hand region 504 have the rectangular shape and tightly enclose the hand in the input image 502.
  • the input image 502 is cropped to the input hand region 504 corresponding to the first hand region 510 of the first image 506.
  • a second gesture label vector 514 is determined from the input hand region 504 of the input image 502, e.g., using a second hand gesture network 516.
  • the input hand region 504 is associated with a first hand gesture 518 based on both the first and second gesture label vectors 508 and 514.
  • the first hand gesture 518 is selected from a plurality of predefined hand gestures based on the first and second gesture label vectors 508 and 514.
  • the plurality of predefined hand gestures are organized in an ordered sequence of hand gestures, and each of the first and second gesture label vectors 508 and 514 has a respective sequence of probability elements aligned with the ordered sequence of hand gestures.
  • Each probability element of the first and second gesture label vectors 508 and 514 represents a probability of being associated with a respective and distinct predefined hand gesture in the ordered sequence of hand gestures.
  • Each of the first and second gesture label vectors 508 and 514 is normalized to 1, i.e., a total probability of the predefined gestures is equal to 1 for each gesture label vector 508 or 514.
  • the sequence of predefined hand gestures includes 6 static single-hand gestures, e.g., a stopping gesture, a grabbing gesture, a thumb up gesture, a thumb down gesture, a high-five gesture, a peace gesture, which are organized according to a predefined order.
  • the first gesture label vector 508 has six probability elements that have a sum equal to 1, so does the second gesture label vector 514.
  • the first gesture label vector 508 is equal to [0, 0.5, 0.1, 0. 4, 0, 0], and indicates that the hand gesture in the first image 504 has 50%, 10% and 40% chances of being the grabbing gesture, thumb up, and thumb down gestures, respectively.
  • the first gesture label vector 508 is equal to [0, 0, 0, 0, 0, 1], and indicates that the hand gesture in the first image 506 corresponds to the peace gesture with a probability of 100%.
  • the second gesture label vector 514 is optionally equal to or distinct from the first gesture label vector 508.
  • the first gesture label vector 508 and second gesture label 514 are combined, e.g., in a weighted manner, to generate a comprehensive gesture label vector 520, and the first hand gesture 518 is selected from a plurality of predefined gestures based on the comprehensive gesture label vector 520.
  • the plurality of predefined hand gestures have a predefined number of predefined hand gestures (e.g., 12 hand gestures) organized in an ordered sequence of hand gestures, and each of the first, second, and comprehensive gesture label vectors 508, 514, and 520 has the predefined number of probability elements.
  • Each probability element of the first, second, or comprehensive gesture label vector 520 represents a probability of being associated with a respective and distinct predefined hand gesture in the ordered sequence of hand gestures.
  • Each of the first, second, and comprehensive gesture label vectors 508, 514, and 520 is normalized to 1, i.e., a total probability of the predefined gestures is equal to 1 for each gesture label vector, and each probability element is equal to or less than 1.
  • one of the predefined number of probability elements of the comprehensive gesture label vector 520 is greater than any other probability element of the comprehensive gesture label vector 520, and the one of the predefined number of probability elements of the comprehensive gesture label vector 520 corresponds to the first hand gesture 518 in the ordered sequence of hand gestures.
  • the plurality of predefined gesture includes 6 static single-hand gestures organized in the predefined order, and the first gesture label vector 508 is equal to [0, 0.5, 0.1, 0. 4, 0, 0], The second gesture label vector 514 is equal to [0, 0, 0, 0, 0, 1], Weights for the first and second gesture label vectors 508 and 514 are equal to 0.5, so the comprehensive gesture label vector 520 is [0, 0.25, 0.05, 02, 0, 0.5], The last probability element of the comprehensive gesture label vector 520 corresponds to the peace gesture and is greater than any other probability elements. As such, the peace gesture is selected as the first hand gesture 518 associated with the input hand region 504 in the input image 502.
  • the first gesture label vector 508 and the second gesture label vector 514 are combined using a first weight and a second weight, respectively.
  • the first hand region 510 is detected in the first image 506 using a first hand gesture network 512A having a first input size.
  • the second gesture label vector 514 is determined from the input hand region 504 of the input image 502 using the second hand gesture network 516 having a second input size.
  • the first image 506 is resized to match the first input size
  • the input hand region 504 of the input image is resized to match the second input size. For example, a dimension of the first image 506 is expanded by filling black pixels to match a corresponding dimension of the first input size.
  • the first weight is determined based on the first input size and a size of the first image 506, and the second weight is determined based on the second input size and a size of the input hand region of the input image 502.
  • a sum of the first and second weights is equal to 1.
  • the first image 506 (which is resized or not) has a second number of pixels that matches (e.g., is equal to) the first input size of the first hand gesture network 512A
  • the input hand region 504 (which is resized or not) has a third number of pixels that matches the second input size of the second hand gesture network 516.
  • sizes of the input hand region 504 and the first hand region 510 are compared to determine whether the second or first gesture label vector 508 or 514 is more reliable. For example, when the size of the first hand region 504 is greater than the size of the input hand region 504, the first gesture label vector 508 is more reliable than the second gesture label vector 514, and the first weight is greater than the second weight.
  • the first gesture label vector 508 is less reliable than the second gesture label vector 514
  • the first weight is less than the second weight.
  • the second number of pixels of the first image 506 does not match the first input size, and is adjusted prior to being processed by the first hand gesture network 512A of the detection and classification model 512.
  • the input hand region 504 has a third number of pixels that do not match the second input size, and is adjusted prior to being processed by the second hand gesture network 516.
  • the first weight is determined based on the first input size and a size of the first image 506, and the second weight is determined based on the second input size and a size of the input hand region 504 of the input image 502.
  • the input sizes of the first and second hand gesture networks 512A and 516 indicate complexity levels of the first and second hand gesture networks 512A and 516, and an increase in the respective size of the first or second hand gesture network 512A or 516 enhances reliability of the first or second gesture label vector 508 or 514 and increases a magnitude of the first or second weight, respectively.
  • the first gesture label vector 508 is determined from the first image 506 using a first hand gesture network 512A
  • the second gesture label vector 514 is determined from the input hand region 504 of the input image 502 using a second hand gesture network 516.
  • the first hand region 510 is detected in the first image 506 using a hand region detection network 512B.
  • the first hand gesture network 512A, second hand gesture network 516, and hand region detection network 512B are trained end-to-end or independently.
  • the first hand gesture network 512A, second hand gesture network 516, and hand region detection network 512B are trained at a server 102, and provided to an electronic device 104 for recognizing the first hand gesture 518 in the input image 502.
  • the first hand gesture network 512A, second hand gesture network 516, and hand region detection network 512B are trained and applied at a server 102 for recognizing the first hand gesture 518 in the input image 502.
  • the input image 502 is captured by an electronic device 104 and uploaded to the server 102.
  • Information of the first hand gesture 518 identified by the server 102 is sent to the electronic device 104.
  • the process 500 corresponds to a two-stage hand gesture determination method that provides real-time and accurate hand gestures 518.
  • the first hand gesture 518 is determined in real time with the input image 502 being captured, when a latency time between image capturing and gesture determination is less than a threshold latency (e.g., 10 milliseconds).
  • a threshold latency e.g. 10 milliseconds.
  • the process 500 down-samples an input image 502 to a predefined resolution of a first image 506 for the detection and classification model 512 in the first stage.
  • the input image 502 has a higher resolution than the first image 506.
  • the detection and classification model 512 outputs an estimated hand location corresponding to a first hand region 510 and the first gesture label vector 508.
  • the process 500 uses the estimated hand location to crop the original high-resolution image 502 and resize a cropped area (i.e., the input hand region 504 of the input image 502) to a predefined resolution for the second hand gesture network 516.
  • the second hand gesture network 516 estimates the second gesture label vector 514 from the input hand region 504.
  • the input hand region 504 provides more detailed information to the second hand gesture network 516 and enables the process 500 to predicate a hand in the input image 502 with a higher accuracy compared with the first image 506.
  • the second hand gesture network 516 processes the input image 502 having the higher resolution based on the first hand region 510 determined from the first image 506 in the first stage. This avoids searching the entire input image 502 for additional hand regions, and significantly saves time to enable real-time hand gesture recognition.
  • the second hand gesture network 516 does not identify any hand gesture that is not detected in the first stage, and however, enhances an accuracy level of hand gesture recognition from the first stage with higher resolution information in the input hand region 504.
  • the two-stage hand gesture determination method enables real-time hand gesture recognition while increasing an overall accuracy level (e.g., beyond a threshold accuracy level).
  • the networks 512 and 516 applied in the two-stage hand gesture determination method are compact and efficient and can be implemented locally at an electronic device (e.g., a drone) having a limited computational capacity.
  • Weights applied to combine the first and second gesture label vectors 508 and 514 form a weight vector that is optionally normalized to 1. These weights reflect qualities of the detection and classification model 512 and second hand gesture network 516, i.e., how reliable the corresponding gesture label vectors 508 and 514 are. The weights are impacted by many factors including a resolution of the input image 502, the down-sampling rate between the input and first images 502 and 506, and the qualities of the networks 512 and 516. In some embodiments, the input image 502 is down-sampled and scaled to the first input size of the detection and classification model 512 in the first stage. A first scaling ratio is determined by a ratio of the size of the input image 502 and the first input size of the model 512.
  • the input image 502 is cropped to the input hand region 504 and scaled to the second input size of the second hand gesture network 516 in the second stage.
  • a second scaling ratio is determined by a ratio of the size of the input hand region 504 of the input image 502 and the second input size of the network 516.
  • the second scaling ratio is greater than 1
  • the input hand region 504 is down-sampled, and information is removed from the first image 506 prior to being fed into the second hand gesture network 516.
  • the second weight of the second gesture label vector 514 decreases with an increase of the second scaling ratio that is large than 1.
  • the increase of the second scaling ratio corresponds to more information being loss, the second gesture label vector 514 being less reliable, and the second weight being smaller.
  • the input hand region 504 is up-sampled prior to being fed into the second hand gesture network 516, and no additional information is added except that a size of the input hand region 504 increases.
  • the second weight of the second gesture label vector 514 varies with the second scaling ratio that is large than 1, and however, does not change after the second scaling ratio drops below 1.
  • the process 500 fully utilizes the high-resolution image 502 for real-time hand gesture recognition.
  • the corresponding two-stage framework improves a classification accuracy for small hands in the image 502, and effectively extends a distance range in which a single detection and classification model 512 can achieve with a down sampled image (e.g., the first image 506).
  • the process 500 is similarly applied to combine more networks to extract human skeleton and/or face from the images when more computation resources are provided on the electronic device 104.
  • a human skeleton network can take a down-sampled image to find the face/hands area, and an area in the original high- resolution image is cropped for recognition processing based on the face/hands area.
  • the electronic device 104 obtaining an input image 502 including an input human or face region 504 where a human body or face is captured.
  • the input image 502 is down-sampled to a first image 506 from which a first gesture label vector 508 from the first image 506.
  • a first human or face region 510 where the human body or face is captured is detected in the first image 506.
  • the input image 502 is cropped to the input human or face region 504 corresponding to the first human or face region 510 of the first image 506.
  • a second gesture label vector 514 is determined from the input human or face region 504 of the input image 502.
  • the electronic device 104 associates the input human or face region 510 with a first body gesture or face expression based on both the first and second gesture label vectors 508 and 514.
  • Figure 6 is a flowchart of a method 600 for recognizing a hand gesture in an image, in accordance with some embodiments.
  • the method 600 is described as being implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof).
  • the client device 104 is a mobile phone 104C, AR glasses 104D, smart television device, or drone.
  • Method 600 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figure 6 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the computer system 200 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 600 may be combined and/or the order of some operations may be changed.
  • the computer system obtains (602) an input image 502 including an input hand region 504 where a hand is captured.
  • the computer system downsamples (604) the input image 502 to a first image 506, determines (606) a first gesture label vector 508 from the first image 506, and detects (608) in the first image 506 a first hand region 510 where the hand is captured.
  • the input image 502 is down- sampled to the first image 506 having a second resolution, the input image 502 having a first resolution. A ratio of the first resolution and the second resolution equal to a down-sampling rate applied to down-sample the input image 502 to the first image 506.
  • the computer system crops (610) the input image 502 to the input hand region 504 corresponding to the first hand region 510 of the first image 506.
  • the computer system determines (612) a second gesture label vector 514 from the input hand region 504 of the input image 502, and associates (614) the input hand region 504 with a first hand gesture 518 based on both the first and second gesture label vectors 508 and 514.
  • each of the first and second gesture label vectors 508 and 514 is normalized (616).
  • the first hand region 510 has a rectangular shape and tightly encloses the hand in the first image 506.
  • the first hand gesture 518 is selected from a plurality of predefined hand gestures based on the first and second gesture label vectors 508 and 514.
  • the predefined hand gestures are organized in an ordered sequence of hand gestures, and each of the first and second gesture label vectors 508 and 514 has a respective sequence of probability elements aligned with the ordered sequence of hand gestures.
  • Each probability element of the first and second gesture label vectors 508 and 514 represents a probability of being associated with a respective and distinct predefined hand gesture in the ordered sequence of hand gestures.
  • the first gesture label vector 508 and the second gesture label vector 514 are combined (618) to generate a comprehensive gesture label vector 520, and the first hand gesture 518 is selected (620) from a plurality of predefined hand gestures based on the comprehensive gesture label vector 520.
  • the plurality of predefined hand gestures have a predefined number of predefined hand gestures organized in an ordered sequence of hand gestures, and each of the first, second, and comprehensive gesture label vectors 508, 514, and 520 has the predefined number of probability elements.
  • Each probability element of the first, second, or comprehensive gesture label vector 508, 514, and 520 represents a probability of being associated with a respective and distinct predefined hand gesture in the ordered sequence of hand gestures. Additionally, in some embodiments, when the one of the plurality of predefined hand gestures, the computer system determines that one of the predefined number of probability elements of the comprehensive gesture label vector 520 is greater than any other probability element of the comprehensive gesture label vector 520 and that the one of the predefined number of probability elements of the comprehensive gesture label vector 520 corresponds to the first hand gesture 518 in the ordered sequence of hand gestures.
  • the first gesture label vector 508 and the second gesture label vector 514 are combined using a first weight and a second weight, respectively.
  • the first hand region 510 is detected in the first image 506 using a first hand gesture network 512A having a first input size
  • the second gesture label vector 514 is determined from the input hand region 504 of the input image 502 using a second hand gesture network 516 having a second input size.
  • the first weight is determined based on the first input size and a size of the first image 506.
  • the second weight is determined based on the second input size and a size of the input hand region 504 of the input image 502.
  • the first weight is greater than the second weight when a size of the first hand region 510 is greater than the size of the input hand region 504, and the first weight is less than the second weight when the size of the first hand region 510 is less than the size of the input hand region 504.
  • the first gesture label vector 508 is detected from the first image 506 using a first hand gesture network 512A
  • the second gesture label vector 514 is determined from the input hand region 504 of the input image 502 using a second hand gesture network 516.
  • the first hand region 510 is detected in the first image 506 using a hand region detection network 512B.
  • the first hand gesture network 512A, second hand gesture network 516, and hand region detection network 512B are trained end-to-end or independently. Further, in some embodiments, the first hand gesture network 512A, second hand gesture network 516, and hand region detection network 512B are trained at a server 102, and provided to an electronic device 104 for recognizing the first hand gesture 518 in the input image 502.
  • the first hand gesture network 512A, second hand gesture network 516, and hand region detection network 512B are trained and applied at a server 102 for recognizing the first hand gesture 518 in the input image 502.
  • the input image 502 is captured by an electronic device 104 and uploaded to the server 102.
  • Information of the first hand gesture 518 is sent by the server 102 to the electronic device 104.
  • the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
  • stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

This application is directed to recognizing hand gestures in images. An electronic device obtains an input image and down-samples the input image to a first image from which a first gesture label vector is determined. A first hand region where the hand is captured is detected in the first image. In accordance with the first hand region in the first image, the input image is cropped to an input hand region corresponding to the first hand region. A second gesture label vector is determined from the input hand region of the input image. The input hand region is associated with a first hand gesture based on both the first and second gesture label vectors. In some embodiments, the first and second gesture label vectors are combined in a weighted manner and applied to select the first hand gesture is selected from a plurality of predefined hand gestures.

Description

Two-Stage Hand Gesture Recognition
TECHNICAL FIELD
[0001] This application relates generally to image data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for using deep learning techniques to recognize a hand gesture in an image.
BACKGROUND
[0002] Touchless air gestures are used to implement certain user interface functions for electronic devices having no touch screens, e.g., drones, smart television devices, and head mounted display (e.g., virtual reality headsets, augmented reality glasses, mixed reality headsets). These devices have no touch screens and include front-facing optical cameras, radar or ultrasound systems, and/or depth sensors to track human hands in real time. Features of a human body and/or face (e.g., key points of the hand) can be extracted from an image to help locate a hand location and improve an accuracy level of gesture recognition. Some head mounted displays have implemented hand tracking functions to complete user interaction including selecting, clicking, and typing on a virtual keyboard. Air gestures can also be used on devices having touch screens when a user’s hands are not available to touch the screen (e.g., while preparing a meal, the user can use air gestures to scroll down a recipe so that the user does not need to touch the device screen with wet hands).
[0003] Air gesture recognition increases a cost of an electronic device, particularly when deep learning techniques are applied to enhance an accuracy level. Typical key points or human skeleton algorithms are computationally expensive and rarely work with hand gesture classification in real time on electronic devices that have limited computing resources. These algorithms are often implemented using multi-stage frameworks that either have a high latency or demand powerful hardware. It would be beneficial to have a more efficient air gesture recognition mechanism than the current practice.
SUMMARY
[0004] Accordingly, there is a need for an efficient air gesture recognition mechanism for estimating and tracking a hand gesture (e.g., a hand location and a hand pose) in real time. Hand gesture recognition enables user interfaces in many electronic devices (e.g., drones, smart television devices, and head-mounted displays) where human hand gestures are captured and interpreted as commands. In various embodiments of this application, a two- stage hand recognition framework is implemented by an electronic devices having limited computation resources to recognize hand gestures from high-resolution inputs in real time. High-resolution cameras are applied to capture the high-resolution inputs, e.g., more details for distant hands. In the two-stage hand recognition framework, a high-resolution image is down-sampled to a pre-defined resolution, allowing a detection network to estimate a hand location and determine a first gesture label vector from a low-resolution down-sampled image. The high-resolution image is then cropped according to the hand location, and a second gesture label vector is determined from a high resolution cropped image. A hand gesture is identified for the high-resolution image based on both the first and second gesture label vectors.
[0005] In one aspect, a method is implemented at an electronic device (e.g., a mobile device) for recognizing hand gestures in an image. The method includes obtaining an input image including an input hand region where a hand is captured, and down-sampling the input image to a first image, and determining a first gesture label vector from the first image. The method further includes detecting in the first image a first hand region where the hand is captured; in accordance with the first hand region in the first image, cropping the input image to the input hand region corresponding to the first hand region of the first image; and determining a second gesture label vector from the input hand region of the input image. The method further includes associating the input hand region with a first hand gesture based on both the first and second gesture label vectors. In some embodiments, the first gesture label vector and the second gesture label vector are combined to generate a comprehensive gesture label vector. The first hand gesture is selected from a plurality of predefined hand gestures based on the comprehensive gesture label vector.
[0006] In another aspect, some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
[0007] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.
[0009] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
[0010] Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.
[0011] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.
[0012] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.
[0013] Figure 5 is a flow diagrams of a hand gesture recognition process in which hand gestures are identified in an image in real time, in accordance with some embodiments. [0014] Figure 6 is a flowchart of a method for recognizing a hand gesture in an image, in accordance with some embodiments.
[0015] Like reference numerals refer to corresponding parts throughout the several views of the drawings.
DETAILED DESCRIPTION
[0016] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.
[0017] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
[0018] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera and a mobile phone 104C. The networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.
[0019] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
[0020] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. The content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.
[0021] In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D). The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D). The server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), rending virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.
[0022] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The AR glasses 104D can be includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses. The device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.
[0023] As explained above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D. 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using a second data processing model. Training of the first and second data processing models is optionally implemented by the server 102 or AR glasses 104D. Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.
[0024] Figure 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments. The data processing system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof. The data processing system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
[0025] Memory 206 includes high-speed random access memory, such as DRAM,
SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;
• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
• One or more user applications 224 for execution by the data processing system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);
• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104; • Data processing module 228 (e.g., applied in a hand gesture recognition process 500 in Figure 5) for processing content data using data processing models 240 (e.g., a hand gesture recognition model), thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;
• One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 240 is a hand gesture recognition model that includes a detection and classification model 512 and a second hand gesture network 516 and is applied to recognize hand gestures from images in real time and locally in an electronic device, e.g., in Figure 5; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the data processing system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at the server 102 to provide the associated results 242 to be presented on client device 104.
[0026] Optionally, the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200. Optionally, the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200. In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively. [0027] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.
[0028] Figure 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
[0029] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 240 is trained according to a type of the content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 240 is provided to the data processing module 228 to process the content data.
[0030] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
[0031] The data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data. The model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240. In some embodiments, the processed content data is further processed by the data post- processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data. [0032] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 240 is established based on the neural network 400. A corresponding model -based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights wi, W2, W3, and W4 according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
[0033] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
[0034] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 240 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0035] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.
[0036] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.
[0037] Figure 5 is a flow diagram of a hand gesture recognition process 500 in which hand gestures are identified in an image in real time, in accordance with some embodiments. An electronic device 104 is configured to implement the hand gesture recognition process 500. Specifically, the electronic device 104 obtains an input image 502 captured by a camera, and the camera is optionally part of the same electronic device 104 or a distinct electronic device 104. The input image 502 includes an input hand region 504 where a hand is captured. The input image 502 is down-sampled to a first image 506. The input image 502 has a first resolution (e.g., a first number of pixels). The first image 506 has a second resolution smaller than the first resolution (e.g., has a second number of pixels that is scaled from the first number of pixels according to a down-sampling rate). A first gesture label vector 508 is determined from the first image 506, e.g., using a first hand gesture network 512A. The first image 506 is also processed, e.g., using a hand region detection network 512B, to detect a first hand region 510 where the hand is captured in the first image 506. In some embodiments, the first hand gesture network 512A and hand region detection network 512B collectively form a detection and classification model 512 configured to generate the first gesture label vector 508 and detect the first hand region 510 from the entire first image 506. [0038] The first hand region 510 in the first image 506 corresponds to the input hand region 504 of the input image 502. The input hand region 504 has a third number of pixels that is scaled down to the first hand region 510 having a fourth number of pixels according to the down-sampling ratio. In some embodiments, the first hand region 510 has a rectangular shape and tightly encloses the hand in the first image 506, so does the input hand region 504 have the rectangular shape and tightly enclose the hand in the input image 502. In accordance with the first hand region 510 in the first image 506, the input image 502 is cropped to the input hand region 504 corresponding to the first hand region 510 of the first image 506.
[0039] A second gesture label vector 514 is determined from the input hand region 504 of the input image 502, e.g., using a second hand gesture network 516. The input hand region 504 is associated with a first hand gesture 518 based on both the first and second gesture label vectors 508 and 514. In some embodiments, the first hand gesture 518 is selected from a plurality of predefined hand gestures based on the first and second gesture label vectors 508 and 514. Further, in some embodiments, the plurality of predefined hand gestures are organized in an ordered sequence of hand gestures, and each of the first and second gesture label vectors 508 and 514 has a respective sequence of probability elements aligned with the ordered sequence of hand gestures. Each probability element of the first and second gesture label vectors 508 and 514 represents a probability of being associated with a respective and distinct predefined hand gesture in the ordered sequence of hand gestures. Each of the first and second gesture label vectors 508 and 514 is normalized to 1, i.e., a total probability of the predefined gestures is equal to 1 for each gesture label vector 508 or 514. [0040] In an example, the sequence of predefined hand gestures includes 6 static single-hand gestures, e.g., a stopping gesture, a grabbing gesture, a thumb up gesture, a thumb down gesture, a high-five gesture, a peace gesture, which are organized according to a predefined order. The first gesture label vector 508 has six probability elements that have a sum equal to 1, so does the second gesture label vector 514. For example, the first gesture label vector 508 is equal to [0, 0.5, 0.1, 0. 4, 0, 0], and indicates that the hand gesture in the first image 504 has 50%, 10% and 40% chances of being the grabbing gesture, thumb up, and thumb down gestures, respectively. In another example, the first gesture label vector 508 is equal to [0, 0, 0, 0, 0, 1], and indicates that the hand gesture in the first image 506 corresponds to the peace gesture with a probability of 100%. The second gesture label vector 514 is optionally equal to or distinct from the first gesture label vector 508.
[0041] In some embodiments, the first gesture label vector 508 and second gesture label 514 are combined, e.g., in a weighted manner, to generate a comprehensive gesture label vector 520, and the first hand gesture 518 is selected from a plurality of predefined gestures based on the comprehensive gesture label vector 520. The plurality of predefined hand gestures have a predefined number of predefined hand gestures (e.g., 12 hand gestures) organized in an ordered sequence of hand gestures, and each of the first, second, and comprehensive gesture label vectors 508, 514, and 520 has the predefined number of probability elements. Each probability element of the first, second, or comprehensive gesture label vector 520 represents a probability of being associated with a respective and distinct predefined hand gesture in the ordered sequence of hand gestures. Each of the first, second, and comprehensive gesture label vectors 508, 514, and 520 is normalized to 1, i.e., a total probability of the predefined gestures is equal to 1 for each gesture label vector, and each probability element is equal to or less than 1. Under some circumstances, one of the predefined number of probability elements of the comprehensive gesture label vector 520 is greater than any other probability element of the comprehensive gesture label vector 520, and the one of the predefined number of probability elements of the comprehensive gesture label vector 520 corresponds to the first hand gesture 518 in the ordered sequence of hand gestures. [0042] In the above example, the plurality of predefined gesture includes 6 static single-hand gestures organized in the predefined order, and the first gesture label vector 508 is equal to [0, 0.5, 0.1, 0. 4, 0, 0], The second gesture label vector 514 is equal to [0, 0, 0, 0, 0, 1], Weights for the first and second gesture label vectors 508 and 514 are equal to 0.5, so the comprehensive gesture label vector 520 is [0, 0.25, 0.05, 02, 0, 0.5], The last probability element of the comprehensive gesture label vector 520 corresponds to the peace gesture and is greater than any other probability elements. As such, the peace gesture is selected as the first hand gesture 518 associated with the input hand region 504 in the input image 502. [0043] Further, in some embodiments, the first gesture label vector 508 and the second gesture label vector 514 are combined using a first weight and a second weight, respectively. The first hand region 510 is detected in the first image 506 using a first hand gesture network 512A having a first input size. The second gesture label vector 514 is determined from the input hand region 504 of the input image 502 using the second hand gesture network 516 having a second input size. In some embodiments, the first image 506 is resized to match the first input size, and the input hand region 504 of the input image is resized to match the second input size. For example, a dimension of the first image 506 is expanded by filling black pixels to match a corresponding dimension of the first input size. When the first and second gesture label vectors 508 and 514 are combined in a weighted manner, the first weight is determined based on the first input size and a size of the first image 506, and the second weight is determined based on the second input size and a size of the input hand region of the input image 502. In some embodiments, a sum of the first and second weights is equal to 1.
[0044] In some situations, the first image 506 (which is resized or not) has a second number of pixels that matches (e.g., is equal to) the first input size of the first hand gesture network 512A, and the input hand region 504 (which is resized or not) has a third number of pixels that matches the second input size of the second hand gesture network 516. Specifically, sizes of the input hand region 504 and the first hand region 510 are compared to determine whether the second or first gesture label vector 508 or 514 is more reliable. For example, when the size of the first hand region 504 is greater than the size of the input hand region 504, the first gesture label vector 508 is more reliable than the second gesture label vector 514, and the first weight is greater than the second weight. Conversely, when a size of the first hand region 504 is less than the size of the input hand region 504, the first gesture label vector 508 is less reliable than the second gesture label vector 514, the first weight is less than the second weight. [0045] Alternatively, in some situations, the second number of pixels of the first image 506 does not match the first input size, and is adjusted prior to being processed by the first hand gesture network 512A of the detection and classification model 512. The input hand region 504 has a third number of pixels that do not match the second input size, and is adjusted prior to being processed by the second hand gesture network 516. When the first and second gesture label vectors 508 and 514 are combined in a weighted manner, the first weight is determined based on the first input size and a size of the first image 506, and the second weight is determined based on the second input size and a size of the input hand region 504 of the input image 502. From a different perspective, the smaller the down-sampling rate of the first image 506 is, the greater the first weight of the first gesture label vector 508, because the smaller down-sampling rate allows the first image 506 to maintain more details of hand gestures in the input image 502 and makes the first gesture label vector 508 more reliable. The input sizes of the first and second hand gesture networks 512A and 516 indicate complexity levels of the first and second hand gesture networks 512A and 516, and an increase in the respective size of the first or second hand gesture network 512A or 516 enhances reliability of the first or second gesture label vector 508 or 514 and increases a magnitude of the first or second weight, respectively.
[0046] The first gesture label vector 508 is determined from the first image 506 using a first hand gesture network 512A, and the second gesture label vector 514 is determined from the input hand region 504 of the input image 502 using a second hand gesture network 516. The first hand region 510 is detected in the first image 506 using a hand region detection network 512B. In some embodiments, the first hand gesture network 512A, second hand gesture network 516, and hand region detection network 512B are trained end-to-end or independently. In some embodiments, the first hand gesture network 512A, second hand gesture network 516, and hand region detection network 512B are trained at a server 102, and provided to an electronic device 104 for recognizing the first hand gesture 518 in the input image 502. Alternatively, in some embodiments, the first hand gesture network 512A, second hand gesture network 516, and hand region detection network 512B are trained and applied at a server 102 for recognizing the first hand gesture 518 in the input image 502. The input image 502 is captured by an electronic device 104 and uploaded to the server 102. Information of the first hand gesture 518 identified by the server 102 is sent to the electronic device 104.
[0047] The process 500 corresponds to a two-stage hand gesture determination method that provides real-time and accurate hand gestures 518. The first hand gesture 518 is determined in real time with the input image 502 being captured, when a latency time between image capturing and gesture determination is less than a threshold latency (e.g., 10 milliseconds). To keep real-time performance, the process 500 down-samples an input image 502 to a predefined resolution of a first image 506 for the detection and classification model 512 in the first stage. The input image 502 has a higher resolution than the first image 506. The detection and classification model 512 outputs an estimated hand location corresponding to a first hand region 510 and the first gesture label vector 508. In the second stage, the process 500 uses the estimated hand location to crop the original high-resolution image 502 and resize a cropped area (i.e., the input hand region 504 of the input image 502) to a predefined resolution for the second hand gesture network 516. The second hand gesture network 516 estimates the second gesture label vector 514 from the input hand region 504. In some embodiments, the input hand region 504 provides more detailed information to the second hand gesture network 516 and enables the process 500 to predicate a hand in the input image 502 with a higher accuracy compared with the first image 506.
[0048] The second hand gesture network 516 processes the input image 502 having the higher resolution based on the first hand region 510 determined from the first image 506 in the first stage. This avoids searching the entire input image 502 for additional hand regions, and significantly saves time to enable real-time hand gesture recognition. The second hand gesture network 516 does not identify any hand gesture that is not detected in the first stage, and however, enhances an accuracy level of hand gesture recognition from the first stage with higher resolution information in the input hand region 504. The two-stage hand gesture determination method enables real-time hand gesture recognition while increasing an overall accuracy level (e.g., beyond a threshold accuracy level). The networks 512 and 516 applied in the two-stage hand gesture determination method are compact and efficient and can be implemented locally at an electronic device (e.g., a drone) having a limited computational capacity.
[0049] Weights applied to combine the first and second gesture label vectors 508 and 514 form a weight vector that is optionally normalized to 1. These weights reflect qualities of the detection and classification model 512 and second hand gesture network 516, i.e., how reliable the corresponding gesture label vectors 508 and 514 are. The weights are impacted by many factors including a resolution of the input image 502, the down-sampling rate between the input and first images 502 and 506, and the qualities of the networks 512 and 516. In some embodiments, the input image 502 is down-sampled and scaled to the first input size of the detection and classification model 512 in the first stage. A first scaling ratio is determined by a ratio of the size of the input image 502 and the first input size of the model 512.
[0050] In some embodiments, the input image 502 is cropped to the input hand region 504 and scaled to the second input size of the second hand gesture network 516 in the second stage. A second scaling ratio is determined by a ratio of the size of the input hand region 504 of the input image 502 and the second input size of the network 516. When the second scaling ratio is greater than 1, the input hand region 504 is down-sampled, and information is removed from the first image 506 prior to being fed into the second hand gesture network 516. The second weight of the second gesture label vector 514 decreases with an increase of the second scaling ratio that is large than 1. The increase of the second scaling ratio corresponds to more information being loss, the second gesture label vector 514 being less reliable, and the second weight being smaller. In contrast, when the second scaling ratio is less than 1, the input hand region 504 is up-sampled prior to being fed into the second hand gesture network 516, and no additional information is added except that a size of the input hand region 504 increases. The second weight of the second gesture label vector 514 varies with the second scaling ratio that is large than 1, and however, does not change after the second scaling ratio drops below 1.
[0051] The process 500 fully utilizes the high-resolution image 502 for real-time hand gesture recognition. Particularly, the corresponding two-stage framework improves a classification accuracy for small hands in the image 502, and effectively extends a distance range in which a single detection and classification model 512 can achieve with a down sampled image (e.g., the first image 506).
[0052] In some embodiments, the process 500 is similarly applied to combine more networks to extract human skeleton and/or face from the images when more computation resources are provided on the electronic device 104. For example, a human skeleton network can take a down-sampled image to find the face/hands area, and an area in the original high- resolution image is cropped for recognition processing based on the face/hands area. Specifically, in some embodiments, the electronic device 104 obtaining an input image 502 including an input human or face region 504 where a human body or face is captured. The input image 502 is down-sampled to a first image 506 from which a first gesture label vector 508 from the first image 506. A first human or face region 510 where the human body or face is captured is detected in the first image 506. In accordance with the first human or face region 510 in the first image 506, the input image 502 is cropped to the input human or face region 504 corresponding to the first human or face region 510 of the first image 506. A second gesture label vector 514 is determined from the input human or face region 504 of the input image 502. The electronic device 104 associates the input human or face region 510 with a first body gesture or face expression based on both the first and second gesture label vectors 508 and 514.
[0053] Figure 6 is a flowchart of a method 600 for recognizing a hand gesture in an image, in accordance with some embodiments. For convenience, the method 600 is described as being implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof). In some embodiments, the client device 104 is a mobile phone 104C, AR glasses 104D, smart television device, or drone. Method 600 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 6 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the computer system 200 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 600 may be combined and/or the order of some operations may be changed.
[0054] The computer system obtains (602) an input image 502 including an input hand region 504 where a hand is captured. In a first stage, the computer system downsamples (604) the input image 502 to a first image 506, determines (606) a first gesture label vector 508 from the first image 506, and detects (608) in the first image 506 a first hand region 510 where the hand is captured. In some embodiments, the input image 502 is down- sampled to the first image 506 having a second resolution, the input image 502 having a first resolution. A ratio of the first resolution and the second resolution equal to a down-sampling rate applied to down-sample the input image 502 to the first image 506. In a second stage, in accordance with the first hand region 510 in the first image 506, the computer system crops (610) the input image 502 to the input hand region 504 corresponding to the first hand region 510 of the first image 506. The computer system determines (612) a second gesture label vector 514 from the input hand region 504 of the input image 502, and associates (614) the input hand region 504 with a first hand gesture 518 based on both the first and second gesture label vectors 508 and 514. In some embodiments, each of the first and second gesture label vectors 508 and 514 is normalized (616). In some embodiments, the first hand region 510 has a rectangular shape and tightly encloses the hand in the first image 506.
[0055] In some embodiments, when the input hand region 504 is associated with the first hand gesture 518 based on both the first and second gesture label vectors 508 and 514, the first hand gesture 518 is selected from a plurality of predefined hand gestures based on the first and second gesture label vectors 508 and 514. Further, in some embodiments, the predefined hand gestures are organized in an ordered sequence of hand gestures, and each of the first and second gesture label vectors 508 and 514 has a respective sequence of probability elements aligned with the ordered sequence of hand gestures. Each probability element of the first and second gesture label vectors 508 and 514 represents a probability of being associated with a respective and distinct predefined hand gesture in the ordered sequence of hand gestures.
[0056] In some embodiments, when the input hand region 504 is associated with the first hand gesture 518 based on both the first and second gesture label vectors 508 and 514, the first gesture label vector 508 and the second gesture label vector 514 are combined (618) to generate a comprehensive gesture label vector 520, and the first hand gesture 518 is selected (620) from a plurality of predefined hand gestures based on the comprehensive gesture label vector 520. Further, in some embodiments, the plurality of predefined hand gestures have a predefined number of predefined hand gestures organized in an ordered sequence of hand gestures, and each of the first, second, and comprehensive gesture label vectors 508, 514, and 520 has the predefined number of probability elements. Each probability element of the first, second, or comprehensive gesture label vector 508, 514, and 520 represents a probability of being associated with a respective and distinct predefined hand gesture in the ordered sequence of hand gestures. Additionally, in some embodiments, when the one of the plurality of predefined hand gestures, the computer system determines that one of the predefined number of probability elements of the comprehensive gesture label vector 520 is greater than any other probability element of the comprehensive gesture label vector 520 and that the one of the predefined number of probability elements of the comprehensive gesture label vector 520 corresponds to the first hand gesture 518 in the ordered sequence of hand gestures.
[0057] Further, in some embodiments, the first gesture label vector 508 and the second gesture label vector 514 are combined using a first weight and a second weight, respectively. The first hand region 510 is detected in the first image 506 using a first hand gesture network 512A having a first input size, and the second gesture label vector 514 is determined from the input hand region 504 of the input image 502 using a second hand gesture network 516 having a second input size. The first weight is determined based on the first input size and a size of the first image 506. The second weight is determined based on the second input size and a size of the input hand region 504 of the input image 502. In an example, the first weight is greater than the second weight when a size of the first hand region 510 is greater than the size of the input hand region 504, and the first weight is less than the second weight when the size of the first hand region 510 is less than the size of the input hand region 504.
[0058] In some embodiments, the first gesture label vector 508 is detected from the first image 506 using a first hand gesture network 512A, and the second gesture label vector 514 is determined from the input hand region 504 of the input image 502 using a second hand gesture network 516. The first hand region 510 is detected in the first image 506 using a hand region detection network 512B. The first hand gesture network 512A, second hand gesture network 516, and hand region detection network 512B are trained end-to-end or independently. Further, in some embodiments, the first hand gesture network 512A, second hand gesture network 516, and hand region detection network 512B are trained at a server 102, and provided to an electronic device 104 for recognizing the first hand gesture 518 in the input image 502. Alternatively, in some embodiments, the first hand gesture network 512A, second hand gesture network 516, and hand region detection network 512B are trained and applied at a server 102 for recognizing the first hand gesture 518 in the input image 502. The input image 502 is captured by an electronic device 104 and uploaded to the server 102. Information of the first hand gesture 518 is sent by the server 102 to the electronic device 104.
[0059] It should be understood that the particular order in which the operations in Figure 7 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to recognize a hand gesture as described herein. Additionally, it should be noted that details of other processes described above with respect to Figure 5 are also applicable in an analogous manner to method 600 described above with respect to Figure 6. For brevity, these details are not repeated here.
[0060] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.
[0061] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
[0062] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.
[0063] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is:
1. A method for recognizing hand gestures, comprising: obtaining an input image including an input hand region where a hand is captured; down-sampling the input image to a first image; determining a first gesture label vector from the first image; detecting in the first image a first hand region where the hand is captured; in accordance with the first hand region in the first image, cropping the input image to the input hand region corresponding to the first hand region of the first image; determining a second gesture label vector from the input hand region of the input image; and associating the input hand region with a first hand gesture based on both the first and second gesture label vectors.
2. The method of claim 1, wherein associating the input hand region with the first hand gesture based on both the first and second gesture label vectors further comprises: selecting the first hand gesture from a plurality of predefined hand gestures based on the first and second gesture label vectors.
3. The method of claim 2, wherein the predefined hand gestures are organized in an ordered sequence of hand gestures, and each of the first and second gesture label vectors has a respective sequence of probability elements aligned with the ordered sequence of hand gestures, each probability element of the first and second gesture label vectors representing a probability of being associated with a respective and distinct predefined hand gesture in the ordered sequence of hand gestures.
4. The method of claim 1, wherein associating the input hand region with the first hand gesture based on both the first and second gesture label vectors further comprises: combining the first gesture label vector and the second gesture label vector to generate a comprehensive gesture label vector; and selecting the first hand gesture from a plurality of predefined hand gestures based on the comprehensive gesture label vector.
5. The method of claim 4, wherein the plurality of predefined hand gestures have a predefined number of predefined hand gestures organized in an ordered sequence of hand gestures, and each of the first, second, and comprehensive gesture label vectors has the
24 predefined number of probability elements, each probability element of the first, second, or comprehensive gesture label vector representing a probability of being associated with a respective and distinct predefined hand gesture in the ordered sequence of hand gestures.
6. The method of claim 5, selecting the one of the plurality of predefined hand gestures further comprising: determining that one of the predefined number of probability elements of the comprehensive gesture label vector is greater than any other probability element of the comprehensive gesture label vector and that the one of the predefined number of probability elements of the comprehensive gesture label vector corresponds to the first hand gesture in the ordered sequence of hand gestures.
7. The method of any of claims 4-6, wherein the first gesture label vector and the second gesture label vector are combined using a first weight and a second weight, respectively, and wherein the first hand region is detected in the first image using a first hand gesture network having a first input size, and the second gesture label vector is determined from the input hand region of the input image using a second hand gesture network having a second input size, the method further comprising: determining the first weight based on the first input size and a size of the first image; and determining the second weight based on the second input size and a size of the input hand region of the input image.
8. The method of claim 7, wherein the first weight is greater than the second weight when a size of the first hand region is greater than the size of the input hand region, and the first weight is less than the second weight when the size of the first hand region is less than the size of the input hand region.
9. The method of any of the preceding claims, wherein the input image is down-sampled to the first image having a second resolution, the input image having a first resolution, a ratio of the first resolution and the second resolution equal to a down-sampling rate.
10. The method of any of the preceding claims, wherein each of the first and second gesture label vectors is normalized.
11. The method of any of the preceding claims, wherein: the first gesture label vector is detected from the first image using a first hand gesture network, and the second gesture label vector is determined from the input hand region of the input image using a second hand gesture network; the first hand region is detected in the first image using a hand region detection network; and the first hand gesture network, second hand gesture network, and hand region detection network are trained end-to-end or independently.
12. The method of claim 9, wherein the first hand gesture network, second hand gesture network, and hand region detection network are trained at a server, and provided to an electronic device for recognizing the first hand gesture in the input image.
13. The method of claim 9, wherein: the first hand gesture network, second hand gesture network, and hand region detection network are trained and applied at a server for recognizing the first hand gesture in the input image; the input image is captured by an electronic device and uploaded to the server; and information of the first hand gesture is sent to the electronic device.
14. The method of any of the preceding claims, wherein the first hand region has a rectangular shape and tightly encloses the hand in the first image.
15. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-14.
16. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-14.
PCT/US2021/054825 2021-10-13 2021-10-13 Two-stage hand gesture recognition WO2023063944A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2021/054825 WO2023063944A1 (en) 2021-10-13 2021-10-13 Two-stage hand gesture recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2021/054825 WO2023063944A1 (en) 2021-10-13 2021-10-13 Two-stage hand gesture recognition

Publications (1)

Publication Number Publication Date
WO2023063944A1 true WO2023063944A1 (en) 2023-04-20

Family

ID=85988799

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/054825 WO2023063944A1 (en) 2021-10-13 2021-10-13 Two-stage hand gesture recognition

Country Status (1)

Country Link
WO (1) WO2023063944A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140022164A1 (en) * 2009-11-06 2014-01-23 Sony Corporation Real time hand tracking, pose classification, and interface control
US20140310764A1 (en) * 2013-04-12 2014-10-16 Verizon Patent And Licensing Inc. Method and apparatus for providing user authentication and identification based on gestures
US20160267325A1 (en) * 2015-03-12 2016-09-15 Qualcomm Incorporated Systems and methods for object tracking
US20180189556A1 (en) * 2017-01-03 2018-07-05 Intel Corporation Hand gesture recognition for virtual reality and augmented reality devices

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140022164A1 (en) * 2009-11-06 2014-01-23 Sony Corporation Real time hand tracking, pose classification, and interface control
US20140310764A1 (en) * 2013-04-12 2014-10-16 Verizon Patent And Licensing Inc. Method and apparatus for providing user authentication and identification based on gestures
US20160267325A1 (en) * 2015-03-12 2016-09-15 Qualcomm Incorporated Systems and methods for object tracking
US20180189556A1 (en) * 2017-01-03 2018-07-05 Intel Corporation Hand gesture recognition for virtual reality and augmented reality devices

Similar Documents

Publication Publication Date Title
WO2021184026A1 (en) Audio-visual fusion with cross-modal attention for video action recognition
WO2021081562A2 (en) Multi-head text recognition model for multi-lingual optical character recognition
WO2023101679A1 (en) Text-image cross-modal retrieval based on virtual word expansion
US20240203152A1 (en) Method for identifying human poses in an image, computer system, and non-transitory computer-readable medium
WO2021077140A2 (en) Systems and methods for prior knowledge transfer for image inpainting
WO2023102223A1 (en) Cross-coupled multi-task learning for depth mapping and semantic segmentation
WO2022103877A1 (en) Realistic audio driven 3d avatar generation
US20240296697A1 (en) Multiple Perspective Hand Tracking
US20240153184A1 (en) Real-time hand-held markerless human motion recording and avatar rendering in a mobile platform
WO2023027712A1 (en) Methods and systems for simultaneously reconstructing pose and parametric 3d human models in mobile devices
WO2023091131A1 (en) Methods and systems for retrieving images based on semantic plane features
WO2023277877A1 (en) 3d semantic plane detection and reconstruction
WO2023133285A1 (en) Anti-aliasing of object borders with alpha blending of multiple segmented 3d surfaces
WO2023086398A1 (en) 3d rendering networks based on refractive neural radiance fields
WO2023063944A1 (en) Two-stage hand gesture recognition
WO2023018423A1 (en) Learning semantic binary embedding for video representations
US20240087344A1 (en) Real-time scene text area detection
WO2024076343A1 (en) Masked bounding-box selection for text rotation prediction
WO2023091129A1 (en) Plane-based camera localization
WO2024123372A1 (en) Serialization and deserialization of layered depth images for 3d rendering
WO2024072410A1 (en) Real-time hand gesture tracking and recognition
WO2022250689A1 (en) Progressive video action recognition using scene attributes
WO2022103412A1 (en) Methods for recognition of air-swipe gestures
WO2023211444A1 (en) Real-time on-device large-distance gesture recognition with lightweight deep learning models
WO2023167658A1 (en) Image processing with encoder-decoder networks having skip connections

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21960795

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21960795

Country of ref document: EP

Kind code of ref document: A1