WO2022103412A1 - Procédés de reconnaissance de gestes par balayage dans l'air - Google Patents

Procédés de reconnaissance de gestes par balayage dans l'air Download PDF

Info

Publication number
WO2022103412A1
WO2022103412A1 PCT/US2020/063624 US2020063624W WO2022103412A1 WO 2022103412 A1 WO2022103412 A1 WO 2022103412A1 US 2020063624 W US2020063624 W US 2020063624W WO 2022103412 A1 WO2022103412 A1 WO 2022103412A1
Authority
WO
WIPO (PCT)
Prior art keywords
hand
pose
moving direction
gesture
sequence
Prior art date
Application number
PCT/US2020/063624
Other languages
English (en)
Inventor
Yang Zhou
Xiang Li
Yu Gao
Jie Liu
Yi Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to CN202080107016.6A priority Critical patent/CN116457744A/zh
Publication of WO2022103412A1 publication Critical patent/WO2022103412A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/20Scenes; Scene-specific elements in augmented reality scenes

Definitions

  • the present application generally relates to artificial intelligence, particularly to methods and systems for using deep learning techniques to recognize air gestures and implement gesture control on user interfaces of an electronic device.
  • Gesture control is an important component of user interface on modem day electronic devices.
  • touch gestures are used for invoking various user interface functions.
  • Common touch gestures include tap, double tap, swipe, pinch, zoom, rotate, etc.
  • Each touch gesture is typically associated a certain user interface function.
  • the swipe gesture is often used to scroll up and down a web page and to switch between photos within a photo album application.
  • touchless air gestures are used to implement certain user interface functions for electronic devices having no touch screens, e.g., head mounted display (e.g., virtual reality headsets, augmented reality glasses, mixed reality headsets).
  • These devices having no touch screens can include front-facing cameras or miniature radars to track human hands in real time.
  • some head mounted displays have implemented hand tracking functions to complete user interaction including selecting, clicking, and typing on a virtual keyboard.
  • Air gestures can also be used on the devices with touch screens when a user’s hands are not available to touch the screen (e.g., while preparing a meal, the user can use air gestures to scroll down a recipe so that the user does not need to touch the device screen with wet hands). It would be beneficial to have an efficient air gesture recognition mechanism, particularly when air gestures do not occur concurrently with touches to distinguish the air gestures from alternative air gestures that are not meant to trigger any functions.
  • the present application describes embodiments related to air gesture control and, more particularly, to systems and methods for identifying air gestures based on hand poses and movement.
  • Different hand poses of a single hand are associated with two associated air swipes and their returning motions.
  • Each air swipe initiates a user interface function, while the returning motions do not initiate any user interface functions.
  • the two associated air swipes have opposite directions, and a returning motion of a first air swipe has the same direction of a second air swipe, potentially leading to an error in detecting the second air swipe.
  • the returning motion of the first air swipe is associated with a hand pose distinct from that of the second air swipe, so is the returning motion of the second air swipe associated with a hand pose distinct from that of the first air swipe.
  • a method for identifying user gestures is implemented at a computer system having one or more processors and memory.
  • the method includes receiving a first sequence of image frames, identifying a first hand pose of a hand at different locations within the first sequence of image frames, determining a first moving direction of the first hand pose, determining a first hand gesture based on the first hand pose and first moving direction, and performing a first predefined operation on a first one of a set of objects according to the first moving direction and the first hand pose of the first hand gesture.
  • the method further includes after performing the first predefined operation, receiving a second sequence of image frames and identifying a second hand pose of the same hand at different locations within the second sequence of image frames.
  • the second hand pose is different from the first hand pose.
  • the method further includes determining a second moving direction of the second hand pose, determining a second hand gesture based on the second hand pose and second moving direction, and selecting a second one of the set of objects according to the second moving direction and the second hand pose of the second hand gesture.
  • the second moving direction is different from the first moving direction.
  • the method includes disabling (e.g., not initiating) any predefined operations on the second one of the set of objects according to the second moving direction and the second hand pose of the second hand gesture.
  • the method further includes performing a second predefined operation on the second one of the set of objects according to the second moving direction and the second hand pose of the second hand gesture.
  • an electronic device includes one or more processing units, memory and a plurality of programs stored in the memory.
  • the programs when executed by the one or more processing units, cause the electronic device to perform the method for identifying user gestures as described above.
  • a non-transitory computer readable storage medium stores a plurality of programs for execution by an electronic apparatus having one or more processing units. The programs, when executed by the one or more processing units, cause the electronic apparatus to perform the method for identifying user gestures as described above.
  • Figure 1 A is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • Figure IB illustrates a pair of augmented reality (AR) glasses (also called a head-mounted display) that can be communicatively coupled to a data processing environment, in accordance with some embodiments.
  • AR augmented reality
  • Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.
  • Figure 3 is another example data processing system for training and applying a neural network based data processing model for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments.
  • content data e.g., video, image, audio, or textual data
  • Figure 4A is an example neural network (NN) applied to process content data in an NN-based data processing model, in accordance with some embodiments
  • Figure 4B is an example node 40 in the NN, in accordance with some embodiments.
  • Figures 5A and 5B illustrate an exemplary combination of dynamic air hand gestures for gesture control, in accordance with some embodiments.
  • Figures 6A and 6B illustrate another exemplary combination of dynamic air hand gestures for gesture control, in accordance with some embodiments.
  • Figures 7A-7C illustrate three exemplary sets of solutions for gesture control based on hand movement, in accordance with some embodiments.
  • Figure 8 is a flowchart illustrating an exemplary process for registering sequences of gesture control corresponding to predefined user interface functions, in accordance with some embodiments.
  • Figure 9 is a flowchart illustrating an exemplary process for identifying gestures for gesture control, in accordance with some embodiments.
  • Figure 1 A is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104 A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network- connected home devices (e.g., a camera).
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • the one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102, and implement some data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 with user data.
  • the game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera and a mobile phone 104C.
  • the networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables.
  • Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • USB Universal Serial Bus
  • FIREWIRE Long Term Evolution
  • LTE Long Term Evolution
  • GSM Global System for Mobile Communications
  • EDGE Enhanced Data GSM Environment
  • CDMA code division multiple access
  • TDMA time division multiple access
  • Bluetooth Wi-Fi
  • Wi-Fi voice over Internet Protocol
  • Wi-MAX wireless wide area network
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3 G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
  • Deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video, image, audio, or textual data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • content data e.g., video, image, audio, or textual data
  • data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C).
  • the client device 104C obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Subsequently to model training, the client device 104C obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the training data processing models locally. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A). The server 102A obtains the training data from itself another server 102 or the storage 106 and applies the training data to train the data processing models.
  • a server 102 e.g., the server 102A
  • the server 102A obtains the training data from itself another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the client device 104 A obtains the content data, sends the content data to the server 102 A (e.g., in an application) for data processing using the trained data processing models, receives data processing results from the server 102A, and presents the results on a user interface (e.g., associated with the application).
  • the client device 104 A itself implements no or little data processing on the content data prior to sending them to the server 102 A.
  • data processing is implemented locally at a client device 104 (e.g., the client device 104B), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B.
  • the server 102B obtains the training data from itself another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104B imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface locally.
  • FIG. IB illustrates a pair of augmented reality (AR) glasses 150 (also called a head-mounted display) that can be communicatively coupled to a data processing environment 100, in accordance with some embodiments.
  • the AR glasses 150 can be includes a camera, a microphone, a speaker, and a display.
  • the camera and microphone are configured to capture video and audio data from a scene of the AR glasses 150. In some situations, the camera captures hand gestures of a user wearing the AR glasses 150. In some situations, the microphone records ambient sound, including user’s voice commands.
  • the video or audio data captured by the camera or microphone is processed by the AR glasses 150, server(s) 102, or both to recognize a user instruction.
  • FIG. 1 is a block diagram illustrating a data processing system 200, in accordance with some embodiments.
  • the data processing system 200 includes a server 102, a client device 104 (e.g., AR glasses 150 in Figures IB), a storage 106, or a combination thereof
  • the data processing system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • the data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch- sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
  • GPS global positioning satellite
  • Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non- transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof
  • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
  • information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the data processing system 200 e.g., games, social network applications, smart home applications, and/or other web or non- web based applications for controlling another electronic device and reviewing data captured by such devices;
  • Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;
  • content data e.g., video, image, audio, or textual data
  • Data processing module 228 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224; • One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing
  • the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memory 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments.
  • the data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240.
  • both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104.
  • the training data source 304 is optionally a server 102 or storage 106.
  • both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300.
  • the training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
  • the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
  • the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312.
  • the data processing model 240 is trained according to a type of the content data to be processed.
  • the training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data.
  • an image pre-processing module 308 A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (RO I) in each training image, and crop each training image to a predefined image size.
  • ROI I region of interest
  • an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform.
  • the model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item.
  • the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
  • the model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
  • the modified data processing model 240 is provided to the data processing module 228 to process the content data.
  • the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled.
  • the model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
  • the data processing module 228 includes a data pre-processing modules 314, a model-based processing module 316, and a data post-processing module 318.
  • the data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of video, image, audio, textual, and other types of data.
  • each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform.
  • the content data includes two or more types, e.g., video data and textual data.
  • the model-based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data.
  • the model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240.
  • the processed content data is further processed by the data postprocessing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.
  • Figure 4 A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments
  • Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments.
  • the data processing model 240 is established based on the neural network 400.
  • a corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format.
  • the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs.
  • the node output is provided via one or more links 412 to one or more other nodes 420
  • a weight w associated with each link 412 is applied to the node output.
  • the one or more node inputs are combined based on corresponding weights w 7, w 2, w 3, and W4 according to the propagation function.
  • the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
  • the collection of nodes 420 is organized into one or more layers in the neural network 400.
  • the one or more layers includes a single layer acting as both an input layer and an output layer.
  • the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406.
  • a deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer.
  • a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer.
  • one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
  • max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
  • a convolutional neural network is applied in a data processing model 240 to process content data (particularly, video and image data).
  • the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406.
  • the one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product.
  • Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., live nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network.
  • Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
  • the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
  • a recurrent neural network is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior.
  • each node 420 of the RNN has a time- varying real- valued activation.
  • the RNN examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • LSTM long short-term memory
  • BAM bidirectional associative memory
  • an echo state network an independently RNN (IndRNN)
  • a recursive neural network a recursive neural network
  • a neural history compressor examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • the RNN can be used for hand
  • the training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402.
  • the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
  • forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
  • backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
  • the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
  • a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied.
  • the network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data.
  • the result of the training includes the network bias parameter b for each layer.
  • the deep learning model 400 is trained to recognize different hand gestures and hand movement to trigger predefined user interface functions, as described in more detail in Figures 5A-5B and Figures 6A-6B below.
  • Figures 5A and 5B illustrate an exemplary combination of dynamic air hand gestures 500 for gesture control, in accordance with some embodiments.
  • the combination of air hand gestures 500 includes four types of air swipe motions: (1) an air swipe in a first direction (e.g., right) while maintaining a first hand pose (e.g., open palm) (501), (2) a returning motion in a second direction (e.g., left) while maintaining a second hand pose (e.g., closed fist) after the air swipe in the first direction (503), (3) an air swipe in the second direction while maintaining the first hand pose (505), and (4) returning motion in the first direction while maintaining the second hand pose after the air swipe in the second direction (507).
  • the air hand gestures in the combination 500 is not limited to an order 501- 507, and can occur in any order.
  • Figure 5B illustrates in detail an example sequence of bidirectional air hand gestures 550 performed within a field of view of an electronic device that allows the electronic device to distinguish between an intended user gesture (e.g., the air swipe 505) that triggers a user interface function and a returning motion (e.g., the motion 503) that is not intended to trigger the user interface function.
  • an intended user gesture e.g., the air swipe 505
  • a returning motion e.g., the motion 503
  • a dynamic air hand gesture is a moving gesture represented by a sequence of hand poses or positions (e.g., a waving gesture, a tap gesture, an air swipe motion).
  • An air swipe 501 or 505 is a dynamic air hand gesture that includes swiping a hand in a direction while maintaining a hand pose.
  • a user performs the sequence of air swipe motions 550 within the field of view of an image sensor of the electronic device (e.g., AR glasses 150 in Figure IB) to activate one or more user interface functions on a display of the electronic device.
  • the electronic device can continuously capture images/videos of the user’s gestures and identify in real time the intended user interface functions to be activated (e.g., switching to a different application, switching to a different user interface object, flipping an image page).
  • the user while performing the air swipes 501 and 505, the user changes the hand pose (e.g., from an open-palm in the air swipe 501 to a closed-fist in the motion 503) and moves the hand to different relative spatial positions.
  • a rightward air swipe 501 performed within the field of view of the camera of the head-mounted display 150 may trigger the head- mounted display 150 to display a new photo and a leftward air swipe 505 may trigger the head-mounted display to display a previously-displayed photo.
  • the user intends to continue to display new photos, and therefore, repeats an air swipe 501 to the right.
  • the head-mounted display 150 is configured to detect a difference between the hand poses of the returning motion 503 and the air swipe 505, thereby avoiding erroneously classifying the returning motion 503 that pulls the user’s hand back between two air swipes 501 as an intent to activate another user interface function.
  • the hand poses of the air swipe 505 and returning motion 503 heading to the left are distinct, so are the hand poses of the air swipe 501 and returning motion 507 heading to the right.
  • the user maintains a first hand pose (e.g., an open palm) 502 and swipes (501) the hand to the right while maintaining the first hand pose 504.
  • a first hand pose e.g., an open palm
  • swipes (501) the hand to the right while maintaining the first hand pose 504.
  • the user activates one or more user interface functions on the electronic device (e.g., displaying a new application or user interface objects).
  • the user changes the first hand pose to a different second hand pose (e.g., a closed fist) 506, and returns (503) the hand to the starting position (e.g., moving the hand in an opposite direction (leftward) for at least a threshold amount of distance as the hand does not have to be returned to the exact starting position) while maintaining the second hand pose 508.
  • the second hand pose has previously been registered by an image recognition model implemented on the electronic device such that moving the hand while maintaining the second hand pose does not trigger any user interface functions on the electronic device.
  • the user After the user’s hand returns (503) to the starting position (or the user’s hand has traveled at least a threshold amount of distance leftward), the user changes the second hand pose 508 back to the first hand pose (e.g., from closed fist to open palm) 510, and continues moving (505) the hand leftward while maintaining the first hand pose 512.
  • the electronic device activates a different user interface function (e.g., different from the rightswipe with an open palm).
  • a left swipe of a closed fist 506-508 and a left swipe of an open palm 510-512 are recognized differently by the electronic device as a returning motion 503 not triggering any user interface function and an air swipe 505 that triggers a corresponding user interface function, respectively.
  • the user changes hand gesture to the second hand pose (from open palm to closed fist) 514 and returns (507) the hand to the starting position while maintaining the second hand pose 516 (e.g., closed fist) (or traveling at least a threshold amount of distance rightward). Due to the hand maintaining the second hand pose, such returning motion does not trigger any user interface functions on the electronic device. As such, a right swipe of a closed fist 514-516 and a right swipe of an open palm 502-504 are recognized differently by the electronic device as a returning motion 507 not triggering any user interface function and an air swipe 501 that triggers a corresponding user interface function, respectively.
  • a right swipe of a closed fist 514-516 and a right swipe of an open palm 502-504 are recognized differently by the electronic device as a returning motion 507 not triggering any user interface function and an air swipe 501 that triggers a corresponding user interface function, respectively.
  • sequence of air hand gestures 550 is merely exemplary and that the air hand gestures 550 can be ordered differently. For example, a right swipe 501 of the first hand pose 502 and 504 is followed sequentially by a change to a different second hand pose 506 and a return 503 of the second hand pose to the starting position. The second hand pose 508 is then changed (518) to the first hand pose 502, and the first hand pose 502 is swiped (501) to the right again.
  • the first hand pose 504 remains unchanged (520) and is used as the first hand pose 510, and the first hand pose 510 is swiped (505) to the left to trigger a user interface function different from the user interface function triggered by the previous right-swipe 501.
  • This new sequence is modified based on the example sequence of air hand gestures 550.
  • the user’s hand travels in a predefined direction while maintaining a predefined first hand pose (e.g., open palm) to activate a user interface function. If the user’s hand maintains a different second hand pose (e.g., closed fist) while moving, no user interface functions are activated on the electronic device.
  • a predefined first hand pose e.g., open palm
  • a different second hand pose e.g., closed fist
  • Additional hand poses that can be used in the combination of air hand gestures 500 include an okay sign, peace sign, Chinese number signs, thumb up sign, etc., and the combination 500 implements different hand poses for the air swipes that activate user interface functions and the returning motions.
  • gestures along other directions can also be implemented - as long as a distinct second hand pose (e.g., closed fist) is registered for returning motions that move the user’s hand with the distinct second hand pose and does not trigger any user interface functions on the electronic device.
  • a distinct second hand pose e.g., closed fist
  • FIGS 6A and 6B illustrate another exemplary combination of dynamic air hand gestures 600 for gesture control, in accordance with some embodiments.
  • this combination of dynamic air hand gestures 600 includes four types of air swipe motions: (1) an air swipe in a first direction (e.g., right) while maintaining a first hand pose (e.g., outward open palm) (601), (2) a returning motion in a second direction (e.g., left) while maintaining the first hand pose (e.g., outward open palm) after the air swipe in the first direction (603), (3) an air swipe in the second direction while maintaining a second hand pose (inward open palm) (605), and (4) returning motion in the first direction while maintaining the second hand pose after the air swipe in the second direction (607).
  • a first direction e.g., right
  • a first hand pose e.g., outward open palm
  • a returning motion in a second direction e.g., left
  • an air swipe in the second direction while maintaining a second hand pose (inward open palm
  • the air hand gestures in the combination 600 is not limited to an order 601-607, and can occur in any order.
  • Figure 6B illustrates in detail an example sequence of hand gestures 650 performed within a field of view of an electronic device that allows the electronic device to distinguish between intended user gestures (e.g., air swipes 601 and 605) that trigger user interface functions and returning motions 603 and 607 that are not intended to trigger user interface functions.
  • intended user gestures e.g., air swipes 601 and 605
  • returning motions 603 and 607 that are not intended to trigger user interface functions.
  • the sequence of dynamic hand gestures in Figures 6A-6B use different hand poses for two air swipes 601 and 605 that have different directions and trigger different user interface functions, and the same hand poses for an air swipe 601 (or 605) that activates a user interface function and a corresponding returning motion 603 (or 607).
  • the user maintains a first hand pose (e.g., an outward open palm) 602 and swipes (601) the hand to the right while maintaining the first hand pose 604.
  • a first hand pose e.g., an outward open palm
  • swipes 601 the hand to the right while maintaining the first hand pose 604.
  • the user activates one or more user interface functions on the electronic device (e.g., displaying a new application or user interface objects).
  • the user maintains the first hand pose (e.g., outward open palm) 606, and returns (603) the hand to the starting position (e.g., moving the hand in an opposite direction (leftward) for at least a threshold amount of distance as the hand does not have to be returned to the exact starting position) while maintaining the first hand pose 608.
  • the starting position e.g., moving the hand in an opposite direction (leftward) for at least a threshold amount of distance as the hand does not have to be returned to the exact starting position
  • Moving the hand in the second direction while maintaining the first hand pose has previously been registered by an image recognition model implemented on the electronic device such that it does not trigger any user interface functions on the electronic device.
  • the user’s hand After the user’s hand returns (603) to the starting position (or the user’s hand has traveled at least a threshold amount of distance leftward) 608, the user changes the first hand pose to a second hand pose 610 (e.g., from outward open palm to inward open palm), and continues moving (605) the hand leftward while maintaining the second hand pose 612.
  • the electronic device activates a different user interface function (e.g., different from the rightswipe with an open palm).
  • a left swipe 603 of the outward open palm 606-608 and a left swipe 605 of the inward open palm 610-612 are recognized differently by the electronic device as a returning motion 603 not triggering any user interface function and an air swipe 605 that triggers a corresponding user interface function, respectively.
  • the user After completing the leftward swipe 605, the user maintains the hand gesture in the second hand pose (inward open palm) 614 and returns (607) the hand to the starting position while maintaining the second hand pose 616 (or traveling at least a threshold amount of distance rightward) . Due to the hand maintaining the second hand pose while traveling rightward, such returning motion 607 does not trigger any user interface functions on the electronic device.
  • a right swipe 607 of the inward open palm 614-616 and a right swipe 601 of the outward open palm 602-604 are recognized differently by the electronic device as a returning motion 607 not triggering any user interface function and an air swipe 601 that triggers a corresponding user interface function, respectively.
  • sequence of air hand gestures 650 is merely exemplary and that the air hand gestures 600 can be ordered differently.
  • a right swipe 601 of the first hand pose 602 and 604 is followed by a return 603 of the first hand pose to the starting position.
  • the first hand pose 608 remains unchanged (618) and is used as the first hand pose 602, and swiped (601) to the right again.
  • the first hand pose 604 is changed (620) to the second hand pose 610, and applied to implement a left swipe 605 to trigger a user interface function different from the user interface function triggered by the previous right- swipe 601.
  • This new sequence is modified based on the example sequence of air hand gestures 650.
  • the user’s hand travels in a predefined direction while maintaining a predefined first hand pose (e.g., the outward open palm) to activate a first user interface function, and maintains the same predefined first hand pose during the returning motion.
  • a predefined first hand pose e.g., the outward open palm
  • a predefined second hand pose e.g., the inward open palm
  • Additional hand poses that can be used in the predefined combination of hand gestures 600 include an okay sign, peace sign, Chinese number signs, thumb up sign, etc., and the combination of hand gestures 600 implements the same hand pose for each air swipe that activate a user interface function and a respective returning motion.
  • Figures 7A-7C illustrate three exemplary sets of solutions 700, 720 and 740 for gesture control based on hand movement, in accordance with some embodiments.
  • air swipes 701 and 705 follow the first and second directions and initiate user interface functions, and the returning motions 703 and 707 return a user’s hand positions from the air swipes 701 and 705 without initiating any other user interface functions, respectively.
  • the hand poses of the air swipe 701 and returning motion 707 in the first direction are distinct, and the hand poses of the air swipe 705 and returning motion 703 in the second direction are distinct.
  • solution 1 and solution 2 correspond to predefined combinations of hand gestures described in Figures 5A-5B and Figures 6A-6B, respectively.
  • a combination 708 of hand poses does not work because an air swipe 710 that has a second direction and triggers a user interface function shares the same hand pose with a returning motion 712 after air swipe 701 in the first direction.
  • An electronic device recognize the same hand gesture, and however, cannot decide whether the hand gesture is associated with the air swipe or returning motion, thereby failing to initiate the user interface function.
  • a predefined combination of hand gestures for gesture control involves three different hand poses (hand pose 1, hand pose 2, and hand pose 3).
  • solution 3 includes a sequence of dynamic hand gestures similar to that in Figures 5A-5B, except for that, after swiping in a second direction while maintaining hand pose 1, the user changes to hand pose 3 to return the hand to starting position (so that the returning motion does not trigger user interface functions).
  • solution 4 for the air swipe in the first direction and in the second direction, the user uses a different hand pose 1 and 3 respectively, and in both cases returns the hand to the starting position in hand pose 2.
  • Solution 5 includes a sequence of dynamic hand gestures similar to that in Figures 6A-6B, except for that, after swiping in a second direction while maintaining hand pose 2, the user changes to hand pose 3 and swipes to return to the starting position.
  • the user uses a hand pose 3 for swiping in the second direction and returning motion.
  • the hand poses of the air swipe 701 and returning motion 707 of the first direction are distinct, and the hand poses of the air swipe 705 and returning motion 703 of the second direction are distinct.
  • the hand poses of the air swipes 701 and 705 are optionally identical or distinct, so are the hand poses of the returning motions 703 and 707.
  • FIG 8 is a flowchart illustrating an exemplary process 800 for registering sequences of gesture control corresponding to predefined user interface functions, in accordance with some embodiments.
  • the process 800 is described as being implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof).
  • An example of the client device 104 is a head- mounted display 150.
  • Method 800 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figure 8 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the system 200 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other nonvolatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 800 may be combined and/or the order of some operations may be changed.
  • the process 800 includes two distinct phases: first the computer system registers a sequence of hand gestures, and then captures video to conditionally activate user interface actions based on identified hand gestures.
  • the computer system associates a first hand gesture (e.g., moving the hand rightward while maintaining an open palm) of a hand with a first user interface action (e.g., displaying a user interface object), the first hand gesture (e.g., right swipe) corresponding to a first hand pose (e.g., open palm) moving in a first direction (e.g., rightward) (802). Further, the computer system associates the first hand gesture with one or more second hand gestures of the hand (e.g., a returning motion of the hand).
  • a first hand gesture e.g., moving the hand rightward while maintaining an open palm
  • a first user interface action e.g., displaying a user interface object
  • the first hand gesture e.g., right swipe
  • a first hand pose e.g., open palm moving in a first direction (e.g., rightward)
  • the computer system associates the first hand gesture with one or more second hand gestures of the hand (e.g., a
  • Each second hand gesture corresponds to a respective second hand pose (e.g., a closed fist) moving in a second direction (e.g., leftward movement) substantially opposite to the first direction (e.g., rightward movement), and is configured to reset a hand position (e.g., returning to substantially the same position as the starting position) of the hand from the first hand gesture without initiating any user interface action (e.g., the returning motion of the hand does not cause additional activation of the user interface actions) (804).
  • a respective second hand pose e.g., a closed fist
  • second direction e.g., leftward movement
  • first direction e.g., rightward movement
  • the computer system associates a third hand gesture of the hand (e.g., moving the hand leftward while maintaining an open palm) with a second user interface action (e.g., re-displaying a previously displayed user interface object), the third hand gesture corresponding to a third hand pose (e.g., an open palm or another hand pose that is different from the closed fist pose) moving in the second direction (e.g., moving the hand leftward), the third hand gesture distinct from each of the one or more second hand gestures (806).
  • a third hand gesture of the hand e.g., moving the hand leftward while maintaining an open palm
  • a second user interface action e.g., re-displaying a previously displayed user interface object
  • the third hand gesture corresponding to a third hand pose e.g., an open palm or another hand pose that is different from the closed fist pose
  • the third hand gesture distinct from each of the one or more second hand gestures (806).
  • the computer system receives a video stream including a sequence of image frames (808).
  • the sequence of image frames captures a sequence of hand gestures within a field of view of a camera associated with the computer system.
  • the computer system identifies a hand gesture in the sequence of image frames (810), and determines that the identified hand gesture corresponds to a hand pose that moves substantially along the second direction (812).
  • the computer system determines whether the identified hand gesture corresponds to the one or more second hand gestures or third hand gesture based on the hand pose (814), and in accordance with a determination that the identified hand gesture corresponds to the third hand gesture, the computer system enables the second user interface action associated with the third gesture (816).
  • the computer system disables or does not initiate any user interface action (818).
  • FIG 9 is a flowchart illustrating an exemplary process 900 for identifying gestures for gesture control, in accordance with some embodiments.
  • the process 900 is described as being implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof).
  • An example of the client device 104 is a headmounted display 150.
  • Method 900 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figure 8 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the system 200 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 900 may be combined and/or the order of some operations may be changed.
  • the computer system receives a first sequence of image frames (902) that captures user hand motions.
  • the computer system identifies a first hand pose of a hand (e.g., an air swipe 501 maintaining an open palm) at different locations within the first sequence of image frames (e.g., the hand is fixed as one of a right hand and a left hand) (904).
  • the computer system determines a first moving direction (e.g., to the right) of a first hand pose, and determine a first hand gesture within the first sequence of image frames based on the first hand pose and first moving direction (906), and performs a first predefined operation on a first one of a set of objects according to the first moving direction and the first hand pose of the first hand gesture (e.g., moving the set of objects to the left in accordance with the hand movement, the operation is uniquely or non-uniquely associated with the first hand gesture) (908).
  • a first moving direction e.g., to the right
  • determine a first hand gesture within the first sequence of image frames based on the first hand pose and first moving direction (906
  • performs a first predefined operation on a first one of a set of objects according to the first moving direction and the first hand pose of the first hand gesture e.g., moving the set of objects to the left in accordance with the hand movement, the operation is uniquely or non-uniquely associated with the first hand gesture
  • the computer system After performing the first predefined operation (910), the computer system receives a second sequence of image frames (912), and identifies a second hand pose of the same hand at different locations within the second sequence of image frames (e.g., hand moving leftward while maintaining a closed fist), wherein the second hand pose is different from the first hand pose (914).
  • the computer system determines (916) a second moving direction (e.g., to the left) of the third hand pose.
  • the computer system determines a third hand gesture within the third sequence of image frames based on the third hand pose and third moving direction.
  • the second moving direction is different from the first moving direction.
  • the computer system selects a second one of the set of objects (e.g., objects that are not currently displayed) according to the second moving direction and the second hand pose of the second hand gesture (918).
  • the computer system disables any predefined operation (distinct from the first predefined action) on the second one of the set of objects according to the second moving direction and the second hand pose of the second hand gesture (e.g., when the second hand gesture corresponds to a returning motion 503 that does not activate any user interface functions) (920).
  • the computer system receives a third sequence of image frames, and identifies a third hand pose of the same hand at different locations within the third sequence of image frames.
  • the computer system determines a third moving direction of the third hand pose.
  • the computer system determines a third hand gesture within the third sequence of image frames based on the third hand pose and third moving direction.
  • the third moving direction is substantially consistent with the second moving direction, and the third hand pose is distinct from the second hand pose.
  • the computer system performs a second predefined operation on a third one of a set of objects according to the third moving direction and the third hand pose of the third hand gesture.
  • the third hand pose is distinct from the first hand pose, e.g., in solution 4 in Figure 7B.
  • the third hand pose is substantially identical to the first hand pose, e.g., in solution 1 in Figure 7 A, and an example of the third hand gesture is an air swipe 505.
  • the computer system after receiving the third sequence of image frames, receives a fourth sequence of image frames, identifies a fourth hand pose of the same hand at different locations within the fourth sequence of image frames, and determines a fourth moving direction and a fourth hand gesture according to the different locations within the fourth sequence of image frames.
  • the fourth moving direction is substantially consistent with the first moving direction
  • the computer system selects a fourth one of the set of objects according to the fourth moving direction and the fourth hand pose of the fourth hand gesture.
  • the fourth hand pose (e.g., a returning motion 507 in Figure 5A) is distinct from the first hand pose.
  • the computer system disables any predefined operation on the fourth one of the set of objects according to the fourth moving direction and the fourth hand pose of the fourth hand gesture.
  • the fourth hand pose is identical to the first hand pose.
  • the computer system performs the first predefined operation on the fourth one of the set of objects according to the fourth moving direction and the fourth hand pose of the fourth hand gesture, e.g., when the fourth hand pose corresponds to the air swipe 501.
  • the computer system after selecting the second one of the set of objects, performs a second predefined operation on the second one of the set of objects according to the second moving direction and the second hand pose of the second hand gesture (e.g., when the second hand gesture corresponds to another air swipe 605 that activates a user interface function of displaying the set of objects according to the second moving direction) (922).
  • the first predefined operation and the second predefined operation correspond to two distinct operations along the first and second moving directions, respectively.
  • the first predefined operation and the second predefined operation correspond to moving a plurality of graphic items along the first and second moving directions, respectively.
  • the first predefined operation and the second predefined operation correspond to flipping a display page along the first and second moving directions, respectively.
  • the computer system receives a third sequence of image frames, identifies a third hand pose of the same hand at different locations within the third sequence of image frames, and determines a third moving direction of the third hand pose.
  • the computer system determines a third hand gesture within the third sequence of image frames based on the third hand pose and third moving direction.
  • the third moving direction is substantially consistent with the first moving direction.
  • the computer system selects a third one of the set of objects according to the third moving direction and the third hand pose of the third hand gesture.
  • the third hand pose (e.g., a returning motion 607) is distinct from the first hand pose.
  • the computer system disables any predefined operation on the third one of the set of objects according to the third moving direction and the third hand pose of the third hand gesture.
  • the third hand pose e.g., the air swipe 601
  • the computer system performs the first predefined operation on the third one of the set of objects according to the third moving direction and the third hand pose of the third hand gesture.
  • the computer system receives a fourth sequence of image frames, identifies a fourth hand pose of the same hand at different locations within the fourth sequence of image frames, determines a fourth moving direction of the fourth hand pose.
  • the computer system determines a fourth hand gesture within the fourth sequence of image frames based on the fourth hand pose and fourth moving direction.
  • the fourth moving direction is substantially consistent with the second moving direction, and the fourth hand pose (e.g., a returning motion 603) is distinct from the second hand pose (e.g., an air swipe 605).
  • the computer system selects a fourth one of a set of objects according to the fourth moving direction and the fourth hand pose of the fourth hand gesture, and disables any predefined operation on the fourth one of the set of objects according to the fourth moving direction and the fourth hand pose of the fourth hand gesture.
  • each of the first and second hand poses is one of a first open palm pose of the hand facing substantially away from a face of a person having the hand, a second open palm pose of the hand facing substantially towards the face, a first fist pose of the hand facing substantially away from the face, and a second fist pose of the hand facing substantially towards from the face.
  • determining the first moving direction of the first hand gesture further includes selecting the first moving direction from a limited number of predefined directions, e.g., right, left, up, down, in and out directions. Each of the second, third, and fourth directions is selected from these predefined directions.
  • the first hand gesture starts from a first position and terminates at a second position that is located substantially at the first moving direction with respect to the first position, and the second position is located substantially at the second moving direction with respect to the first position when the second position is located right at the first moving direction or a line connecting the first and second positions deviates from the first moving direction by a predefined range of angles, e.g., within ⁇ 45°.
  • a predefined range of angles e.g., within ⁇ 45°.
  • Each of the second, third, and fourth directions is also broadly defined by this predefined range of angles.
  • each of the first and second hand gestures is defined, independently of a static time and a speed of the respective hand gesture at each location on a path of the respective hand gesture.
  • the first and second hand gestures are identified, independently of a static time and a speed of the hand gesture at each point located on a path of the first and second hand gestures.
  • each hand gesture is identified using a pre-trained convolution neural network (CNN) having a plurality of layers.
  • CNN convolution neural network
  • the pre-trained CNN is trained at a server 102.
  • an electronic device 104 e.g., a head- mounted display 150 the pre-trained CNN.
  • the computer system prior to receiving the first sequence of image frames, the computer system displaying, on a user interface, instructions indicating (1) that the second hand gesture is associated with a second predefined operation or (2) that the second hand gesture is used to reset a hand position of the hand from the first hand gesture while not initiating any predefined operation.
  • the computer system prior to receiving the first sequence of image frames, associates the first hand gesture of the hand with the first predefined operation.
  • the first hand gesture corresponds to the first hand pose moving in the first moving direction.
  • the computer system associates the first hand gesture with the second hand gesture of the hand.
  • the second hand gesture corresponds to the second hand pose moving in the second moving direction substantially opposite the first moving direction, and is configured to reset a hand position of the hand from the first hand gesture while not initiating any predefined operation .
  • the computer system associates a third hand gesture of the hand with a second predefined operation, the third hand gesture corresponding to a third hand pose moving in the second direction, the third hand gesture distinct from the second hand gesture.
  • Computer- readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol.
  • computer-readable media generally may correspond to (1) tangible computer- readable storage media which is non- transitory or (2) a communication medium such as a signal or carrier wave.
  • Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the embodiments described in the present application.
  • a computer program product may include a computer- readable medium.
  • first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.
  • a first electrode could be termed a second electrode, and, similarly, a second electrode could be termed a first electrode, without departing from the scope of the embodiments.
  • the first electrode and the second electrode are both electrodes, but they are not the same electrode.
  • Gesture control is arguably the most important component of user interface (UI) on modem day mobile devices.
  • touch gestures are used for invoking various UI functions.
  • Most common touch gestures include tap, double tap, swipe, pinch, zoom, rotate, etc.
  • Each gesture is typically associated a certain UI function.
  • the swipe gesture is often used to scroll up and down a web page and switch between photos within a photo album application.
  • head mounted diplay e.g., Virtual Reality (VR) headsets, Augmented Reality (AR) glasses, Mixed Reality (MR) headsets, etc.
  • touchless air-gestures are used to implement certain UI functions.
  • These VR, AR and MR headsets can have hand tracking functions to complete user interaction including selecting, clicking, typing on a virtual keyboard, etc.
  • Air gestures can also be used on mobile phones for certain functions when hands are not available to touch the screen (e.g., when preparing a meal, air gestures can be used to scroll down a recipe so that a user does not need to touch the phone screen with wet hands).
  • Front-facing camera on a mobile phone can be used to implement this function.
  • Miniature radar can also be built for real-time motion tracking of the human hand.
  • a swipe in a first direction triggers the app to display the next photo while a swipe in a second direction (often an opposite to the first direction) triggers the app to display the previous photo.
  • FIG. 1 This figure shows two consecutive left- swipe gestures performed by a user to quickly flip through two pages.
  • Each rectangle represents a frame captured by a sensor on the mobile device (e.g., a camera).
  • the first left- swipe is performed during frames T1 to T4.
  • the second leftswipe is performed during frames T7 to T10.
  • T4 to T7 the user returns his/her hand back to the starting position to perform the second left- swipe.
  • an algorithm cannot distinguish between a right- swipe and the motion of returning to the start position after left- swipe without additional information about the user’s intention. This will lead to a left- swipe, right- swipe, and left- swipe gesture sequence being detected, which is different from user’s intention.
  • a handedness detection algorithm can be incorporated into the hand tracking pipeline to facilitate appropriate gesture recognition of air- swipes.
  • An object detection CNN can be trained to detect left hand or right hand from the input images.
  • an object detection network is trained offline, then used for online inference.
  • the object detection network produces the following outputs: (1) hand presence probability (ranging from 0 to 1); (2) hand location (e.g., a bounding box on the image surrounding the hand); and (3) handedness probability (i.e., whether it is left or right hand and its probability ranging from 0 to 1).
  • the hand presence probability determines whether the system will execute the subsequent hand pose estimation module. For example, if the probability is below a certain threshold (e.g. 0.5), hand pose estimation will not be executed.
  • the hand location information can be used to crop an image around the hand, which is used as input to hand pose estimation module. This allows the hand pose estimation module to process images with little background pixels, resulting in better accuracy.
  • the handedness probability determines the execution choice of right- or left-hand pose estimation module. For example, if the left-hand probability is higher than the right-hand probability, only the left-hand pose estimation module is executed.
  • the pose estimation result combined with handedness detection will be used to recognize a left air- swipe or right air-swipe gesture. Part of this process is shown in Figure 2.
  • FIG. 1 An example hand pose estimation pipeline with handedness detection. Another solution to this problem is to use a “cool down” period where the algorithm intentionally omits sensor data after an air-swipe gesture is detected and allows the user to return to the starting position after an air-swipe.
  • One more solution to this problem is to enforce the user to remain static for a while at the start position of an air-swipe gesture. After a short period of time, the algorithm recognizes the static gesture and determines the starting position of the gesture. An UI element or a sound may be generated by the system to indicate that the system is ready for an air- swipe gesture. Then, the user can perform the gesture accordingly.
  • Hand gestures can be categorized into static hand gestures and dynamic hand gestures.
  • a static hand gesture refers to a specific pose of a hand (e.g., an OK or a thumb-up gesture).
  • a static gesture can convey a meaning on its own.
  • a dynamic gesture is a moving gesture, represented by a sequence of hand poses or positions (e.g., a waving gesture or a tap gesture).
  • Air-swipes are dynamic gestures that consist of swiping a hand in a certain pose across the field of view of the sensor.
  • Our solution is to use a set of at least two different hand poses from the same hand so that the there is no ambiguity on the sensor data between the air- swipe in the second direction and the motion of returning to the start position after an air- swipe in the first direction. In this way, the system will not mistakenly classify the motion of returning to the start position after an air- swipe in the first direction as a swipe in the second direction and allows repeated and continuous gesture control using air-swipes.
  • the two-directional swipe gestures of one particular hand can be divided into four different phases: (I) air-swipe in the first direction; (2) motion of returning after an airswipe in the first direction; (3) air- swipe in the second direction; and (4) motion of returning after an air-swipe in the second direction.
  • the goal of our invention is to guarantee that the sensor data of phase (1) is different from phase (4); and that of phase (2) is different from phase (3). It is obvious that if all 4 states are implemented using a single hand pose (e.g., open palm with the palm facing the sensor) swiping across the field of view of the sensor, there will be ambiguities. However, we can eliminate the ambiguities by using at least two hand poses for the four different states.
  • the user needs to use different hand pose during the motion of returning after an air-swipe.
  • the user uses open palm with the palm facing the sensor to swipe across the field of view of the sensor in both directions.
  • a different hand pose e.g., a fist
  • the user needs to use different hand poses for air- swipes in two different directions.
  • the hand pose during the return motion remains the same after an air- swipe.
  • open palm with the palm facing the sensor is used for the entire cycle of left- swipe, while open palm with the palm facing away from the sensor if used for the entire cycle of right- swipe.
  • the third row in Table 1 shows another implementation using two hand poses, i.e., hand flipping downward and hand flipping upward with the palm always facing down. Because there is ambiguity between air- swipe in the first direction and the motion of returning after an air- swipe in the second direction; this design cannot be used for continuous air- swipes.
  • the choice of the two hand poses can be any of the static hand gestures the system can recognize. It can also depend on user preference, culture norm, etc. 3 and 4 pose method:
  • Our invention allows repeated and continuous usage of an air- swipe gesture in one direction. This is especially useful when the user wants to quickly browse through an e- book or a collection of photos in one direction.

Abstract

La présente demande a pour objet l'identification des gestes d'un utilisateur. Un dispositif électronique reçoit une première séquence de trames d'images, identifie une première pose d'une main en différentes positions et détermine une première direction de déplacement de la première pose de la main. Le dispositif électronique détermine un premier geste de la main sur la base de la première pose de la main et de la première direction de déplacement, et réalise en conséquence une première opération prédéfinie sur un premier objet d'un ensemble d'objets. Après la réalisation de la première opération prédéfinie, le dispositif électronique reçoit une seconde séquence de trames d'images et identifie une seconde pose de la même main. La seconde pose de la main est différente de la première pose de la main. Le dispositif électronique détermine une seconde direction de déplacement différente de la première direction de déplacement, et sélectionne un second objet parmi l'ensemble d'objets sur la base de la seconde pose de la main et de la seconde direction de déplacement.
PCT/US2020/063624 2020-11-13 2020-12-07 Procédés de reconnaissance de gestes par balayage dans l'air WO2022103412A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202080107016.6A CN116457744A (zh) 2020-11-13 2020-12-07 隔空摆动手势的识别方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063113673P 2020-11-13 2020-11-13
US63/113,673 2020-11-13

Publications (1)

Publication Number Publication Date
WO2022103412A1 true WO2022103412A1 (fr) 2022-05-19

Family

ID=81602596

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/063624 WO2022103412A1 (fr) 2020-11-13 2020-12-07 Procédés de reconnaissance de gestes par balayage dans l'air

Country Status (2)

Country Link
CN (1) CN116457744A (fr)
WO (1) WO2022103412A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100040292A1 (en) * 2008-07-25 2010-02-18 Gesturetek, Inc. Enhanced detection of waving engagement gesture
US20140040835A1 (en) * 2008-02-27 2014-02-06 Qualcomm Incorporated Enhanced input using recognized gestures
US20140118244A1 (en) * 2012-10-25 2014-05-01 Pointgrab Ltd. Control of a device by movement path of a hand
US20150009124A1 (en) * 2013-07-08 2015-01-08 Augumenta Ltd. Gesture based user interface
US20170060254A1 (en) * 2015-03-03 2017-03-02 Nvidia Corporation Multi-sensor based user interface

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140040835A1 (en) * 2008-02-27 2014-02-06 Qualcomm Incorporated Enhanced input using recognized gestures
US20100040292A1 (en) * 2008-07-25 2010-02-18 Gesturetek, Inc. Enhanced detection of waving engagement gesture
US20140118244A1 (en) * 2012-10-25 2014-05-01 Pointgrab Ltd. Control of a device by movement path of a hand
US20150009124A1 (en) * 2013-07-08 2015-01-08 Augumenta Ltd. Gesture based user interface
US20170060254A1 (en) * 2015-03-03 2017-03-02 Nvidia Corporation Multi-sensor based user interface

Also Published As

Publication number Publication date
CN116457744A (zh) 2023-07-18

Similar Documents

Publication Publication Date Title
US11783496B2 (en) Scalable real-time hand tracking
US11526713B2 (en) Embedding human labeler influences in machine learning interfaces in computing environments
WO2017129149A1 (fr) Procédé et dispositif d'interaction multimodale à base d'entrées
US10540055B2 (en) Generating interactive content items based on content displayed on a computing device
KR20230157274A (ko) 관련 이미지를 검색하기 위한 전자 장치 및 이의 제어 방법
US11435845B2 (en) Gesture recognition based on skeletal model vectors
Kim et al. Watch & Do: A smart iot interaction system with object detection and gaze estimation
WO2021184026A1 (fr) Fusion audiovisuelle avec attention intermodale pour la reconnaissance d'actions vidéo
WO2023101679A1 (fr) Récupération inter-modale d'image de texte sur la base d'une expansion de mots virtuels
WO2021092632A2 (fr) Récupération de moment vidéo à base de texte faiblement supervisé par modélisation de l'attention croisée
WO2021092600A2 (fr) Réseau pose-over-parts pour estimation de pose multi-personnes
Madhiarasan et al. A comprehensive review of sign language recognition: Different types, modalities, and datasets
WO2023277888A1 (fr) Suivi de la main selon multiples perspectives
WO2022103412A1 (fr) Procédés de reconnaissance de gestes par balayage dans l'air
WO2023277877A1 (fr) Détection et reconstruction de plan sémantique 3d
WO2023055466A1 (fr) Techniques de génération de données pour un détecteur de gestes intelligent
WO2023027712A1 (fr) Procédés et systèmes permettant de reconstruire simultanément une pose et des modèles humains 3d paramétriques dans des dispositifs mobiles
Anitha et al. Implementation of touch-less input recognition using convex hull segmentation and bitwise and approach
US20240153184A1 (en) Real-time hand-held markerless human motion recording and avatar rendering in a mobile platform
WO2023063944A1 (fr) Reconnaissance de gestes de la main en deux étapes
WO2024072410A1 (fr) Suivi et reconnaissance de gestes de la main en temps réel
US20230012426A1 (en) Camera control using system sensor data
WO2023091129A1 (fr) Localisation de caméra sur la base d'un plan
WO2024076343A1 (fr) Sélection de zone de délimitation masquée pour une prédiction de rotation de texte
US20240087344A1 (en) Real-time scene text area detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20961773

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202080107016.6

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20961773

Country of ref document: EP

Kind code of ref document: A1