WO2019199967A1

WO2019199967A1 - Systems and methods for gamification of drone behavior using artificial intelligence

Info

Publication number: WO2019199967A1
Application number: PCT/US2019/026783
Authority: WO
Inventors: Massimiliano Versace
Original assignee: Neurala, Inc.
Priority date: 2018-04-10
Filing date: 2019-04-10
Publication date: 2019-10-17

Abstract

A conventional drone with an image sensor can track an object identified by a user in its image sensor's field of view. The user draws a bounding box around the object, and as long as the drone keeps the object in the bounding box, the drone can track the object. Unfortunately, this tracking scheme isn't robust; if the object changes shape or aspect ratio or is occluded by another object, the drone will lose the object. Using a neural network to identify the object in the image stream from the image sensor increases the robustness of the tracking; a trained neural network can recognize the object from a variety of angles without any user input (no user-drawn bounding box necessary). Moreover, if the neural network is a lifelong deep neural network (L-DNN), it can learn new objects on the fly. The drone can respond by moving in a predetermined fashion with respect to the object.

Description

SYSTEMS AND METHODS FOR GAMIFICATION OF DRONE BEHAVIOR USING

ARTIFICIAL INTELLIGENCE

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the priority benefit, under 35 ET.S.C. §119(e), of ET.S. Application No. 62/655,490, entitled“Systems and Methods for Gamification of Drone Behavior using Artificial Intelligence,” filed on April 10, 2018. This application is incorporated herein by reference in its entirety.

BACKGROUND

[0002] Drones are relatively‘old’ machines, introduced decades ago under the name of Remote Piloted Aircrafts (RPA). Recently, drones have experienced a surge in popularity thanks to improved hardware, compute power, connectivity. Nevertheless, drone usage is still mostly a human-controlled activity, where the human controls the drone’s flight path and interprets and analyzes the content of video or other sensor information collected by the drone.

SUMMARY

[0003] With the introduction of inexpensive drones able to stream high-definition (HD) data to a controlling device equipped with one or more powerful processors (e.g., a smart phone with multi-core central processor units (CPUs) and graphics processor units (GPUs)) or with onboard computing, it is possible to exploit the use of artificial intelligence (AI) for controlling the drones. One application of an AI-controlled drone is creating an interactive, game-like situation where the AI provides semantic understanding of the environment for the purpose of re-creating the human ability to play a game. In other words, an AI-controlled drone can be used to mimic a human for playing games and other physical activities.

[0004] Examples of this use of AI-controlled drones include a method of controlling a drone. In this method, a sensor collects a data stream representing an object in the environment. A neural network running on a processor operably coupled to the sensor extracts a convolutional output from the data stream. This convolutional output represents features of the object and is used by a a classifier operably coupled to the neural network to classify the object. The drone is controlled in response to classifying the object according to pre-defmed logic.

[0005] For instance, the sensor can be an image sensor on the drone, in which case the data stream includes imagery acquired by the image sensor. Controlling the drone may include following the object with the drone. The classifier can be trained to recognize the object, on-the- fly or in real-time. In some cases, the processor may determine that the object has disappeared from the data stream, in which case the neural network and/or the classifier may automatically recognize a reappearance of the object in the data stream.

[0006] An AI-controlled drone can be implemented as a system that includes a drone and a at least one processor. The drone is equipped with a sensor, such as an image sensor, lidar, radar, or acoustic sensor, that acquires a data stream that represents an object in an environment. The processor is operably coupled to the sensor and can be on the drone or on a smart phone that is wirelessly coupled to the drone (e.g., via a cellular, Wi-Fi, or Bluetooth link). In operation, the processor (1) executes a neural network that produces a convolutional output from the data stream; (2) classifies the object based on the convolutional output, which represents features of the object; and (3) controls the drone in response to the object.

[0007] Another method includes playing a game with a drone. An image sensor on the drone acquires an input data stream. At least one processor communicatively coupled to the image sensor detects an object of interest in an input data stream. An artificial neural network executed by the processor identifies the object of interest. The processor determines an action to be taken by the drone in the context of the game based on the object of interest, and the drone performs the action. This action may be tracking the object of interest in the input data stream, following the object of interest with the drone, or blurring at least a portion of an image of the object of interest. The method can also include detecting another object in the input data stream and identifying the other object with the artificial neural network.

[0008] A drone can also be used to track or follow an object as follows. An image sensor on the drone acquires a first image (e.g., a first frame of video). A processor on or communicatively coupled to the drone determines, based on the first image, that the object is not in a field of view of the image sensor. The image sensor acquires a second image (e.g., a second or subsequent frame of video). A neural network executed by the processor recognizes the object in the second image. The processor determines that the object has moved back into the field of view of the image sensor based on this recognition and automatically causes the drone to track the object in response.

[0009] The neural network may recognize the object by extracting features of the object from the second image to generate a convolutional output and classifying the object with a classifier coupled to the neural network based on the convolutional output. In some cases, the classifier may be trained in real-time, without backpropagation, to recognize the object.

[0010] If the processor is in a smart phone communicatively coupled to the drone, the drone may transmit the second image to the smart phone. The smart phone executes the neural network, which recognizes the object, and transmits a command to the drone, which executes the command.

[0011] In summary, the technology disclosed herein includes, but is not limited to:

a. Using AI and drones in the context of interactive games;

b. Using a drone, such as a low-cost, consumer drone, as an interactive device rather than a flying device or a flying camera; and

c. Enabling games to be translated in drone-enabled games.

[0012] It should be appreciated that all combinations of the foregoing concepts and additional concepts discussed in greater detail below (provided such concepts are not mutually inconsistent) are contemplated as being part of the inventive subject matter disclosed herein. In particular, all combinations of claimed subject matter appearing at the end of this disclosure are contemplated as being part of the inventive subject matter disclosed herein. It should also be appreciated that terminology explicitly employed herein that also may appear in any disclosure incorporated by reference should be accorded a meaning most consistent with the particular concepts disclosed herein.

[0013] Other systems, processes, and features will become apparent to those skilled in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, processes, and features be included within this description, be within the scope of the present invention, and be protected by the accompanying claims. BRIEF DESCRIPTION OF THE DRAWINGS

[0014] The skilled artisan will understand that the drawings primarily are for illustrative purposes and are not intended to limit the scope of the inventive subject matter described herein. The drawings are not necessarily to scale; in some instances, various aspects of the inventive subject matter disclosed herein may be shown exaggerated or enlarged in the drawings to facilitate an understanding of different features. In the drawings, like reference characters generally refer to like features ( e.g ., functionally similar and/or structurally similar elements).

[0015] FIG. 1 illustrates various components of an artificial intelligence (Al)-enabled interactive game-playing system, including a drone, a controller (optional), and AI residing in either or both the drone and the controller.

[0016] FIG. 2 illustrates an example L-DNN architecture suitable for implementing the AI in the system shown in FIG. 1.

[0017] FIG. 3 illustrates a VGG-16 based L-DNN classifier as one example implementation of the L-DNN architecture in FIG. 2.

[0018] FIG. 4 illustrates non-uniform multiscale object detection with an L-DNN for recognizing and tracking objects with a drone.

[0019] FIG. 5 illustrates a Mask R-CNN-based L-DNN for object segmentation in a video stream acquired with a camera on a drone, smart phone, or other imaging device.

[0020] FIG. 6 shows a flowchart illustrating a method of detecting and tracking an object of interest using an AI-enabled game playing system.

[0021] FIG. 7 illustrates an implementation of hide-and-seek with the system shown in FIG. 1.

[0022] FIG. 8 illustrates an implementation of an‘alarm drone’ with the system shown in FIG. 1.

DETAILED DESCRIPTION

[0023] The present technology enables gamification, or the application of typical elements of game playing, to the interaction between humans and drones (e.g., quadcopters) to increase and encourage user encourage engagement with drones via the use of Artificial Intelligence (AI) and AI processes. Suitable AI processes include Artificial Neural Networks (ANNs), Deep Neural Networks (DNNs), Lifelong Deep Neural Networks (L-DNNs), and other machine vision processes executed either at the compute Edge (the drone) and/or on a controlling device (e.g., a smart phone) and/or a server (via a cellular or other wireless connection), or a combination thereof. AI can provide information about people and objects surrounding the drone, and this information (semantic information about items in the environment, their location, and numerosity) can be used to script a game interaction between the drone and the humans participating in the game.

[0024] An AI device or process can be used to enhance drone user engagement by intelligently processing data captured by a sensor on or associated with the drone and intelligently controlling the behavior of the drone so as to reproduce a typical game that would normally involve people. In this case, the drone‘plays’ a role of a human player in the game, introducing a novelty in the user engagement, determined by the fact that the new‘player’ is actually a drone-embodied imitation of a human player, where the imitation is provided by the AI executed by a processor on or connected to the drone. The AI can also enable to the drone to learn and track objects in a robust fashion for photography, videography, etc.

An AI-Enabled, Interactive Drone System

[0025] FIG. 1 provides an overview of an AI-enabled game-playing system comprising a drone 100 with or without onboard compute power. The drone 100 includes a sensor suite 101, which may include a red-green-blue (RGB) camera, structured light sensor, radar, lidar, and/or microphone, that acquires sensor data 103, such as a stream of image or audio data. The drone 100 has one or more processors 102, which may provide enough compute power to execute the AI processes or may only manage more basic drone functions, such as managing the power supply, actuating the drone’s flight control in response to external commands, and controlling the sensor suite. The drone 100 also includes an antenna 112, such as a Wi-Fi, Bluetooth, or cellular antenna, that can be used to establish and maintain a wireless communications link to a smart phone 120, tablet, laptop, or other controller. The smart phone 120 has its own processors 122, such as multi-core CPUs and GPUs, which can be used to process data from the drone 100 and send commands to the drone 100, including commands generated by AI processes in response to the data. [0026] In the case of a drone 100 with compute power, e.g., in the form of one or more processors 102, the input images 103 (e.g., from an RGB camera or other sensors, such as Infrared or Structured Light sensors, in the sensor suite 110), AI module 104, game logic 105, and drone control 106 can all be hosted on board the drone 100. Alternatively, for a drone 100 with little or no onboard compute power, the smart phone 120 may host an AI module 124, game logic 125, and drone control 126 that communicate with the drone 100 via the wireless communications link 110.

[0027] In operation, the sensor suite 101 on the drone 100 acquires a video data stream 103 showing objects in the drone’s environment. If the drone 100 has on-board compute power, its on-board AI module 104 recognizes objects in the data stream 103 as explained below. Game logic 105 uses information about the objects recognized by the AI module 104 (e.g., which object(s), the object position(s), etc.) to determine the drone’s next action and instructs the drone control 106 accordingly. For instance, if the AI module 104 recognizes a person moving through the scene, the game logic 105 may instruct the drone control 106 to cause the drone to follow the person. If the AI module 104 recognizes a person hiding, then the game logic 105 may instruct the drone control 106 to cause the drone to hover over the person or send an appropriate signal to the smart phone 120.

[0028] If the drone 100 does not have sufficient operational on-board compute power, the AI module 124, game logic 125, and/or drone control 126 provided by the smart phone 120 may process the image data 103 acquired by the drone’s sensor suite 101 and control the drone 100 in response to the object(s) recognized in the image data 103. The smart phone 120 may perform all or only some of these functions, depending on the drone’s capabilities and the speed and reliability of the wireless communications link 110 between the drone 100 and the smart phone 120.

[0029] In some cases, the smart phone 120 or another device, including another drone, may perform all of the data acquisition and processing, and the drone 100 may simply respond to commands issued by AI implemented by the smart phone 120. For instance, a person may acquire imagery of the drone 100 and/or the drone’s location with a camera on the smart phone 120. The smart phone 120 may recognize objects in that image data using its own AI module 124, then determine how the drone 100 should respond to the object according to the game logic 125 The drone control 126 translates the game logic’s output into commands that the smart phone 120 transmits to the drone 100 via the wireless communications link 110 The drone 100 responds to these commands, e.g., by moving in a particular way.

Neural Networks for AI-Enabled Drones

[0030] The AI in the interactive drone system shown in FIG. 1 can be implemented using a neural network, such as an ANN or DNN. ANNs and DNNs can be trained to learn and identify objects of interest relevant to the AI-enabled game-playing system. A properly trained ANN or DNN can recognize an object each and every time it appears in the data stream, regardless of the object’s orientation or relative size. As a result, unlike other drones that track objects, the interactive drone system in FIG. 1 can simply recognize the object each time it appears in the data stream— there is no need for the user to draw a bounding box around the object in order for the drone to track the object. And the drone does not have to keep the object in view— the object can disappear from the data stream, then reappear in the data stream some time later, and the ANN or DNN will recognize the object automatically (without user intervention).

[0031] A conventional neural network is pre-trained in order to recognize people and objects in data of the environment surrounding the drone. This pre-training is typically accomplished using backpropagation with a set of training data (e.g., tagged images). Unfortunately, training with backpropagation can take hours or longer, making it impractical for a user to train a conventional neural network to recognize a new object. As a result, if an AI-enabled game playing system that uses a traditional neural network encounters a new, unfamiliar object, the system may fail to recognize the object correctly.

[0032] A Lifelong Deep Neural Network (L-DNN) can perform recognize objects like a conventional neural network and learn to recognize new objects on the fly (e.g., in near real time). An L-DNN enables continuous, online, lifelong learning on a lightweight compute device (e.g., a drone or smart phone) without time-consuming, computationally intensive learning through backpropagation. An L-DNN enables real-time learning from continuous data streams, bypassing the need to store input data for multiple iterations of backpropagation learning. These features make an L-DNN eminently suitable for implementing the drone- and phone-based AI modules 104 and 124 in the system of FIG. 1 [0033] L-DNN technology combines a representation-rich, DNN-based subsystem (Module A), also called a backbone, with a fast-learning subsystem (Module B), also called a classifier, to achieve fast, yet stable learning of features that represent entities or events of interest. These feature sets can be pre-trained by slow learning methodologies, such as backpropagation. In the DNN-based case, described in detail in this disclosure (other feature descriptions are possible by employing non-DNN methodologies for Module A), the high-level feature extraction layers of the DNN serve as inputs into the fast learning system in Module B to classify familiar entities and events and add knowledge of unfamiliar entities and events on the fly. Module B is able to learn important information and capture descriptive and highly predictive features of the environment without the drawback of slow learning.

[0034] L-DNN techniques can be applied to visual, structured light, LIDAR, SONAR, RADAR, or audio data, among other modalities. For visual or similar data, L-DNN techniques can be applied to visual processing, such as enabling whole-image classification (e.g., scene detection), bounding box-based object recognition, pixel -wise segmentation, and other visual recognition tasks. They can also perform non-visual recognition tasks, such as classification of non-visual signal, and other tasks, such as updating Simultaneous Localization and Mapping (SLAM) generated maps by incrementally adding knowledge as the drone is navigating the environment.

[0035] An L-DNN enables an Al-enabled game playing system to learn on-the fly at the edge without the necessity of learning on a central server or cloud. This eliminates network latency, increases real-time performance, and ensures privacy when desired. In some instances, AI- enabled game playing systems can be updated for specific tasks in the field using an L-DNN. For example, with L-DNNs, inspection drones can learn how to identify problems at the top of cell towers or solar panel arrays, Al-enable game playing systems can be personalized based on user preferences without the worry about privacy issues since data is not shared outside the local device, smart phones can share knowledge learned at the Edge (peer to peer or globally with all devices) without shipping information to a central server for lengthy learning.

[0036] An L-DNN also enables learning new knowledge without forgetting old knowledge, thereby mitigating or eliminating catastrophic forgetting. In other words, the present technology enables Al-enabled game playing systems to continually and optimally adjust behavior at the edge based on user input without a) needing to send or store input images, b) time-consuming training, or c) large computing resources. Learning after deployment with an L-DNN allows an AI-enabled game playing system to adapt to changes in its environment and to user interactions, handle imperfections in the original data set, and provide customized experience for a user.

[0037] An L-DNN implements a heterogeneous Neural Network architecture characterized by two modules:

1) Slow learning Module A, which includes a neural network (e.g., a Deep Neural Network) that is either factory pre-trained and fixed or configured to learn via backpropagation or other learning algorithms based on sequences of data inputs; and

2) Module B, which provides an incremental classifier able to change synaptic weights and representations instantaneously, with very few training samples. Example instantiations of this incremental classifier include, for example, an Adaptive Resonance Theory (ART) network or Restricted Boltzmann Machine (RBM) with contrastive divergence training neural networks, as well as non-neural methods, such as Support Vector Machines (SVMs) or other fast-learning supervised classification processes.

Example L-DNN Architecture

[0038] FIG. 2 illustrates an example L-DNN architecture used by an Al-enabled game playing system. The L-DNN 226 uses two subsystems, slow learning Module A 222 and fast learning Module B 224. In one implementation, Module A includes a pre-trained DNN, and Module B is based on a fast-learning Adaptive Resonance Theory (ART) paradigm, where the DNN feeds to the ART the output of one of the latter feature layers (typically, the last or the penultimate layer before DNN own classifying fully connected layers). Other configurations are possible, where multiple DNN layers can provide inputs to one or more Modules B (e.g., in a multiscale, voting, or hierarchical form).

[0039] An input source 103, such as a digital camera, detector array, or microphone, acquires information/data from the environment (e.g., video data, structured light data, audio data, a combination thereof, and/or the like). If the input source 103 includes a camera system, it can acquire a video stream of the environment surrounding the Al-enabled game playing system or drone. The input data from the input source 103 is processed in real-time by Module A 222, which provides a compressed feature signal as input to Module B 224. In this example, the video stream can be processed as a series of image frames in real-time by Modules A and B. Module A and Module B can be implemented in suitable computer processors, such as graphics processor units, field-programmable gate arrays, or application-specific integrated circuits, with appropriate volatile and non-volatile memory and appropriate input/output interfaces.

[0040] In one implementation, the input data is fed to a pre-trained Deep Neural Network (DNN) 200 in Module A. The DNN 200 includes a stack 202 of convolutional layers 204 used to extract features that can be employed to represent an input information/data as detailed in the example implementation section. The DNN 200 can be factory pre-trained before deployment to achieve the desired level of data representation. It can be completely defined by a configuration file that determines its architecture and by a corresponding set of weights that represents the knowledge acquired during training.

[0041] The L-DNN system 226 takes advantage of the fact that weights in the DNN are excellent feature extractors. In order to connect Module B 224, which includes one or more fast learning neural network classifiers, to the DNN 200 in Module A 222, some of the DNN’s upper layers only engaged in classification by the original DNN (e.g., layers 206 and 208 in FIG. 2) are ignored or even stripped from the system altogether. A desired raw convolutional output of high level feature extraction layer 204 is accessed to serve as input to Module B 224. For instance, the original DNN 200 usually includes a number of fully connected, averaging, and pooling layers 206 plus a cost layer 208 that is used to enable the gradient descent technique to optimize its weights during training. These layers are used during DNN training or for getting direct predictions from the DNN 200 but aren’t necessary for generating an input for the Module B 224 (the shading in FIG. 2 indicates that layers 206 and 208 are unnecessary). Instead, the input for the neural network classifier in Module B 224 is taken from a subset of the convolutional layers of the DNN 204. Different layers, or multiple layers can be used to provide input to Module B 224.

[0042] Each convolutional layer on the DNN 200 contains filters that use local receptive fields to gather information from a small region in the previous layer. These filters maintain spatial information through the convolutional layers in the DNN. The output from one or more late stage convolutional layers 204 in the feature extractor (represented pictorially as a tensor 210) are fed to input neural layers 212 of a neural network classifier (e.g., an ART classifier) in Module B 224. There can be one-to-one or one-to-many correspondence between each late stage convolutional layer 204 in Module A 102 and a respective fast learning neural network classifier in Module B 224 depending on whether the L-DNN 226 is designed for whole image

classification or object detection as described in detail in the example implementation section.

[0043] The tensor 210 transmitted to the Module B system 224 from the DNN 200 can be seen as an n-layer stack of representations from the original input data (e.g., an original image from the sensor 103). In this example, each element in the stack is represented as a grid with the same spatial topography as the input images from the camera. Each grid element, across n stacks, is the actual input to the Module B neural networks.

[0044] The initial Module B neural network classifier can be pre-trained with arbitrary initial knowledge or with a trained classification of Module A 222 to facilitate learning on-the-fly after deployment. The neural network classifier continuously processes data (e.g., tensor 210) from the DNN 200 as the input source 103 provides data relating to the environment to the L-DNN 106. The Module B neural network classifier uses fast, preferably one-shot learning. An ART classifier uses bottom-up (input) and top-down (feedback) associative projections between neuron-like elements to implement match-based pattern learning as well as horizontal projections to implement competition between categories.

[0045] In the fast learning mode, when a novel set of features is presented as input from Module A 222, ART -based Module B 224 puts the features as an input vector in Fl layer 212 and computes a distance operation between this input vector and existing weight vectors 214 to determine the activations of all category nodes in F2 layer 216. The distance is computed either as a fuzzy AND (in the default version of ART), dot product, or Euclidean distance between vector ends. The category nodes are then sorted from highest activation to lowest to implement competition between them and considered in this order as winning candidates. If the label of the winning candidate matches the label provided by the user, then the corresponding weight vector is updated to generalize and cover the new input through a learning process that in the simplest implementation takes a weighted average between the new input and the existing weight vector for the winning node. If none of the winners has a correct label, then a new category node is introduced in category layer F2 216 with a weight vector that is a copy of the input. In either case, Module B 224 is now familiar with this input and can recognize it on the next presentation. [0046] The result of Module B 224 serves as an output of L-DNN 226 either by itself or as a combination with an output from a specific DNN layer from Module A 222, depending on the task that the L-DNN 226 is solving. For whole scene object recognition, the Module B output may be sufficient as it classifies the whole image. For object detection, Module B 224 provides class labels that are superimposed on bounding boxes determined from Module A activity, so that each object is located correctly by Module A 222 and labeled correctly by Module B 224.

For object segmentation, the bounding boxes from Module A 222 may be replaced by pixel-wise masks, with Module B 224 providing labels for these masks.

Learning Objects Using an L-DNN

[0047] L-DNN Classifier

[0048] FIG. 3 represents example L-DNN implementation for whole image classification using a modified VGG-16 DNN as the core of Module A. Sofmax and the last two fully connected layers are removed from the original VGG-16 DNN, and an ART -based Module B is connected to the first fully connected layer of the VGG-16 DNN. A similar but much simpler L-DNN can be created using Alexnet instead of VGG-16. This is a very simple and computationally cheap system that runs on any modem smart phone or drone, does not require a GPU or any other specialized processor, and can learn any set of objects from a few frames of input provided by the smart phone camera or camera attached to a drone.

[0049] L-DNN Grid-Based Detector

[0050] One way to detect objects of interest in an image is to divide the image into a grid and ran classification on each grid cell. In this implementation of an L-DNN, the following features of conventional neural networks (CNNs) are especially useful.

[0051] In addition to the longitudinal hierarchical organization across layers described above, each layer processes data maintaining a topographic organization. This means that irrespective of how deep in the network or kernel, stride, or pad sizes, features corresponding to a particular area of interest on an image can be found on every layer at various resolutions in the similar area of the layer. For example, when an object is in the upper left corner of an image, the

corresponding features will be located in the upper left corner of each layer along the hierarchy of layers. Therefore, attaching a Module B to each of the locations in the layer allows the Module B to run classification on a particular location of an image and determine whether any familiar objects are present in this location.

[0052] Furthermore, only one Module B must be created per each DNN layer (or scale) used as input because the same feature vector represents the same object irrespective of the position in the image. Learning one object in the upper right corner thus allows Module B to recognize it anywhere in the image. Using multiple DNN layers of different sizes (scales) as inputs to separate Modules B allows detection on multiple scales. This can be used to fine tune the position of the object in the image without processing the whole image at finer scale as in the following process.

[0053] In this process, Module A provides the coarsest scale (for example, 7 ^c 7 in the publicly available ExtractionNet) image to Module B for classification. If Module B says that an object is located in the cell that is second from the left edge and fourth from top edge, only the corresponding part of the finer DNN input (for example, 14 x 14 in the same ExtractionNet) should be analyzed to further refine the location of the object.

[0054] Another application of multiscale detection can use a DNN design where the layer sizes are not multiples of each other. For example, if a DNN has a 30 ^c 30 layer it can be reduced to layers that are 2 x 2 (compression factor of 15), 3 x 3 (compression factor of 10), and 5 x 5 (compression factor of 6). As shown in FIG. 4, attaching Modules B to each of these compressed DNNs gives coarse locations of an object (indicated as 402, 404, 406). But if the output of these Modules B is combined (indicated as 408), then the spatial resolution becomes a nonuniform 8 x 8 grid with higher resolution in the center and lower resolution towards the edges.

[0055] Note that to achieve this resolution, the system runs the Module B computation only (2 x 2) + (3 x 3) + (5 x 5) = 38 times, while to compute a uniform 8 x 8 grid it does 64 Module B computations. In addition to being calculated with fewer computations, the resolution in the multiscale grid in FIG. 4 for the central 36 locations is equal to or finer than the resolution in the uniform 8 x 8 grid. Thus, with multiscale detection, the system is able to pinpoint the location of an object (410) more precisely using only 60% of the computational resources of a comparable uniform grid. This performance difference increases for larger layers because the square of the sum (representing the number of computations for a uniform grid) grows faster than sum of squares (representing the number of computations for a non-uniform grid).

[0056] Non-uniform (multiscale) detection can be especially beneficial for AI-enabled game playing systems as the objects in the center of view are most likely to be in the path of the drone and benefit from more accurate detection than objects in the periphery that do not present a collision threat.

[0057] L-DNN for Image Segmentation

[0058] For images, object detection is commonly defined as the task of placing a bounding box around an object and labeling it with an associated class (e.g., "dog"). In addition to the grid- based method of the previous section, object detection techniques are commonly implemented by selecting one or more regions of an image with a bounding box, and then classifying the features within that box as a particular class, while simultaneously regressing the bounding box location offsets. Algorithms that implement this method of object detection include Region-based CNN (R-CNN), Fast R-CNN, and Faster R-CNN, although any method that does not make the localization depend directly on classification information may be substituted as the detection module.

[0059] Image segmentation is the task of determining a class label for all or a subset of pixels in an image. Segmentation may be split into semantic segmentation, where individual pixels from two separate objects of the same class are not disambiguated, and instance segmentation, where individual pixels from two separate objects of the same class are uniquely identified or instanced. Image segmentation is commonly implemented by taking the bounding box output of an object detection method (such as R-CNN, Fast R-CNN, or Faster R-CNN) and segmenting the most prominent object in that box. The class label that is associated with the bounding box is then associated with segmented object. If no class label can be attributed to the bounding box, the segmentation result is discarded. The resulting segmented object may or may not have instance information. One algorithm that implements this method of segmentation is Mask R-CNN.

[0060] FIG. 5 shows an L-DNN design for image detection or segmentation based on the R- CNN family of networks. Consider an image segmentation process that uses a static

classification module, such as Mask R-CNN. In this scenario, a static classification module 500 may be replaced with an L-DNN Module B 224. That is, the segmentation pathway of the network remains unchanged; region proposals are made as usual, and subsequently segmented. Just as the case with a static classification module, when the L-DNN Module B 224 returns no positive class predictions that pass threshold, the segmentation results are discarded. Similarly, when the L-DNN Module B 224 returns an acceptable class prediction, the segmentation results are kept, just as with the static classification module. Unlike the static classification module 500, the L-DNN Module B 224 offers continual adaptation to change state from the former to the latter via user feedback.

[0061] User feedback may be provided directly through bounding box and class labels, such as is the case when the user selects and tags an object on a social media profile, or through indirect feedback, such as is the case when the user selects an object in a video, which may then be tracked throughout the video to provide continuous feedback to the L-DNN on the new object class. This feedback is used to train the L-DNN how to classify novel class networks over time. This process does not affect the segmentation component of the network.

[0062] The placement of Module B 224 in this paradigm also has some flexibility. The input to Module B 224 should be directly linked to the output of Module A convolutional layers 202, so that class labels may be combined with the segmentation output to produce a segmented, labeled output 502. This constraint may be fulfilled by having both Modules A and B take the output of a region proposal stage. Module A should not depend on any dynamic portion of Module B.

That is, because Module B is adapting its network’s weights, but Module A is static, if Module B were to change its weights and then pass its output to Module A, Module A would likely see a performance drop due to the inability of most static neural networks to handle a sudden change in the input representation of its network.

Recognizing and Tracking Objects Using an L-DNN

[0063] An Al-enabled game playing system can include a camera input or other sensory input (e.g., a sensor suite 110 as in FIG. 1) to capture information about people or objects surrounding a drone. The L-DNN included in the Al-enabled game playing system uses the input data to first extract features of an object or a person using module A. The one-shot classifier or module B in L-DNN uses these extracted features to classify the object. In this manner, the L-DNN identifies objects on-the-fly. [0064] Once an object is identified, subsequent input data can be used to continue tracking the object. If the drone encounters new objects or people as the game progresses, it can learn these new objects/people. L-DNNs enable learning new knowledge on-the-fly without catastrophic forgetting.

[0065] As the drone continues to track an object of interest, the object of interest might go out of the drone’s field of view. If the object of interest returns to the drone’s field of view, the L-DNN can identify the object again without user intervention. In other words, if an object of interest moves back into the drone’s field of view after having moved out of the drone’s field of view, the AI-enabled game playing system can identify the object and resume tracking the object automatically. If the L-DNN has been trained to recognize the object from different angles, the object’s orientation upon reappearing the data stream is irrelevant; the L-DNN should recognize it no matter what. Likewise, the time elapsed between the object’s disappearance from and reappearance in the data stream is irrelevant; the L-DNN can wait indefinitely, so long as it is not reset during the wait.

[0066] Conversely, in other drones that are used for tracking a moving object, the user has to identify the object before tracking begins. Typically, the user does this by positioning the drone to acquire an image of the object and drawing a bounding box in the object. The drone correlates the pixels in this bounding box with pixels in subsequent images; the correlation gives the shift of the object’s geometric center from frame to frame. The drone can track the object so long as the object remains in the field of view and so long as the object’s appearance doesn’t change suddenly, e.g., because the object has turned sideways.

[0067] The L-DNN-based AI-enabled game playing system can also enable improved performance accuracy by combining contextual information with current object information.

The contextual L-DNN may learn that certain objects are likely to co-occur in the input stream. For example, camels, palm trees, sand dunes, and off-road vehicles are typical objects in a desert scene, whereas houses, sports cars, oak trees, and dogs are typical objects in a suburban scene. Locally ambiguous information at the pixel level and acquired as a drone input can be mapped to two object classes (e.g., camel or dog) depending on the context. In both cases, the object focus of attention has an ambiguous representation, which is often the case in low-resolution images.

In a desert scene, the pixelated image of the camel, can be disentangled by global information about the scene and past associations learned between objects, despite“camel” is only the fourth most likely class inferred by the L-DNN, the most probably being“horse” based on local pixel information alone. Contextual objects (e.g., a sand dune, off-road vehicle, or palm tree) have been in the past associated with“camel,” so that the contextual classifier can overturn the “horse” class in favor of the“camel” class. Similarly, in an urban scene containing a“house,” “sport car,” and“oak tree,” the same set of pixels could be mapped to a“dog.” In this manner, an L-DNN based AI-enabled game playing system can provide seamless identification, detection, and tracking of an object of interest.

AI-Enabled Game Playing Processes

[0068] A drone implementing an AI-enabled game playing system disclosed herein can identify one or more objects of interest and track these objects. If the drone loses sight of the object of interest and the object of interest moves back into the line of sight of the drone, the Al-enabled game playing system can retrack the object. If the drone encounters new object that it hasn’t encountered before, the Al-enabled game playing system can learn these new objects.

[0069] FIG. 6 shows a method 600 of detecting, tracking, and interacting with an object of interest using an Al-enabled interactive drone . If desired, the user may train the AI (e.g., an L- DNN) to recognize new objects (box 602), such as the people who will be playing a game using the Al-enabled interactive drone. The user can train the AI with previously acquired and tagged images, e.g., from the players’ social media accounts, or with images acquired in real time using a smart phone or the camera on the drone. Once the AI has been trained, it can be loaded onto the drone or onto a smart phone that controls the drone (box 604) if it does not already reside on the drone or smart phone.

[0070] A sensor, such as a camera on the drone, acquires a data stream (box 606), such as an image or video stream during the game or other interaction (e.g., a“follow-me” interaction where the drone tracks and films a user). In a typical sensor (e.g., camera) stream from a drone, multiple objects can be present, and an AI module in or coupled to the drone recognizes one or more objects in that object stream (box 608) without the need for any bounding boxes drawn by the user. If the AI modulate includes an L-DNN, the L-DNN uses a pre-trained neural network (module A) to extract features of the objects present in the sensor stream (box 610). A one-shot classifier (module B) uses these extracted features to classify the objects (box 612). [0071] The game logic module (e.g., game logic modules 105 and 125 in FIG. 1) can then provide guidance to the drone control module (e.g., drone control modules 106 and 126 in FIG. 1) to execute an action (box 620). In some instances, executing the action can include avoiding obstacles that are in the path of the drone. In other instances, the action can be to track a particular object of interest (box 622). For example, the drone may be prompted to track a ball with its camera or . In some instances, the action can be to follow an object of interest. For example, if the drone detects the presence of a particular person, the drone may be prompted to follow that person. The action can also include blurring or replacing an object of interest in the images (box 624). For example, the drone may be prompted to blur or segment a face in an image for privacy purposes before providing the image data to a display, server, or other memory. The drone may also take off, perform a particular maneuver (e.g., a flip or roll), hover, or land if it detects a specific object. The relationship between detected objects and action is contained in the game logic module (e.g., game logic 105 in FIG. 1).

[0072] On occasion, the object may disappear from the data stream (box 630). For instance, if the object is a person, and the person hides behind another object, the person may no longer be visible in the image stream captured by the drone’s camera. Similarly, as the drone flies through a given area, other objects in the area may occlude some or all of the drone camera’s field of view, potentially causing the person to disappear from the data stream, at least momentarily. If the game logic senses that the person has disappeared, it may command the drone to hover, return to a previous position, or follow a search pattern to re-acquire the person. In any event, if the person re-appears in the data stream (box 632), even from a different perspective, the AI module can recognize the person as described above (box 608).

Example Gamification of Drone Behavior Using AI

[0073] Traditionally, a variety of interactive games exist where humans play an activity involving a player performing an action (e.g., hide-and-seek), with a winner at the end of the game. A drone using AI can be used to imitate a role of a human player. The AI-enable game playing system can provide enhanced drone user engagement by intelligently processing the camera input or other sensory input and intelligently controlling the behavior of the drone so as to reproduce a typical game that would normally involve people.

[0074] Hide-and-Seek [0075] In a traditional hide-and-seek game, which can be played indoors or outdoors, the seeker closes his eyes for a brief period while the other players hide. The seeker then opens his eyes and tries to find the hiders, with the last person to be found being the winner of the round.

[0076] The technology described herein transforms the traditional game hide-and-seek game into one where a drone plays the role of the seeker. AI embedded onboard a drone or on a controlling device (e.g., a smart phone) with available compute power provides a new user experience where the seeker is not a human player, but an intelligent drone, via appropriate scripting of the AI in the context of the game.

[0077] In the hide-and-seek example, the AI recognizes people. As a drone encounters several people in its path, the video stream provides raw data for the AI to identify people. FIG. 7 illustrates‘hide-and-seek,’ where the drone imitates the homonymous game. In the game, described previously, a drone 100 slowly spins on its axis and/or performs small, semi-random movements to seek human players partially occluded (e.g., by a tree 1100) in an outdoor or indoor space. When the AI detects a person 1000, the player identified may be captured by a picture and eliminated from the game. The drone may continue this behavior until a time limit is reached or all but one person has been found.

[0078] Capture the Flag

[0079] In a traditional capture the flag game, which can be played indoors or outdoors, the players are divided into two teams. The field is divided into two territories belonging to each team. Each team has a flag that is either placed at the rear of a team’s territory or is hidden. The goal is to grab the enemy team’s hidden flag. However, if a player gets tagged in the enemy team’s territory, that player is sent to“jail.” The first team to steal the other teams flag wins.

[0080] The technology described herein transforms the traditional game capture the flag game into one where a drone plays the role of a player. AI embedded onboard a drone or on a controlling device (e.g., a smart phone) with available compute power provides a new user experience where one of the players is not a human player, but an intelligent drone.

[0081] In this case, the drone can imitate a player by tagging enemy team’s players in the drone’s team territory. The drone can detect, identify, and track enemy team players in order to do so. The drone can also search and detect the opposing team’s flag. If the drone detects the flag, the drone may be prompted to perform an action, such as hovering over the flag, so that another player in the same team can steal the opposing team’s flag.

[0082] Tag

[0083] In a traditional tag game, one player is chosen to chase the other players. The goal is to attempt to tag at least one other player while the other players try to escape.

[0084] In this case, the drone can imitate the player chosen to chase other players. The drone can identify and track the other players. When it is at a pre-defmed distance from at least one other player, it can be prompted to tag that player.

[0085] Follow -Me Feature

[0086] As described above, the technology described herein can be used to track a person of interest. The drone can therefore also be prompted to follow a person of interest. For example, a drone may be prompted to follow a user when the user is performing an activity and capture pictures or videos of the user. For example, a drone can be prompted to follow a user when the user is skiing in order to capture pictures and videos of the user. Similarly, the user could be running, biking, kayaking, snowboarding, climbing, etc. The drone may maneuver based on the user’s trajectory, e.g., the drone may follow the user at a predetermined distance or keep the user positioned at a particular point in the sensor’s field of view. Again, because the drone uses a neural -network classifier, if the drone loses track of the person, e.g., because the person is temporarily occluded or leaves the sensor’s field of view, the drone can automatically re-acquire the person as soon as the occlusion vanishes or the person returns to the sensor’s field of view. This is because the neural -network classifier recognizes the person in the data stream, regardless of the person’s orientation or how long the person has been absent from the data stream, so long as the drone has not been restored to its factory settings. For instance, the drone can be taught to recognize a person during a first session, then turned off and on and used to recognize the same person during a subsequent session.

[0087] Other Features

[0088] Drone alarm FIG. 8 illustrates another example game, termed‘drone alarm’. In this example, a drone 100 is placed in such a position as to monitor the entrance 2000 of a room (similar concepts can be applied to outdoor spaces). When the AI module detects an intruder (here, a person 1000, but it can also be a pet), the drone takes off. Optionally, events can happen on the controlling device 120, with a picture or a scan of the intruder being taken, or a sound alarm being played.

Conclusion

[0089] While various inventive embodiments have been described and illustrated herein, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the inventive embodiments described herein. More generally, those skilled in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the inventive teachings is/are used. Those skilled in the art will recognize or be able to ascertain using no more than routine experimentation, many equivalents to the specific inventive embodiments described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, inventive embodiments may be practiced otherwise than as specifically described and claimed. Inventive embodiments of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the inventive scope of the present disclosure.

[0090] The above-described embodiments can be implemented in any of numerous ways. For example, embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers.

[0091] Further, it should be appreciated that a computer may be embodied in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer. Additionally, a computer may be embedded in a device not generally regarded as a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smart phone or any other suitable portable or fixed electronic device.

[0092] Also, a computer may have one or more input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output.

Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible format.

[0093] Such computers may be interconnected by one or more networks in any suitable form, including a local area network or a wide area network, such as an enterprise network, and intelligent network (IN) or the Internet. Such networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks.

[0094] The various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.

[0095] Also, various inventive concepts may be embodied as one or more methods, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.

[0096] All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. [0097] All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms.

[0098] The indefinite articles“a” and“an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean“at least one.”

[0099] The phrase“and/or,” as used herein in the specification and in the claims, should be understood to mean“either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with“and/or” should be construed in the same fashion, i.e.,“one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the“and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to“A and/or B”, when used in conjunction with open-ended language such as“comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.

[0100] As used herein in the specification and in the claims,“or” should be understood to have the same meaning as“and/or” as defined above. For example, when separating items in a list, “or” or“and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as“only one of’ or“exactly one of,” or, when used in the claims,“consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term“or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e.“one or the other but not both”) when preceded by terms of exclusivity, such as“either,”“one of,”“only one of,” or“exactly one of.”“Consisting essentially of,” when used in the claims, shall have its ordinary meaning as used in the field of patent law.

[0101] As used herein in the specification and in the claims, the phrase“at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase“at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example,“at least one of A and B” (or, equivalently,“at least one of A or B,” or, equivalently“at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.

[0102] In the claims, as well as in the specification above, all transitional phrases such as “comprising,”“including,”“carrying,”“having,”“containing,”“involving,”“holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases“consisting of’ and“consisting essentially of’ shall be closed or semi-closed transitional phrases, respectively, as set forth in the United States Patent Office Manual of Patent Examining Procedures, Section 2111.03.

Claims

1. A method of controlling a drone, the method comprising:

collecting, by a sensor, a data stream representing an object in the environment;

extracting, with a neural network running on a processor operably coupled to the sensor, a convolutional output from the data stream, the convolutional output representing features of the object;

classifying, with a classifier operably coupled to the neural network, the object based on the convolutional output; and

controlling the drone in response to classifying the object according to pre-defmed logic.

2. The method of claim 1, wherein the sensor is an image sensor on the drone and the data stream includes imagery acquired by the image sensor.

3. The method of claim 1, wherein controlling the drone comprises following the object with the drone.

4. The method of claim 1, further comprising:

training the classifier to recognize the object.

5. The method of claim 1, further comprising:

determining that the object has disappeared from the data stream; and

automatically recognizing a reappearance of the object in the data stream using the neural network and/or the classifier.

6. A system comprising:

a drone equipped with a sensor to acquire a data stream, the data stream representing an object in an environment; and

at least one processor, operably coupled to the sensor, to:

execute a neural network that produces a convolutional output from the data stream, the convolutional output representing features of the object

classify the object based on the convolutional output; and

control the drone in response to the object.

7. The system of claim 6, wherein the sensor comprises at least one of an image sensor, a lidar, a radar, or an acoustic sensor.

8. The system of claim 6, wherein the at least one processor is on the drone.

9. The system of claim 6, wherein the at least one processor is on a smart phone wirelessly coupled to the drone.

10. A method of playing a game with a drone, the method comprising:

acquiring an input data stream with an image sensor on a drone;

detecting an object of interest in an input data stream with at least one processor communicatively coupled to the image sensor;

identifying the object of interest with an artificial neural network executed by the at least one processor;

determining, by the at least one processor, an action to be taken by the drone in the context of the game based on the object of interest; and

performing the action with the drone.

11. The method of claim 10, wherein performing the action includes tracking the object of interest in the input data stream.

12. The method of claim 10, wherein performing the action includes following the object of interest with the drone.

13. The method of claim 10, wherein performing the action includes blurring at least a portion of an image of the object of interest.

14. The method of claim 10, further comprising:

detecting another object in the input data stream; and

identifying the other object with the artificial neural network.

15. A method of using a drone to track an object, the method comprising:

acquiring a first image an image sensor on the drone;

determining, based on the first image, that the object is not in a field of view of the image sensor; acquiring a second image with the image sensor on a drone;

recognizing the object in the second image with a neural network;

determining that the object has moved back into the field of view of the image sensor; and

automatically causing the drone to track the object in response to determining that the object has moved back into the field of view of the image sensor.

16. The method of claim 15, wherein recognizing the object with the neural network comprises:

extracting features of the object from the second image to generate a convolutional output; and

classifying the object with a classifier coupled to the neural network based on the convolutional output.

17. The method of claim 16, further comprising:

training the classifier to recognize the object.

18. The method of claim 15, further comprising:

transmitting the second image from the drone to a smart phone, and

wherein recognizing the object with the neural network comprises executing the neural network on the smart phone.

19. The method of claim 18, wherein automatically causing the drone to track the object comprises transmitting a command from the smart phone to the drone and executing the command by the drone.