WO2020129025A1

WO2020129025A1 - Method and system for detecting holding in images

Info

Publication number: WO2020129025A1
Application number: PCT/IB2019/061224
Authority: WO
Inventors: Colin Joseph BROWN; Andrey Tolstikhin; Maggie Zhang; Paul Anthony KRUSZEWSKI
Original assignee: Wrnch Inc.
Priority date: 2018-12-20
Filing date: 2019-12-20
Publication date: 2020-06-25

Abstract

A method for detecting a holding in images, comprising: receiving a sequence of initial images representing at least partially an entity; for each initial image, determining a respective region of interest around a predefined anatomical region of the character, thereby obtaining a sequence of regions of interest; extracting latent spatial features from each respective region of interest, thereby obtaining a sequence of latent spatial features for each region of interest, the latent spatial features being indicative of the holding; generating a set of spatiotemporal features from the sequences of latent spatial features, the set of spatiotemporal features representing a sequence of visual states; determining from the set of spatiotemporal features whether the region of interest comprises the holding; and outputting an indication as to whether the holding is associated with the region of interest.

Description

METHOD AND SYSTEM FOR DETECTING HOLDING IN IMAGES

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority on US Provisional Application No. 62/782,503 filed on December 20, 2018.

TECHNICAL FIELD

The present invention relates to the field of methods and systems for detecting objects being held in images, and more particularly in frames of a video.

BACKGROUND

Understanding interactions between humans and other objects is of importance in computer vision and may have applications in different fields such as entertainment, security, retail or robotics. Central to human-object interactions are the notions of grabbing and holding. In the literature, there exist computer vision-based methods to categorize activities in video data, such as interacting with specific items, however many of these methods cannot be run efficiently on modern hardware in real-time or are not causal, and therefore require an entire video sequence to infer (i.e. they cannot be run in an online fashion). There also exist robust methods to localize both objects and persons (e.g. YOLOTM, FastRCNNTM) but object-person interactions are not typically examined or identified. The task of identifying a person holding or not holding an item in any context (i.e. any item category, any reasonable appearance of the person and scene), especially in real-time, remains underexamined.

The ability to automatically determine if a person is holding an item or not, regardless of context, from standard RGB video data facilitates a variety of other kinds of subsequent recognition and analysis tasks (e.g. deciding if someone is armed, detecting completion of a pass in certain sports, identifying shoplifters, etc.) without requiring domain-specific algorithms or training data. This may be important because collection of domain-specific data may be costly and domain-specific algorithms may not generalize as well to uncommon situations as those designed for general applicability.

Therefore, there is a need for an improved method and system for detecting holding in images. SUMMARY

It is an object of the present technology to ameliorate at least some of the inconveniences present in the prior art. Embodiments of the present technology may provide and/or broaden the scope of approaches to and/or methods of achieving the aims and objects of the present technology.

According to a first broad aspect, there is provided a computer-implemented method for detecting a holding in images, comprising: receiving a sequence of initial images representing at least partially an entity; for each initial image, determining a respective region of interest around a predefined anatomical region of the character, thereby obtaining a sequence of regions of interest; extracting latent spatial features from each respective region of interest, thereby obtaining a sequence of latent spatial features for each region of interest, the latent spatial features being indicative of the holding; generating a set of spatiotemporal features from the sequences of latent spatial features, the set of spatiotemporal features representing a sequence of visual states; determining from the set of spatiotemporal features whether the region of interest comprises the holding; and outputting an indication as to whether the holding is associated with the region of interest.

In one embodiment, the step of determining a respective region of interest around a predefined anatomical region of the character comprises for each initial image, identifying an anatomical region of the character as corresponding to the predefined anatomical region, defining the respective region of interest that contains the identified anatomical region and cropping the initial image to obtain a respective cropped image containing the respective region of interest.

In one embodiment, the predefined anatomical region of the character comprises one of a hand, an elbow and an underarm.

In one embodiment, the method further comprises resizing the respective region of interest to a predefined resolution.

In one embodiment, each one of the latent spatial features are represented by an N-dimensional block of data. In one embodiment, the latent spatial features are chosen so as to localize at least one of salient visual cues and visual forms and relationships.

In one embodiment, the salient visual cues comprise at least one of edges, colors, patterns and shapes.

In one embodiment, the sequence of initial images correspond to video frames.

In one embodiment, the step of receiving the sequence of initial images comprises receiving a live video.

According to another broad aspect, there is provided a system for detecting a holding in images, the system comprising: a region of interest (ROI) extractor configured for receiving a sequence of initial images representing at least partially an entity and for each initial image, determining a respective region of interest around a predefined anatomical region of the character to obtain a sequence of regions of interest; a spatial feature extractor configured for extracting latent spatial features from region of interest to obtain a sequence of latent spatial features for each region of interest, the latent spatial features being indicative of the holding; a temporal feature extractor configured for generating a set of latent spatiotemporal features from the sequences of latent spatial features, the set of latent spatiotemporal features representing a sequence of visual states; and a classifier configured for determining from the set of spatiotemporal features whether the region of interest comprises the holding and outputting an indication as to whether the holding is associated with the region of interest.

In one embodiment, the ROI extractor is configured for, for each initial image, identifying an anatomical region of the character as corresponding to the predefined anatomical region, defining the respective region of interest that contains the identified anatomical region and cropping the initial image to obtain a respective cropped image containing the respective region of interest.

In one embodiment, the predefined anatomical region of the character comprises one of a hand, an elbow and an underarm. In one embodiment, the ROI extractor is further configured for resizing the respective region of interest to a predefined resolution.

In one embodiment, each one of the latent spatial features are represented by an N-dimensional block of data.

In one embodiment, the latent spatial features are chosen so as to localize at least one of salient visual cues and visual forms and relationships.

In one embodiment, the sequence of initial images correspond to video frames.

In one embodiment, the video frames are part of a live video.

While in the present description, it is referred to a method and system for detecting holding in images, i.e. a method and system for detecting whether an entity such as a human being, an animal, a robot or the like is holding at least one object, it should be understood that the expression“holding an object” is not limitative and may be interpreted as including grabbing, grasping, touching, etc. an object. It should be understood that an object may be held within a hand, using an elbow, under an underarm, etc. and therefore, a holding is not limited to a hand holding an object.

Machine Learning Algorithms (MLA)

A machine learning algorithm is a process or sets of procedures that helps a mathematical model adapt to data given an objective. A MLA normally specifies the way the feedback is used to enable the model to learn the appropriate mapping from input to output. The model specifies the mapping function and holds the parameters while the learning algorithm updates the parameters to help the model satisfy the objective.

MLAs may generally be divided into broad categories such as supervised learning, unsupervised learning and reinforcement learning. Supervised learning involves presenting a machine learning algorithm with training data consisting of inputs and outputs labelled by assessors, where the objective is to train the machine learning algorithm such that it learns a general rule for mapping inputs to outputs. Unsupervised learning involves presenting the machine learning algorithm with unlabeled data, where the objective is for the machine learning algorithm to find a structure or hidden patterns in the data. Reinforcement learning involves having an algorithm evolving in a dynamic environment guided only by positive or negative reinforcement.

Models used by the MLAs include neural networks (including deep learning), decision trees, support vector machines (SVMs), Bayesian networks, and genetic algorithms.

Neural Networks (NNs)

Neural networks (NNs), also known as artificial neural networks (ANNs) are a class of non linear models mapping from inputs to outputs and comprised of layers that can potentially learn useful representations for predicting the outputs. Neural networks are typically organized in layers, which are made of a number of interconnected nodes that contain activation functions. Patterns may be presented to the network via an input layer connected to hidden layers, and processing may be done via the weighted connections of nodes. The answer is then output by an output layer connected to the hidden layers. Non-limiting examples of neural networks includes: perceptrons, back-propagation, hopfield networks.

Multilayer Perceptron (MLP)

A multilayer perceptron (MLP) is a class of feedforward artificial neural networks. A MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. A MLP uses a supervised learning technique called backpropagation for training. A MLP can distinguish data that is not linearly separable.

Convolutional Neural Network (CNN)

A convolutional neural network (CNN or ConvNet) is a NN which is a regularized version of a MLP. A CNN uses convolution in place of general matrix multiplication in at least one layer.

Recurrent Neural Network (RNN) A recurrent neural network (RNN) is a NN where connection between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Each node in a given layer is connected with a directed (one-way) connection to every other node in the next successive layer. Each node (neuron) has a time-varying real-valued activation. Each connection (synapse) has a modifiable real-valued weight. Nodes are either input nodes (receiving data from outside the network), output nodes (yielding results), or hidden nodes (that modify the data en route from input to output).

Gradient Boosting

Gradient boosting is one approach to building an ML A based on decision trees, whereby a prediction model in the form of an ensemble of trees is generated. The ensemble of trees is built in a stage-wise manner Each subsequent decision tree in the ensemble of decision trees focuses training on those previous decision tree iterations that were "weak learners" in the previous iteration(s) of the decision trees ensemble (i.e. those that are associated with poor prediction/high error).

Generally speaking, boosting is a method aimed at enhancing prediction quality of the MLA. In this scenario, rather than relying on a prediction of a single trained algorithm (i.e. a single decision tree) the system uses many trained algorithms (i.e. an ensemble of decision trees), and makes a final decision based on multiple prediction outcomes of those algorithms.

In boosting of decision trees, the MLA first builds a first tree, then a second tree, which enhances the prediction outcome of the first tree, then a third tree, which enhances the prediction outcome of the first two trees and so on. Thus, the MLA in a sense is creating an ensemble of decision trees, where each subsequent tree is better than the previous, specifically focusing on the weak learners of the previous iterations of the decision trees. Put another way, each tree is built on the same training set of training objects, however training objects, in which the first tree made "mistakes" in predicting are prioritized when building the second tree, etc. These "tough" training objects (the ones that previous iterations of the decision trees predict less accurately) are weighted with higher weights than those where a previous tree made satisfactory prediction. Examples of deep learning MLAs include: Deep Boltzmann Machine (DBM), Deep Belief Networks (DBN), Convolutional Neural Network (CNN), and Stacked Auto-Encoders.

Examples of ensemble MLAs include: Random Forest, Gradient Boosting Machines (GBM), Boosting, Bootstrapped Aggregation (Bagging), AdaBoost, Stacked Generalization (Blending), Gradient Boosted Decision Trees (GBDT) and Gradient Boosted Regression Trees (GBRT).

Examples of NN MLAs include: Radial Basis Function Network (RBFN), Perceptron, Back- Propagation, and Hopfield Network

Examples of Regularization MLAs include: Ridge Regression, Least Absolute Shrinkage and Selection Operator (LASSO), Elastic Net, and Least Angle Regression (LARS).

Examples of Rule system MLAs include: Cubist, One Rule (OneR), Zero Rule (ZeroR), and Repeated Incremental Pruning to Produce Error Reduction (RIPPER).

Examples of Regression MLAs include: Linear Regression, Ordinary Least Squares Regression (OLSR), Stepwise Regression, Multivariate Adaptive Regression Splines (MARS), Locally Estimated Scatterplot Smoothing (LOESS), and Logistic Regression.

Examples of Bayesian MLAs include: Naive Bayes, Averaged One-Dependence Estimators (AODE), Bayesian Belief Network (BBN), Gaussian Naive Bayes, Multinomial Naive Bayes, and Bayesian Network (BN).

Examples of Decision Trees MLAs include: Classification and Regression Tree (CART), Iterative Dichotomiser 3 (103), C4.5, C5.0, Chi-squared Automatic Interaction Detection CCHAID), Decision Stump, Conditional Decision Trees, and M5.

Examples of Dimensionality Reduction MLAs include: Principal Component Analysis (PCA), Partial Least Squares Regression (PLSR), Sammon Mapping, Multidimensional Scaling (MDS), Projection Pursuit, Principal Component Regression (PCR), Partial Least Squares Discriminant Analysis, Mixture Discriminant Analysis (MDA), Quadratic Discriminant Analysis (QDA), Regularized Discriminant Analysis (RDA), Flexible Discriminant Analysis (FDA), and Linear Discriminant Analysis (LOA). Examples of Instance Based MLAs include: k-Nearest Neighbour (kNN), Learning Vector Quantization (LVQ), Self-Organizing Map (SOM), Locally Weighted Learning (LWL).

Examples of Clustering MLAs include: k-Means, k-Medians, Expectation Maximization, and Hierarchical Clustering.

In the context of the present specification, a "server" is a computer program that is running on appropriate hardware and is capable of receiving requests (e.g., from electronic devices) over a network (e.g., a communication network), and carrying out those requests, or causing those requests to be carried out. The hardware may be one physical computer or one physical computer system, but neither is required to be the case with respect to the present technology. In the present context, the use of the expression a "server" is not intended to mean that every task (e.g., received instructions or requests) or any particular task will have been received, carried out, or caused to be carried out, by the same server (i.e., the same software and/or hardware); it is intended to mean that any number of software elements or hardware devices may be involved in receiving/sending, carrying out or causing to be carried out any task or request, or the consequences of any task or request; and all of this software and hardware may be one server or multiple servers, both of which are included within the expressions "at least one server" and "a server".

In the context of the present specification, "electronic device" is any computing apparatus or computer hardware that is capable of running software appropriate to the relevant task at hand. Thus, some (non-limiting) examples of electronic devices include general purpose personal computers (desktops, laptops, netbooks, etc.), mobile computing devices, smartphones, and tablets, and network equipment such as routers, switches, and gateways. It should be noted that an electronic device in the present context is not precluded from acting as a server to other electronic devices. The use of the expression "an electronic device" does not preclude multiple electronic devices being used in receiving/sending, carrying out or causing to be carried out any task or request, or the consequences of any task or request, or steps of any method described herein. In the context of the present specification, a "client device" refers to any of a range of end-user client electronic devices, associated with a user, such as personal computers, tablets, smartphones, and the like. In the context of the present specification, the expression "computer readable storage medium" (also referred to as "storage medium" and "storage") is intended to include non-transitory media of any nature and kind whatsoever, including without limitation RAM, ROM, disks (CD-ROMs, DVDs, floppy disks, hard drivers, etc.), USB keys, solid state-drives, tape drives, etc. A plurality of components may be combined to form the computer information storage media, including two or more media components of a same type and/or two or more media components of different types.

In the context of the present specification, a "database" is any structured collection of data, irrespective of its particular structure, the database management software, or the computer hardware on which the data is stored, implemented or otherwise rendered available for use. A database may reside on the same hardware as the process that stores or makes use of the information stored in the database or it may reside on separate hardware, such as a dedicated server or plurality of servers.

In the context of the present specification, unless expressly provided otherwise, an "indication" of an information element may be the information element itself or a pointer, reference, link, or other indirect mechanism enabling the recipient of the indication to locate a network, memory, database, or other computer-readable medium location from which the information element may be retrieved. For example, an indication of a document could include the document itself (i.e. its contents), or it could be a unique document descriptor identifying a file with respect to a particular file system, or some other means of directing the recipient of the indication to a network location, memory address, database table, or other location where the file may be accessed. As one skilled in the art would recognize, the degree of precision required in such an indication depends on the extent of any prior understanding about the interpretation to be given to information being exchanged as between the sender and the recipient of the indication. For example, if it is understood prior to a communication between a sender and a recipient that an indication of an information element will take the form of a database key for an entry in a particular table of a predetermined database containing the information element, then the sending of the database key is all that is required to effectively convey the information element to the recipient, even though the information element itself was not transmitted as between the sender and the recipient of the indication. In the context of the present specification, the expression "communication network" is intended to include a telecommunications network such as a computer network, the Internet, a telephone network, a Telex network, a TCP/IP data network (e.g., a WAN network, a LAN network, etc.), and the like. The term "communication network" includes a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared and other wireless media, as well as combinations of any of the above.

In the context of the present specification, the words "first", "second", "third", etc. have been used as adjectives only for the purpose of allowing for distinction between the nouns that they modify from one another, and not for the purpose of describing any particular relationship between those nouns. Thus, for example, it should be understood that, the use of the terms "server" and "third server" is not intended to imply any particular order, type, chronology, hierarchy or ranking (for example) of/between the server, nor is their use (by itself) intended imply that any "second server" must necessarily exist in any given situation. Further, as is discussed herein in other contexts, reference to a "first" element and a "second" element does not preclude the two elements from being the same actual real-world element. Thus, for example, in some instances, a "first" server and a "second" server may be the same software and/or hardware, in other cases they may be different software and/or hardware.

Implementations of the present technology each have at least one of the above-mentioned object and/or aspects, but do not necessarily have all of them. It should be understood that some aspects of the present technology that have resulted from attempting to attain the above- mentioned object may not satisfy this object and/or may satisfy other objects not specifically recited herein.

Additional and/or alternative features, aspects and advantages of implementations of the present technology will become apparent from the following description, the accompanying drawings and the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

Further features and advantages of the present invention will become apparent from the following detailed description, taken in combination with the appended drawings, in which: Figure 1 illustrates a schematic diagram of an electronic device in accordance with non-limiting embodiments of the present technology;

Figure 2 depicts a schematic diagram of a system in accordance with non-limiting embodiments of the present technology;

Figure 3 is a flow chart of a method for detecting a holding in a sequence of images, in accordance with an embodiment;

Figure 4 is a block diagram of a system for detecting a holding in a sequence of images, in accordance with an embodiment;

Figure 5 illustrates the extraction of two regions of interest localized around a left hand and a right hand of a person from 10 successive video frames, in accordance with an embodiment;

Figure 6 is a block diagram illustrating an exemplary architecture for a holding classifier system comprising a VGG16 type convolutional neural network spatial feature extractor, in accordance with an embodiment;

Figure 7 is a block diagram illustrating an exemplary architecture for a holding classifier system comprising a VGG19 type convolutional neural network spatial feature extractor, in accordance with an embodiment;

Figure 8 is a block diagram illustrating an exemplary architecture for a holding classifier system comprising a YOLO type convolutional neural network spatial feature extractor, in accordance with an embodiment;

Figure 9 is a block diagram illustrating an exemplary architecture for a holding classifier system comprising a encoder-recurrent-decoder (ERD) type temporal feature extractor, in accordance with an embodiment;

Figure 10 presents two exemplary pictorial representations of possible inputs and outputs to and from a spatial feature extractor and a temporal feature extractor, in each example (top and bottom), a single frame image being inputted into the spatial feature extractor, which extracts a set (256) of spatial feature maps that are inputted into the temporal feature extractor, resulting in a vector of (128) spatiotemporal features, in accordance with an embodiment; and

Figure 11 is a block diagram of a processing module adapted to execute at least some of the steps of the method of Figure 3, in accordance with an embodiment.

It will be noted that throughout the appended drawings, like features are identified by like reference numerals.

DETAILED DESCRIPTION

The examples and conditional language recited herein are principally intended to aid the reader in understanding the principles of the present technology and not to limit its scope to such specifically recited examples and conditions. It will be appreciated that those skilled in the art may devise various arrangements which, although not explicitly described or shown herein, nonetheless embody the principles of the present technology and are included within its spirit and scope.

Furthermore, as an aid to understanding, the following description may describe relatively simplified implementations of the present technology. As persons skilled in the art would understand, various implementations of the present technology may be of a greater complexity.

In some cases, what are believed to be helpful examples of modifications to the present technology may also be set forth. This is done merely as an aid to understanding, and, again, not to define the scope or set forth the bounds of the present technology. These modifications are not an exhaustive list, and a person skilled in the art may make other modifications while nonetheless remaining within the scope of the present technology. Further, where no examples of modifications have been set forth, it should not be interpreted that no modifications are possible and/or that what is described is the sole manner of implementing that element of the present technology.

Moreover, all statements herein reciting principles, aspects, and implementations of the present technology, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof, whether they are currently known or developed in the future. Thus, for example, it will be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the present technology. Similarly, it will be appreciated that any flowcharts, flow diagrams, state transition diagrams, pseudo-code, and the like represent various processes which may be substantially represented in computer-readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.

The functions of the various elements shown in the figures, including any functional block labeled as a "processor" or a "graphics processing unit", may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. In some non-limiting embodiments of the present technology, the processor may be a general purpose processor, such as a central processing unit (CPU) or a processor dedicated to a specific purpose, such as a graphics processing unit (GPU). Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read-only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.

Software modules, or simply modules which are implied to be software, may be represented herein as any combination of flowchart elements or other elements indicating performance of process steps and/or textual description. Such modules may be executed by hardware that is expressly or implicitly shown.

With these fundamentals in place, we will now consider some non-limiting examples to illustrate various implementations of aspects of the present technology.

Electronic device

Referring to Figure 1, there is shown an electronic device 100 suitable for use with some implementations of the present technology, the electronic device 100 comprising various hardware components including one or more single or multi-core processors collectively represented by processor 110, a graphics processing unit (GPU) 111, a solid-state drive 120, a random access memory 130, a display interface 140, and an input/output interface 150.

Communication between the various components of the electronic device 100 may be enabled by one or more internal and/or external buses 160 (e.g. a PCI bus, universal serial bus, IEEE 1394 "Firewire" bus, SCSI bus, Serial-ATA bus, etc.), to which the various hardware components are electronically coupled.

The input/output interface 150 may be coupled to a touchscreen 190 and/or to the one or more internal and/or external buses 160. The touchscreen 190 may be part of the display. In some embodiments, the touchscreen 190 is the display. The touchscreen 190 may equally be referred to as a screen 190. In the embodiments illustrated in Figure 1, the touchscreen 190 comprises touch hardware 194 (e.g., pressure-sensitive cells embedded in a layer of a display allowing detection of a physical interaction between a user and the display) and a touch input/output controller 192 allowing communication with the display interface 140 and/or the one or more internal and/or external buses 160. In some embodiments, the input/output interface 150 may be connected to a keyboard (not shown), a mouse (not shown) or a trackpad (not shown) allowing the user to interact with the electronic device 100 in addition or in replacement of the touchscreen 190.

According to implementations of the present technology, the solid-state drive 120 stores program instructions suitable for being loaded into the random-access memory 130 and executed by the processor 110 and/or the GPU 111 for generating a reduced molecular graph of a given molecule. For example, the program instructions may be part of a library or an application.

The electronic device 100 may be implemented as a server, a desktop computer, a laptop computer, a tablet, a smartphone, a personal digital assistant or any device that may be configured to implement the present technology, as it may be understood by a person skilled in the art.

System Referring to Figure 2, there is illustrated a schematic diagram of a system 200, the system 200 being suitable for implementing non-limiting embodiments of the present technology. It is to be expressly understood that the system 200 as shown is merely an illustrative implementation of the present technology. Thus, the description thereof that follows is intended to be only a description of illustrative examples of the present technology. This description is not intended to define the scope or set forth the bounds of the present technology. In some cases, what are believed to be helpful examples of modifications to the system 200 may also be set forth below. This is done merely as an aid to understanding, and, again, not to define the scope or set forth the bounds of the present technology. These modifications are not an exhaustive list, and, as a person skilled in the art would understand, other modifications are likely possible. Further, where this has not been done (i.e., where no examples of modifications have been set forth), it should not be interpreted that no modifications are possible and/or that what is described is the sole manner of implementing that element of the present technology. As a person skilled in the art would understand, this is likely not the case. In addition, it is to be understood that the system 200 may provide in certain instances simple implementations of the present technology, and that where such is the case they have been presented in this manner as an aid to understanding. As persons skilled in the art would understand, various implementations of the present technology may be of a greater complexity.

The system 200 comprises inter alia a server 220 and a database 230 communicatively coupled over a communications network 250.

Server

Generally speaking, the server 220 is configured to determine whether a holding of an object occurs in a sequence of images. How the server 220 is configured to do so will be explained in more detail herein below.

The server 220 can be implemented as a conventional computer server and may comprise some or all of the features of the electronic device 100 depicted in Figure 1. In a non-limiting example of an embodiment of the present technology, the server 220 can be implemented as a Dell™ PowerEdge™ Server running the Microsoft™ Windows Server™ operating system. Needless to say, the server 220 can be implemented in any other suitable hardware and/or software and/or firmware or a combination thereof. In the depicted non-limiting embodiment of present technology, the server 220 is a single server. In alternative non-limiting embodiments of the present technology, the functionality of the server 220 may be distributed and may be implemented via multiple servers (not depicted).

The implementation of the server 220 is well known to the person skilled in the art of the present technology. However, briefly speaking, the server 220 comprises a communication interface (not depicted) structured and configured to communicate with various entities (such as the database 230, for example and other devices potentially coupled to the network) via the network. The server 220 further comprises at least one computer processor (e.g., the processor 110 of the electronic device 100) operationally connected with the communication interface and structured and configured to execute various processes to be described herein.

The server 220 has access to a plurality of machine learning algorithms (MLA).

Database

A database 230 is communicatively coupled to the server 220 via the communications network 250 but, in alternative implementations, the database 230 may be communicatively coupled to the server 220 without departing from the teachings of the present technology. Although the database 230 is illustrated schematically herein as a single entity, it is contemplated that the database 230 may be configured in a distributed manner, for example, the database 230 could have different components, each component being configured for a particular kind of retrieval therefrom or storage therein.

The database 230 may be a structured collection of data, irrespective of its particular structure or the computer hardware on which data is stored, implemented or otherwise rendered available for use. The database 230 may reside on the same hardware as a process that stores or makes use of the information stored in the database 230 or it may reside on separate hardware, such as on the server 220. Generally speaking, the database 230 may receive data from the server 220 for storage thereof and may provide stored data to the server 220 for use thereof. The database 230 may also be configured to store information for training the plurality of MLAs, such as training datasets, which may include training objects such as sequences digital images or sequences of video frames as well as labels.

Communication Network

In some embodiments of the present technology, the communications network 250 is the Internet. In alternative non-limiting embodiments, the communication network 250 can be implemented as any suitable local area network (LAN), wide area network (WAN), a private communication network or the like. It should be expressly understood that implementations for the communication network 250 are for illustration purposes only. How a communication link 255 (not separately numbered) between the first server 220, the database 230, the second server 240 and/or another electronic device (not shown) and the communications network 250 is implemented will depend inter alia on how each electronic device is implemented.

Figure 3 illustrates one embodiment of a computer-implemented method 300 for detecting a holding in images, i.e. determining whether an entity contained in the images holds an object. The method 300 is implemented by a computer machine comprising at least a processing unit or processor, a memory or storing unit and communication means. In one embodiment, the method 300 is executed by a server.

At step 302, a sequence of initial images is received. The initial images comprises a representation of at least a portion of the entity. In one embodiment, the initial images correspond to a plurality of frames of a video.

In one embodiment, the initial images received at step 302 correspond to frames of a live video which is received in substantially real-time. In another embodiment, the initial images received at step 302 correspond to the frames of a video stored in a file on a memory. In this case, the step 302 corresponds to uploading the frames of the stored video from the memory.

At step 304, a region of interest is located within each initial image and extracted from each initial image. In one embodiment, the initial images may be cropped to obtain a sequence of cropped images each containing the region of interest. In one embodiment, a cropped initial image only contains the region of interest and optionally a given region surrounding the region of interest.

The region of interest is selected according to a set of predefined anatomical regions of the entity. For example, the predefined anatomical regions may include hands, elbows, underarms, and/or the like. The region of interest corresponds to the region of the initial image that contains at least one of the predefined anatomical region and optionally also surrounds the anatomical region. Therefore, at step 304, at least one anatomical region corresponding to at least one of the predefined anatomical regions is first searched within the initial images in order to identify and extract the region of interest.

In one embodiment, a region of interest containing a predefined anatomical region of the entity is determined in the first image of the sequence of initial images and the same region of interest is subsequently identified in the other images of the sequence of initial images from one initial image to a subsequent or following one.

It should be understood that any adequate method for detecting a region of interest may be used. For example, when the anatomical region of interest is a hand, any adequate method for identifying a hand in an image may be used.

It should also be understood that more than one region of interest may be detected within the sequence of initial images. For example, when the initial images represent an entity provided with two arms and two hands, four regions of interests may be determined, i.e. a region of interest for each hand and a region of interest for each elbow.

At step 306, latent spatial features are extracted from each region of interest, thereby obtaining a sequence of latent spatial features. As described below in greater detail, a latent spatial feature is an image feature that was found to be indicative of a holding during training of a machine learning system used to performed step 306. Latent spatial features may localize salient visual cues such as edges, colors, patterns, shapes and/or the like, and other more complex visual forms and relationships that are relevant for identifying a holding in images. The person skilled in the art will understand that the exact form for the latent spatial features will be defined by the machine learning system itself during the training of the machine learning system.

At step 308, a set of spatiotemporal features is generated from the sequences of latent spatial features. As described in further detail below, the set of spatiotemporal features represents a sequence of visual states from which it can be determined whether the entity represented in the initial images holds an object.

At step 310, it is determined from the set of spatiotemporal features whether the region of interest identified in the initial images comprises a holding, i.e. whether the entity represented in the initial images holds an object or not.

In one embodiment and as described in greater detail below, the step 310 is performed using a machine learning method executed by a machine or system that is trained to recognize a holding from sets of spatiotemporal features.

At step 312, an indication as to whether or not the region of interest comprises a holding is outputted. In one embodiment, the indication is stored in memory. In the same or another embodiment, the indication may be transmitted to a display unit to be displayed thereon and/or to a another computer machine.

In the following, there is described one embodiment of a detection system for executing the method 300.

In one embodiment, the detection system pertains to real-time detection of an entity holding at least one object or item in video frames. The system comprises a region of interest (ROI) extractor that extracts at least one region of interest (e.g. cropped images from bounding boxes around an anatomical part of the entity such a hand, an elbow, etc.) from a video feed, and holding detector that receives the regions of interest(s) from the ROI extractor and classifies each region of interest as containing a held object or not. The detection system may accept sequential video frames from a video file or live video stream. It may output a binary label or tag representing holding or not holding for each ROI detected by the holding ROI extractor.

Figure 4 illustrates one embodiment of a system 350 for detecting a holding in a sequence of images such as in frames of a video. The system 350 receives as input a sequence of initial images 352 which contains the representation of an entity. The system 350 is configured for determining whether the entity holds an object or not. The system 350 comprises a ROI extractor 354, a spatial features extractor 356, a temporal features extractor 358 and a classifier 360.

The ROI extractor 354 is configured for receiving the sequence of initial images, extracting at least one region of interest from the received initial images and optionally generating a sequence of cropped images containing the region of interest. The spatial features extractor 356 is configured for extracting a sequence of spatial features from the sequence of cropped images. The temporal features extractor 358 is configured for generating a set of spatiotemporal features from the sequence of spatial features and the classifier 360 is configured for determining if a holding is associated with the initial images using the spatiotemporal features and outputting an indication of the holding, i.e. an indication as to whether the entity contained in the initial images holds an object.

The ROI extractor 354 may be any adequate device configured for receiving an image or a video frame (or images or video frames) as input and providing, as output, regions of interest, i.e. regions of the image or video frame that contains a predefined "holding targets" such as a predefined anatomical part of an entity that can hold an object. For example, a ROI extractor may be configured for extracting regions of interest containing a hand, an elbow, an underarm, or the like. The regions of interest may be represented in memory as cropped images.

It should be understood that the ROI extractor 354 may be configured for extracting more than one regions of interest from the initial images. For example, the ROI extractor 354 may be configured for identifying hands, elbows, underarms, and the like in the initial images. In one embodiment, the ROI extractor 354 may comprise a plurality of modules each adapted to identify a specific anatomical part of an entity. When a sequence of images is inputted in the ROI extractor 354, the ROI extractor 354 identifies the region(s) of interest in each image so that the anatomical part corresponding to the region of interest may be tracked between subsequent images or video frames to ensure consistency between the identified regions of interest through the images or frames, as illustrated in Figure 5.

For example, the ROI extractor 354 may be a hand detector system that localizes and tracks the bounding boxes of the hands of a person in a video frame and extracts the image data within these bounding boxes as ROIs.

In one embodiment, the regions of interest outputted by the ROI extractor 354 may be resized to a consistent resolution (e.g. 128x128) such that input to the spatial feature extractor 356 is consistent for any size of the entity, for any size of the region of interest in the images and for any resolution of the input images or frames.

Referring back to Figure 4, the spatial feature extractor 356 receives the region(s) of interest identified by the ROI extractor 354 for each initial image or video frame. The spatial feature extractor 356 is configured for determining a set of latent spatial features for each received region of interest.

In one embodiment, a region of interest identified by the ROI extractor 354 comprises the pixels located within the bounding-box surrounding the predefined anatomical region corresponding to the region of interest. In this case, the spatial feature extractor 356 may output a latent feature representation of the region of interest that may be encoded as an N-dimensional block of data such as a stack of 2D spatial feature maps, as illustrated in Figure 10.

In one embodiment, the spatial feature extractor 356 uses a machine learning method for extracting the latent spatial features. The output of the spatial feature extractor 356 may comprise a set of latent spatial features that, through training of the spatial feature extractor 356, have been found to be informative for deciding if the region of interest contains an object being held. It should be understood that the extraction of the latent spatial features is performed for each region of interest of each image independently from the previous images so that the extracted latent spatial features comprise no temporal information. In one embodiment, the spatial feature extractor 356 is a NxN convolutional filter in which each 1X1 element is a scalar weight learned through training and where multiple filters comprise a single layer in the stacked model. In this case, a latent spatial feature is an MxM scalar image derived by applying the feature extractor to an initial region of interest or the output of a previous layer) and where a stack of latent spatial features are output by each feature extractor layer in the model. A value is associated to a latent spatial feature and the value is a scalar image represented as MxM grid of numbers (such as 16- or 32- bit floating point values), a stack of which comprises the output of a given layer.

In one embodiment, the extracted latent spatial features may localize, within a 2D feature map of spatial coordinates corresponding to those of the initial image for example, salient visual cues such as edges, colors, patterns, shapes and other more complex visual forms and relationships that are relevant for identifying holding. For example, latent spatial features may have a high value in regions corresponding to hands, fingers, common holdable objects, etc.

In one embodiment, the spatial feature extractor 356 may be implemented as a convolutional neural network (CNN) comprising a stack of convolutional layers, fully-connected layers, pooling layers, activation layers (such as ReLU layers), and/or other adequate layers. Convolutional layers, fully-connected layers and possibly other kinds of layers contain learnable parameters that are set with a training procedure over a set of relevant training data.

In one embodiment, the specific CNN architecture of the spatial feature extractor 356 may be similar to the architecture of a VGG16 network, comprising 5 blocks of convolutional layers (blocks: two layers of size 64, two layers of size 128, three layers of size 256, three layers of size 512, two layers of size 512), with ReLU activations between each block and spatial max pooling layers, as illustrated in Figure 6.

Figures 7 and 8 illustrate other adequate CNN architectures capable of extracting spatial features, including a VGG19 network and a YOLO network, respectively.

Other adequate network such as a ResNet50 network may also be used, depending on particular training and deployment considerations such as latency, memory constraints and/or available processing power. The person skilled in the art will understand that the exact forms of the latent spatial features are defined by the operation of the spatial feature extractor on input data. The spatial feature extractor is in turn defined by a training process during which the parameters of the spatial feature extractor (e.g. convolutional filter weights) are learned by examining a set of training data, as described in greater detail below.

Referring back to Figure 6, the temporal feature extractor 358 receives the sets of latent spatial features as they are produced by the spatial feature extractor 356, i.e. it receives a temporally ordered sequence of sets of latent spatial feature provided by the spatial feature extractor 356 (e.g. operating over a sequence of regions of interest).

The temporal feature extractor 358 is configured for outputting a set of latent spatiotemporal features that may also be encoded as an N-dimensional block of data per input initial image or video frame. In one embodiment, a set of latent spatiotemporal features is represented by a ID feature vector per input frame, where a given element of the ID feature vector has a high value when the feature associated with the given element is activated/detected, as illustrated in Figure 10.

In one embodiment, the set of latent spatiotemporal features outputted by the temporal feature extractor 358 may represent a sequence of visual states (e.g. the image of a hand open followed by the image of a hand gripped) that are informative for deciding the holding state of a region of interest given the current initial image and all previously received images.

Similarly to the latent spatial features, the person skilled in the art will understand that the latent spatiotemporal features are defined by the operation of the temporal feature extractor on the input features, which in turn are defined by the training process on a given set of sequential training data (e.g. videos) and the input data, respectively.

In one embodiment, the temporal feature extractor 358 may store current and previous state information to facilitate the extraction of the latent spatiotemporal features.

In an embodiment in which more than one region of interest is determined, a separate temporal feature extractor, with its own stored state, may be instantiated for each input region of interest in order for temporal information to be relevant to each region of interest respectively. In one embodiment, the temporal feature extractor 358 may be implemented as a recurrent neural network (RNN) that may comprise long-short-term memory (LSTM) layers, fully connected layers, activation layers and/or the like. LSTM layers contain learnable parameters to be set during training. Unlike the temporal state, which may be instantiated per region of interest, a single set of learned parameters may be trained for use with any input.

In one embodiment, the specific RNN architecture of the temporal feature extractor 358 may comprise two fully connected layers, each with 1024 outputs and ReLU activations, followed by an LSTM layer with 256 outputs. It should be understood that other RNN architecture such as an encoder-recurrent-decoder (ERD) architecture may be implemented for the temporal feature extractor 358.

Referring back to Figure 4, for each sequential image and each region of interest extracted for the sequential image, the classifier 360 receives the set of spatiotemporal features from the temporal feature extractor 358 as input and outputs an indication of a holding, i.e. whether the image contains a held object or not. In one embodiment, the classifier 360 may generate an output only when a holding has been detected. In another embodiment, the classifier 360 may generate an output only when no holding has been detected. In a further embodiment, the classifier 360 may generate a respective and different output when a holding has been detected and when a holding has not been detected. For example, the classifier 360 may output a binary label of 1 or 0, which may represent 'holding' or 'not holding' respectively.

In one embodiment, the classifier 360 may be implemented as a neural network, comprising one or more fully-connected layers followed by a soft-max activation layer, which transforms the output to a class probability. For instance, the classifier 360 may comprise a single fully- connected layer followed by a soft-max layer. The probability may be thresholded at 0.5 (or any other appropriate value) to produce the final inferred binary label.

As described above, the system 350 may be based on a trained, machine learning-based model that may accept a sequence of images/frames localized to one anatomical holding region of interest of an entity in a sequence of images or video and may output an indication as to whether the entity is holding an object in the region of interest. At the time of inferencing the holding state of the regions of interest within a frame, the holding classifier may be run once for each region of interest.

In one embodiment, the spatial feature extractor 356, the temporal feature extractor 358 and the classifier 360 may be separate software modules, separate hardware units or portions of one or more software or hardware components. For example, the software modules may be written in the Python programming language (or C++, C#, etc.) with suitable modules, such as Caffe (or TensorFlow, Torch, etc.), and run on modern GPU hardware (or CPU hardware or implemented on an FPGA, etc.). The system may be run on an electronic device such as a desktop and a mobile device, or other platforms (e.g. imbedded systems) and may accept video feeds from webcams, phone cameras, television feeds, stored video files, or other stored or live video sources (e.g. over the internet).

In one embodiment, the system may be provided with videos at a resolution high enough to represent predefined anatomical parts (e.g. hands, elbows) of entities in frame with sufficient fidelity to distinguish if objects are being held or not (e.g. 720p). Similarly, the system may be provided video frames at a frequency high-enough to capture state transitions between holding and not holding, such as picking-up and putting-down objects (e.g. 30hz).

In the following, there is described one embodiment of a method for training the system 350. The learnable parameters of the components, i.e. the spatial feature extractor 356, the temporal feature extractor 358 and the classifier 360, may be trained with appropriate training data such as sequences of images or videos of entities holding and not holding objects paired with associated ground-truth 'holding' or 'not holding' labels for each relevant holding region of interest in each frame.

In one embodiment and in order to train the components of the system 350 to be robust to different contexts and not over-fit to any particular kinds of scenes or objects, the training data set may be large and highly varied (a property often referred to as 'in-the-wild').

In one embodiment, while, large, annotated, in-the-wild video training sets may be difficult or costly to obtain, the proposed architecture may be trained in two stages, relaxing this requirement: First, the spatial-feature extractor 356 may be trained on a large (e.g. tens of thousands to hundreds of thousands of images), annotated, in-the-wild image dataset, since it does not require temporal information. These kinds of datasets are readily available. To pre-train the spatial feature extractor 356 alone, a secondary atemporal classifier may be stacked onto the spatial feature extractor 356 and then discarded after training.

Finally, the entire holding classifier system may be refined (i.e. tuned) by training on a much smaller (e.g. hundreds of videos each comprising thousands of frames), annotated video-dataset. This video-dataset may not be highly-varied in spatial appearance but may capture a wide variety of temporal features (e.g. different motions of holding ROIs and different transitions between 'holding' and 'not holding' states). Each video may be split into short sequences of frames (e.g. 32) to facilitate batching during training and to reduce memory requirements.

In one embodiment, both images and sequences of video frames may be stochastically transformed during training (i.e. data augmentation) to improve generalization of the trained classifier. Scaling, rotation, translation and color-shift (i.e. scaling values of individual color channels) and other transformations may be used for augmentation.

In one embodiment, the above-described architecture provides:

Robust detection of holding, relying on both spatial and temporal cues;

Flexibility to be run in real-time or on stored video;

Classifier architecture that facilitates training on smaller and more readily available video and image training datasets; and

Efficient use of compute, by means of restricting the regions of inferencing to those local to relevant holding ROIs.

Figure 11 is a block diagram illustrating an exemplary processing module 400 for executing the steps 302 to 312 of the method 300, in accordance with some embodiments. The processing module 400 typically includes one or more CPUs and/or GPUs 402 for executing modules or programs and/or instructions stored in memory 404 and thereby performing processing operations, memory 404, and one or more communication buses 406 for interconnecting these components. The communication buses 406 optionally include circuitry (sometimes called a chipset) that interconnects and controls communications between system components. The memory 404 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM or other random access solid state memory devices, and may include non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. The memory 404 optionally includes one or more storage devices remotely located from the CPU(s) and/or GPUs 402. The memory 404, or alternately the non-volatile memory device(s) within the memory 404, comprises a non- transitory computer readable storage medium. In some embodiments, the memory 404, or the computer readable storage medium of the memory 404 stores the following programs, modules, and data structures, or a subset thereof: a ROI extractor module 410 for extracting at least one region of interest from a sequence of initial images and cropping the initial images; a spatial feature extractor module 412 for extracting spatial features from the cropped initial images; a temporal feature extractor module 414 for generating temporal features; and a classifier module 416 for determining whether a holding is present from the temporal features.

Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures or modules, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, the memory 404 may store a subset of the modules and data structures identified above. Furthermore, the memory 404 may store additional modules and data structures not described above.

Although it shows a processing module 400, Figure 11 is intended more as functional description of the various features which may be present in a management module than as a structural schematic of the embodiments described herein. In practice, and as recognized by those of ordinary skill in the art, items shown separately could be combined and some items could be separated.

The embodiments of the invention described above are intended to be exemplary only. The scope of the invention is therefore intended to be limited solely by the scope of the appended claims.

Claims

I/WE CLAIM:

1. A computer-implemented method for detecting a holding in images, comprising: receiving a sequence of initial images representing at least partially an entity; for each initial image, determining a respective region of interest around a predefined anatomical region of the character, thereby obtaining a sequence of regions of interest; extracting latent spatial features from each respective region of interest, thereby obtaining a sequence of latent spatial features for each region of interest, the latent spatial features being indicative of the holding; generating a set of spatiotemporal features from the sequences of latent spatial features, the set of spatiotemporal features representing a sequence of visual states; determining from the set of spatiotemporal features whether the region of interest comprises the holding; and outputting an indication as to whether the holding is associated with the region of interest.

2. The computer-implemented method of claim 1, wherein said determining a respective region of interest around a predefined anatomical region of the character comprises for each initial image, identifying an anatomical region of the character as corresponding to the predefined anatomical region, defining the respective region of interest that contains the identified anatomical region and cropping the initial image to obtain a respective cropped image containing the respective region of interest.

3. The computer- implemented method of claim 1 or 2, wherein the predefined anatomical region of the character comprises one of a hand, an elbow and an underarm.

4. The computer-implemented method of any one of claims 1 to 3, further comprising resizing the respective region of interest to a predefined resolution.

5. The computer-implemented method of any one of claims 1 to 4, wherein each one of the latent spatial features are represented by an N-dimensional block of data.

6. The computer-implemented method of any one of claims 1 to 5, wherein the latent spatial features are chosen so as to localize at least one of salient visual cues and visual forms and relationships.

7. The computer-implemented method of claim 7, wherein the salient visual cues comprises at least one of edges, colors, patterns and shapes.

8. The computer-implemented method of any one claims 1 to 8, wherein the sequence of initial images correspond to video frames.

9. The computer-implemented method of claim 8, wherein said receiving the sequence of initial images comprises receiving a live video.

10. A system for detecting a holding in images, the system comprising: a region of interest (ROI) extractor configured for receiving a sequence of initial images representing at least partially an entity and for each initial image, determining a respective region of interest around a predefined anatomical region of the character to obtain a sequence of regions of interest; a spatial feature extractor configured for extracting latent spatial features from region of interest to obtain a sequence of latent spatial features for each region of interest, the latent spatial features being indicative of the holding; a temporal feature extractor configured for generating a set of latent spatiotemporal features from the sequences of latent spatial features, the set of latent spatiotemporal features representing a sequence of visual states; and a classifier configured for determining from the set of spatiotemporal features whether the region of interest comprises the holding and outputting an indication as to whether the holding is associated with the region of interest.

11. The system of claim 10, wherein the ROI extractor is configured for, for each initial image, identifying an anatomical region of the character as corresponding to the predefined anatomical region, defining the respective region of interest that contains the identified anatomical region and cropping the initial image to obtain a respective cropped image containing the respective region of interest.

12. The system of claim 10 or 11, wherein the predefined anatomical region of the character comprises one of a hand, an elbow and an underarm.

13. The system of any one of claims 10 to 12, wherein the ROI extractor is further configured for resizing the respective region of interest to a predefined resolution.

14. The system of any one of claims 10 to 13, wherein each one of the latent spatial features are represented by an N-dimensional block of data.

15. The system of any one of claims 10 to 14, wherein the latent spatial features are chosen so as to localize at least one of salient visual cues and visual forms and relationships.

16. The system of claim 15, wherein the salient visual cues comprises at least one of edges, colors, patterns and shapes.

17. The system of any one claims 10 to 16, wherein the sequence of initial images correspond to video frames.

18. The system of claim 17, wherein the video frames are part of a live video.