WO2018076122A1

WO2018076122A1 - System and method for improving the prediction accuracy of a neural network

Info

Publication number: WO2018076122A1
Application number: PCT/CA2017/051293
Authority: WO
Inventors: Roland MEMISEVIC; Peter Yianilos; Sumeet Sobti
Original assignee: Twenty Billion Neurons GmbH
Priority date: 2016-10-31
Filing date: 2017-10-31
Publication date: 2018-05-03
Also published as: EP3533002A1; CN110431567A; US20180124437A1; CA3041726A1; EP3533002A4

Abstract

A system and method for improving the prediction accuracy of a neural network is proposed. Prediction tasks derived from labeled video data are used to regularize the feature space of the neural network so that it encodes constraints of the physical world while also learning to solve the original task at hand. The videos are generated by instructing humans to perform actions according to predefined labels or descriptions, so that a wide variety of physically relevant motion patterns are available to regularize the network.

Description

System and method for improving the prediction accuracy of a neural network Cross-Reference to Related Applications

[0001] The present patent application claims the benefits of priority of US Patent Application No. 62/414,949, entitled "SYSTEM AND METHOD FOR TRAINING NEURAL NETWORKS FROM VIDEOS", and filed at the US Patent Office on October 31, 2016, and of US Patent Application No. 15/608,059, entitled "SYSTEM AND METHOD FOR VIDEO DATA COLLECTION", and filed at the US Patent Office on May 30, 2017, the content of which is incorporated herein by reference.

Field of the Invention

[0002] The present invention generally relates to a system and method for improving the prediction accuracy of a neural network using transfer learning.

Background of the Invention

[0003] There has been an increasing interest recently in learning more about representations of physical aspects of the world using neural networks. Such representations are sometimes referred to as "intuitive physics" to contrast them with the symbolic/mathematical descriptions of the world developed in physics.

[0004] Many intelligent video analysis systems are based on a machine learning model, such as a neural network.

[0005] A neural network is a system that can be trained to perform a task (such as recognizing an object in an image or predicting the translation of a sentence from one language to another).

[0006] Although images still largely dominate research in visual deep learning, a variety of sizable labeled video datasets have been introduced. A dominating application domain has been action recognition, where the task is to predict a global action label for a given video. A potential drawback of action recognition datasets is that they are targeted at fairly high-level aspects of videos. Typically, a long video sequence is taken as input, producing a relatively small number of global class-labels as output. These datasets require features that can condense a long sequence, often including many scene changes, into a single label.

[0007] The predominant way of creating large, labeled datasets for training machine learning models is by starting with a large collection of input items, such as images or videos. Usually, these are found using online resources, such as Google image search or Youtube. Subsequently, the gathered input examples are labeled by human workers. Since the number of labels may be very large, it is common to use crowdworkers from services like Amazon Mechanical Turk (AMT) or Crowdflower to perform labeling.

[0008] To train a neural network to predict a label when given a video as the input, training data in the form of pairs including video and label (video, label) is needed. The number of these pairs has to be large to prevent the machine learning model from overfitting and to facilitate generalization.

[0009] The label may be in the form of one of K possible discrete values (this is commonly referred to as "classification"), or in the form of a sequence of multiple such values (this is commonly referred to as "structured prediction" and it subsumes the case that the label is a natural language sentence, which is also known as "video captioning").

[0010] One approach to training a neural network (commonly referred to as supervised learning) is to minimize a cost function that quantifies how badly the network performs the task on a training dataset D, consisting of pairs (x, y) where x is an input example and y the corresponding desired output. Common cost functions include but are not limited to the cross- entropy loss (when the task is classification), or squared error (when the task is regression). One way to minimize the cost function is gradient-based optimization, using gradients that are computed with the error backpropagation algorithm. If the cost function or parts of the network are not differentiable, gradient-based optimization with back propagation can be combined with reinforcement learning.

[0011] Prediction accuracy is generally a function of the size of the training dataset that the network is trained on. Since collecting training data can be costly or difficult, several approaches exist to improving prediction accuracy given a fixed training set size.

[0012] One approach, referred to regularization, amounts to restricting the capacity of the network, which increases its ability to generalize. One way to achieve this is by adding to the cost function an additional cost function that quantifies the capacity of the network.

[0013] A related approach, referred to as transfer learning, amounts to complementing the training dataset for the task at hand with additional training data that is merely related to the task at hand. The additional training data allows the network to develop an improved internal representation of the input data, which in turn allows it to yield improved performance on the target task. One way to perform transfer learning is to first train the network on the additional training data and subsequently train part of the network (often the last layer in the network) on data from the actual task at hand. The task associated with the additional training data can be referred to as source task or auxiliary task and the additional training data itself as source data or auxiliary data. The original task can be referred to as target task and the associated training data as target data. Collecting a training dataset for the source task is difficult, because the training dataset generally needs to be large to have a sufficiently strong impact on performance on the target task. A dataset that is often used for transfer learning is the "ImageNet"™ dataset created and managed by Stanford and Princeton Universities.

[0014] Transfer learning may be viewed as a special case of regularization, because forcing a network to solve multiple tasks at the same time effectively reduces the network's representational capacity in solving the target task in comparison to a network that is trained only on the target task.

[0015] In its standard form, the ImageNet dataset contains approximately one million images classified into 1000 object-classes. The primary task associated with the ImageNet dataset, and its original use, has been classification of the dominant object in the image into these 1000 classes. However, it has been observed that after training a convolutional neural network on the task, the network can be used as a feature extractor, and the features can be applied to other tasks. A common way to implement an ImageNet based feature extractor is as follows: a convolutional neural network is first trained on the ImageNet dataset. The final (classification-) layer of the network is then discarded. If it is a fully connected layer, the size of the weight matrix associated with this layer is (1000 x H), where H is the number of hidden units in the layer before (often referred to as "penultimate layer"). The network up to the penultimate layer is then used as a component in another (target) network. When training the target network on the target task the parameters of the ImageNet -based network can either be fixed or trained along with the additional parameters of the target network.

Summary of the Invention

[0016] The shortcomings of the prior art are generally mitigated by a system and method for improving the prediction accuracy of a neural network as described herein.

[0017] In a first aspect, a method performed by one or more computers for improving the prediction accuracy of a neural network is disclosed. The method comprises:

a. providing a neural network that is configured to receive an input video and predict a target label for the input video;

b. pre-training the neural network that includes a source training phase. [0018] In at least one embodiment, the method may further comprise a pre-training step that includes a target training phase.

[0019] In another embodiment, the source training phase comprises training that adjusts the network's parameters to optimize performance on source labels that are different from the target labels based on a corresponding labeled video source action dataset.

[0020] In at least one embodiment, the target training phase comprises training to optimize performance on the target labels that performs significantly less parameter adjustment based on a corresponding labeled video target action dataset.

[0021] In another embodiment, the target training phase relies only on the target dataset and adjusts only a subset of the networks parameters, or a small number of newly introduced parameters.

[0022] In still another embodiment, the target training phase relies only on the target dataset and adjusts all or most of the network's parameters using regularization techniques to limit the magnitude of the adjustments.

[0023] In another embodiment, the target training phase relies on both the target and source datasets but employs a blended cost function to limit the magnitude of the adjustments resulting from the target dataset.

[0024] In another embodiment, the source training dataset comprises source training videos showing humans performing actions, and each source training video being associated with a question and an answer.

[0025] In another embodiment, the labels are generated first showing a worker what other worker(s) have said, then asking for a description, or question/answer, that is relevant but different.

[0026] In yet another embodiment, the source training dataset comprises source training videos showing humans performing sequences of one or more actions, and each source training video being associated with a corresponding sequence of one or more phrases.

[0027] In another embodiment, the source training dataset comprises random motion patterns performed by humans, the random motion patterns being generated by first instructing workers to invent and perform a random motion in front of the camera, and by subsequently instructing other workers to watch and then get filmed repeating the same motion. [0028] In another embodiment, the method performed by one or more computers for improving the prediction accuracy of a neural network comprises creating a labeled video action dataset (either source or target) that is generated by asking humans to perform one or more actions, and recording the result.

[0029] In yet another embodiment, one or more discrete labels identifies the action, and the label is a caption and the caption contains a detailed textual description of the action and of the objects involved in the action.

[0030] In another embodiment, the target videos consist of a repetition of a single frame and some of the source videos may consist of a repetition of a single frame to allow for the use of labeled images as additional source training data.

[0031] In still another embodiment, a method for improving the prediction accuracy of a neural network performed by one or more computers is disclosed, the method comprising: a. providing a neural network that is configured to receive an input video and predict a target label for the input video, the label being a class or caption;

b. generating a source training dataset comprising source training videos showing humans performing actions, and each source training video being associated with a respective label;

c. pre-training the neural network using the source training dataset.

[0032] In another embodiment, the source training dataset is used to improve generalization performance of the neural network trained on the target task, by initializing the network using training on the source dataset and subsequently training a subset or all parameters on the target dataset.

[0033] In another embodiment, the source training dataset is used to improve generalization performance of the neural network trained on the target task, by regularizing the network by simultaneously training on the source and target datasets using a weighted combination of cost functions.

[0034] In another embodiment, the label is a caption and the caption contains a detailed textual description of the action and of the objects involved in the action.

[0035] In yet another embodiment, the input data for the target task is images not videos and the images are provided to the neural network as a "still" video of repeated images.

[0036] In yet another embodiment, the source training dataset comprises random motion patterns performed by humans. [0037] In another embodiment, the target network is a recurrent neural network that is configured to receive an input video stream and to generate as each frame is processed, a corresponding output (not necessarily linguistic in nature) relevant to the ultimate purpose of the network (such as actions or robotic control signals); the source task being used to regularize the recurrent network by asking it to generate a series of linguistic outputs (a narrative) while solving the target task.

[0038] Other and further aspects and advantages of the present invention will be obvious upon an understanding of the illustrative embodiments about to be described or will be indicated in the appended claims, and various advantages not referred to herein will occur to one skilled in the art upon employment of the invention in practice.

Brief Description of the Drawings

[0039] The above and other aspects, features and advantages of the invention will become more readily apparent from the following description, reference being made to the accompanying drawings in which:

[0040] Figure 1 is a schematic diagram of transfer learning using videos, in accordance with at least one embodiment.

Detailed Description of the Preferred Embodiment

[0041] A novel system and method for improving the prediction accuracy of a neural network will be described hereinafter. Although the invention is described in terms of specific illustrative embodiments, it is to be understood that the embodiments described herein are by way of example only and that the scope of the invention is not intended to be limited thereby.

[0042] The degree to which supplementing a training dataset for a given task with additional data (transfer learning) improves recognition performance is dependent on the supplementary training data, and on how the supplementary training data relates to the target task. Described herein is an approach to generating supplementary training data that can improve accuracy robustly across a variety of recognition tasks.

[0043] It has been found that, in general, transfer learning works well if the source task requires a network to distinguish subtle characteristics in the input. For example, one of the most commonly used datasets for transfer learning is ImageNet. A key characteristic of the ImageNet dataset, besides its large size, is that it contains a comparably large number of classes, many of which are similar to one another, such that they can be distinguished from one another only by focussing on subtle aspects of the visual input. For example, the labels associated with the ImageNet dataset contain a variety of different dog breeds, and networks trained on this data need to be able to distinguish, among other attributes, those that allow to differentiate the different dog breeds.

[0044] Correspondingly, it is described herein that a training dataset can be generated to create optimal representations of visual input, thereby improving recognition accuracy via transfer leaming. In contrast to ImageNet and other existing datasets, the sole purpose of such a dataset is to optimize transfer leaming performance.

[0045] The features that a neural network learns in response to being trained on a task are ultimately related to physical aspects of the world. For example, a network trained on ImageNet may learn to distinguish dog breeds using features that represent the relative sizes, distances and shapes of the eyes and other facial features of the animal. The features are useful in transfer learning, because such representations of sizes, distances and shapes are useful in other tasks.

[0046] Images as input domain are suboptimal in learning to represent physical aspects of the world, because they fail to reveal some aspects and can reveal some aspects only implicitly. For example, an object that is partly occluded may be represented partly correctly by an image-trained network in that a subset of relevant features are present in the representation. But the representation will fail to encode the fact that the presence of another, occluding object makes it likely that the missing features are present even though they are not visible. The insufficiency of images is addressed by using video as the input domain.

[0047] A dataset that forces a network to learn representations of physical aspects of the world needs to contain a wide variety of spatio-temporal patterns that are representative of those aspects. They need to include, but not be limited to, occlusion effects, motion patterns, pose-variations, reflections, effects of gravity, object materials, object structure (for example, show articulated versus rigid objects).

[0048] Humans are highly trained in manipulating objects and moving their bodies in controlled ways. Especially, dextrous manipulation skills are highly developed in humans, and by moving objects with their hands humans are able to generate a wide variety of the required spatio-temporal patterns.

[0049] The approach described is based upon the crowd sourced generation of videos described and claimed in co-pending US application no. 15/608,059. Crowd workers (or other persons filming the training videos) are requested to provide video clips of themselves and/or of other persons performing actions in front of the camera. [0050] Data collected with crowd workers is used to train discriminative machine learning models, which may take a series of videos and associated labels as input.

[0051] The videos may show a whole person or just part of a person's body. A preference is given to hands in order to profit from the strong dextrous manipulation skills humans exhibit. Some of the actions that workers are instructed to perform may involve objects, some may involve other persons, some may involve both, and some may involve neither of those (pure body motions). Actions are chosen such that they cover a wide spectrum of spatio-temporal patterns. Training a neural network on prediction targets associated with the videos, which will be discussed below, will force the network - to the degree that it succeeds in making the right predictions - to implicitly develop internal representations of the physical constraints that govern the spatio-temporal patterns.

[0052] More specifically, videos and prediction tasks that show persons, impose a model of the physical human body in the trained networks. For example, a network that can correctly predict if or where in a video two hands touch has to develop an internal representation of spatial relations. A network that can more generally correctly distinguish a large number of random hand gestures has to develop an internal representation of arms and hands to the degree that it can track their three-dimensional positions and distinguish these from the background.

[0053] Videos and prediction tasks that show manipulations of objects, similarly, enforce encodings of physical aspects of the world and the objects. This includes information about cardinality, relative positions of objects over time, material, weight, or shape, and how these affect their behaviour in the physical world. For example, a network that can predict whether a ball will bounce versus stay on the ground when dropped, has to infer an implicit representation of the ball's material such as rubber versus metal from the appearance of the object.

[0054] Besides representations, networks can learn computational procedures, or "routines", when being trained on supervised tasks. These can include counting of objects or of events, localization in space or in time, or memorization procedures. Like representations, routines can be common building blocks that can be shared across tasks.

[0055] A sufficient condition for a network to develop useful internal representations is that the network is able to predict the exact appearance of future frames in the video from past frames. However, predicting frames is computationally demanding. Our proposal is thus to generate training data consisting of videos and prediction targets which are sufficiently varied, such that without the internal representations or routines, the network would be unable to solve the task. By learning to solve the task on the other hand, the network will implicitly acquire the required representations and routines.

[0056] Actions may include, but are not limited to round objects rolling along a surface, non- round objects being stuck on the same surface, non-round objects sliding across the surface if it is steep enough, objects rolling or sliding down fast or slow depending on the steepness of the surface, objects being moved across a surface, objects being moved until they fall down, objects being pushed thereby passing other objects, objects being pushed thereby moving other objects, objects falling slowly (for example, papers or feathers), objects falling fast (for example, stones), objects spinning on a surface but stopping quickly due to strong friction, objects spinning on a surface for a long time due to the lack of strong friction, objects being moved behind other objects so they get occluded, objects being moved from behind other objects so they become visible, objects being moved in front of other objects such that the other objects get occluded, objects being moved partly from behind other objects, so they become partly visible, objects (such as paper) being folded, unfolded, torn, torn partly, any action being performed N times, any combination of actions being performed N times in any prespecified order, etc.

[0057] In at least one embodiment of the invention, contrastive examples are used to ensure that the recognition tasks are difficult and involve highly specific aspects of the visual scene. The learned representations thereby have to enable sharp decision boundaries in the feature space.

[0058] In at least one embodiment of the invention, placeholders can be used to enrich the set of classes so that they involve not only actions but also nouns, adjectives, etc. Free-form captions that describe actions in detail can also be used instead of action classes. A known problem with existing image and video captioning datasets is that the captions are broad descriptions of the scene. This is in part because captioning datasets are sourced from descriptions that consumers add to existing images or videos.

[0059] In at least one embodiment of the invention, computer graphics may be used to render videos of actions or their effects synthetically.

[0060] The models are trained by using gradient-based optimization to minimize a cost function that quantifies how close the network output is to the desired output. The desired output is determined by the labels. [0061] In at least one embodiment of the invention, the task is phrased as a question answering task where the input to the network is a video as well as a question or configuration string, and the output is a label or caption.

Utilizing learned representations

[0062] The term label may refer to class-labels or textual descriptions (captions) in the following.

[0063] As shown in Figure 1, the source network is trained on video clips of typically up to a few seconds duration. Each clip is associated with at least one label. The label is typically "atomic" in that it describes the content of the clip as a whole, rather than describing a temporal sequence of events. The source network aggregates frame-wise predictions into a single prediction for the clip. This makes it possible to train the source network on clips that have multiple different durations, and it makes it possible to run the network (or a target network that is initialized with the source network parameters) in real-time. The source network is trained by minimizing a cost function that depends on the network output for the clip and the ground-truth label for the clip. Parts of the target network are shared with the source network. These may be parameters in lower layers of the network as shown in the figure, or it may be other parts of the network. The target network is trained on videos that may have different durations than the source network. It may be trained by making and aggregating frame-wise predictions like the source network (as shown in the figure), for example, if it has to run in real-time, or it may be trained using different training criteria. It may also be trained with different learning approaches, including reinforcement learning.

[0064] Although the data is used to train neural networks, the primary purpose of the data or the trained networks is not to recognize the underlying visual concepts. Instead, after training the networks, the internal representations and routines are used to facilitate the developments of, or improve the prediction performance of, other networks on other tasks.

[0065] To this end, the source training data may be generated to take on the form of video clips, such that the label associated with each clip is temporally "atomic" in the sense that it describes or characterises the content of the video in its entirety. This kind of atomicity is not meant to exclude juxtapositions of events, such as "Dropping a box, then dropping a pencil, then dropping a paper" or "Putting a cup on the table and putting a wallet on the table". Rather, "atomic" here means that the label describes the content of the video completely instead of referring in detail to the exact temporal location of parts of the action. For training the source network the whole auxiliary dataset or parts of it may be used. A part of the auxiliary dataset may suffice if the target task requires features that give rise to specific abilities, such as the ability to localize a moving person or a particular type of object.

[0066] The source network takes as input a video of a certain length and can output labels or captions. The network typically contains 2d- and/or 3d-convolutional layers. The network may have recurrent connections.

[0067] To utilize the representations and routines learned from the auxiliary dataset in a video recognition target task, in at least one embodiment of the invention, the parameters of the trained network, or of part of the trained network, are used as an initialization for the parameters of the target network (referred to herein as target network) trained on the target task.

[0068] In another embodiment, the target network is simultaneously trained on both the source task and the target task, by minimizing a weighted combination of the cost functions associated with the source and target tasks.

[0069] The auxiliary data accordingly may improve the prediction accuracy on the target task, and in some cases make it possible to obtain satisfactory prediction accuracy from a very small number of training examples, such as a single or few training examples ("one-shot learning" or "few-shot learning").

[0070] If the input data in the target task is a still image (such as in an object classification task), then transfer learning can be applied as follows: the input image is replicated T times, where T is the number of frames (video length) that the source network expects as the input. The resulting tensor of repeated images is then provided as input to the target network. In other words, the image is treated as a (still) video.

[0071] Conversely, image data can also be used as additional source training data by generating a (still) video from each image by replicating it T times, and by adding the training label for the image to the set of source training labels.

[0072] The target task may be a reinforcement learning task, such as training a robot control policy that involves visual feedback.

Visual Grounding of natural language

[0073] Training a neural network to generate textual descriptions of video leads to textual representations that carry information about the visual world. Unlike multi-modal (image, word)-representations, video-based representations can capture information that is not present in images, such as actions, 3-D structure, partial occlusions, dynamics and affordances. Multi-modal representations of text and video can therefore correlate visual data with not only nouns but complex phrases and sentences containing verbs, adjectives, conjunctions and prepositions. This allows for a much more detailed level of grounding of natural language than image-based grounding of nouns.

[0074] Existing video captioning datasets contain phrases not single words, but they are based on context-dependent and high-level descriptions.

[0075] Captioning models trained on datasets such as these (as well as on similar image- based caption datasets) fall short of learning fine-grained details about the depicted scenes: an ^" oracle" trained to turn individual, unordered nouns, verbs or adjectives into captions (without access to the visual input) can reach a higher accuracy than that of full captioning models.

[0076] This shows that a large part of the predictive capability that has been achieved so far on captioning is due to a strong language model not a strong image or video recognition model.

[0077] By contrast, to enforce representations within the network that capture physical details narrow descriptions that cover fine-grained details of the unfolding scene are needed.

[0078] By encoding the basic physical aspects of the visual world as opposed to complex cultural phenomena, video captioning can also provide visual grounding for natural language concepts. Specifically, training a neural network to generate textual descriptions of video leads to textual representations that carry information about the physical world.

[0079] A large fraction of the predictive capability that have been achieved so far on captioning is due to a strong language model not a strong image or video recognition model.

Human motion and gesture recognition

[0080] A special case of the approach described in the previous section is human motion recognition. A large database of random human motions and/or poses is used to generate internal representations that correspond to an implicit model of the human body and the physical constraints that govern its motions (such as the set of joints and corresponding degrees of freedom).

[0081] In at least one embodiment, workers are instructed to film themselves performing random motions of a certain length using their bodies. The resulting videos are then assigned random labels. Individual videos are shown to other workers who are instructed to perform the same motion they saw in the videos. That way, a large labelled database of random human body motions is created that can be used as an auxiliary dataset specifically for target tasks that involve recognition of human motions.

[0082] In at least one embodiment, workers are instructed to perform motions using just their hands, giving rise to an auxiliary dataset specifically for gesture recognition.

[0083] If the task involves localizing persons which are not necessarily shown dominantly in the center of the video, a dataset of labeled bounding boxes may be used as a second auxiliary task (that is added with its own weight to the weighted combination of cost function) to support learning a representation of not only human motion patterns but also roughly the position of the person.

Regularizing the state-space of a recurrent neural network (RNN)

[0084] The task of generating a sequence or of continuing a sequence with a recurrent neural network has been successful in the domain of natural language. But it has been unsuccessful so far in other domains, including video. It has also been unsuccessful in the context of reinforcement learning, where the goal is to generate a sequence of actions. As a result, reinforcement learning policies are not commonly represented using recurrent neural networks but using (feedforward) Q-functions instead.

[0085] The fact that a recurrent network can output syntactically well-formed language demonstrates that training on language generation can successfully force hidden states of an RNN onto trajectories in the RNN feature space which correspond (with a very high frequency) to well-formed language. This amount to letting the computation in the RNN be accompanied by a continuous "narrative" that describes the state of the system.

[0086] The state-space of a recurrent network that is trained to generate natural language descriptions of video, is naturally forced onto stable trajectories corresponding to well -formed language, which at the same time constitute natural language descriptions of the video content.

[0087] In one embodiment of the invention, the task of predicting natural language descriptions of the input video is used as an auxiliary task that improves the accuracy of recurrent neural networks performing other prediction tasks involving the video. The tasks can include, but are not limited to, predicting future video features, future video pixels, and actions (such as robotic control signals) in reinforcement learning. In the latter case, the approach can be viewed as a kind of model-based reinforcement learning, where the natural language decoder forces the recurrent neural network policy to represent a world model at the same time.

Other aspects

[0088] In at least one embodiment, the systems and methods as described herein may be implemented as a non-transitory computer-readable storage medium configured with a computer program, wherein the storage medium so configured causes a computer to operate in a specific and predefined manner to perform at least some of the functions as described herein.

[0089] The videos that may be crowd-sourced generally span a variety of use cases, including, for example, human gestures (for automatic gesture recognition) and/or aggressive behavior (for video surveillance). Unlike gathering videos online, the use of video collection as described herein may also make possible to generate video data for training generic visual feature extractors, which may be applicable across multiple different use-cases.

[0090] The system and method as described herein may use action groups and contrastive examples, to ensure that video data is suitable for training machine learning models with minimal overfitting (making model too complex).

[0091] Label templates (e.g. "Dropping [something] onto [something]") may be used to sample the otherwise large space of (action/object)-combinations. Label templates may exploit the fact that (action/object)-combinations are highly unevenly distributed. They may make it possible for video providers to choose themselves appropriate objects in response to a given action template.

[0092] Through the use of label templates (previous point), it is also possible to use curriculum learning. Since label templates may interpolate between simple one-of-K labels and full-fledged video captions (textual descriptions), they may make it possible to collect videos incrementally and with increasing complexity. The degree of complexity of the labels may be a function of the performance of machine learning models on the data collected so far.

[0093] It may be possible to track, and, if applicable, to react to, similarities of the video that may be harmful to the machine learning models, i.e. which affect the ability of machine learning models to generalize.

[0094] A machine learning model trained on videos may learn to overfit on a given task by representing labels using tangential aspects of the input videos that do not really correspond to the meaning of the label at hand. A model may learn to predict the label "dropping [something]", for example, as a function of whether a hand is visible in the top of the frames, in case the videos corresponding to other labels do not share this property.

[0095] A contrastive example (or "contrastive class") may be an action which is very similar to a given action to be learned by the model, but which may contain one or several, potentially subtle, visual differences to that class, forcing the model to learn the true meaning of the action instead of tangential aspects. Examples may be the "pretending" -classes. For example, a neural network model may learn to represent the "picking-up" action using the characteristic hand-motion of that action. The class "Pretending to pick up" may contain the same hand- motion, and may just differ from the original class in that the object does not move. That way, the contrastive class "Pretending to pick up" may force a neural network to capture the true meaning of the action "Picking up", preventing it from wrongly associating the mere hand- motion as the true information-carrying aspect of that class. Geometrically, contrastive examples may be training examples that are close to the examples from the underlying class to be learned (like "Picking up"). Since they belong to a different class (here "Pretending to pick up") they may force neural network models trained on the data to learn sharper decision boundaries.

[0096] Technically, contrastive classes may simply form an action group together with the underlying action class to which they provide contrast.

[0097] In order to prevent a machine learning model from overfitting and forcing networks to develop a fine-grained understanding of the true underlying visual concepts, one may use grouping of labels into action groups.

[0098] Action groups may be designed such that a fine-grained understanding of the activity may be required in order to distinguish the actions within a group.

[0099] An important type of action group may be obtained by combining an action type with a pretending-action, where the video provider may be prompted to pretend to perform an action without actually performing it.

[00100] For example, an action group may consist of the actions "Picking up an object" and "Pretending to pick up an object (without actually picking it up)". Action groups may force neural networks trained on the data to closely observe the object instead of secondary cues such as hand positions. They may also force networks to learn and represent indirect visual cues, such as whether an object is present or not present in a particular region in the image. [00101] Other examples of action groups may be: "Putting something behind something / Pretending to put something behind something (but not actually leaving it there)"; "Putting something on top of something / Putting something next to something / Putting something behind something"; "Poking something so lightly that it does not or almost does not move / Poking something so it slightly moves / Poking something so that it falls over / Pretending to poke something"; "Pushing something to the left / turning the camera right while filming something"; "Pushing something to the right / turning the camera left while filming something"; "Pushing something so that it falls off the table / Pushing something so that it almost falls off the table"; "Stuffing/Taking out", "Folding something", "Holding something", "crowd of things", "Collisions of objects", "Tearing something", "Lifting/Tilting objects with other objects on them", "Spinning something", "Moving two objects relative to each other", etc.

[00102] In image recognition systems and datasets, labels typically take the form of a one-of-K encoding, such that a given input image is assigned to one of K labels. In currently existing video recognition datasets the labels typically correspond to actions. However, most actions in a video typically involve one or more objects, and the roles of actions and objects may be naturally intertwined. As a result, the task of predicting or of acting out an action verb may be closely related to the task of predicting or acting out the involved objects.

[00103] For example, the phrase "opening NOUN" may have drastically different visual appearances, depending on whether "NOUN" in this phrase is replaced by "door", "zipper", "blinds", "bag", or "mouth". There may be also commonalities between these instances of "opening", like the fact that parts are moved to the sides giving way to what is behind. It is, of course, exactly these commonalities which may define the concept of "opening". Therefore, understanding of the underlying meaning of the action word "opening" depends on the ability to generalize across these different use cases.

[00104] What may make collection of video data in terms of actions and objects challenging is that Cartesian product of actions and objects constitutes a space that is so large, that it may be hard to sample it sufficiently densely as needed for most practical applications. However, the probability density of real-world cases in the space of permissible actions and objects is far from uniform.

[00105] For example, many actions, such as "Moving an elephant on the table" or "Pouring paper from a cup", for example, may have almost zero density. And the combinations that are more reasonable can nevertheless have highly variable probabilities. Consider, for example, "drinking from a plastic bag" (highly rare) vs. "dropping a piece of paper" (highly common).

[00106] In order to obtain samples from the Cartesian product of actions and objects, it may be possible to exploit the resulting highly non-uniform (equivalently, low entropy) distribution over actions and objects.

[00107] The use of label templates may be viewed as approximations to full natural language descriptions and they may dynamically increase in complexity in response to the learning success of machine learning models. For example, by incrementally introducing parts of speech, such as adjectives or adverbs. For example, this may make it possible to generate output phrases whose complexity may vary from very simple ("pushing a pencil") to very complex ("pulling a blue pencil on the table so hard that it falls down").

[00108] Slowly increasing the complexity of the data used to train a machine learning model is known as "curriculum learning".

Training data sets

[00109] The proposed method may be used to improve accuracy in different special purpose use-cases, such as:

• building systems that detect that a person fell down (for example, as used in elderly care applications);

• for building systems that provide personal exercise-coaching by observing the quality of, or counting, physical exercises, such as push-ups, sit-ups, "crunches";

• for building systems that provide meditation-, yoga-, or concentration-coaching by observing the pose, posture and/or motion patterns of a person;

• to collect training data for building gesture recognition systems using RGB cameras;

• to collect training data for building controllers for video games, so that these games may be played without the need of holding, or keeping otherwise close to the body, any physical device. Examples include, driving-games, fighting games, dancing games;

• to collect training data for building interfaces to music or sound generation programs or systems, for example, an "air guitar" system that generates sounds in response to, and as a function, of imaginary guitar-, drum-, or other musical instrument-play;

• to collect training data for building a posture-monitoring system, i.e. a system that observes and ranks a user's posture and possibly notifies the user of bad posture; • to collect training data for building systems that recognizes gaze-direction or changes in gaze-direction, as used, for example, in cars to determine if "auto-pilot"-functions are to be engaged or not;

• to collect training data for building systems that may recognize that an object was left behind, as used in video surveillance applications of public spaces;

• to collect training data for building systems that recognize that an obj ect was carried away, as used, for example, in domestic surveillance applications.

[001 10] While illustrative and presently preferred embodiment(s) of the invention have been described in detail hereinabove, it is to be understood that the inventive concepts may be otherwise variously embodied and employed and that the appended claims are intended to be construed to include such variations except insofar as limited by the prior art.

Claims

1) A method performed by one or more computers for improving the prediction accuracy of a neural network, the method comprising:

b. pre-training the neural network that includes a source training phase.

2) A method as claimed in claim 1), wherein the pre-training step includes a target training phase.

3) A method as claimed in claim 1) or 2), wherein the source training phase comprises training that adjusts the network's parameters to optimize performance on source labels that are different from the target labels based on a corresponding labeled video source action dataset.

4) A method as claimed in claim 1) or 2), wherein the target training phase comprises training to optimize performance on the target labels that performs significantly less parameter adjustment based on a corresponding labeled video target action dataset.

5) A method as claimed in claim 3) or 4), wherein the target training phase comprises training to optimize performance on the target labels that performs significantly less parameter adjustment based on a corresponding labeled source action dataset.

6) A method as claimed in claim 3) or 4), wherein the target training phase relies only on the target dataset and adjusts only a subset of the networks parameters, or a small number of newly introduced parameters.

7) A method as claimed in claim 3) or 4), wherein the target training phase relies only on the target dataset and adjusts all or most of the network's parameters using regularization techniques to limit the magnitude of the adjustments. 8) A method as claimed in claim) or 4), wherein the target training phase relies on both the target and source datasets but employs a blended cost function to limit the magnitude of the adjustments resulting from the target dataset. 9) A method as claimed in claim 3) or 4), wherein the target training phase relies on combinations of the methods described in claims 7) and 8).

10) A method as claimed in claim 3) or 4), wherein the source training dataset comprises source training videos showing humans performing actions, and each source training video being associated with a question and an answer.

11) A method as claimed in claim 10), wherein the question and answer being encoded in a text-string.

12) A method as claimed in claim 10), wherein the labels are generated first showing a worker what other worker(s) have said, then asking for a description, or question/answer, that is relevant but different.

13) A method as claimed in claim 3) or 4), wherein the source training dataset comprises source training videos showing humans performing sequences of one or more actions, and each source training video being associated with a corresponding sequence of one or more phrases.

14) A method as claimed in claim 3) or 4), wherein the source training dataset comprises random motion patterns performed by humans.

15) A method as claimed in claim 14), wherein the random motion patterns are generated by first instructing workers to invent and perform a random motion in front of the camera, and by subsequently instructing other workers to watch and then get filmed repeating the same motion. 16) A method performed by one or more computers for improving the prediction accuracy of a neural network, the method comprising creating a labeled video action dataset (either source or target) that is generated by asking humans to perform one or more actions, and recording the result.

17) A method as claimed in claim 16), wherein one or more discrete labels identifies the action, and the label is a caption and the caption contains a detailed textual description of the action and of the objects involved in the action. 18) A method as claimed in claim 16), wherein one or more textual captions describe the label.

19) A method as claimed in claim 3) or 4), wherein the target videos consist of a repetition of a single frame.

20) A method as claimed in claim 3) or 4), wherein some of the source videos consist of a repetition of a single frame to allow for the use of labeled images as additional source training data. 21) A method for improving the prediction accuracy of a neural network performed by one or more computers, the method comprising:

a. providing a neural network that is configured to receive an input video and predict a target label for the input video, the label being a class or caption;

c. pre-training the neural network using the source training dataset.

22) A method as claimed in claim 21), wherein the source training dataset is used to improve generalization performance of the neural network trained on the target task, by initializing the network using training on the source dataset and subsequently training a subset or all parameters on the target dataset. 23) A method as claimed in claim 21), wherein the source training dataset is used to improve generalization performance of the neural network trained on the target task, by regularizing the network by simultaneously training on the source and target datasets using a weighted combination of cost functions.

24) A method as claimed in claim 22) or 23), wherein the label is a caption and the caption contains a detailed textual description of the action and of the objects involved in the action. 25) A method as claimed in claim 22) or 23), wherein the input data for the target task is images distinct from videos and the images are provided to the neural network as a "still" video of repeated images.

26) A method as claimed in claim 22) or 23), wherein the source training dataset comprises random motion patterns performed by humans.

27) A method as claimed in claim 21), wherein the target network is a recurrent neural network being configured to receive an input video stream and to generate, as each frame is processed, a corresponding output relevant to the ultimate purpose of the network; the source task being used to regularize the recurrent network by asking the recurrent network to generate a series of linguistic outputs while solving the target task.

28) A method as claimed in claim 27), wherein the corresponding output is not linguistic in nature.

29) A method as claimed in claim 27), wherein the corresponding output relevant to the ultimate purpose of the network being actions or robotic control signals.

* * *