WO2018094294A1 - Modèle d'attention spatiale pour sous-titrage d'image - Google Patents

Modèle d'attention spatiale pour sous-titrage d'image Download PDF

Info

Publication number
WO2018094294A1
WO2018094294A1 PCT/US2017/062433 US2017062433W WO2018094294A1 WO 2018094294 A1 WO2018094294 A1 WO 2018094294A1 US 2017062433 W US2017062433 W US 2017062433W WO 2018094294 A1 WO2018094294 A1 WO 2018094294A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
decoder
caption
attention
image feature
Prior art date
Application number
PCT/US2017/062433
Other languages
English (en)
Inventor
Jiasen LU
Caiming Xiong
Richard Socher
Original Assignee
Salesforce.Com, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/817,161 external-priority patent/US10565305B2/en
Application filed by Salesforce.Com, Inc. filed Critical Salesforce.Com, Inc.
Priority to CA3040165A priority Critical patent/CA3040165C/fr
Priority to JP2019526275A priority patent/JP6689461B2/ja
Priority to CN201780071579.2A priority patent/CN110168573B/zh
Priority to EP17821750.1A priority patent/EP3542314B1/fr
Priority to EP21167276.1A priority patent/EP3869416A1/fr
Publication of WO2018094294A1 publication Critical patent/WO2018094294A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks.
  • the technology disclosed generally relates to a novel visual attention-based encoder-decoder image captioning model.
  • One aspect of the technology disclosed relates to a novel spatial attention model for extracting spatial image features during image captioning.
  • the spatial attention model uses current hidden state information of a decoder long short-term memory (LSTM) to guide attention, rather than using a previous hidden state or a previously emitted word.
  • LSTM decoder long short-term memory
  • Another aspect of the technology disclosed relates to a novel adaptive attention model for image captioning that mixes visual information from a convolutional neural network (CNN) and linguistic information from an LSTM.
  • CNN convolutional neural network
  • the adaptive attention model automatically decides how heavily to rely on the image, as opposed to the linguistic model, to emit the next caption word.
  • Yet another aspect of the technology disclosed relates to adding a new auxiliary sentinel gate to an LSTM architecture and producing a sentinel LSTM (Sn-LSTM).
  • the sentinel gate produces a visual sentinel at each timestep, which is an additional representation, derived from the LSTM's memory, of long and short term visual and linguistic information.
  • Image captioning is drawing increasing interest in computer vision and machine learning. Basically, it requires machines to automatically describe the content of an image using a natural language sentence. While this task seems obvious for human-beings, it is complicated for machines since it requires the language model to capture various semantic features within an image, such as objects' motions and actions. Another challenge for image captioning, especially for generative models, is that the generated output should be human-like natural sentences.
  • FIG. 2A shows an attention leading decoder mat uses previous hidden state information to guide attention and generate an image caption (prior art).
  • Deep neural networks have been successfully applied to many areas, including speech and vision.
  • RNNs recurrent neural networks
  • a long short-term memory (LSTM) neural network is an extension of an RNN that solves this problem.
  • LSTM a memory cell has linear dependence of its current activity and its past activity.
  • a forget gate is used to modulate the information flow between the past and the current activities.
  • LSTMs also have input and output gates to modulate its input and output.
  • LSTMs have been configured to condition their output on auxiliary inputs, in addition to the current input and the previous hidden state.
  • LSTMs incorporate external visual information provided by image features to influence linguistic choices at different stages.
  • image caption generators LSTMs take as input not only the most recently emitted caption word and the previous hidden state, but also regional features of the image being captioned (usually derived from the activation values of a hidden layer in a convolutional neural network (CNN)).
  • CNN convolutional neural network
  • the auxiliary input carries auxiliary information, which can be visual or textual. It can be generated externally by another LSTM, or derived externally from a hidden state of another LSTM. It can also be provided by an external source such as a CNN, a multilayer perception, an attention network, or another LSTM.
  • the auxiliary information can be fed to the LSTM just once at the initial timestep or fed successively at each timestep.
  • FIG. 1 illustrates an encoder that processes an image through a convolutional neural network (abbreviated CNN) and produces image features for regions of the image.
  • CNN convolutional neural network
  • FIG.2A shows an attention leading decoder that uses previous hidden state information to guide attention and generate an image caption (prior art).
  • FIG.2B shows the disclosed attention lagging decoder which uses current hidden state information to guide attention and generate an image caption.
  • FIG.3A depicts a global image feature generator that generates a global image feature for an image by combining image features produced by the CNN encoder of FIG. 1.
  • FIG.3B is a word embedder that vectorizes words in a high-dimensional embedding space.
  • FIG.3C is an input preparer that prepares and provides input to a decoder.
  • FIG.4 depicts one implementation of modules of an attender that is part of the spatial attention model disclosed in FIG. 6.
  • FIG. 5 shows one implementation of modules of an emitter that is used in various aspects of the technology disclosed.
  • Emitter comprises a feed-forward neural network (also referred to herein as multilayer perceptron (MLP)), a vocabulary softmax (also referred to herein as vocabulary probability mass producer), and a word embedder (also referred to herein as embedder).
  • MLP feed-forward neural network
  • vocabulary softmax also referred to herein as vocabulary probability mass producer
  • embedder also referred to herein as embedder
  • FIG. 6 illustrates the disclosed spatial attention model for image captioning rolled across multiple timesteps.
  • the attention lagging decoder of FIG.2B is embodied in and implemented by the spatial attention model.
  • FIG. 7 depicts one implementation of image captioning using spatial attention applied by the spatial attention model of FIG. 6.
  • FIG. 8 illustrates one implementation of the disclosed sentinel LSTM (Sn-LSTM) that comprises an auxiliary sentinel gate which produces a sentinel state.
  • FIG. 9 shows one implementation of modules of a recurrent neural network
  • RNN (abbreviated RNN) that implements the Sn-LSTM of FIG. 8.
  • FIG. 10 depicts the disclosed adaptive attention model for image captioning that automatically decides how heavily to rely on visual information, as opposed to linguistic information, to emit a next caption word.
  • the sentinel LSTM (Sn-LSTM) of FIG. 8 is embodied in and implemented by the adaptive attention model as a decoder.
  • FIG. 11 depicts one implementation of modules of an adaptive attender that is part of the adaptive attention model disclosed in FIG. 12.
  • the adaptive attender comprises a spatial attender, an extractor, a sentinel gate mass determiner, a sentinel gate mass softmax, and a mixer (also referred to herein as an adaptive context vector producer or an adaptive context producer).
  • the spatial attender in turn comprises an adaptive comparator, an adaptive attender softmax, and an adaptive convex combination accumulator.
  • FIG. 12 shows the disclosed adaptive attention model for image captioning rolled across multiple timesteps.
  • the sentinel LSTM (Sn-LSTM) of FIG. 8 is embodied in and implemented by the adaptive attention model as a decoder.
  • FIG. 13 illustrates one implementation of image captioning using adaptive attention applied by the adaptive attention model of FIG. 12.
  • FIG. 14 is one implementation of the disclosed visually hermetic decoder that processes purely linguistic information and produces captions for an image.
  • FIG. 15 shows a spatial attention model that uses the visually hermetic decoder of
  • FIG. 14 for image captioning.
  • the spatial attention model is rolled across multiple timesteps.
  • FIG. 16 illustrates one example of image captioning using the technology disclosed.
  • FIG. 17 shows visualization of some example image captions and image/spatial attention maps generated using the technology disclosed.
  • FIG. 18 depicts visualization of some example image captions, word-wise visual grounding probabilities, and corresponding image/spatial attention maps generated using the technology disclosed.
  • FIG. 19 illustrates visualization of some other example image captions, word-wise visual grounding probabilities, and corresponding image spatial attention maps generated using the technology disclosed.
  • FIG.20 is an example rank-probability plot that illustrates performance of the technology disclosed on the COCO (common objects in context) dataset.
  • FIG.21 is another example rank-probability plot that illustrates performance of the technology disclosed on the Flicker30k dataset
  • FIG.22 is an example graph that shows localization accuracy of the technology disclosed on the COCO dataset.
  • the blue colored bars show localization accuracy of the spatial attention model and the red colored bars show localization accuracy of the adaptive attention model.
  • FIG 23 is a table that shows performance of the technology disclosed on the Flicker30k and COCO datasets based on various natural language processing metrics, including BLEU (bilingual evaluation understudy), METEOR (metric for evaluation of translation with explicit ordering), CIDEr (consensus-based image description evaluation), ROUGE-L (recall- oriented understudy for gisting evaluation-longest common subsequence), and SPICE (semantic propositional image caption evaluation).
  • BLEU bilingual evaluation understudy
  • METEOR metric for evaluation of translation with explicit ordering
  • CIDEr consensus-based image description evaluation
  • ROUGE-L recall- oriented understudy for gisting evaluation-longest common subsequence
  • SPICE semantic propositional image caption evaluation
  • FIG.25 is a simplified block diagram of a computer system that can be used to implement the technology disclosed.
  • Attention-based visual neural encoder-decoder models use a convolutional neural network (CNN) to encode an input image into feature vectors and a long short-term memory network (LSTM) to decode the feature vectors into a sequence of words.
  • CNN convolutional neural network
  • LSTM long short-term memory network
  • the LSTM relies on an attention mechanism that produces a spatial map that highlights image regions relevant to for generating words. Attention-based models leverage either previous hidden state information of the LSTM or previously emitted caption word(s) as input to the attention mechanism.
  • each conditional probability is modeled as:
  • / is a nonlinear function that outputs the probability of y
  • . c t is the visual context vector at time t extracted from image / .
  • h t is the current hidden state ofthe R N at time t.
  • the technology disclosed uses a long short-term memory network (LSTM) as the RNN.
  • LSTMs are gated variants of a vanilla RNN and have
  • x is the current input at time t and m t _ x is the previous memory cell state at time t-l.
  • Context vector c t is an important factor in the neural encoder-decoder framework because it provides visual evidence for caption generation.
  • Different ways of modeling the context vector fall into two categories: vanilla encoder-decoder and attention-based encoder- decoder frameworks.
  • context vector c t is only dependent on a convolutional neural network (CNN) that serves as the encoder.
  • CNN convolutional neural network
  • the input image / is fed into the CNN, which extracts the last fully connected layer as a global image feature.
  • the context vector c t keeps constant, and does not depend on the hidden state of the decoder.
  • context vector c t is dependent on both the encoder and the decoder.
  • the decoder attends to specific regions of the image and determines context vector c, using the spatial image features from a convolution layer of a CNN. Attention models can significantly improve the performance of image captioning.
  • our model uses the current hidden state information of the decoder LSTM to guide attention, instead of using the previous hidden state or a previously emitted word.
  • our model supplies the LSTM with a time-invariant global image representation, instead of a progression by timestep of attention-variant image
  • the attention mechanism of our model uses current instead of prior hidden state information to guide attention, which requires a different structure and different processing steps.
  • the current hidden state information is used to guide attention to image regions and generate, in a timestep, an attention-variant image representation.
  • the current hidden state information is computed at each timestep by the decoder LSTM, using a current input and previous hidden state information. Information from the LSTM, the current hidden state, is fed to the attention mechanism, instead of output of the attention mechanism being fed to the LSTM.
  • the current input combines word(s) previously emitted with a time-invariant global image representation, which is determined from the encoder CNN's image features.
  • the first current input word fed to decoder LSTM is a special start ( ⁇ start>) token.
  • the global image representation can be fed to the LSTM once, in a first timestep, or repeatedly at successive timesteps.
  • the spatial attention model determines context vector c, that is defined as:
  • g is the attention function which is embodied in and implemented by the attender of FIG. 4, comprises the image features
  • Each image feature is a d dimensional representation corresponding to a part or region of the image produced by the CNN encoder, h, is the current hidden state of the LSTM decoder at time t , shown in FIG.2B.
  • the disclosed spatial attention model feeds them through a
  • the context vector c is obtained by a convex combination accumulator as:
  • the attender comprises the comparator, the attender softmax (also referred to herein as attention probability mass producer), and the convex combination accumulator (also referred to herein as context vector producer or context producer).
  • FIG. 1 illustrates an encoder that processes an image through a convolutional neural network (abbreviated CNN) and produces the image features for regions of the image.
  • the encoder CNN is a pretrained ResNet.
  • the image features are spatial feature outputs of the last
  • the image features have a dimension of 2048 x 7 x 7.
  • the technology disclosed uses to represent the spatial CNN features at each of the k grid locations. Following this, in some implementations, a global image feature generator produces a global image feature, as discussed below.
  • FIG. 2B shows the disclosed attention lagging decoder which uses current hidden state information h t to guide attention and generate an image caption.
  • the attention lagging decoder uses current hidden state information h, to analyze where to look in the image, i.e., for generating the context vector c t .
  • the decoder then combines both sources of information h t and c t to predict the next word.
  • the generated context vector c t embodies the residual visual information of current hidden state h t , which diminishes the uncertainty or complements the informativeness of the current hidden state for next word prediction. Since the decoder is recurrent, LSTM-based and operates sequentially, the current hidden state h, embodies the previous hidden state and the current input x, , which form the current visual and linguistic context.
  • the attention lagging decoder attends to the image using this current visual and linguistic context rather than stale, prior context (FIG. 2A).
  • the image is attended after the current visual and linguistic context is determined by the decoder, i.e., the attention lags the decoder. This produces more accurate image captions.
  • FIG.3A depicts a global image feature generator that generates a global image feature for an image by combining image features produced by the CNN encoder of FIG. 1.
  • Global image feature generator first produces a preliminary global image feature as follows:
  • a s is the preliminary global image feature that is determined by averaging the image features produced by the CNN encoder.
  • the global image feature generator uses a single layer perception with rectifier activation function to transform the image feature vectors into new vectors with dimension z d :
  • W a and W b are the weight parameters
  • v is the global image feature.
  • Global image feature v is time-invariant because it is not sequentially or recurrently produced, but instead determined from non-recurrent, convolved image features.
  • the transformed spatial image features v form the image features
  • Transformation of the image features is embodied in and implemented by the image feature rectifier of the global image feature generator, according to one implementation. Transformation of the preliminary global image feature is embodied in and implemented by the global image feature rectifier of the global image feature generator, according to one implementation.
  • FIG.3B is a word embedder that vectorizes words in a high-dimensional embedding space.
  • the technology disclosed uses the word embedder to generate word embeddings of vocabulary words predicted by the decoder, w, denotes word embedding of a vocabulary word predicted by the decoder at time t .
  • w, ⁇ denotes word embedding of a vocabulary word predicted by the decoder at time t - 1 .
  • word embedder generates word embeddings w t _ ⁇ of dimensionality d using an embedding matrix where v represents the size of the vocabulary.
  • word embedder first transforms a word into a one-hot encoding and then converts it into a continuous representation using the embedding matrix
  • the word embedder initializes word embeddings using pretrained word embedding models like GloVe and word2vec and obtains a fixed word embedding of each word in the vocabulary.
  • word embedder generates character embeddings and/or phrase embeddings.
  • FIG.3C is an input preparer that prepares and provides input to a decoder. At each time step, the input preparer concatenates the word embedding vector (predicted by the input preparer).
  • the input preparer is also referred to herein as concatenator.
  • a long short-term memory is a cell in a neural network that is repeatedly exercised in timesteps to produce sequential outputs from sequential inputs.
  • the output is often referred to as a hidden state, which should not be confused with the cell's memory.
  • Inputs are a hidden state and memory from a prior timestep and a current input.
  • the cell has an input activation function, memory, and gates.
  • the input activation function maps the input into a range, such as -1 to 1 for a tanh activation function.
  • the gates determine weights applied to updating the memory and generating a hidden state output result from the memory.
  • the gates are a forget gate, an input gate, and an output gate.
  • the forget gate attenuates the memory.
  • the input gate mixes activated inputs with the attenuated memory.
  • the output gate controls hidden state output from the memory.
  • the hidden state output can directly label an input or it can be processed by another component to emit a word or other label or generate a probability distribution over
  • An auxiliary input can be added to the LSTM that introduces a different kind of information than the current input, in a sense orthogonal to current input. Adding such a different kind of auxiliary input can lead to overfitting and other training artifacts.
  • the technology disclosed adds a new gate to the LSTM cell architecture that produces a second sentinel state output from the memory, in addition to the hidden state output. This sentinel state output is used to control mixing between different neural network processing models in a post-LSTM component.
  • a visual sentinel for instance, controls mixing between analysis of visual features from a CNN and of word sequences from a predictive language model.
  • the new gate that produces the sentinel state output is called "auxiliary sentinel gate”.
  • the auxiliary input contributes to both accumulated auxiliary information in the LSTM memory and to the sentinel output.
  • the sentinel state output encodes parts of the accumulated auxiliary information that are most useful for next output prediction.
  • the sentinel gate conditions current input, including the previous hidden state and the auxiliary information, and combines the conditioned input with the updated memory, to produce the sentinel state output.
  • An LSTM that includes the auxiliary sentinel gate is referred to herein as a "sentinel LSTM (Sn-LSTM)".
  • the auxiliary information prior to being accumulated in the Sn-LSTM, the auxiliary information is often subjected to a "tanh" (hyperbolic tangent) function that produces output in the range of -1 and 1 (e.g., tanh function following the fully-connected layer of a CNN).
  • tanh hyperbolic tangent
  • the auxiliary sentinel gate gates the pointwise tanh of the Sn-LSTM' s memory cell.
  • tanh is selected as the non-linearity function applied to the Sn-LSTM's memory cell because it matches the form of the stored auxiliary information.
  • FIG. 8 illustrates one implementation of the disclosed sentinel LSTM (Sn-LSTM) that comprises an auxiliary sentinel gate which produces a sentinel state or visual sentinel.
  • the Sn-LSTM receives inputs at each of a plurality of timesteps.
  • the inputs include at least an input for a current timestep x t , a hidden state from a previous tiniestep and an auxiliary input
  • the Sn-LSTM can run on at least one of the numerous parallel processors.
  • the auxiliary input a f is not separately provided, but instead encoded as auxiliary information in the previous hidden state and/or the input x t
  • the auxiliary input a t can be visual input comprising image data and the input can be a text embedding of a most recently emitted word and or character.
  • the auxiliary input a can be a text encoding from another long short-term memory network (abbreviated LSTM) of an input document and the input can be a text embedding of a most recently emitted word and/or character.
  • the auxiliary input a t can be a hidden state vector from another LSTM that encodes sequential data and the input can be a text embedding of a most recently emitted word and/or character.
  • the auxiliary input a can be a prediction derived from a hidden state vector from another LSTM that encodes sequential data and the input can be a text embedding of a most recently emitted word and/or character.
  • the auxiliary input a ⁇ can be an output of a convolutional neural network (abbreviated CNN).
  • the auxiliary input can be an output of an
  • the Sn-LSTM generates outputs at each of the plurality of timesteps by processing the inputs through a plurality of gates.
  • the gates include at least an input gate, a forget gate, an output gate, and an auxiliary sentinel gate. Each of the gates can run on at least one of the numerous parallel processors.
  • the input gate controls how much of the current input and the previous hidden
  • the forget gate operates on the current memory cell state and the previous
  • the output gate scales the output from the memory cell and is represented as:
  • the Sn-LSTM can also include an activation gate (also referred to as cell update gate or input transformation gate) that transforms the current input x ( and previous hidden state to be taken into account into the current memory cell state and is represented as:
  • the Sn-LSTM can also include a current hidden state producer that outputs the current hidden state scaled by a tanh (squashed) transformation of the current memory cell
  • a memory cell updater (FIG. 9) updates the memory cell of the Sn-LSTM from the previous memory cell state to the current memory cell state m ⁇ as follows:
  • the auxiliary sentinel gate produces a sentinel state or visual sentinel which is a latent representation of what the Sn-LST decoder already knows.
  • the Sn- LSTM decoder's memory stores both long and short term visual and linguistic information.
  • the adaptive attention model learns to extract a new component f om the Sn-LSTM that the model can fall back on when it chooses to not attend to the image. This new component is called the visual sentinel.
  • the gate that decides whether to attend to the image or to the visual sentinel is the auxiliary sentinel gate.
  • the visual and linguistic contextual information is stored in the Sn-LSTM decoder's memory cell.
  • aux ⁇ is the auxiliary sentinel gate applied to the current memory cell state m ( . represents the element-wise product and ⁇ is the logistic sigmoid activation.
  • the Sn-LSTM can be used as a decoder that receives auxiliary information from another encoder LSTM.
  • the encoder LSTM can process an input document to produce a document encoding.
  • the document encoding or an alternative representation of the document encoding can be fed to the Sn-LSTM as auxiliary information.
  • Sn-LSTM can use its auxiliary sentinel gate to determine which parts of the document encoding (or its alternative representation) are most important at a current timestep, considering a previously generated summary word and a previous hidden state.
  • the important parts of the document encoding (or its alternative representation) can then be encoded into the sentinel state.
  • the sentinel state can be used to generate the next summary word.
  • the Sn-LSTM can be used as a decoder that receives auxiliary information from another encoder LSTM.
  • the encoder LSTM can process an input question to produce a question encoding.
  • the question encoding or an alternative representation of the question encoding can be fed to the Sn-LSTM as auxiliary information.
  • Sn-LSTM can use its auxiliary sentinel gate to determine which parts of the question encoding (or its alternative representation) are most important at a current timestep, considering a previously generated answer word and a previous hidden state. The important parts of the question encoding (or its alternative representation) can then be encoded into the sentinel state.
  • the sentinel state can be used to generate the next answer word.
  • the Sn-LSTM can be used as a decoder that receives auxiliary information from another encoder LSTM.
  • the encoder LSTM can process a source language sequence to produce a source encoding.
  • the source encoding or an alternative representation of the source encoding can be fed to the Sn- LSTM as auxiliary information.
  • Sn-LSTM can use its auxiliary sentinel gate to determine which parts of the source encoding (or its alternative representation) are most important at a current timestep, considering a previously generated translated word and a previous hidden state.
  • the important parts of the source encoding (or its alternative representation) can then be encoded into the sentinel state.
  • the sentinel state can be used to generate the next translated word.
  • the Sn-LSTM can be used as a decoder that receives auxiliary information from an encoder comprising a CNN and an LSTM.
  • the encoder can process video frames of a video to produce a video encoding.
  • the video encoding or an alternative representation of the video encoding can be fed to the Sn-LSTM as auxiliary information.
  • Sn-LSTM can use its auxiliary sentinel gate to determine which parts of the video encoding (or its alternative representation) are most important at a current timestep, considering a previously generated caption word and a previous hidden state.
  • the important parts of the video encoding (or its alternative representation) can then be encoded into the sentinel state.
  • the sentinel state can be used to generate the next caption word.
  • the Sn-LSTM can be used as a decoder that receives auxiliary information from an encoder CNN.
  • the encoder can process an input image to produce an image encoding.
  • the image encoding or an alternative representation of the image encoding can be fed to the Sn-LSTM as auxiliary information.
  • Sn- LSTM can use its auxiliary sentinel gate to determine which parts of the image encoding (or its alternative representation) are most important at a current timestep, considering a previously generated caption word and a previous hidden state.
  • the important parts of the image encoding (or its alternative representation) can then be encoded into the sentinel state.
  • the sentinel state can be used to generate the next caption word.
  • a long short-term memory (LSTM) decoder can be extended to generate image captions by attending to regions or features of a target image and conditioning word predictions on the attended image features.
  • attending to the image is only half of the story; knowing when to look is the other half. That is, not all caption words correspond to visual signals; some words, such as stop words and linguistically correlated words, can be better inferred from textual context.
  • Existing attention-based visual neural encoder-decoder models force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as "the" and "of.
  • FIG. 10 depicts the disclosed adaptive attention model for image captioning that automatically decides how heavily to rely on visual information, as opposed to linguistic information, to emit a next caption word.
  • the sentinel LSTM (Sn-LSTM) of FIG. 8 is embodied in and implemented by the adaptive attention model as a decoder.
  • the sentinel gate produces a so-called visual sentinel/sentinel state St at each timestep, which is an additional representation, derived from the Sn-LSTM' s memory, of long and short term visual and linguistic information.
  • the visual sentinel St encodes information that can be relied on by the linguistic model without reference to the visual information from the CNN.
  • the visual sentinel St is used, in combination with the current hidden state from the Sn- LSTM, to generate a sentinel gate mass/gate probability mass 3 ⁇ 4 that controls mixing of image and linguistic context.
  • FIG. 14 is one implementation of the disclosed visually hermetic decoder that processes purely linguistic information and produces captions for an image.
  • FIG. 15 shows a spatial attention model that uses the visually hermetic decoder of FIG. 14 for image captioning. In FIG. 15, the spatial attention model is rolled across multiple timesteps.
  • a visually hermetic decoder can be used that processes purely linguistic information w , which is not mixed with image data during image captioning.
  • This alternative visually hermetic decoder does not receive the global image representation as input. That is, the current input to the visually hermetic decoder is just its most recently emitted caption word w,_, and the initial input is only the ⁇ start> token.
  • a visually hermetic decoder can be implemented as an LSTM, a gated recurrent unit (GRU), or a quasi-recurrent neural network (QRN ). Words, with this alternative decoder, are still emitted after application of the attention mechanism.
  • the technology disclosed also provides a system and method of evaluating performance of an image captioning model.
  • the technology disclosed generates a spatial attention map of attention values for mixing image region vectors of an image using a convolutional neural network (abbreviated CNN) encoder and a long-short term memory (LSTM) decoder and produces a caption word output based on the spatial attention map.
  • the technology disclosed segments regions of the image above a threshold attention value into a segmentation map.
  • the technology disclosed projects a bounding box over the image that covers a largest connected image component in the segmentation map.
  • the technology disclosed determines an intersection over union (abbreviated IOU) of the projected bounding box and a ground truth bounding box.
  • the technology disclosed determines a localization accuracy of the spatial attention map based on the calculated IOU.
  • the technology disclosed presents a system.
  • the system includes numerous parallel processors coupled to memory.
  • the memory is loaded with computer instructions to generate a natural language caption for an image.
  • the instructions when executed on the parallel processors, implement the following actions.
  • the encoder can be a convolutional neural network (abbreviated CNN).
  • the decoder can be a long short-term memory network (abbreviated LSTM).
  • the feed-forward neural network can be a multilayer perceptron (abbreviated MLP).
  • This system implementation and other systems disclosed optionally include one or more of the following features.
  • System can also include features described in connection with methods disclosed. In the interest of conciseness, alternative combinations of system features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
  • the system can be a computer-implemented system.
  • the system can be a neural network-based system.
  • the current hidden state of the decoder can be determined based on a current input to the decoder and a previous hidden state of the decoder.
  • the image context vector can be a dynamic vector that determines at each timestep an amount of spatial attention allocated to each image region, conditioned on the current hidden state of the decoder.
  • the system can use weakly-supervised localization to evaluate the allocated spatial attention.
  • the attention values for the image feature vectors can be determined by processing the image feature vectors and the current hidden state of the decoder through a single layer neural network.
  • the system can cause the feed-forward neural network to emit the next caption word at each timestep.
  • the feed-forward neural network can produce an output based on the image context vector and the current hidden state of the decoder and use the output to determine a normalized distribution of vocabulary probability masses over words in a vocabulary that represent a respective likelihood that a vocabulary word is the next caption word.
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • the technology disclosed presents a system.
  • the system includes numerous parallel processors coupled to memory.
  • the memory is loaded with computer instructions to generate a natural language caption for an image.
  • the instructions when executed on the parallel processors, implement the following actions.
  • the system can be a computer-implemented system.
  • the system can be a neural network-based system.
  • the current hidden state information can be determined based on a current input to the decoder and previous hidden state information.
  • the system can use weakly-supervised localization to evaluate the attention map.
  • the encoder can be a convolutional neural network (abbreviated CNN) and the image feature vectors can be produced by a last convolutional layer of the CNN.
  • CNN convolutional neural network
  • the attention lagging decoder can be a long short-term memory network (abbreviated
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • the technology disclosed presents a system.
  • the system includes numerous parallel processors coupled to memory.
  • the memory is loaded with computer instructions to generate a natural language caption for an image.
  • the instructions when executed on the parallel processors, implement the following actions.
  • the encoder can be a convolutional neural network (abbreviated CNN).
  • the decoder can be a long short-term memory network (abbreviated LSTM).
  • the system can be a computer-implemented system.
  • the system can be a neural network-based system.
  • the system does not supply the global image feature vector to the decoder and processes words through the decoder by beginning at the initial timestep with the start-of-caption token ⁇ start > and continuing in successive timesteps using the most recently emitted caption word w,_j as input to the decoder.
  • the system does not supply the image feature vectors to the decoder, in some implementations.
  • the technology disclosed presents a system for machine generation of a natural language caption for an image.
  • the system runs on numerous parallel processors.
  • the system can be a computer-implemented system.
  • the system can be a neural network-based system.
  • the system comprises an attention lagging decoder.
  • the attention lagging decoder can run on at least one of the numerous parallel processors.
  • the attention lagging decoder uses at least current hidden state information to generate an attention map for image feature vectors produced by an encoder f om an image.
  • the encoder can be a convolutional neural network (abbreviated CNN) and the image feature vectors can be produced by a last convolutional layer of the CNN.
  • the attention lagging decoder can be a long short-term memory network (abbreviated LSTM).
  • the attention lagging decoder causes generation of an output caption word based on a weighted sum of the image feature vectors, with the weights determined from the attention map.
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • FIG. 6 illustrates the disclosed spatial attention model for image captioning rolled across multiple timesteps.
  • the attention lagging decoder of FIG.2B is embodied in and implemented by the spatial attention model.
  • the technology disclosed presents an image-to- language captioning system that implements the spatial attention model of FIG. 6 for machine generation of a natural language caption for an image.
  • the system runs on numerous parallel processors.
  • the system comprises an encoder (FIG. 1) for processing an image through a convolutional neural network (abbreviated CNN) and producing image features for regions of the image.
  • the encoder can run on at least one of the numerous parallel processors.
  • the system comprises a global image feature generator (FIG.3A) for generating a global image feature for the image by combining the image features.
  • the global image feature generator can run on at least one of the numerous parallel processors.
  • the system comprises an input preparer (FIG.3C) for providing input to a decoder as a combination of a start-of-caption token ⁇ start > and the global image feature at an initial decoder timestep and a combination of a most recently emitted caption word w t-1 and the global image feature at successive decoder timesteps.
  • the input preparer can run on at least one of the numerous parallel processors.
  • the system comprises the decoder (FIG. 2B) for processing the input through a long short-term memory network (abbreviated LSTM) to generate a current decoder hidden state at each decoder timestep.
  • the decoder can run on at least one of the numerous parallel processors.
  • the system comprises an attender (FIG.4) for accumulating, at each decoder timestep, an image context as a convex combination of the image features scaled by attention probability masses determined using the current decoder hidden state.
  • the attender can run on at least one of the numerous parallel processors.
  • FIG.4 depicts one implementation of modules of the attender that is part of the spatial attention model disclosed in FIG. 6.
  • the attender comprises the comparator, the attender softmax (also referred to herein as attention probability mass producer), and the convex combination accumulator (also referred to herein as context vector producer or context producer).
  • the system comprises a feed-forward neural network (also referred to herein as multilayer perception (MLP)) (FIG. 5) for processing the image context and the current decoder hidden state to emit a next caption word at each decoder timestep.
  • the feed-forward neural network can run on at least one of the numerous parallel processors.
  • the system comprises a controller (FIG. 25) for iterating the input preparer, the decoder, the attender, and the feed-forward neural network to generate the natural language caption for the image until the next caption word emitted is an end-of-caption token ⁇ end > .
  • the controller can run on at least one of the numerous parallel processors.
  • the system can be a computer-implemented system.
  • the system can be a neural network-based system.
  • the attender can further comprise an attender softmax (FIG.4) for exponentially normalizing attention values to produce the attention probability masses
  • the attender softmax can run on at least one of the
  • the attender can further comprise a comparator (FIG. 4) for producing at each decoder timestep the attention values as a result of interaction between the
  • the comparator can run on at least one of the numerous parallel processors.
  • the attention values are determined by processing the current decoder hidden state h, and the image features through a single layer neural network
  • the attention values are determined by processing the current decoder hidden state
  • the attention values are determined by processing
  • the decoder can further comprise at least an input gate, a forget gate, and an output gate for determining at each decoder timestep the current decoder hidden state based on a current decoder input and a previous decoder hidden state.
  • the input gate, the forget gate, and the output gate can each run on at least one of the numerous parallel processors.
  • the attender can further comprise a convex combination accumulator (FIG.4) for producing the image context to identify an amount of spatial attention allocated to each image region at each decoder timestep, conditioned on the current decoder hidden state.
  • the convex combination accumulator can run on at least one of the numerous parallel processors.
  • the system can further comprise a localizer (FIG.25) for evaluating the allocated spatial attention based on weakly-supervising localization.
  • the localizer can run on at least one of the numerous parallel processors.
  • the system can further comprise the feed-forward neural network (FIG. 5) for producing at each decoder timestep an output based on the image context and the current decoder hidden state.
  • FOG. 5 feed-forward neural network
  • the system can further comprise a vocabulary softmax (FIG. 5) for determining at each decoder timestep a normalized distribution of vocabulary probability masses over words in a vocabulary using the output.
  • the vocabulary softmax can run on at least one of the numerous parallel processors.
  • the vocabulary probability masses can identify respective likelihood that a vocabulary word is the next caption word.
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • FIG. 7 depicts one implementation of image captioning using spatial attention applied by the spatial attention model of FIG. 6.
  • the technology disclosed presents a method that performs the image captioning of FIG. 7 for machine generation of a natural language caption for an image.
  • the method can be a computer- implemented method.
  • the method can be a neural network-based method.
  • the method includes processing an image / through an encoder (FIG. 1) to produce image feature vectors for regions of the image / and determining a
  • the encoder can be a convolutional neural network (abbreviated CNN), as shown in FIG. 1.
  • CNN convolutional neural network
  • the method includes processing words through a decoder (FIGs.2B and 6) by beginning at an initial timestep with a start-of-caption token ⁇ start > and the global image feature vector and continuing in successive timesteps using a most recently emitted caption word and the global image feature vector as input to the decoder.
  • the decoder can be a long short-term memory network (abbreviated LSTM), as shown in FIGs. 2B and 6.
  • the method includes, at each timestep, using at least a current hidden state of the decoder h t to determine unnormalized attention values for the image feature
  • the method includes applying the attention probability masses to the
  • image feature vectors to accumulate in an image context vector c t a weighted sum of the image feature vectors
  • the method includes submitting the image context vector c t and the current hidden state of the decoder h t to a feed-forward neural network and causing the feed-forward neural network to emit a next caption word w t .
  • the feed-forward neural network can be a multilayer perceptron (abbreviated MLP).
  • the method includes repeating the processing of words through the decoder, the using, the applying, and the submitting until the caption word emitted is end-of-caption token
  • the iterations are performed by a controller, shown in FIG. 25.
  • implementations may include a non-transitory computer readable storage medium (CRM) storing instructions executable by a processor to perform the method described above.
  • implementations may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the method described above.
  • the technology disclosed presents a method of machine generation of a natural language caption for an image.
  • the method can be a computer- implemented method.
  • the method can be a neural network-based method.
  • the method includes using current hidden state information h, of an attention lagging decoder (FIGs.2B and 6) to generate an attention map
  • implementations may include a non-transitory computer readable storage medium (CRM) storing instructions executable by a processor to perform the method described above.
  • implementations may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the method described above.
  • the technology disclosed presents a method of machine generation of a natural language caption for an image.
  • This method uses a visually hermetic LSTM.
  • the method can be a computer-implemented method.
  • the method can be a neural network-based method.
  • the method includes processing an image through an encoder (FIG. 1) to produce image feature vectors for k regions of the image / .
  • the encoder can be a convolutional neural network (abbreviated CNN).
  • the method includes processing words through a decoder by beginning at an initial timestep with a start-of-caption token ⁇ start > and continuing in successive timesteps using a most recently emitted caption word as input to the decoder.
  • the decoder can be a visually hermetic long short-term memory network (abbreviated LSTM), shown in FIGs. 14 and 15.
  • the method includes, at each timestep, using at least a current hidden state h, of the decoder to determine, from the image feature vectors an image context
  • the method includes not supplying the image context vector c t to the decoder.
  • the method includes submitting the image context vector c t and the current hidden state of the decoder h t to a feed-forward neural network and causing the feed-forward neural network to emit a caption word.
  • the method includes repeating the processing of words through the decoder, the using, the not supplying, and the submitting until the caption word emitted is an end-of-caption.
  • implementations may include a non-transitory computer readable storage medium (CRM) storing instructions executable by a processor to perform the method described above.
  • implementations may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the method described above.
  • FIG. 12 shows the disclosed adaptive attention model for image captioning rolled across multiple timesteps.
  • the sentinel LSTM (Sn-LSTM) of FIG. 8 is embodied in and implemented by the adaptive attention model as a decoder.
  • FIG. 13 illustrates one
  • the technology disclosed presents a system that performs the image captioning of FIGs. 12 and 13.
  • the system includes numerous parallel processors coupled to memory.
  • the memory is loaded with computer instructions to automatically caption an image.
  • the instructions when executed on the parallel processors, implement the following actions.
  • the mixing results of an image encoder (FIG. 1) and a language decoder (FIG. 8) to emit a sequence of caption words for an input image / .
  • the mixing is governed by a gate probability mass/sentinel gate mass 3 ⁇ 4 determined from a visual sentinel vector St of the language decoder and a current hidden state vector of the language decoder ht .
  • the image encoder can be a convolutional neural network (abbreviated CNN)-
  • the language decoder can be a sentinel long short-term memory network (abbreviated Sn-LSTM), as shown in FIGs. 8 and 9.
  • the language decoder can be a sentinel bi-directional long short-term memory network
  • the language decoder can be a sentinel gated recurrent unit network (abbreviated Sn-GRU).
  • the language decoder can be a sentinel quasi-recurrent neural network (abbreviated Sn-QRN ).
  • Determining the results of the language decoder by processing words through the language decoder includes - (1) beginning at an initial timestep with a start-of-caption token ⁇ start > and the global image feature vector , (2) continuing in successive timesteps using a most recently emitted caption word and the global image feature vector as input to the language decoder, and (3) at each timestep, generating a visual sentinel vector St that combines the most recently emitted caption word the global image feature vector , a previous hidden state vector of the language decoder and memory contents of the language decoder.
  • adaptive context vector Ct is embodied in and implemented by the mixer of the adaptive attender, shown in FIGs. 11 and 13.
  • the system can be a computer-implemented system.
  • the system can be a neural network-based system.
  • the adaptive context vector at timestep / can be determined as
  • st denotes the visual sentinel vector
  • fit denotes the gate probability
  • the visual sentinel vector St can encode visual sentinel information that includes visual context determined from the global image feature vector V s and textual context determined from previously emitted caption words.
  • the gate probability mass/sentinel gate mass/sentinel gate mass fit being unity can result in the adaptive context vector Ct being equal to the visual sentinel vector St .
  • the next caption word Wt is emitted only in dependence upon the visual sentinel information.
  • the image context vector Ct can encode spatial image information conditioned on the current hidden state vector ht of the language decoder.
  • the gate probability mass/sentinel gate mass fit being zero can result in the adaptive context vector ct being equal to the image context vector ct .
  • the next caption word wt is emitted only in dependence upon the spatial image information.
  • the gate probability mass sentinel gate mass fit can be a scalar value between unity and zero that enhances when the next caption word Wt is a visual word and diminishes when the next caption word Wt is a non-visual word or linguistically correlated to the previously emitted caption word Wt-i.
  • the system can further comprise a trainer (FIG. 25), which in turn further comprises a preventer (FIG. 25).
  • the preventer prevents, during training, backpropagation of gradients from the language decoder to the image encoder when the next caption word is a non-visual word or linguistically correlated to the previously emitted caption word.
  • the trainer and the preventer can each run on at least one of the numerous parallel processors.
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • the technology disclosed presents a method of automatic image captioning.
  • the method can be a computer-implemented method.
  • the method can be a neural network-based method.
  • the method includes mixing ⁇ results of an image encoder (FIG. 1) and a language decoder (FIGs. 8 and 9) to emit a sequence of caption words for an input image / .
  • the mixing is embodied in and implemented by the mixer of the adaptive attender of FIG. 11.
  • the mixing is governed by a gate probability mass (also referred to herein as the sentinel gate mass) determined from a visual sentinel vector of the language decoder and a current hidden state vector of the language decoder.
  • the image encoder can be a convolutional neural network (abbreviated CNN).
  • the language decoder can be a sentinel long short-term memory network (abbreviated Sn-LSTM).
  • the language decoder can be a sentinel bi-directional long short-term memory network (abbreviated Sn-Bi-LSTM).
  • the language decoder can be a sentinel gated recurrent unit network (abbreviated Sn-GRU).
  • the language decoder can be a sentinel quasi- recurrent neural network (abbreviated Sn-QRNN).
  • the method includes determining the results of the image encoder by processing the image through the image encoder to produce image feature vectors for regions of the image and computing a global image feature vector from the image feature vectors.
  • the method includes determining the results of the language decoder by processing words through the language decoder. This includes - (1) beginning at an initial timestep with a start-of-caption token ⁇ start > and the global image feature vector, (2) continuing in successive timesteps using a most recently emitted caption word w t _ x and the global image feature vector as input to the language decoder, and (3) at each timestep, generating a visual sentinel vector that combines the most recently emitted caption word ⁇ me global image feature vector, a previous hidden state vector of the language decoder, and memory contents of the language decoder.
  • the method includes, at each timestep, using at least a current hidden state vector of the language decoder to determine unnormalized attention values for the image feature vectors and an unnormalized gate value for the visual sentinel vector.
  • the method includes concatenating the unnormalized attention values and the unnormalized gate value and exponentially normalizing the concatenated attention and gate values to produce a vector of attention probability masses and the gate probability mass/sentinel gate mass.
  • the method includes applying the attention probability masses to the image feature vectors to accumulate in an image context vector c, a weighted sum of the image feature vectors.
  • the method includes determining an adaptive context vector ct as a mix of the image context vector and the visual sentinel vector St according to the gate probability mass/sentinel gate mass 3 ⁇ 4.
  • the method includes submitting the adaptive context vector ct and the current hidden state of the language decoder h to a feed-forward neural network (MLP) and causing the feed-forward neural network to emit a next caption word wt .
  • MLP feed-forward neural network
  • the method includes repeating the processing of words through the language decoder, the using, the concatenating, the applying, the determining, and the submitting until the next caption word emitted is an end-of-caption token ⁇ end > .
  • the iterations are performed by a controller, shown in FIG.25.
  • implementations may include a non-transitory computer readable storage medium (CRM) storing instructions executable by a processor to perform the method described above.
  • implementations may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the method described above.
  • the technology disclosed presents an automated image captioning system. The system runs on numerous parallel processors.
  • the system comprises a convolutional neural network (abbreviated CNN) encoder (FIG .11).
  • the CNN encoder can run on at least one of the numerous parallel processors.
  • the CNN encoder processes an input image through one or more convolutional layers to generate image features by image regions that represent the image.
  • the system comprises a sentinel long short-term memory network (abbreviated Sn- LSTM) decoder (FIG .8).
  • Sn-LSTM decoder can run on at least one of the numerous parallel processors.
  • the Sn-LSTM decoder processes a previously emitted caption word combined with the image features to emit a sequence of caption words over successive timesteps.
  • the system comprises an adaptive attender (FIG .11).
  • the adaptive attender can run on at least one of the numerous parallel processors.
  • the adaptive attender spatially attends to the image features and produces an image context conditioned on a current hidden state of the Sn-LSTM decoder.
  • the adaptive attender extracts, from the Sn-LSTM decoder, a visual sentinel that includes visual context determined from previously processed image features and textual context determined from previously emitted caption words.
  • the adaptive attender mixes the image context a and the visual sentinel St for next caption word Wt emittance.
  • the mixing is governed by a sentinel gate mass 3 ⁇ 4 determined from the visual sentinel St and the current hidden state of the Sn- LSTM decoder h .
  • the system can be a computer-implemented system.
  • the system can be a neural network-based system.
  • the adaptive attender enhances attention directed to the image context when a next caption word is a visual word, as shown in FIGs. 16, 18, and 19.
  • the adaptive attender enhances attention directed to the visual sentinel when a next caption word is a non-visual word or linguistically correlated to the previously emitted caption word, as shown in FIGs. 16, 18, and 19.
  • the system can further comprise a trainer, which in turn further comprises a preventer.
  • the preventer prevents, during training, backpropagation of gradients from the Sn- LSTM decoder to the CNN encoder when a next caption word is a non-visual word or linguistically correlated to the previously emitted caption word.
  • the trainer and the preventer can each run on at least one of the numerous parallel processors.
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • the technology disclosed presents an automated image captioning system.
  • the system runs on numerous parallel processors.
  • the system can be a computer-implemented system.
  • the system can be a neural network-based system.
  • the system comprises an image encoder (FIG. 1).
  • the image encoder can run on at least one of the numerous parallel processors.
  • the image encoder processes an input image through a convolutional neural network (abbreviated CNN) to generate an image representation.
  • abbreviated CNN convolutional neural network
  • the system comprises a language decoder (FIG. 8).
  • the language decoder can run on at least one of the numerous parallel processors.
  • the language decoder processes a previously emitted caption word combined with the image representation through a recurrent neural network (abbreviated RNN) to emit a sequence of caption words.
  • RNN recurrent neural network
  • the system comprises an adaptive attender (FIG. 11).
  • the adaptive attender can run on at least one of the numerous parallel processors.
  • the adaptive attender enhances attention directed to the image representation when a next caption word is a visual word.
  • the adaptive attender enhances attention directed to memory contents of the language decoder when the next caption word is a non-visual word or linguistically correlated to the previously emitted caption word.
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • the technology disclosed presents an automated image captioning system.
  • the system runs on numerous parallel processors.
  • the system can be a computer-implemented system.
  • the system can be a neural network-based system.
  • the system comprises an image encoder (FIG. 1).
  • the image encoder can run on at least one of the numerous parallel processors.
  • the image encoder processes an input image through a convolutional neural network (abbreviated CNN) to generate an image representation.
  • the system comprises a language decoder (FIG. 8).
  • the language decoder can run on at least one of the numerous parallel processors.
  • the language decoder processes a previously emitted caption word combined with the image representation through a recurrent neural network (abbreviated R N) to emit a sequence of caption words.
  • R N recurrent neural network
  • the system comprises a sentinel gate mass/gate probability mass fit .
  • the sentinel gate mass can run on at least one of the numerous parallel processors.
  • the sentinel gate mass controls accumulation of the image representation and memory contents of the language decoder for next caption word emittance.
  • the sentinel gate mass is determined from a visual sentinel of the language decoder and a current hidden state of the language decoder.
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • the technology disclosed presents a system that automates a task.
  • the system runs on numerous parallel processors.
  • the system can be a computer-implemented system.
  • the system can be a neural network-based system.
  • the system comprises an encoder.
  • the encoder can run on at least one of the numerous parallel processors.
  • the encoder processes an input through at least one neural network to generate an encoded representation.
  • the system comprises a decoder.
  • the decoder can run on at least one of the numerous parallel processors.
  • the decoder processes a previously emitted output combined with the encoded representation through at least one neural network to emit a sequence of outputs.
  • the system comprises an adaptive attender.
  • the adaptive attender can run on at least one of the numerous parallel processors.
  • the adaptive attender uses a sentinel gate mass to mix the encoded representation and memory contents of the decoder for emitting a next output.
  • the sentinel gate mass is determined from the memory contents of the decoder and a current hidden state of the decoder.
  • the sentinel gate mass can run on at least one of the numerous parallel processors.
  • the system comprises a first recurrent neural network (abbreviated RNN) as the encoder that processes an input document to generate a document encoding and a second RNN as the decoder that uses the document encoding to emit a sequence of summary words.
  • RNN first recurrent neural network
  • the system when the task is question answering, the system comprises a first RNN as the encoder that processes an input question to generate a question encoding and a second RNN as the decoder that uses the question encoding to emit a sequence of answer words.
  • the system when the task is machine translation, the system comprises a first RNN as the encoder that processes a source language sequence to generate a source encoding and a second RNN as the decoder that uses the source encoding to emit a target language sequence of translated words.
  • the system when the task is video captioning, the system comprises a combination of a convolutional neural network (abbreviated CNN) and a first RNN as the encoder that process video frames to generate a video encoding and a second RNN as the decoder that uses the video encoding to emit a sequence of caption words.
  • CNN convolutional neural network
  • the system comprises a CNN as the encoder that process an input image to generate an image encoding and a RNN as the decoder that uses the image encoding to emit a sequence of caption words.
  • the system can determine an alternative representation of the input from the encoded representation. The system can then use the alternative representation, instead of the encoded representation, for processing by the decoder and mixing by the adaptive attender.
  • the alternative representation can be a weighted summary of the encoded representation conditioned on the current hidden state of the decoder.
  • the alternative representation can be an averaged summary of the encoded representation.
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • the technology disclosed presents a system for machine generation of a natural language caption for an input image / .
  • the system runs on numerous parallel processors.
  • the system can be a computer-implemented system.
  • the system can be a neural network-based system.
  • FIG. 10 depicts the disclosed adaptive attention model for image captioning that automatically decides how heavily to rely on visual information, as opposed to linguistic information, to emit a next caption word.
  • the sentinel LSTM (Sn-LSTM) of FIG. 8 is embodied in and implemented by the adaptive attention model as a decoder.
  • FIG. 11 depicts one implementation of modules of an adaptive attender that is part of the adaptive attention model disclosed in FIG. 12.
  • the adaptive attender comprises a spatial attender, an extractor, a sentinel gate mass determiner, a sentinel gate mass softmax, and a mixer (also referred to herein as an adaptive context vector producer or an adaptive context producer).
  • the spatial attender in turn comprises an adaptive comparator, an adaptive attender softmax, and an adaptive convex combination accumulator.
  • the system comprises a convolutional neural network (abbreviated CNN) encoder (FIG. 1) for processing the input image through one or more convolutional layers to generate image features by k image regions that represent the image / .
  • CNN convolutional neural network
  • CNN encoder can run on at least one of the numerous parallel processors.
  • the system comprises a sentinel long short-term memory network (abbreviated Sn- LSTM) decoder (FIG. 8) for processing a previously emitted caption word w t _, combined with the image features to produce a current hidden state ht of the Sn-LSTM decoder at each decoder timestep.
  • Sn-LSTM decoder can run on at least one of the numerous parallel processors.
  • the system comprises an adaptive attender, shown in FIG. 11.
  • the adaptive attender can run on at least one of the numerous parallel processors.
  • the adaptive attender further comprises a spatial attender (FIGs. 11 and 13) for spatially attending to the image features
  • the adaptive attender further comprises an extractor (FIGs. 11 and 13) for extracting, from the Sn-LSTM decoder, a visual sentinel St at each decoder timestep.
  • the visual sentinel St includes visual context determined from previously processed image features and textual context determined from previously emitted caption words.
  • the adaptive attender further comprises mixer (FIGs. 11 and 13) for mixing ⁇ the image context ct and the visual sentinel St to produce an adaptive context ct at each decoder timestep. The mixing is governed by a sentinel gate mass fit determined from the visual sentinel St and the current hidden state ht of the Sn-LSTM decoder.
  • the spatial attender, the extractor, and the mixer can each run on at least one of the numerous parallel processors.
  • the system comprises an emitter (FIGs. 5 and 13) for generating the natural language caption for the input image / based on the adaptive contexts ct produced over successive decoder timesteps by the mixer.
  • the emitter can run on at least one of the numerous parallel processors.
  • the Sn-LSTM decoder can further comprise an auxiliary sentinel gate (FIG. 8) for producing the visual sentinel St at each decoder timestep.
  • the auxiliary sentinel gate can run on at least one of the numerous parallel processors.
  • the adaptive attender can further comprise a sentinel gate mass softmax (FIGs. 11 and 13) for exponentially normalizing attention values of the image features and a sentinel gate mass softmax (FIGs. 11 and 13) for exponentially normalizing attention values of the image features and a sentinel gate mass softmax (FIGs. 11 and 13) for exponentially normalizing attention values of the image features and a sentinel gate mass softmax (FIGs. 11 and 13) for exponentially normalizing attention values of the image features and a sentinel gate mass softmax (FIGs. 11 and 13) for exponentially normalizing attention values of the image features and a sentinel gate mass softmax (FIGs. 11 and 13) for exponentially normalizing attention values of the image features and a sentinel gate mass softmax (FIGs. 11 and 13) for exponentially normalizing attention values of the image features and a sentinel gate mass softmax (FIGs. 11 and 13) for exponentially normalizing attention values of the image features and a sentinel gate mass
  • the sentinel gate mass softmax can run on at least one of the numerous parallel processors.
  • the adaptive sequence can be determined as:
  • W g can be the same weight parameter as in equation (6). is the attention distribution over both the spatial image features as well as the visual sentinel vector
  • the last element of the adaptive sequence is the sentinel gate mass
  • the probability over a vocabulary of possible words at time t can be determined by the vocabulary softmax of the emitter (FIG. 5) as follows:
  • W p is the weight parameter that is learnt.
  • the adaptive attender can further comprise a sentinel gate mass determiner (FIGs. 11 and 13) for producing at each decoder timestep the sentinel gate mass 3 ⁇ 4 as a result of interaction between the current decoder hidden state ht and the visual sentinel St .
  • the sentinel gate mass determiner can run on at least one of the numerous parallel processors.
  • the spatial attender can further comprise an adaptive comparator (FIGs. 11 and 13) for producing at each decoder timestep the attention values as a result of interaction between the current decoder hidden state ht and the image features
  • FOGs. 11 and 13 for producing at each decoder timestep the attention values as a result of interaction between the current decoder hidden state ht and the image features
  • adaptive comparator can run on at least one of the numerous parallel processors.
  • the attention and gate values are determined by processing the
  • the attention and gate values are determined by processing the current decoder hidden state h, , the image features and the sentinel state vector s t through a dot producter or inner
  • the attention and gate values are
  • the spatial attender can further comprise an adaptive attender softmax (FIGs. 11 and 13) for exponentially normalizing the attention values for the image features to produce the attention probability masses at each decoder timestep.
  • the adaptive attender softmax can run on at least one of the numerous parallel processors.
  • the spatial attender can further comprise an adaptive convex combination accumulator (also referred to herein as mixer or adaptive context producer or adaptive context vector producter) (FIGs. 11 and 13) for accumulating, at each decoder timestep, the image context as a convex combination of the image features scaled by attention probability masses determined using the current decoder hidden state.
  • the sentinel gate mass can run on at least one of the numerous parallel processors.
  • the system can further comprise a trainer (FIG. 25).
  • the trainer in turn further comprises a preventer for preventing backpropagation of gradients from the Sn-LSTM decoder to the CNN encoder when a next caption word is a non-visual word or linguistically correlated to a previously emitted caption word.
  • the trainer and the preventer can each run on at least one of the numerous parallel processors.
  • the adaptive attender further comprises the sentinel gate mass/gate probability mass fit for enhancing attention directed to the image context when a next caption word is a visual word.
  • the adaptive attender further comprises the sentinel gate mass/gate probability mass fit for enhancing attention directed to the visual sentinel when a next caption word is a non-visual word or linguistically correlated to the previously emitted caption word.
  • the sentinel gate mass can run on at least one of the numerous parallel processors.
  • Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • the technology disclosed presents a recurrent neural network system (abbreviated RN ).
  • the RN runs on numerous parallel processors.
  • the RNN can be a computer-implemented system.
  • the RNN comprises a sentinel long short-term memory network (abbreviated Sn- LSTM) that receives inputs at each of a plurality of timesteps.
  • the inputs include at least an input for a current timestep, a hidden state from a previous timestep, and an auxiliary input for the current timestep.
  • the Sn-LSTM can run on at least one of the numerous parallel processors.
  • the RNN generates outputs at each of the plurality of timesteps by processing the inputs through gates of the Sn-LSTM.
  • the gates include at least an input gate, a forget gate, an output gate, and an auxiliary sentinel gate. Each of the gates can run on at least one of the numerous parallel processors.
  • the RNN stores in a memory cell of the Sn-LSTM auxiliary information accumulated over time from (1) processing of the inputs by the input gate, the forget gate, and the output gate and (2) updating of the memory cell with gate outputs produced by the input gate, the forget gate, and the output gate.
  • the memory cell can be maintained and persisted in a database (FIG 9).
  • the auxiliary sentinel gate modulates the stored auxiliary information from the memory cell for next prediction.
  • the modulation is conditioned on the input for the current timestep, the hidden state from the previous timestep, and the auxiliary input for the current timestep.
  • the auxiliary input can be visual input comprising image data and the input can be a text embedding of a most recently emitted word and/or character.
  • the auxiliary input can be a text encoding from another long short-term memory network (abbreviated LSTM) of an input document and the input can be a text embedding of a most recently emitted word and/or character.
  • LSTM long short-term memory network
  • the auxiliary input can be a hidden state vector from another LSTM that encodes sequential data and the input can be a text embedding of a most recently emitted word and/or character.
  • the auxiliary input can be a prediction derived from a hidden state vector from another LSTM that encodes sequential data and the input can be a text embedding of a most recently emitted word and/or character.
  • the auxiliary input can be an output of a convolutional neural network (abbreviated CNN).
  • the auxiliary input can be an output of an attention network.
  • the prediction can be a classification label embedding.
  • the Sn-LSTM can be further configured to receive multiple auxiliary inputs at a timestep, with at least one auxiliary input comprising concatenated vectors.
  • the auxiliary input can be received only at an initial timestep.
  • the auxiliary sentinel gate can produce a sentinel state at each timestep as an indicator of the modulated auxiliary information.
  • the outputs can comprise at least a hidden state for the current timestep and a sentinel state for the current timestep.
  • the RNN can be further configured to use at least the hidden state for the current timestep and the sentinel state for the current timestep for making the next prediction.
  • the inputs can further include a bias input and a previous state of the memory cell.
  • the Sn-LSTM can further include an input activation function.
  • the auxiliary sentinel gate can gate a pointwise hyperbolic tangent (abbreviated tanh) of the memory cell.
  • the auxiliary sentinel gate at the current timestep i can be defined as where Wx and Wh are weight parameters to be learned, xt is the input for the current timestep, is the auxiliary sentinel gate applied on the memory cell mt , represents element-wise product, and ⁇ denotes logistic sigmoid activation.
  • the sentinel state visual sentinel at the current timestep t is defined as
  • st is the sentinel state
  • auxiliary sentinel gate applied on the memory cell represents element-wise product
  • tanh denotes hyperbolic tangent activation
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • the technology disclosed presents a sentinel long short- term memory network (abbreviated Sn-LSTM) that processes auxiliary input combined with input and previous hidden state.
  • Sn-LSTM runs on numerous parallel processors.
  • the Sn- LSTM can be a computer-implemented system.
  • the Sn-LSTM comprises an auxiliary sentinel gate that applies on a memory cell of the Sn-LSTM and modulates use of auxiliary information during next prediction.
  • the auxiliary information is accumulated over time in the memory cell at least from the processing of the auxiliary input combined with the input and the previous hidden state.
  • the auxiliary sentinel gate can run on at least one of the numerous parallel processors.
  • the memory cell can be maintained and persisted in a database (FIG 9).
  • the auxiliary sentinel gate can produce a sentinel state at each timestep as an indicator of the modulated auxiliary information, conditioned on an input for a current timestep, a hidden state from a previous timestep, and an auxiliary input for the current timestep.
  • the auxiliary sentinel gate can gate a pointwise hyperbolic tangent (abbreviated tanh) of the memory cell.
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • the technology disclosed presents a method of extending a long short-term memory network (abbreviated LSTM).
  • the method can be a computer-implemented method.
  • the method can be a neural network-based method.
  • the method includes extending a long short-term memory network (abbreviated LSTM) to include an auxiliary sentinel gate.
  • LSTM long short-term memory network
  • the auxiliary sentinel gate applies on a memory cell of the LSTM and modulates use of auxiliary information during next prediction.
  • the auxiliary information is accumulated over time in the memory cell at least from the processing of auxiliary input combined with current input and previous hidden state.
  • the auxiliary sentinel gate can produce a sentinel state at each timestep as an indicator of the modulated auxiliary information, conditioned on an input for a current timestep, a hidden state from a previous timestep, and an auxiliary input for the current timestep.
  • the auxiliary sentinel gate can gate a pointwise hyperbolic tangent (abbreviated tanh) of the memory cell.
  • implementations may include a non-transitory computer readable storage medium (CRM) storing instructions executable by a processor to perform the method described above.
  • implementations may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the method described above.
  • the technology disclosed presents a recurrent neural network system (abbreviated R N) for machine generation of a natural language caption for an image.
  • R N recurrent neural network system
  • the RNN runs on numerous parallel processors.
  • the RNN can be a computer- implemented system.
  • FIG.9 shows one implementation of modules of a recurrent neural network
  • RNN (abbreviated RNN) that implements the Sn-LSTM of FIG. 8.
  • the RNN comprises an input provider (FIG. 9) for providing a plurality of inputs to a sentinel long short-term memory network (abbreviated Sn-LSTM) over successive timesteps.
  • the inputs include at least an input for a current timestep, a hidden state from a previous timestep, and an auxiliary input for the current timestep.
  • the input provider can run on at least one of the numerous parallel processors.
  • the RNN comprises a gate processor (FIG. 9) for processing the inputs through each gate in a plurality of gates of the Sn-LSTM.
  • the gates include at least an input gate (FIGs. 8 and 9), a forget gate (FIGs. 8 and 9), an output gate (FIGs. 8 and 9), and an auxiliary sentinel gate (FIGs. 8 and 9).
  • the gate processor can run on at least one of the numerous parallel processors.
  • Each of the gates can run on at least one of the numerous parallel processors.
  • the RNN comprises a memory cell (FIG. 9) of the Sn-LSTM for storing auxiliary information accumulated over time from processing of the inputs by the gate processor.
  • the memory cell can be maintained and persisted in a database (FIG 9).
  • the RNN comprises a memory cell updater (FIG. 9) for updating the memory cell with gate outputs produced by the input gate (FIGs. 8 and 9), the forget gate (FIGs. 8 and 9), and the output gate (FIGs. 8 and 9).
  • the memory cell updater can run on at least one of the numerous parallel processors.
  • the RNN comprises the auxiliary sentinel gate (FIGs. 8 and 9) for modulating the stored auxiliary information from the memory cell to produce a sentinel state at each timestep.
  • the modulation is conditioned on the input for the current timestep, the hidden state from the previous timestep, and the auxiliary input for the current timestep.
  • the RNN comprises an emitter (FIG. 5) for generating the natural language caption for the image based on the sentinel states produced over successive timesteps by the auxiliary sentinel gate.
  • the emitter can run on at least one of the numerous parallel processors.
  • the auxiliary sentinel gate can further comprise an auxiliary nonlinearity layer (FIG. 9) for squashing results of processing the inputs within a predetermined range.
  • the auxiliary nonlinearity layer can run on at least one of the numerous parallel processors.
  • the Sn-LSTM can further comprise a memory nonlinearity layer (FIG. 9) for applying a nonlinearity to contents of the memory cell.
  • the memory nonlinearity layer can run on at least one of the numerous parallel processors.
  • the Sn-LSTM can further comprise a sentinel state producer (FIG. 9) for combining the squashed results from the auxiliary sentinel gate with the nonlinearized contents of the memory cell to produce the sentinel state.
  • the sentinel state producer can run on at least one of the numerous parallel processors.
  • the input provider (FIG. 9) can provide the auxiliary input that is visual input comprising image data and the input is a text embedding of a most recently emitted word and or character.
  • the input provider (FIG. 9) can provide the auxiliary input that is a text encoding from another long short-term memory network (abbreviated LSTM) of an input document and the input is a text embedding of a most recently emitted word and or character.
  • LSTM long short-term memory network
  • the input provider (FIG. 9) can provide the auxiliary input that is a hidden state from another LSTM that encodes sequential data and the input is a text embedding of a most recently emitted word and/or character.
  • the input provider (FIG. 9) can provide the auxiliary input that is visual input comprising image data and the input is a text embedding of a most recently emitted word and or character.
  • the input provider (FIG. 9) can provide the auxiliary input that is a text
  • the input provider (FIG. 9) can provide the auxiliary input that is a prediction derived from a hidden state from another LSTM that encodes sequential data and the input is a text embedding of a most recently emitted word and or character.
  • the input provider (FIG. 9) can provide the auxiliary input that is an output of a convolutional neural network (abbreviated CNN).
  • the input provider (FIG.9) can provide the auxiliary input that is an output of an attention network.
  • the input provider can further provide multiple auxiliary inputs to the Sn- LSTM at a timestep, with at least one auxiliary input further comprising concatenated features.
  • the Sn-LSTM can further comprise an activation gate (FIG. 9).
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform actions of the system described above.
  • a visual sentinel vector can represent, identify, and or embody a visual sentinel.
  • a sentinel state vector can represent, identify, and/or embody a sentinel state.
  • This application uses the phrases “sentinel gate” and "auxiliary sentinel gate” interchangeable.
  • a hidden state vector can represent, identify, and/or embody a hidden state.
  • a hidden state vector can represent, identify, and or embody hidden state information.
  • An input vector can represent, identify, and or embody an input.
  • An input vector can represent, identify, and/or embody a current input.
  • a memory cell vector can represent, identify, and/or embody a memory cell state.
  • a memory cell state vector can represent, identify, and/or embody a memory cell state.
  • An image feature vector can represent, identify, and/or embody an image feature.
  • An image feature vector can represent, identify, and/or embody a spatial image feature.
  • a global image feature vector can represent, identify, and/or embody a global image feature.
  • word embedding and "word embedding vector” interchangeably.
  • a word embedding vector can represent, identify, and or embody a word embedding.
  • image context vector can represent, identify, and or embody an image context.
  • a context vector can represent, identify, and/or embody an image context
  • An adaptive image context vector can represent, identify, and/or embody an adaptive image context.
  • An adaptive context vector can represent, identify, and/or embody an adaptive image context.
  • This application uses the phrases “gate probability mass” and “sentinel gate mass” interchangeably.
  • FIG. 17 illustrates some example captions and spatial attentional maps for the specific words in the caption. It can be seen that our learns alignments that correspond with human intuition. Even in the examples in which incorrect captions were generated, the model looked at reasonable regions in the image.
  • FIG. 18 shows visualization of some example image captions, word-wise visual grounding probabilities, and corresponding image spatial attention maps generated by our model.
  • the model successfully learns how heavily to attend to the image and adapts the attention accordingly. For example, for non-visual words such as "of and "a” the model attends less to the images. For visual words like “red”, “rose”, “doughnuts”, “ woman”, and “snowboard” our model assigns a high visual grounding probabilities (over 0.9). Note that the same word can be assigned different visual grounding probabilities when generated in different contexts. For example, the word “a” typically has a high visual grounding probability at the beginning of a sentence, since without any language context, the model needs the visual information to determine plurality (or not). On the other hand, the visual grounding probability of "a” in the phrase "on a table” is much lower. Since it is unlikely for something to be on more than one table.
  • FIG. 19 presents similar results as shown in FIG. 18 on another set of example image captions, word-wise visual grounding probabilities, and corresponding image/spatial attention maps generated using the technology disclosed.
  • FIGs. 20 and 21 are example rank-probability plots that illustrate performance of our model on the COCO (common objects in context) and Flickr30k datasets respectively. It can be seen that our model attends to the image more when generating object words like “dishes”, “people”, “cat”, “boat”; attribute words like “giant”, “metal”, “yellow”, and number words like "three". When the word is non-visual, our model learns to not attend to the image such as for "the", “of, “to” etc. For more abstract words such as "crossing”, “during” etc., our model attends less than the visual words and attends more than the non- visual words. The model does not rely on any syntactic features or external knowledge. It discovers these trends automatically through learning.
  • FIG. 22 is an example graph that shows localization accuracy over the generated caption for top 45 most frequent COCO object categories.
  • the blue colored bars show localization accuracy of the spatial attention model and the red colored bars show localization accuracy of the adaptive attention model.
  • FIG. 22 shows that both models perform well on categories such as "cat”, e 3 ⁇ 4ed", "bus", and "truck". On smaller objects, such as "sink”,
  • FIG. 23 is a table that shows performance of the technology disclosed on the Flicker30k and COCO datasets based on various natural language processing metrics, including BLEU (bilingual evaluation understudy), METEOR (metric for evaluation of translation with explicit ordering), CIDEr (consensus-based image description evaluation), ROUGE-L (recall- oriented understudy for gisting evaluation-longest common subsequence), and SPICE (semantic prepositional image caption evaluation).
  • the table in FIG. 23 shows that our adaptive attention model significantly outperforms our spatial attention model.
  • the CIDEr score performance of our adaptive attention model is 0.S31 versus 0.493 for spatial attention model on Flickr30k database.
  • CIDEr scores of adaptive attention model and spatial attention model on COCO database are 1.085 and 1.029 respectively.
  • FIG. 25 is a simplified block diagram of a computer system that can be used to implement the technology disclosed.
  • Computer system includes at least one central processing unit (CPU) that communicates with a number of peripheral devices via bus subsystem.
  • peripheral devices can include a storage subsystem including, for example, memory devices and a file storage subsystem, user interface input devices, user interface output devices, and a network interface subsystem. The input and output devices allow user interaction with computer system.
  • Network interface subsystem provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems.
  • At least the spatial attention model, the controller, the localizer (FIG.25), the trainer (which comprises the preventer), the adaptive attention model, and the sentinel LSTM (Sn-LSTM) are communicably linked to the storage subsystem and to the user interface input devices.
  • User interface input devices can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices.
  • pointing devices such as a mouse, trackball, touchpad, or graphics tablet
  • audio input devices such as voice recognition systems and microphones
  • input device is intended to include all possible types of devices and ways to input information into computer system.
  • User interface output devices can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
  • the display subsystem can include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
  • the display subsystem can also provide a non-visual display such as audio output devices.
  • output device is intended to include all possible types of devices and ways to output information from computer system to the user or to another machine or computer system.
  • Storage subsystem stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by deep learning processors.
  • Deep learning processors can be graphics processing units (GPUs) or field- programmable gate arrays (FPGAs). Deep learning processors can be hosted by a deep learning cloud platform such as Google Cloud PlatformTM, XilinxTM, and CirrascaleTM.
  • a deep learning cloud platform such as Google Cloud PlatformTM, XilinxTM, and CirrascaleTM.
  • Examples of deep learning processors include Google's Tensor Processing Unit (TPU)TM, rackmount solutions like GX4 Rackmount SeriesTM, GX8 Rackmount SeriesTM, NVIDIA DGX-1TM, Microsoft' Stratix V FPGATM, Graphcore's Intelligent Processor Unit (IPU)TM, Qualcomm's Zeroth PlatformTM with Snapdragon processorsTM, NVTDIA's VoltaTM, NVIDIA's DRIVE PXTM, NVIDIA' s JETSON TX1/TX2 MODULETM, Intel's NirvanaTM, Movidius VPUTM, Fujitsu DPITM, ARM's
  • Memory subsystem used in the storage subsystem can include a number of memories including a main random access memory (RAM) for storage of instructions and data during program execution and a read only memory (ROM) in which fixed instructions are stored.
  • a file storage subsystem can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
  • the modules implementing the functionality of certain implementations can be stored by file storage subsystem in the storage subsystem, or in other machines accessible by the processor.
  • Bus subsystem provides a mechanism for letting the various components and subsystems of computer system communicate with each other as intended. Although bus subsystem is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
  • Computer system itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system depicted in FIG. 13 is intended only as a specific example for purposes of illustrating the preferred embodiments of the present invention. Many other configurations of computer system are possible having more or less components than the computer system depicted in FIG. 13.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

La technologie de l'invention présente un nouveau modèle d'attention spatiale qui utilise des informations d'état caché courant d'une longue mémoire à court terme (LSTM) de décodeur pour guider l'attention et pour extraire des caractéristiques d'image spatiale à utiliser en sous-titrage d'image. La technologie de l'invention présente également un nouveau modèle d'attention adaptatif pour le sous-titrage d'image, qui mélange des informations visuelles issues d'un réseau de neurones à convolution (CNN) et des informations linguistiques issues d'une LSTM. A chaque saut de temps, le modèle d'attention adaptatif décide automatiquement comment se fier fortement à l'image, par opposition au modèle linguistique, pour émettre le mot de sous-titre suivant. La technologie de l'invention ajoute en outre une nouvelle porte sentinelle auxiliaire à une architecture LSTM et produit une LSTM sentinelle (Sn-LSTM). La porte sentinelle produit, à chaque saut de temps, une sentinelle visuelle qui est une représentation supplémentaire, dérivée de la mémoire LSTM, d'informations visuelles et linguistiques à long terme et à court terme.
PCT/US2017/062433 2016-11-18 2017-11-18 Modèle d'attention spatiale pour sous-titrage d'image WO2018094294A1 (fr)

Priority Applications (5)

Application Number Priority Date Filing Date Title
CA3040165A CA3040165C (fr) 2016-11-18 2017-11-18 Modele d'attention spatiale pour sous-titrage d'image
JP2019526275A JP6689461B2 (ja) 2016-11-18 2017-11-18 画像キャプション生成のための空間的注目モデル
CN201780071579.2A CN110168573B (zh) 2016-11-18 2017-11-18 用于图像标注的空间注意力模型
EP17821750.1A EP3542314B1 (fr) 2016-11-18 2017-11-18 Modèle d'attention spatiale pour le sous-titrage d'images
EP21167276.1A EP3869416A1 (fr) 2016-11-18 2017-11-18 Modèle d'attention spatiale pour sous-titrage d'image

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
US201662424353P 2016-11-18 2016-11-18
US62/424,353 2016-11-18
US15/817,161 US10565305B2 (en) 2016-11-18 2017-11-17 Adaptive attention model for image captioning
US15/817,161 2017-11-17
US15/817,153 US10558750B2 (en) 2016-11-18 2017-11-17 Spatial attention model for image captioning
US15/817,153 2017-11-17
US15/817,165 US10565306B2 (en) 2016-11-18 2017-11-18 Sentinel gate for modulating auxiliary information in a long short-term memory (LSTM) neural network
US15/817,165 2017-11-18

Publications (1)

Publication Number Publication Date
WO2018094294A1 true WO2018094294A1 (fr) 2018-05-24

Family

ID=60629823

Family Applications (3)

Application Number Title Priority Date Filing Date
PCT/US2017/062433 WO2018094294A1 (fr) 2016-11-18 2017-11-18 Modèle d'attention spatiale pour sous-titrage d'image
PCT/US2017/062434 WO2018094295A1 (fr) 2016-11-18 2017-11-18 Modèle d'attention adaptatif pour sous-titrage d'image
PCT/US2017/062435 WO2018094296A1 (fr) 2016-11-18 2017-11-18 Réseau sentinel long short-term memory

Family Applications After (2)

Application Number Title Priority Date Filing Date
PCT/US2017/062434 WO2018094295A1 (fr) 2016-11-18 2017-11-18 Modèle d'attention adaptatif pour sous-titrage d'image
PCT/US2017/062435 WO2018094296A1 (fr) 2016-11-18 2017-11-18 Réseau sentinel long short-term memory

Country Status (1)

Country Link
WO (3) WO2018094294A1 (fr)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034373A (zh) * 2018-07-02 2018-12-18 鼎视智慧(北京)科技有限公司 卷积神经网络的并行处理器及处理方法
CN110163299A (zh) * 2019-05-31 2019-08-23 合肥工业大学 一种基于自底向上注意力机制和记忆网络的视觉问答方法
CN110175979A (zh) * 2019-04-08 2019-08-27 杭州电子科技大学 一种基于协同注意力机制的肺结节分类方法
JP2020047191A (ja) * 2018-09-21 2020-03-26 ソニーセミコンダクタソリューションズ株式会社 固体撮像システム、固体撮像装置、情報処理装置、画像処理方法及びプログラム
CN112052906A (zh) * 2020-09-14 2020-12-08 南京大学 一种基于指针网络的图像描述优化方法
CN112927255A (zh) * 2021-02-22 2021-06-08 武汉科技大学 一种基于上下文注意力策略的三维肝脏影像语义分割方法
JP2021520002A (ja) * 2019-03-29 2021-08-12 ベイジン センスタイム テクノロジー デベロップメント カンパニー, リミテッド テキスト認識方法及び装置、電子機器並びに記憶媒体
US11244111B2 (en) 2016-11-18 2022-02-08 Salesforce.Com, Inc. Adaptive attention model for image captioning
CN115544259A (zh) * 2022-11-29 2022-12-30 城云科技(中国)有限公司 一种长文本分类预处理模型及其构建方法、装置及应用
CN112307769B (zh) * 2019-07-29 2024-03-15 武汉Tcl集团工业研究院有限公司 一种自然语言模型的生成方法和计算机设备

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102323548B1 (ko) * 2016-09-26 2021-11-08 구글 엘엘씨 신경 기계 번역 시스템
CN108898639A (zh) * 2018-05-30 2018-11-27 湖北工业大学 一种图像描述方法及系统
CN109086779B (zh) * 2018-07-28 2021-11-09 天津大学 一种基于卷积神经网络的注意力目标识别方法
CN109376246B (zh) * 2018-11-07 2022-07-08 中山大学 一种基于卷积神经网络和局部注意力机制的句子分类方法
US20220046206A1 (en) * 2020-08-04 2022-02-10 Vingroup Joint Stock Company Image caption apparatus
CN112528989B (zh) * 2020-12-01 2022-10-18 重庆邮电大学 一种图像语义细粒度的描述生成方法
CN112529857B (zh) * 2020-12-03 2022-08-23 重庆邮电大学 基于目标检测与策略梯度的超声图像诊断报告生成方法
CN114782702A (zh) * 2022-03-23 2022-07-22 成都瑞数猛兽科技有限公司 一种基于三层lstm推敲网络的图像语义理解算法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016077797A1 (fr) * 2014-11-14 2016-05-19 Google Inc. Génération de descriptions d'images en langage naturel

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016077797A1 (fr) * 2014-11-14 2016-05-19 Google Inc. Génération de descriptions d'images en langage naturel

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
KELVIN XU ET AL: "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 10 February 2015 (2015-02-10), XP080677655 *
LONG CHEN ET AL: "SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 November 2016 (2016-11-17), XP080732428 *
LU JIASEN ET AL: "Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning", IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION. PROCEEDINGS, IEEE COMPUTER SOCIETY, US, 21 July 2017 (2017-07-21), pages 3242 - 3250, XP033249671, ISSN: 1063-6919, [retrieved on 20171106], DOI: 10.1109/CVPR.2017.345 *
RYAN KIROS ET AL: "Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models", 10 November 2014 (2014-11-10), pages 1 - 13, XP055246385, Retrieved from the Internet <URL:http://arxiv.org/pdf/1411.2539v1.pdf> [retrieved on 20160201] *
STEPHEN MERITY ET AL: "Pointer Sentinel Mixture Models", 26 September 2016 (2016-09-26), XP055450460, Retrieved from the Internet <URL:https://arxiv.org/pdf/1609.07843.pdf> [retrieved on 20180213] *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11244111B2 (en) 2016-11-18 2022-02-08 Salesforce.Com, Inc. Adaptive attention model for image captioning
CN109034373A (zh) * 2018-07-02 2018-12-18 鼎视智慧(北京)科技有限公司 卷积神经网络的并行处理器及处理方法
CN109034373B (zh) * 2018-07-02 2021-12-21 鼎视智慧(北京)科技有限公司 卷积神经网络的并行处理器及处理方法
JP2020047191A (ja) * 2018-09-21 2020-03-26 ソニーセミコンダクタソリューションズ株式会社 固体撮像システム、固体撮像装置、情報処理装置、画像処理方法及びプログラム
JP7153088B2 (ja) 2019-03-29 2022-10-13 ベイジン・センスタイム・テクノロジー・デベロップメント・カンパニー・リミテッド テキスト認識方法及びテキスト認識装置、電子機器、記憶媒体並びにコンピュータプログラム
US12014275B2 (en) 2019-03-29 2024-06-18 Beijing Sensetime Technology Development Co., Ltd. Method for text recognition, electronic device and storage medium
JP2021520002A (ja) * 2019-03-29 2021-08-12 ベイジン センスタイム テクノロジー デベロップメント カンパニー, リミテッド テキスト認識方法及び装置、電子機器並びに記憶媒体
CN110175979A (zh) * 2019-04-08 2019-08-27 杭州电子科技大学 一种基于协同注意力机制的肺结节分类方法
CN110163299A (zh) * 2019-05-31 2019-08-23 合肥工业大学 一种基于自底向上注意力机制和记忆网络的视觉问答方法
CN112307769B (zh) * 2019-07-29 2024-03-15 武汉Tcl集团工业研究院有限公司 一种自然语言模型的生成方法和计算机设备
CN112052906A (zh) * 2020-09-14 2020-12-08 南京大学 一种基于指针网络的图像描述优化方法
CN112052906B (zh) * 2020-09-14 2024-02-02 南京大学 一种基于指针网络的图像描述优化方法
CN112927255B (zh) * 2021-02-22 2022-06-21 武汉科技大学 一种基于上下文注意力策略的三维肝脏影像语义分割方法
CN112927255A (zh) * 2021-02-22 2021-06-08 武汉科技大学 一种基于上下文注意力策略的三维肝脏影像语义分割方法
CN115544259A (zh) * 2022-11-29 2022-12-30 城云科技(中国)有限公司 一种长文本分类预处理模型及其构建方法、装置及应用
CN115544259B (zh) * 2022-11-29 2023-02-17 城云科技(中国)有限公司 一种长文本分类预处理模型及其构建方法、装置及应用

Also Published As

Publication number Publication date
WO2018094296A1 (fr) 2018-05-24
WO2018094295A1 (fr) 2018-05-24

Similar Documents

Publication Publication Date Title
US10846478B2 (en) Spatial attention model for image captioning
WO2018094294A1 (fr) Modèle d&#39;attention spatiale pour sous-titrage d&#39;image
JP6972265B2 (ja) ポインタセンチネル混合アーキテクチャ
US11520998B2 (en) Neural machine translation with latent tree attention
US11023210B2 (en) Generating program analysis rules based on coding standard documents
US11797822B2 (en) Neural network having input and hidden layers of equal units
EP3535706A1 (fr) Réseau de co-attention dynamique pour répondre à des questions
US11481609B2 (en) Computationally efficient expressive output layers for neural networks
US10671909B2 (en) Decreasing neural network inference times using softmax approximation
Han et al. Latent variable autoencoder
US11837000B1 (en) OCR using 3-dimensional interpolation
US20230124177A1 (en) System and method for training a sparse neural network whilst maintaining sparsity
US20210050019A1 (en) Reducing latency and improving accuracy of work estimates utilizing natural language processing
CN117609779A (zh) 一种目标检测方法、系统、存储介质及计算设备
CN112347196A (zh) 基于神经网络的实体关系抽取方法及装置

Legal Events

Date Code Title Description
DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17821750

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3040165

Country of ref document: CA

ENP Entry into the national phase

Ref document number: 2019526275

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017821750

Country of ref document: EP

Effective date: 20190618