AU2016225819B2 - Structured knowledge modeling and extraction from images - Google Patents

Structured knowledge modeling and extraction from images Download PDF

Info

Publication number
AU2016225819B2
AU2016225819B2 AU2016225819A AU2016225819A AU2016225819B2 AU 2016225819 B2 AU2016225819 B2 AU 2016225819B2 AU 2016225819 A AU2016225819 A AU 2016225819A AU 2016225819 A AU2016225819 A AU 2016225819A AU 2016225819 B2 AU2016225819 B2 AU 2016225819B2
Authority
AU
Australia
Prior art keywords
text
image
images
structured
computing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
AU2016225819A
Other versions
AU2016225819A1 (en
Inventor
Walter Chang
Scott Cohen
Mohamed Elhoseiny
Brian Price
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Adobe Inc
Original Assignee
Adobe Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US201562254147P priority Critical
Priority to US201562254143P priority
Priority to US62/254,143 priority
Priority to US62/254,147 priority
Priority to US14/978,350 priority
Priority to US14/978,350 priority patent/US20170132526A1/en
Application filed by Adobe Inc filed Critical Adobe Inc
Publication of AU2016225819A1 publication Critical patent/AU2016225819A1/en
Assigned to ADOBE INC. reassignment ADOBE INC. Alteration of Name(s) of Applicant(s) under S113 Assignors: ADOBE SYSTEMS INCORPORATED
Application granted granted Critical
Publication of AU2016225819B2 publication Critical patent/AU2016225819B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computer systems using knowledge-based models
    • G06N5/02Knowledge representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computer systems based on biological models
    • G06N3/02Computer systems based on biological models using neural network models
    • G06N3/04Architectures, e.g. interconnection topology
    • G06N3/0427Architectures, e.g. interconnection topology in combination with an expert system
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computer systems based on biological models
    • G06N3/02Computer systems based on biological models using neural network models
    • G06N3/04Architectures, e.g. interconnection topology
    • G06N3/0454Architectures, e.g. interconnection topology using a combination of multiple neural nets
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computer systems based on biological models
    • G06N3/02Computer systems based on biological models using neural network models
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computer systems based on biological models
    • G06N3/02Computer systems based on biological models using neural network models
    • G06N3/04Architectures, e.g. interconnection topology
    • G06N3/0445Feedback networks, e.g. hopfield nets, associative networks

Abstract

Techniques and systems are described to model and extract knowledge from images. A digital medium environment is configured to learn and use a model to compute a descriptive summarization of an input image automatically and without user intervention. Training data is obtained to train a model using machine learning in order to generate a structured image representation that serves as the descriptive summarization of an input image. The images and associated text are processed to extract structured semantic knowledge from the text, which is then associated with the images. The structured semantic knowledge is processed along with corresponding images to train a model using machine learning such that the model describes a relationship between text features within the structured semantic knowledge. Once the model is learned, the model is usable to process input images to generate a structured image representation of the image. Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI 60 Inventors- Cohen at al. Title- Structured Knowledge Modeling and Extraction from Images 100 .................................................................................................................................................................................................................................... 08 ......................... ........ ............... .................................. ...... ........................ ........ ....... ............................... ............................. ......... ........................................... .. ......................... ......................................... ........ .. .......................... ............... ....................................... .......... .. .......................... ............... ...................................... ........ .. ............................... ............... .............................. ......... . ......................... .... ................ .. .................................. .. ......................... ..................................... ..................................... .................................. .............................. ............................... .. ............................ ........................... ............................... .............................. ............................... .............................. ............................... .... .............................. ............................... ............... .. ......... . .......................... ........................... ............................... ........... ..... OF .. ....................... ............... ........... .. .... . .................... .................. ............ ............. ................... ........ ..... ..................... ..................... ............................... ... .. ........................ ............................... ...................... ................................. ......................... ............................... .................... ............... ............... - -...... ................. ..................... ............................................ ............ .................... ...... ....... .... ........ .................................. ....................... i i i > * : I I .... I .................. ..................... ...... -H .. .............. M , :.:. .1 ... ..... .... .............. ............ ...... ..... .................... ........................... ...... .................................. .................. _ ......... ............ ....... ........ . ..... ................ ....... ...... ..................... ....... . .. ..... ................................. .................... ........ ........ 11 ............ ....... .................. ............. ....... .................. .......... . ........................ ............................ ........... ................... X ........... .... ..... ......... .. ............ ............ ........... N 2 ME~ ... ............. ................. 110% .............. no a 100 ............ ................. NOX A ................ ......... ................. ............... .................. NEW, .................... ...... ..................... ............... ..... ........... ............ ............. ............. ............. ........... ............. .............. A: .......... ....... ............................. ....... Computing Device 102 Knowledge Extraction System Database 104 112 ------------------------- -------------------- C aption Structured Image Generation Representation Network Syste m 1 1 8 106 116 ---- --- -- Image Search 0 Module 110 a

Description

Inventors- Cohen at al. Title- Structured Knowledge Modeling and Extraction from Images 100
.................................................................................................................................................................................................................................... 08
......................... .................................. ................................. ............................... ........ ........ ...... ............... ....... .. ............................. ........................................... ......................... ......................................... .......................... ........ ............... ....................................... .. .......................... ...................................... .. ............................... .......... ............... ........ ............... ... .............................. ......................... .................................. ......................... .. ......... ................ ..................................... .. .................................. .............................. ............................... ............................ ........................... ............................... .............................. ............................... .............................. ............................... .... .............. . ..........................
. ........................... ............................... OF .. ........... ..... ........... . ....................... ................... ..................... .................... ............... ........ ............ .................. ............. ..... ... . ............................... ........................ ............................... ... .. ...................... ................................. ......................... ............................... .................... ...............- -.... ............... ..................... ................. ............................................ ............ .................... .................................. .... ...... ........ ....... ..................... i i > *: I I . . I .................. ....................... ...... -H .. , :. .............. M .1 .... ........................ . . _ ......... ............ ............. ....... ...... .................................. ..................... ....... ...... .................. ..................... ................................. . ............ ...... ........ ........................... .................... ............ ..... ........ ........ 11 ............ ....... ............. ....... .................. ............... . ................... ..... ........................ X ............................ ......... .. ........... ............ ............ no N 2 ME~ a ........... .............. ... 100 ............. ................. ............ 110% ................. ................ ............... ......... ...... NEW, NOX A ................. .................. .................... ..................... ............... ..... ............. ........... ............ ............ ............. ............. A: .............. ..... .... ..... .
............................. .......
Computing Device 102
Knowledge Extraction System Database 104 112 ------------------------- -------------------- C aption Structured Image Generation Representation Network Syste m 1 1 8 106 116 ---- -----
Image Search 0 Module 110 a
Structured Knowledge Modeling and Extraction
fr11m Imaeges
Inventors: Scott D. Cohen Walter Wei-Tuh Chang Brian L. Price Mohamed Hamdy Mahmoud Abdelbaky Elhoseiny
RELATE) APPLICATIONS
[00011 This Application claims prioritytoU.SProvisional Patent Application No.
62/254,143, filed November 11, 2015, and titled "Structured Knowledge Modeling
and Extraction from Images," the disclosure of which is incorporated by reference
in its entirety.
BACKGROUND
[00021 Image searches involve the challenge of matching text in a search request
with text associated with the image, e.g., tags and so forth. For example, a
creative professional may capture an image and associate tags having text that are
used to locate the image. On the other side, a user trying to locate the image in an
image search enters one or more keywords. Accordingly, this requires that the
creative professional and the user reach agreement as to how to describe the image
using text, such that the user can locate the image and the creative professional can
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORG1 make the image available to the user. Conventional tag and keyword search techniques may be prone to error, misunderstandings, and different interpretations due to this requirement that a common understanding is reached between the creative professional and the user in order to locate the images.
[0003] Further, conventional search techniques for images do not support high
precision semantic image search due to limitations of conventional image tagging
and search. This is because conventional techniques merely associate tags with
the images, but do not define relationships between the tags nor with the images
itself. As such, conventional search techniques cannot achieve accurate search
results for complex search queries, such as "a man feeding a baby in a high chair
with the baby holding a toy." Consequently, these conventional search techniques
force users to navigate through tens, hundreds, and even thousands of images
oftentimes using multiple search requests in order to locate a desired image, which
is caused by forcing the user in conventional techniques to gain an understanding
as to how the creative professional described the image for location as part of the
search.
SUMMARY
[0004] Techniques and systems to extract and model structured knowledge from
images are described. In one or more implementations, a digital medium
environment is configured to learn and use a model to compute a descriptive
summarization of an input image automatically and without user intervention.
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
Training data (e.g., image and unstructured text such as captions) is first obtained
to train a model using machine learning in order to generate a structured image
representation that serves as the descriptive summarization of an inputimage.
[0005] The images and associated text are then processed to extract structured
semantic knowledge from the text, which is then associated with the images.
Structured semantic knowledge may take a variety of forms, such as <subject,
attribute> and <subject, predicate, object> tuples that function as a statement
linking the subject to the object via the predicate. This may include association
with the image as a whole and/or objects within the image through a process called
"localization."
[0006] The structured semantic knowledge is then processed along with
corresponding images to train a model using machine learning such that the model
describes a relationship between text features within the structured semantic
knowledge (e.g., subjects and objects) and image features ofimages, e.g., portions
of the image defined in bounding boxes that include the subjects or objects.
[0007] Once the model is learned, the model is then usable to process input images
to generate a structured image representation of theimage. The structured image
representation may include text that is structured in a way that describes
relationships between objects in the image and the imageitself. The structured
image representation may be used to support a variety of functionality, including
image searches, automatic caption and metadata generation, object tagging, and so
forth.
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI
100081 This Summary introduces a selection of concepts in a simplified form that
are further described below in the Detailed Description. As such, this Summary is
not intended to identify essential features of the claimed subject matter, nor is it
intended to be used as an aid in determining the scope of the claimed subject
matter.
BRIEF DESCRIPTION OF THE DRAWINGS
[00091 The detailed description is described with reference to the accompanying
figures. In the figures, the left-most digit(s) of a reference number identifies the
figure in which the reference number first appears. The use of the same reference
numbers in different instances in the description and the figures may indicate
similar or identical items. Entities represented in the figures may be indicative of
one or more entities and thus reference may be made interchangeably to single or
plural forms of the entities in the discussion.
100101 FIG. I is an illustration of an environment in an example implementation
that is operable to employ knowledge extraction techniques from images as
described herein.
[oon FIG. 2 depicts another example of an image from which knowledge is
extracted using a knowledge extraction system of FIG. 1.
[00121 FIG. 3 depicts a system showing the knowledge extraction system of FIG. I
in greater detail.
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
100131 FIG. 4 depicts an example implementation showing an extractor module of
FIG. 3 in greater detail.
[00141 FIG. 5 depicts an example system in which an extractor module of FIG. 4 is
shown as including localization functionality as part of knowledge extraction.
100151 FIG. 6 depicts an example of localization of structured semantic knowledge
to portions of images.
100161 FIG. 7 depicts an example implementation showing a model training
module of FIG. 3 in greater detail as employing a machine learning module to
model a relationship between the structured semantic knowledge and images.
[00171 FIG. 8 depicts an example implementation showing training of a model
using a two module machine learning system.
100181 FIG. 9 is a flow diagram depicting a procedure in an example
implementation in which a digital medium environment is employed to extract
knowledge from an input image automatically and without user intervention.
[0019] FIG. 10 is a flow diagram depicting a procedure in an example
implementation in which a digital medium environment is employed to extract
knowledge and localize text features to image features of an input image.
[00201 FIG. I depicts a system for structured face image embedding
[00211 FIG. 12 depicts Model 1 and Model 2 as part of machine learning.
[00221 FIG. 13 illustrates an example system including various components of an
example device that can be implemented as any type of computing device as
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 described and/or utilize with reference to FIGS. 1-12 to implement embodiments of the techniques described herein.
DETAILED DESCRIPTION
Overview
[0023] Techniques and systems are described that support knowledge extraction
from an image in order to generate a descriptive summarization of the image,
which may then be used to support image search, automatic generations of
captions and metadata for the image, and a variety of other uses. The descriptive
summarization, for instance, may describe qualities of the image as a whole as
well as attributes, objects, and interaction of the objects, one to another, within the
image as further described below. Accordingly, although examples involving
image searches are described in the following, these techniques are equally
applicable to a variety of other examples such as automated structured image
tagging, caption generation, and so forth.
[0024] Training data is first obtained to train a model using machine learning in
order to generate a structured image representation. Techniques are described
herein in which training data is obtained that uses images and associated text (eg.,
captions ofthe images which include any type of text configuration that describes
a scene captured by the image) that may be readily obtained from a variety of
sources. The images and associated text are then processed automatically and
without user intervention to extract structured semantic knowledge from the text,
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 which is then associated with the images. This may include association with the image as a whole and/or objects within the imagethrough a process called
"localization" in the following. Use of this training data differs from conventional
techniques that rely on crowd sourcing in which humans manually label images,
which can be expensive, prone to error, and inefficient.
[00251 In one example, structured semantic knowledge is extracted from the text
usingnaturallnguage processing. Structured semantic knowledge may take a
variety of forms, such as <subject, attribute> and <subject, predicate, object>
tuples that function as a statement linking the subject to the object via the
predicate. The structured semantic knowledge is then processed along with
corresponding imagesto train a model using machine learning suchthat the model
describes a relationship between text features within the structured semantic
knowledge (e.g., subjects and objects) and image features of images,e.g.,portions
of the image defined in bounding boxes that include the subjects or objects. In one
example, the model is a joint probabilistic model that is built without requiring
reduction of a large vocabulary of individual words to small pre-defined set of
concepts and as such the model may directly address this large vocabulary, which
is not possible using conventional techniques.
[00261 For example, localization techniques may be employed such that the
structured semantic knowledge is mapped to corresponding object within an
image. A <baby, holding, toy> tuple, for instance, may explicitly map the subject
"baby" in an image to the object "toy" in the image using the predicate "holding"
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 and thus provides a structure to describe "what is going on" in the image. This is not possible in conventional unstructured tagging techniques that were not explicit in that a correspondence between a particular object in the image with the tag was notpossible such that if multiple object were included in the image and distinction was not made between the object, e.g., multiple babies. Thus, use of explicit, structured knowledge provided by the techniques described herein may be leveraged in a way that is searchable by a computing device.
100271 If one searches for images of a "red flower", for instance, a conventional
bag-of-words approach considers "red" and "flower" separately, which may return
images of flowers that are not red, but have red elsewhere in the image. However,
use of the techniques described herein know that a user is looking for the concept
of <flower, red> from a structure of a search request, which is then used to locate
images having a corresponding structure. In this way, the model may achieve
increased accuracy over techniques that rely on description of the image as a
whole, as further described in relation to FIGS. 5 and 6 in the following.
100281 Further, this mapping may employ a common vector space that penalizes
differences such that similar semantic concepts are close to each other within this
space. For example, this may be performed for feature vectors for text such that
"curvy road" and "winding road" are relatively close to each other in the vector
space. Similar techniques are usable to promote similar concepts for image
vectors as well as to adapt the image and text vectors to each other. A variety of
machine learning techniques may be employedto train the model to perform this
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 mapping. In one such example, a two column deep network is used to learn the correlation between the structured semantic information and an image or portion of an image, e.g., bounding box, an example of which is shown in FIG. 8.
[0029] Oncethe model is learned, the model isthen usable to process input images
to generate a structured image representation of the image through calculation of a
confidence value to describe which text best corresponds with the image. The
model, for instance, may loop over bounding boxes of parts of the image to
determine which structured text (e.g., <flower, red>) likely describes that part of
the image such as objects, attributes, and relationships there between through
calculation of probabilities (i.e., the confidence values) that the structured text
describes a same concept as image features in the image. In this way, the
structured image representation provides a descriptive summary of the image that
uses structured text to describe the images and portions of the image. The
structured image representation may thus be calculated for an image to include
text that is structured in a way that describes relationships between objects in the
image (e.g., flower), attributes of the object (e.g., red), relationships between (e.g.,
<flower, red> <baby, holding, toy>) and the image itself as described above. The
structured image representation may be used to support a variety of functionalitv,
including image searches, automatic caption and metadata generation, automated
object tagging, and so forth. Further discussion of these and other examples is
included in the following sections.
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
100301 In the following discussion, an example environment is first described that
may employ the knowledge extraction techniques described herein. Example
procedures are then described which may be performed in the example
environment as well as other environments. Consequently, performance of the
example procedures is not limited to the example environment and the example
environment is not limited to performance of the example procedures.
Example Environment
100311 FIG. I is an illustration of an environment 100 in an example
implementation that is operable to employ knowledge extraction techniques
described herein. The illustrated environment 100 includes a computing device
102, which may be configured in a variety of ways.
[00321 The computing device 102, for instance, may be configured as a desktop
computer, a laptop computer, a mobile device (e.g., assuming a handheld
configuration such as a tablet or mobile phone as illustrated), wearables, and so
forth. Thus, the computing device 102 may range from full resource devices with
substantial memory and processor resources (e.g., personal computers, game
consoles) to a low-resource device with limited memory and/or processing
resources (e.g., mobile devices). Additionally, although a single computing device
102 is shown, the computing device 102 may be representative of a plurality of
different devices, such as multiple servers utilized by a business to perform
operations "overthe cloud" as further described in relationto FIG. 13.
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
100331 The computing device 102 is illustrated as including a knowledge
extraction system 104 that is representative of functionality to form a structured
image representation 106 from an image 108 that descriptively summarizes the
image 108. The structured image representation 106 is usable to support a variety
of functionality, such as to be employed by an image search module 110 to search
a database 112 of images 114 based on corresponding structured image
representations. As previously described, other uses of the structured image
representation 106 are also contemplated, such as automatic generation of captions
and metadata for images as represented by a caption generation system 118.
Additionally, although the knowledge extraction system 104 and image search
module 110 and database 112 are illustrated as implemented using computing
device 102, this functionality may be further divided "over the cloud" via a
network 116 as further described in relation to FIG. 13.
100341 The structured image representation 106 provides a set of concepts with
structure that describes a relationship between entities included in the concepts.
Through this, the structured image representation may function as an intermediate
representation of the image 108 using text to describe not only "what is included"
in the image 108 but also a relationship, one to another, of entities and concepts
included in the image 108. This may be used to support a higher level of semantic
precision in an image search that is not possible using conventional techniques that
relied on unstructured tags.
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI
100351 A high precision semantic image search, for instance, involves finding
images with the specific content requested in a textual search query. For example,
a user may input a search query of a 'man feeding baby in high chair with the
baby holding a toy" to an image sharing service to locate an image of interest that
is available for licensing. Conventional techniques that relied on unstructured
tags, however, are not able to accurately satisfy this query. In practice,
conventional image search provide images typically satisfy some, but not all, of
the elements in the query, such as a man feeding a baby, but the baby is not
holding a toy, a baby in a high chair, but there is no man in the picture, a picture of
a woman feeding a baby holding a toy, and so forth due to this lack of structure.
[00361 A structured image representation 106, however, provides an explicit
representation of what is known about an image 108. This supports an ability to
determine which concepts in a search query are missing in a searched database
image and thus improve accuracy of search results. Accordingly, a measure of
similarity between the search query and an image 114 in a database 112 can
incorporate which and how many concepts are missed. Also, if there is an image
that is close to satisfying the querying but misses a concept, techniques may be
employed to synthesize a new image using the close image and content from
another image that contains the missing concept as further described in the
following.
100371 Consider an example of use of the structured image representation 106 in
which the extracted knowledge of the image 108 includes the following:
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
{<man, smiling>, <baby, smiling>, <baby, holding, toy>, <man, sitting at,
table>, <baby, sitting in, high chair>, <man, feeding, baby>, <baby,
wearing, blue clothes>}
The caption generation system 118 is configured to use this extracted knowledge
to generate a caption as follows:
"A man is feeding a smiling babywhile the baby holds a toy. The baby is
sitting in a high chair. The man is happy too. It is probably a dad feeding
his son. The dad and his son are having fun together while mom is away."
100381 Thus, the explicit representation of knowledge of the structured image
representation 106 allows for a multiple sentence description of the scene of the
image 108 as a caption in this example that is formed automatically and without
user intervention. The first two sentences are a straightforward inclusion of the
concepts <man, feeding, baby>, <baby, holding, toy>, and <baby, sitting in, high
chair>. The third sentence involves reasoning based on the concepts <man,
smiling> and <baby, smiling> to deduce by the caption generation system 118 that
the man is happy and to add the "too" because both the baby and man are smiling.
The fourth sentence also uses reasoning on the extracted concept that the baby is
wearing blue to deduce that the baby is a boy.
[00391 The caption generation system 118 may also use external statistical
knowledge, e.g., that most of the time when a man is feeding a baby boy, it is a
father feeding his son. The generated fourth sentence above is tempered with "it
is probably ." because statistics may indicate a reasonable amount of uncertainty
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 in that deduction and because there may also be uncertainty in the deduction that the baby is boy because the baby is wearing blue clothes. Since the structured image representation 106 may be used to all extract relevant information about the scene, the absence of information may also be used as part of deductions performed by the caption generation system 118. In this case, the structured image representation 106 does not mention a woman as being present in the image
108. Thus, the caption generation system 118 may deduce that the "mom is away"
and combined with the concepts that the man and baby are smiling, generate the
final sentence "The dad and his son are having fun together while mom is away."
[0040] Note that a caption generation system 118 may avoid use of some of the
extracted information. in this case, the caption did not mention that the man was
sitting at the table because the caption generation system 118 deemed that concept
uninteresting or unimportant in describing the scene or that it could be deduced
with high probability from another concept such as that the baby is sitting in a
high chair. This reasoning is made possible through use of the structured image
representation 106 as a set of structured knowledge that functions as an descriptive
summarization of the image 106 using text.
[0041] The structured image representation 106 may also include part-of-speech
(POS) tags such as singular noun, adjective, adverb, and so on for the extracted
subjects, predicates, actions, attributes, and objects. The part-of-speech tags can
be used as part of reasoning as described above as well as slot filling in a
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 grammar-based caption generation approach, and to ensure that a valid sentence is generated as further described below.
[0042] Additionally, explicit extraction of knowledge of images 108 at the level of
objects within the image 108 and corresponding attributes and interactions allows
for further reasoning about middle and higher level scene properties. The
deductions about the baby being a boy, the man being happy, and the dad and son
having fun while mom is away are examples.
100431 FIG. 2 depicts another example of an image 200. In this example, the
structured image representation 106 may include the following knowledge this is
extracted from the image 200:
{<soccer ball>, <person 1, wearing, blue shirt>, <person 2, wearing, red
shirt>, <person 3, wearing, red shirt>, <person 4, wearing, red shirt>,
<person 5, wearing, blue shirt>, <person 6, wearing, blue shirt>, <field>,
<person 5, kicking, soccer ball>, <person 6, running>, <person 4, chasing,
person 5>, <person 3, running>, <person 1, running> }.
The existence of a soccer ball indicates that the people are playing soccer, which is
further supported by knowledge that one of the people are kicking the soccer ball.
That there are only two different color shirts indicates that there are two teams
playing a game. This is backed up by the knowledge that a person in red is
actually chasing the person in blue that is kicking the ball, and that other people
are running on a field. From this extracted object level knowledge, scene level
properties may be deduced by the caption generation system 118 with enhanced
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI object level descriptions, such as "A soccer match between a team in red and a team in blue".
[00441 Further reasoning and deduction about scenes and their constituent objects
and actions may also be achieved by building a knowledge base about the content
of images where the knowledge base is then used by a reasoning engine. The
construction of a knowledge base, for instance, may take as an input structured
knowledge describing images such as <subject, attribute, ->, <subject, predicate,
object->, <subject,-,->, <-,ation,->. Input data for constructing the knowledge
base can be taken from existing image caption databases and image captions and
surrounding text in documents. The ability of the techniques described herein to
extract such knowledge from any image allows the image knowledge base to
include much more data from uncaptioned and untagged images, which is most
images. The image knowledge base and corresponding reasoning engine can
make deductions such as those needed in the man feeding baby captioning
example above. The image knowledge base can also provide the statistics to
support the probabilistic reasoning used in that example such as deducing that the
man is likely the baby's father. If the example had included an attribute like
<man, old>, then a more likely deduction may include that the man is the baby's
grandfather.
[00451 Having described examples of an environment in which a structured image
representation 106 is used to descriptively summarize images 114, further
discussion of operation of the knowledge extraction system 104 to generate and
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 use a model to as part of knowledge extraction from images is included in the following.
[0046] FIG. 3 depicts a system 300 an example implementation showing the
knowledge extraction system 104 of FIG. I in greater detail. In this example, the
knowledge extraction system 104 employs a machine learning approach to
generate the structured image representation 106. Accordingly, training data 302
is first obtained by the knowledge extraction system 110 that is to be used to train
the model that is then used to form the structuredimage representation 106.
Conventional techniques that are used to train models in similar scenarios (e.g.,
image understanding problems) rely on users to manually tag the images to form
the training data 302, which may be inefficient, expensive, time-consuming, and
prone to error. In the techniques described herein, however, the model is trained
using machine learning using techniques that are performable automatically and
without user intervention.
10047] In the illustrated example, the training data 302 includes images 304 and
associated text 306, such as captions or metadata associated with the images 304.
An extractor module 308 is then used to extract structured semantic knowledge
310, e.g., "<Subject,Attribute>, Image" and "<Subject,PredicateObject>, Image"
using natural language processing as further described in relation to FIG. 4.
Extraction may also include localization of the structured semantic knowledge 310
to objects within the image as further described in relation to FIGS. 5 and 6.
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
100481 The images 304 and corresponding structured semantic knowledge 310 are
then passed to a model training module 312. The model training module 312 is
illustrated as including a machine learning module 314 that is representative of
functionality to employ machine learning (e.g., neural networks, convolutional
neural networks, and so on) to train the model 316 using the images 304 and
structured semantic knowledge 310. The model 316 is trained to define a
relationship between text features included in the structured semantic knowledge
310 with image features in the images as further described in relation to FIG. 7.
100491 The model 316 is then used by a structured logic determination module 318
to generate a structured image representation 106 for an input image 108. The
structure image representation 106, for instance, may include textthat is structured
to define concepts of the image 108, even in instances in which the image 108
does not have text. Rather, the model 316 is usable to generate this text as part of
the structured image representation 106, which is then employed by the structured
image representation use module 320 to control a variety of functionality, such as
image searches, caption and metadata generation, and so on automatically and
without user intervention. Having described example modules and functionality
of the knowledge extraction system 110 generally, the following discussion
includes a description of these modules in greater detail.
[00501 FIG. 4 depicts an example implementation 400 showing the extractor
module 308 of FIG. 3 in greater detail. The extractor module 308 includes a
natural language processing module 402 that is representative of functionality to
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 is use natural language processing (NLP) for semantic knowledge extraction from free-form (i.e., unstructured) text 306 associated with images 304 in the training data 302. Sich free-form descriptions are readily avaible inexisting image caption databases and documents with images such as web pages and PDF documents and thus the natural language processing module 402 may take advantage of this availability, which is not possible using conventional manual techniques. However, manual techniques may also be employed in which a worker generates text 306 captions for images 304 to describe the images 304.
[00511 The structured semantic knowledge 310 is configurable ina variety of ways
as previously described, such as "<subject, attribute>, image" 406 and/or
"<subject, predicate, object>, image" 408 tuples. Examples of captions and
structured knowledge tuples as performed by the extractor module 308 include "A
boy is petting a dog while watching TV" which is then extracted as "<boy, petting,
dog>, <boy, watching, tv>." In another example, a caption "A brown horse is
eating grass in a big green field" is then extracted as "<horse, brown>, <field,
green>, <horse, eating, grass>, <horse, in, field>."
[00521 A variety of tuple extraction solutions may be employed by the natural
language processing module 402. Additionally, in some instances a plurality of
tuple extraction techniques may be applied to the same image caption and
consensus used among the techniques to correct mistakes in tuples, remove bad
tuples, and identify high confidence tuples or assign confidences to tuples. A
similar technique may be employed in which a tuple extraction technique is used
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 to perform tuple extraction jointly on a set of captions for the same image and consensus used to correct mistakes in tuples, remove bad tuples, and identify high confidence tuples or assign confidences to tuples. This data is readily available from existing databases as images oftentimes have multiple captions.
Additionally, inputs obtained from crowd sourcing may also be used confirm good
tuples and to remove bad tuples.
100531 In one or more implementations, abstract meaning representation (AMR)
techniques are used by the natural language processing module 402 to aid in tuple
extraction. AMR is aimed at achieving a deeper semantic understanding of free
form text. Although it does not explicitly extract knowledge tuples of the form
Subject, Attribute> or <Subject, Predicate, Object>, a tuple representation may
be extracted from an AMR output. Additionally, knowledge tuples may be
extracted from a scene graph (e.g., a Stanford Scene Graph dataset) which is a
type of image representationfor capturingobject attributesand relationships for
use in semantic image retrieval.
100541 FIG. 5 depicts an example system 500 in which the extractor module 308 of
FIG. 4 is shown as including localization functionality as part of knowledge
extraction. in addition to extraction of structured semantic knowledge 310 to
describe an image as a whole as part of the training data 302, structured semantic
knowledge 310 may also be localized within an image to promote efficient and
correct machine learning
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
100551 If there is a complex scene with a man walking a dog, for instance, then the
structured semantic knowledge 310 may be configured as "<man, walking, dog>,
image data" with the image data referring to a portion of the image 304 that
includes the man walking the dog, which is referred to as a bounding box 504 in
the following. Thus, tuples of the structured semantic knowledge 310 may refer to
portions within the image, examples of which are represented as "<subject,
attribute>, portion" 506 and"<subject, predicate, object>, portion" 508.
100561 Accordingly, this may promote accuracy in training and subsequent use for
images having multiple entities and corresponding actions. For example, if an
entirety of an image that is captioned that includes multiple concepts, e.g., a
woman Jogging or a boy climbing a tree, then any machine learning performed
will be confronted with a determination of which part of the image is actually
correlated with <man, walking, dog>. Therefore, the more the structured semantic
knowledge 310 is localized, the easier it will be to fit a high quality model that
correlates images and structured text by the model training module 312. The
problem of associating parts of a textual description with parts of an image is also
called "grounding".
100571 The grounding and localization module 502 may employ a variety of
techniques to perform localization. In one example, object detector and classifier
modules that are configured to identify particular objects and/or classify objects
are used to process portions of images 304. A region-CNN (convolutional neural
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 network) or a semantic segmentation technique may also be used to localize objects in an image.
[0058] In another example, structured semantic knowledge 310 tuples such as
<Subject, Attribute> and <Subject, Predicate, Object> and localized objects are
identified by calculating how many class occurrences have been localized for the
subject and object classes as further described below. This may also include
identifying subjects or objects that indicate that the tuple describes an entire scene,
in which case the entire training image 304 is associated with the tuple of the
structured semantic knowledge 310. To do so, an external list of scene types is
used, e.g., bathroom.
[0059] Before the grounding and localization module 502 can look up the
bounding boxes for an object class mentioned in the subject or object of a tuple,
the text used for the subject or object is mapped to a pre-defined subset of
database objects since bounding boxes are typically stored according to those class
labels. For example, the mapping problem may be solved from subject or object
text "guy" to a pre-defined class such as "man" by using a hierarchy to perform
the matching.
[0060] Once a set of bounding boxes 504 in an image 304 for the subject and
object classes in a <Subject, Predicate, Object> triple or the bounding boxes 504
for a <Subject, Attribute> double are obtained, rules and heuristics are then
employed by the grounding an localization module 502 to localize a tuple of the
structured semantic knowledge 310 withinthe training image 304. In a first such
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 example, for a <Subject, Attribute> tuple, if there is only a single occurrence of a subject class in the image 304 (e.g. just one car) then the tuple is associated with the single bounding box for that tuple since the bounding box 504 contains the subject and the attribute describesthe subject within that box, eg., "<car, shiny>."
[0061] For a <Subject, Predicate, Object> tuple with only a single occurrence of
the subject class and one occurrence of the object class, the tuple is associated
with the smallest rectangular image area that covers the bounding box for the
subject and the bounding box for the object, i.e., the bounding box of the two
bounding boxes. For example, if there is a single person and a single dog in the
image, then <person, walking, dog> is localized to the person and dog bounding
boxes. This lkely contains the leash connecting the person and dog. In general,
the tacit assumption here is that the predicate relating the subject and object is
visible near the subject and object.
100621 For a <Subject, Predicate, Object> tuple with a singular subject and a
singular object ("car" not "cars") and more than one occurrence of either the
subject class or the object class, the following is determined. If a nearest pair of
bounding boxes 504 with one from the subject class and one from the object class
is within a threshold distance, then this tuple is associated with the bounding box
of the nearest pair bounding boxes. The assumption here is that relationship
between a subject and object can be well localized visually. The distribution of
the distances between each of the pairs may also be used to determine if there is
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 uncertainty in this choice because of a second or third pair that also has a small distance.
[0063] The above heuristics give examples of types of information considered in
localization. Additional techniques may also be used to aid localization performed
by the grounding and localization module 502. An example of this is illustrated
by a text semantic module 510 that is representative of functionality of use of text
understanding to aid in grounding subjects and objects in the image. In one
example, positional attributes associated with a subject are used to select or
narrow down the correct bounding box for that subject. If there are several cars in
a scene, for instance, but the caption states "There is a child sitting on the hood of
the leftmost car", then the text semantic module 510 may aid in selecting the
bounding box with the minimum horizontal coordinate to ground as the leftmost
car in this caption and in the <child, sitting on, car> tuple extracted from it.
Instead of using the bounding box of all bounding boxes for cars in the example
above, the bounding box of just the grounded car or of the subset of cars that
match the "leftmost" criterion may be used. This determination may be
generalized to other criteria that may be measured, such as color.
[0064] In grounding a tuple, the grounding and localization module 502 first
reduces a set of bounding boxes for the subject and the object using their attributes
to filter out bounding boxes 504 that do not include these attributes. Such
attributes includeposition, color, and proximity to other identifiableregions, e.g,
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 for "the car on the grass" the grass region is discoverable using a semantic segmentation algorithm.
[0065] Relative positional information is also used to select the correct pair of
subject class and object class bounding boxes for a positional relationship. For
example, if the caption is "A baby sits on top of a table", then the baby and table
are grounded to rectangles in the image with the baby rectangle above the table
rectangle. As such, this uniquely identifies the image area to associate with this
tuple if there are multiple babies arid/or multiple tables in the scene.
100661 For a <Subject, Predicate, Object> tuple with the subject and object
grounded in the image, the tuple with a smallest rectangular image area that covers
the bounding box for the subject and the bounding box for the object. A variety of
other examples are also contemplated, such as to add an amount of context to
bounding boxes through inclusion of a larger area than would otherwise be
included in a "tight" bounding box.
100671 FIG. 6 depicts an example implementation 600 of localization between
portions of an image 108 and structured semantic knowledge 310. As illustrated, a
bounding box 602 for "<man, sitting on, chair>" includes the man and the chair.
A bounding box 604 for "<man, feeding, baby>" includes both the man and the
baby. A bounding box 606 for "<baby, holding, toy>" includes the baby and the
toy. Having described extraction of structured semantic knowledge 310, the
following includes discussion of use of this extracted structured semantic
knowledge 310 to train a model 316 by the model training module 312.
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI
100681 FIG. 7 depicts an example implementation 700 showing the model training
module 312 in greater detail as employing a machine learning module 314 to
model a relationship between the structured semantic knowledge 310 that was
extracted from the test 306 and the images 304. In this example, the machine
learning module 314 is configured to model a relationship 702 between text
features 704 of the structured semantic knowledge 310 with image features of the
image 304 of the training data 302 in order to train the model 316.
100691 The model 316, for instance, may be formed as a joint probabilistic model
having a"P(<Subject, Attribute>, Image I), P(<Subject, Predicate, Object>,Image
." The model 316 is built in this example to output a probability that image "I"
and structured text <Subject, Attribute> or <Subject, Predicate, Object> represent
the same real world concept visually and textually. The model 316 in this
example is configured to generalize well to unseen or rarely seen combinations of
subjects, attributes, predicates, and objects, and does not require explicit reduction
of a large vocabulary of individual words to a small, pre-defined set of concepts
through use of identification of latent concepts and matching of this structure as
further described below.
100701 The model 316, once trained, is configured to locate images based on
structured text by computing probabilities that images correspond to the structured
text. For instance, a text-based image search involves mapping a text query(e.g.,
represented as a set of structured knowledge using a natural language tuple
extraction technique) to an image. Thisis supported by a joint model as further
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 described in relation to FIG. 8 by looping over images "I" and checking which gives a high probability "P(structured text <S,P,O>, image I)" for a given concept
<S,P,0>. Knowledge extraction/tagging is supported by looping over possible
concepts <S,P,0> and checking which gives a high probability "P(structured text
<S,P,O>, image I)"for a given image or image portion"I."
[00711 There are two parts to forming the model 316 by the model training module
312. The first is to generate a feature representation for the structured text
"<S,P,0>," <SA,->"<S->"(where "- indicates an unused slotto represent
all concepts as triples) and for images. The second part is to correlate the feature
representation of the structured text with image features, e.g., to correlate text
feature "t" 704 and image feature "x: P(tx)"' 706. For example, to define a
relationship 702 of text features (e.g., <man, holding, ball>) with images features
706 of an image that show a ball being held by a man. These two parts are further
described in greater detail below.
100721 The structured semantic knowledge 310 "<S,P,0:>" and "<S,A>" tuples are
configured such that similar structured knowledge concepts have nearby and
related representations, e.g., as vectors in a vector space. This supports
generalization and use of a large vocabulary. For example, text feature 704
representations of "<road, curvy>" and "<road, winding>" are configured to be
similar and the representations between "<dog, walking>" and "<person,
walking>" are related by the common action of walking. This may be performed
such that similar words are nearby in the space and the vector space captures some
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 relationships between words. For example, vec("man")+(vec("queen") vec("woman"))=vec("king").
[0073] The model training module 312 nay also be configured to build upon
semantic vector representations of single words to develop a vector representation
of knowledge tuples which captures the relationship between two concepts
"<S1,P1,Ol>" and "<S2,P2,02>." Specifically, a feature vector is built for an
"<S,P,O>" triple as a function of single word representations "vec(S)," "vec(P),"
and "vec(O)." The "vec(<S,P,>)" is built as a concatenation of the individual
word vectors "vec(<S,P,O>)=[vec(S) vec(P) vec(0)]."
[0074] When an "<S,P,0>" element is missing, such as the object "O" when
representing a "<Subject, Attribute>" or both a predicate "P" and object "O" when
representing a "<Subject>," the corresponding vector slot is filled using zeros.
Thus the vector representation for a subject, solely, lies along the "S" axis in
"S,P,0" space. Visual attributes may be addressed as modifiers for an unadorned
subject that move the representation of "<S,P>" into the "SP" plane of "S,P,O"
space. Another option involves summing the vector representations of the
individual words.
[0075] For a compound "S" or "P" or "0,"the vector representation for each
individual word in the phrase is averaged to insert a single vector into a target slot
of a "[vec(S) vec(P) ve(0)]" representation. For example, "vec("running
toward")" is equal to "0 5*(vec("running")+i-vec("toward"))." A non-uniform
weight average may also be used when some words in the phrase carry more
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI meaning than others. In an implementation, a semantic representation (e.g., vector or probability distribution) is learned directly for compound phrases such as
"running toward" or "running away from" by treating these phrases atomically as
new vocabulary elements in an existing semantic word embedding model.
[0076] There are a variety of choices of techniques that are usable to capture
semantics of image features 706. In one such example, a deep machine learning
network is used that has a plurality of levels of features that are learned directly
from the data. In particular, convolution neural networks (CNNs) with
convolution, pooling, and activation layers (e.g., rectified linear units that
threshold activity) have been proven for image classification. Examples include
AlexNet, VGGNet, and GoogLeNet.
[00771 Additionally, classification features from deep classification nets have been
shown to give high quality results on other tasks (e.g. segmentation), especially
after fine tuning these features for the other task. Thus, starting from features
learned for classification and then fine tuning these features for another image
understanding task may exhibit increased efficiency in terms of training than
starting training from scratch for a new task. For the reasons above, CNN features
are adopted as fixed features in a baseline linear CCA model The machine
learning module 314 then fine tunes the model 316 from a CNN in a deep network
for correlating text and images features 704, 706.
10078] The machine learning module 316 is configured to map text features "t"
704 and image features "x" 706 into a common vector space and penalize
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 differences in the mapped features when the same or similar concepts are represented by "t" and "x."
[0079] One technique that may be leveraged todosoinclude linear mapping
referred to as Canonical Correlation Analysis (CCA) which is applied to text and
image features 704, 706. In CCA, matrices "T" and "X" are discovered that map
feature vectors "t" aid "x," respectively, into a common vector space "t'= Tt" and
"x'=Xx." If the mapping is performed into a common space of dimension "D,"
and "t" is a vector in "Dt-dimensional space," and "x" is a vector in "Dx
dimensional space," then "T" is a "(D by D t)" matrix, "X" is a "(D by D x)"
matrix, and the mapped representations t' and x' are D-dimensional vectors.
[0080] Loss functions may be employed for model fitting using training pairs
"(t,x)" based on squared Euclidean distance "||t'-x'll 2 ^ 2 " or a cosine similarity
"dot_product(t',x')" or the "anglebetween(t',x')" which removes the vector
length from the cosine similarity measure. When the dot product is used, then the
CCA correlation function is expressed as follows:
f(t,x) = fCCA dp(t,x) = tr(Tt) * Xx = tr(t)*M*x = sum{i,j ti M{ij} x,
where "tr" equals transpose, and "M = tr(T)*X is (D t by D_x)," and subscripts
indicate vector components. This form supports a faster than exhaustive search
for images or text ziven the other. For example, in text-based image search,
images with feature vectors "x" are found such that "dotprod(v,x)" is large,
where "v=tr(t)*M."
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
[008i For a squared Euclidean loss, the CCA correlation function may be
expressed as follows:
f(t,x) = fCCAE(t,x) =|T tt- X x_2^2.
Again, the simple closed form of the correlation function above may also support
faster than exhaustive search for images or text Ziven the other. For example, in
text-based image search images with feature vectors "x" are found such that
'fCCAE(t,x)" is small for agiven text vector "t." Given "(T,X)" from fitting
the CCA model and the query "t," linear algebra provides a set of vectors that
minimize "f(t,x)" and images are found with feature vector"x" close to this set.
[0082] FIG. 8 depicts an example of a deep network 800 for correlating text and
images as part ofmachine learning. The deep network 800 includes a text
machine learning module 802 (e.g., column) and an image machine learning
module 804 (e.g., column that is separate from the column that implementes the
text machine learning module 802) that are configured to learn the correlation
"f(<SP,O>,I') between structured semantic knowledge "<S,P,O>" and an image
or image portion "I" by non-linear mapping into a common space.
[00831 The text machine learning module 802 starts with a semantic text vector
representation "t" that includes vec(S) 806, vec(P) 808, and vec(O) 810 which is
then passed through sets of fully connected and activation layers 812 to output a
non-linear mapping t->t' as a feature vector for the text 814.
100841 The image machine learning module 804 is configured as a deep
convolutional neural network 814 (e.g., as AlexNet or VGGNet or GoogLeNet
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI with the final layers mapping to probabilities of class removed) that starts from image pixels of the image 816 and outputs a feature vector x' for the image 814.
The image module is initialized as the training result of an existing CNN and the
image features are fine tuned to correlate images with structured text capturing
image attributes and interactions instead of just object class discrimination as in
the existing CNN.
[00851 Adaptation layers 822, 824 in the text and image machine learning modules
802, 804 adapt the representations according to anon-linear functionto map it into
common space with image features representing the same concept. A loss layer
828 joins the modules and penalizes differences in the outputs t' and x' of the text
and image machine learning modules 802, 804 to encourage mapping into a
common space for the same concept.
[00861 A discriminative loss function such as a ranking loss may be used to ensure
that mismatched text and images have smaller correlation or larger distance than
correctly matched text and images. For example, a simple ranking loss function
may require correlations "dotprod(t.i',x.i') > dotprod(t j',x.i')" for a training
example "(t i,xi)" and where the original tuple for training tuple tj does not
match training image "xi." A ranking loss may also use a semantic text
similarity or an external object hierarchysuch as ImageNet to formulate the loss to
non-uniformly penalize different mismatches.
100871 Other loss functions and architectures are possible, for example with fewer
or more adaptation layers between the semantic text representation
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI
"t=[vec(S),vec(P),vec(O)]" and the embedding space t' or with connections
between text and image layers before the common embedding space. In one
example, a wild card loss that ignores the object part of embedding vectors for
second order facts <S, P> and the predicate and object parts of embedding vectors
for first order facts <S> is also possible.
[0088] Returning again to FIG. 3, at this point structured semantic knowledge 310
is obtained by the model training module 312 to solve the problem of extracting a
concept relevant to an image region. The modeling above is now applied for
"P(Concept <S,P>,Image I)" to extract all high probability concepts about a
portion of an image. This may be performed without choosing the most probable
concept. For example, consider an image region that contains a smiling man who
is wearing a blue shirt. Image pixel data "I" for this region will have high
correlation with both "<man, smiling>" and "<man, wearing, blue shirt>" and thus
both these concepts may be extracted for the same image region.
[ows9] The knowledge extraction task may be solved by applying the above model
with image pixel data from regions identified by an objectproposalalgorithmor
object regions identified by the R-CNN algorithm or even in a sliding window
approach that more densely samples image regions. To capture object
interactions, bounding boxes are generated from pairs of object proposals or pairs
of R-CNN object regions. One approach is to try all pairs of potential object
regions to test for possible interactions. Another approach is to apply some
heuristics to be more selective, such as to not examine pairs that are distant in
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 image. Since the model may be applied to extract zero, one, or more high probability concepts about an image region, the extracted <S,P,O> concepts may be localized to image regions that provide the corresponding visual data.
Example Procedures
[00901 The following discussion describes knowledge extraction techniques that
may be implemented utilizing the previously described systems and devices.
Aspects of each of the procedures may be implemented in hardware, firmware, or
software, or a combination thereof. The procedures are shown as a set of blocks
that specify operations performed by one or more devices and are not necessarily
limited to the orders shown for performing the operations bythe respective blocks.
In portions of the following discussion, reference will be made to FIGS. 1-8.
[0091] FIG. 9 depicts a procedure 900 in an example implementation in which a
digital medium environment is employed to extract knowledge from an input
image automatically and without user intervention. A digital medium
environment is described to learn a model that is usable to compute a descriptive
summarization of an input image automatically and without user intervention.
Training data is obtained that includes images and associated text (block 902).
The training data 320, for instance, may include images 304 and unstructured text
306 that is associated with the images 304, e.g., as captions, metadata, and so
forth.
Wolfe-SBMC Docket No.: P5726--US-PROV1-ORG1
100921 Structured semantic knowledge is extracted from the associated text using
natural language processing by the at least one computing device, the structured
semantic knowledge describing text features (block 904). The structured semantic
knowledge 310, for instance. may be extracted using natural language processing
to generate tuples, such as <subject, attribute>, <subject, predicate, object>, and so
forth.
100931 A model is trained using the structured semantic knowledge and the images
as part of machine learning (block 906). A model training module 312, for
instance, may train a neural network using the images 304 and structured semantic
knowledge 310. This knowledge may also be localized as described in greater
detail in relation to FIG. 10.
[00941 The model is used to form a structured image representation of the input
image that explicitly correlates at least part of the text features with image features
of the input image as the descriptive summarization of the input image (block
908). The structured image representation, for instance, may correlate concepts in
the text with portions of the images along with addressing a structure of the
knowledge to describe "what is going on" in the images as a description
summarization. This description summarization may be employed in a variety of
ways, such as to locate images as part of an image search, perform automated
generation of captions, and so on.
[00951 FIG. 10 depicts a procedure 1000 in an example implementation in which a
digital medium environment is employed to extract knowledge and localize text
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI features to image features of an input image. A digital medium environment is described to learn a model that is usable to compute a descriptive summarization of an object within an input image automatically and without user intervention.
Structured semantic knowledge is extracted from text associated with images
using natural language processing by the at least one computing device (block
1002). Image features of objects within respective said images is localized as
corresponding to the text features of the structured semantic knowledge (block
1004). As before, structured semantic knowledge 310 is extracted. However, in
this case this knowledge is localized to particular portions of the image and thus
may improve accuracy of subsequent modeling by potentially differentiating
between multiple concepts in an image, e.g.,the baby holding the toy and the man
feeding the baby as shown in FIG. 1.
[00961 A model is trained using the localized image and text features as part of
machine learning (block 1006). A variety of different techniques may be used,
such as to perform probabilistic modeling. The model is used to form a structured
image representation of the input image that explicitly correlates at least one of the
textual features with at least one image feature of the object included in the input
image (block 1008). For example, the structured logic determination module 318
may take an input image 108 and form a structured image representation 106
especially in instances in which the input image 108 does not include associated
text. Further, the structured image representation 106 may be localized to
correlate concepts included in the text and image to each other. As before, the
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 structured image representation 106 may be used to support a variety of functionality, such as image searches, automated caption generation, and so forth.
Implementation Example
[0097] FIG. 11 depicts an example system 1100 usable to perform structured fact
image embedding. This system 1100 support properties such as an ability to (1)
can be continuously fed with new facts without changing the architecture, (2) is
able to learn with wild cards to support all types of facts, (3) can generalize to
unseen or otherwise not-directly observable facts, and (4) allows two way retrieval
such as to retrieve relevant facts in a language view given an image and to retrieve
relevant images give a fact in a language view. This system 1100 aims to model
structured knowledge in images as a problem having views in the visual domain V
and the language domain L. Let "f" be a structured "fact" (i.e., concept) and "fi c
L denotes the view of "f'in the language domain. For instance, an annotated
fact, with language view "f1 = <S:girl, Priding, 0:bike>" would have a
corresponding visual view ""as an image where the fact occurs as shown in FIG.
11.
[0098] The system is configured to learn a representation that covers first-order
facts <S> (objects), second-order facts <S, P> (actions and attributes), and third
order facts <S, P, O> (interaction and positional facts). These type of facts are
represented as an embedding problem into a "structured fact space." The
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 3, structured fact is configured as a learning representation of three hyper-dimensions that are denoted as follows:
<, 4' *e R *R and o
[0099] The embedding function from a visual view of a fact 'are denoted as
the following, respectively:
and] ias<$(t,¾ q ancqf
00100]Similarly, the embedding function from a language view of a fact "fi" are
denoted:
gp and $
as respective ones of the following:
[oo11]The concatenation of the visual view hyper-dimensions' is denoted as:
[00102]The concatenation of the language view hyper-dimensions' embedding is
denoted as:
where the above are the visual embedding and the language embedding of "f',
respectively, thereby forming:
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI
00103] Thus, as is apparent from above the third-order facts <S, P, O> can be
directly embedded to the structured face space by:
for the image view and:
for the language view.
[001041First-order facts are facts that indicate an object like <S: person>. Second
order facts are more specific about the subject, e.g., <S: person, P: playing>.
Third-order facts are even more specific, e.g., <S: person, P: playing, 0: piano).
In the following, higher order facts are defined as lower order facts with an
additional modifier applied. For example, adding the modifier "P: eating" to the
fact <S: kid> constructs the fact <S: kid, P: eating>. Further, applying the
modifier "0: ice cream" to the fact <S: kid, P: eating> constructs the fact <S: kid,
P: eating, 0: ice cream>. Similarly, attributes may be addressed as modifiers to a
subject, e.g., applying "P: smiling" to the fact <S: baby> constructs the fact <S:
baby, P: smiling>
100105 Based on the fact modifier observation above, both first and second order
facts may be represented as wild cards, as illustrated in the following equations for
first-order and second-order facts, respectively.
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
Setting "p"and " "to "" for first-order facts is interpreted to mean that the
"P" and "0" modifiers are not of interest for first-order facts. Similarly, setting
"rpo" to "*" for second-order facts indicates that the "0" modifier is not of interest
for single-frame actions and attributes.
[001061Both first and second-order facts are named wild-card facts. Since
modeling structured facts in visual data potentially allows logical reasoning over
facts front images, the described problem is also referenced as a "Sherlock"
problem inthe following.
1001071In order to train a machine learning model that connects the structured fact
language view in L with its visual view in V data is collected in the form of (f,fi)
pairs. Data collection for large scale problems has become increasingly
challenging, especially in the below examples as the model relies on localized
association of a structure language fact "f,'with an image ","when such facts
occur. In particular, it is a complex task to collect annotations especially for
second order facts <S, P> and third-order facts <S, P, 0>. Also, multiple
structured language facts may be assigned to the same image, e.g., <S: man, P;
smiling> and <S: man, P: wearing, 0: glass>. If these facts refer to the same man,
the same image example could be used to learn about both facts.
00108 As previously described, techniques are discussed in which fact annotations
are automatically collected from datasets that come in the form of image/caption
pairs. For example, a large quantity of high quality facts may be obtained from
caption datasets using natural language processing. Since caption writing is free
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 form, these descriptions are typically readily available, e.g., from social networks, preconfigured databases, and so forth.
[00109 In the following example, a two-step automatic annotation process is
described (i) fact extraction from captions which includes any text associated with
an image that describes the image; and (ii) fact localization in images. First, the
captions associated with the given image are analyzed to extract sets of clauses
that are consider as candidate <S, P> and <S, P, O> facts in the image. Clauses
form facts but are not necessarily facts by themselves.
00110 Captions can provide rich amounts of information to image understanding
systems. However, developing natural language processing systems to accurately
and completely extract structured knowledge from free-form text is challenging
due to (1) spelling and punctuation mistakes; (2) word sense ambiguity within
clauses; and (3) spatial preposition lexicon that may include hundreds of terms
such as "next to," "on top of," as well as collection phrase adjectives such as
group of," "bunch of," and so forth.
00111 The process of localizing facts in an image isconstrained by information in
the dataset. For example, a database may contain object annotations for different
objects by training and validation sets. This allows first-order facts to be localized
for objects using bounding box information. In order to locate higher-order facts
in images, visual entities are defined as any noun that is either a dataset object or a
noun in a predefined ontology that is an immediate or indirect hypernym of one of
the objects. it is expected that visual entitles appear either in the S or the 0 part, if
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 it exists, for a candidate fact "ff' which allows for the localization of facts for images. Given a candidate third-order fact, an attempt is first made to assign each
"S" and O" to one of the visual entities. If "S" and O" are not visual entities,
then the clause is ignored. Otherwise, the clauses are processed by several
heuristics. The heuristics, for instance, may take into account whether the subject
or the object is singular or plural, or a scene. For example, in the fact <S: men, P:
chasing, 0: soccer ball> the techniques described herein may identify that "men"
may involve a union of multiple candidate bounding boxes, while for "soccer ball"
it is expected that there is a single bounding box.
[00121 A straightforward way to model facts in images is to learn a classifier for
each separate fact. However, there is a clear scalability limitation in this technique
as the number of facts is signification, e.g., S1 x P1 x 1O where Si, ,PI and |O are
the number of subjects, predicates, and objects, respectively. Thus, this number
could reach millions for possible facts in the real world. In addition to scalability
problems, this technique discards semantic relationships between facts, which is a
significant property that allows generalization to unseen facts or facts with few
examples. For instance, during training there might be a second-order fact like
<S: boy, P: playing> and a first-order fact like <S: girl>, <S: boy>. At run time,
the model trained using the techniques described herein understands an image with
the fact <girl, playing> even if this fact is not seen during training, which is clearly
not captured by learning a model for each fact in the training.
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
100113]Accordingly, a two-view embedding problem is described in this example
that is used to model structured facts. For example, a structured fact embedding
model may include (1) two-way retrieval (i e., retrieve relevant facts in language
view given and image, and retrieve relevant images given a fact in a language
view; and (2) wild-card facts are supported, i.e., first and second order facts.
[001141The first property is satisfied in this example by using a generative model
p(ffi) that connects the visual and the language views of"f" This technique first
models the following:
where "s( , ) is a similarity function defined over the structured face space
denoted by "S, which is a discriminative space of facts. This is performed such
that two views of the same fact are embedded close to each other.
[00115 To model and trainf, a CNN encoder is used and to train L "an
RNN encoder is used. Two models are proposed for learning facts, denoted by
Model I and Model 2 in an example implementation 1200 of FIG. 12. Models 1
and 2 share the same structured fact language embedding and encoder but differ in
the structured factimage encoder.
100116]This process starts by defining an activation operator"L(O, a)" where "a"' is
an input and "O" is a series of one or more neural network layers, which may
include different layer types such as four convolution, one pooling, and another
convolution and pooling. The operator "V(0, a)" applies "6" parameters layer by
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI layer tocompute the active of "0" subnetwork given ""a". An operator ", is used to define Model I and Model 2 structured fact image encoders.
[001171 In Model 1, a structured fact is visually encoded by sharing convolutional
layer parameters (denoted by Of) and fully connected layer parameters (denoted
by 8O). Then, "W, "W," and "W," transformation matrices are applied to
produce "(),4§(f),#()"as follows:
001181 In contrast to Model 1, different convolutional layers are used in Model 2
for "S" than for "P" and "O", as consistent withthe above discussion that "'P"and
"0" are modifiers to "S" as previously described. Starting from "f,"there is a
common set of convolutional layers, denoted by "0'0", then the network splits into
two branches, producing two set of convolutional layers "c"" and "OcP"
followed by two sets of fully connected layers "0',y" and "O'0". Finally,'"#(),
$ 0(f)are(o)" pare computed by transformation matrices "W"." and "W1,"
as follows:
[001191In both models, a structured language fact is encoded using RNN word
embedding vectors for "S, P, and 0." Hence, in the case"14(fi)= RNNOL
;(t) = RNNoL()9(fi) = f),"0 where RNN0L(f "f""fp"and "fj"are the
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 subject, predicate, and object parts ofIL ". For each of these, the literals are dropped andif any of "f",f"andff" contain multiple words, the average vector is computed as the representation of that part. The RNN language encoder parameters are denoted by "O". In one or more implementations, "("is fixed to a pre-trained word vector embedding model for"fsi", "f" Land"fL".
[001201 One way to model "pf)" for Model I and Model 2 is to assume that
"p(fA, fi) -o= exp(-loss,(f f))" and minimize "ioss,,(f, fi) "distance loss which is
defined as follows:
which minimizes the distances between the embedding of the visual view and the
language view. A solution to penalize wild-card facts is to ignore the wild-card
modifiers in the loss through use of a weighted Euclidean distance, the weighting
of which is based on whether corresponding parts of the feature vectors are
present, which is called a"wild card" loss. Here "vw = 1," "w,= I," and "w,
I" for <S, P, O> facts, "w =: 1,""w= 1," and "v =: 0" for <S, P> facts, and
"w = 1,", "wf = 0," and "w = 0" for <S> facts. Hence "'oss,"does not penalize S4" P 0b
the "O" modifier for the second order facts or the "P" and "0"modifiers for first
order facts, which follows the above definition of a wild-card modifier.
[001211Accordingly,this example describes a problem of associating high-order
visual and language facts. A neural network approach is described for mapping
visual facts and language facts into a common, continuous structured fact space
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1 that allows natural language facts to be associated with image and images to be associated with natural language structured descriptions.
Example System and Device
[001221FIG. 13 illustrates an example system generally at 1300 that includes an
example computing device 1302 that is representative of one or more computing
systems and/or devices that may implement the various techniques described
herein. This is illustrated through inclusion of the knowledge extraction system
104. The computing device 1302 may be, for example, a server of a service
provider, a device associated with a client (eg., a client device), an on-chip
system, and/or any other suitable computing device or computing system.
[001231The example computing device 1302 as illustrated includes a processing
system 1304, one or more computer-readable media 1306, and one or more I/O
interface 1308 that are communicatively coupled, one to another. Although not
shown, the computing device 1302 may further include a system bus or other data
and command transfer system that couples the various components, one to
another. A system bus can include any one or combination of different bus
structures, such as a memory bus or memory controller, a peripheral bus, a
universal serial bus, and/or a processor or local bus that utilizes any of a variety of
bus architectures. A variety of other examples are also contemplated, such as
control and data lines.
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
00124]The processing system 1304 is representative of functionality to perform
one or more operations using hardware. Accordingly, the processing system 1304
isillustrated as including hardware element 1310 that may be configured as
processors, functional blocks, and so forth. This may include implementation in
hardware as an application specific integrated circuit or other logic device formed
using one or more semiconductors. The hardware elements 1310 are not limited
by the materials from which they are formed or the processing mechanisms
employed therein. For example, processors may be comprised of
semiconductor(s) and/or transistors (e.g., electronic integrated circuits (ICs)). In
such a context, processor-executable instructions may be electronically-executable
instructions.
[001251The computer-readable storage media 1306 is illustrated as including
memory/storage 1312. The memory/storage 1312 represents memory/storage
capacity associated with one or more computer-readable media. The
memory/storage component 1312 may include volatile media (such as random
access memory (RAM)) and/or nonvolatile media (such as read only memory
(ROM), Flash memory, optical disks, magnetic disks, and so forth). The
memory/storage component 1312 may include fixed media (e.g., RAM, ROM, a
fixed hard drive, and so on) as well as removable media (e.g., Flash memory, a
removable hard drive, an optical disc, and so forth). The computer-readable media
1306 may be configured in a variety of other ways as further described below.
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
100126]Input/output interface(s) 1308 are representative of functionality to allow a
user to enter commands and information to computing device 1302, and also allow
information to be presented to the user and/or other components or devices using
various input/output devices. Examples of input devices include a keyboard, a
cursor control device (e.g., a mouse), a microphone, a scanner, touch functionality
(e.g., capacitive or other sensors that are configured to detect physical touch), a
camera (e.g., which may employ visible or non-visible wavelengths such as
infrared frequencies to recognize movement as gestures that do not involve touch),
and so forth. Examples of output devices include a display device (e.g., a monitor
or projector), speakers, a printer, a network card, tactile-response device, and so
forth. Thus, the computing device 1302 may be configured in a variety of ways as
further described below to support user interaction.
[00127 Various techniques may be described herein in the general context of
software, hardware elements, or program modules. Generally, such modules
include routines, programs, objects, elements, componentsdata structures, and so
forth that perform particular tasks or implement particular abstract data types. The
terms "module," "functionality," and "component" as used herein generally
represent software, firmware, hardware, or a combination thereof. The features of
the techniques described herein are platform-independent, meaning that the
techniques may be implemented on a variety of commercial computing platforms
having a variety of processors.
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
100128]An implementation of the described modules and techniques may be stored
on or transmitted across some form of computer-readable media. The computer
readable media may include a variety of media that may be accessed by the
computing device 1302. By way of example, and not limitation, computer
readable media may include "computer-readable storage media" and "computer
readable signal media."
100129] "Computer-readable storage media" may refer to media and/or devices that
enable persistent and/or non-transitory storage of information in contrast to mere
signal transmission, carrier waves, or signals per se. Thus, computer-readable
storage media refers to non-signal bearing media. The computer-readable storage
media includes hardware such as volatile and non-volatile, removable and non
removable media and/or storage devices implemented in a method or technology
suitable for storage of information such as computer readable instructions, data
structures, program modules, logic elements/circuits, or other data. Examples of
computer-readable storage media may include, but are not limitedto, RAM, ROM-,
EEPROM, flash memory or other memory technology, CD-ROM, digital versatile
disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic
tape, magnetic disk storage or other magnetic storage devices, or other storage
device, tangible media, or article of manufacture suitable to store the desired
information and which may be accessed by a computer.
100130]"Computer-readable signal media" may refer to a signal-bearing medium
that is configuredto transmit instructions to the hardware of the computing device
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI
1302, such as via a network. Signal media typically may embody computer
readable instructions, data structures, program modules, or other data in a
modulated data signal, such as carrier waves, data signals, or other transport
mechanism. Signal media also include any information delivery media. The term
"modulated data signal" means a signal that has one or more of its characteristics
set or changed in such a manner as to encode information in the signal. By way of
example, and not limitation, communication media include wired media such as a
wired network or direct-wired connection, and wireless media such as acoustic,
RF, infrared, and other wireless media.
[001311As previously described, hardware elements 1310 and computer-readable
media 1306 are representative of modules, prograimable device logic and/or
fixed device logic implemented in a hardware form that may be employed in some
embodiments to implement at least some aspects of the techniques described
herein, such as to perform one or more instructions. Hardware may include
components of an integrated circuit or on-chip system, anapplication-specific
integrated circuit (ASIC), a field-programmable gate array (FPGA), a complex
programmable logic device (CPLD), and other implementations in silicon or other
hardware. In this context, hardware may operate as a processing device that
performs program tasks defined by instructions and/or logic embodied by the
hardware as well as a hardware utilized to store instructions for execution, e.g., the
computer-readable storage media described previously.
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1
100132]Combinations of the foregoing may also be employed to implement various
techniques described herein. Accordingly, software, hardware, or executable
modules may be implemented as one or more instructions and/or logic embodied
on some form of computer-readable storage media and/or by one or more
hardware elements 1310. The computing device 1302 may be configured to
implement particular instructions and/or functions corresponding to the software
and/or hardware modules. Accordingly, implementation of a module that is
executable by the computing device 1302 as software may be achieved at least
partially in hardware, e.g., through use of computer-readable storage media and/or
hardware elements 1310 of the processing system 1304. The instructions and/or
functions may be executable/operable by one or more articles of manufacture (for
example, one or more computing devices 1302 and/or processing systems 1304) to
implement techniques, modules, and examples described herein.
100133] The techniques described herein may be supported by various
configurations of the computing device 1302 and are not limited to the specific
examples of the techniques described herein. This functionality may also be
implemented all or in part through use of a distributed system, such as over a
"cloud" 1314 via a platform 1316 as described below.
[001341 The cloud 1314 includes and/or is representative of a platform 1316 for
resources 1318. The platform 1316 abstracts underlying functionality of hardware
(e.g., servers) and software resources of the cloud 1314. The resources 1318 may
include applications and/or data that can be utilized while computer processing is
Wolfe-SBMC Docket No.: P5726-US-PROVI-ORGI executed on servers that are remote from thecomputing device 1302. Resources
1318 can also include services provided over the Internet and/or through a
subscriber network, such as a cellular or Wi-Fi network.
[00135] The platform 1316 may abstract resources and functions to connect the
computing device 1302 with other computing devices. The platform 1316 may
also serve to abstract scaling of resources to provide a corresponding level of scale
to encountered demand for the resources 1318 that are implemented via the
platform 1316. Accordingly, in an interconnected device embodiment,
implementation of functionality described herein may be distributed throughout
the system 1300 For example, the functionality may be implemented in part on
the computing device 1302 as well as via the platform 1316 that abstracts the
functionality of the cloud 1314.
Conclusion
100136]Although the invention has been described in language specific to structural
features and/or methodological acts, it is to be understood that the invention
defined in the appended claims is not necessarily limited to the specific features or
acts described. Rather, the specific features and acts are disclosed as example
forms of implementing the claimed invention.
Wolfe-SBMC Docket No.: P5726-US-PROV1-ORG1

Claims (20)

  1. CLAIMS What is claimed is: 1. In a digital medium environment to learn a model that is usable to explicitly correlate image features of an input image with text features automatically and without user intervention, a method implemented by at least one computing device comprising: obtaining training data by the at least one computing device, the training data including images and associated text; extracting, by the at least one computing device, a plurality of text features resulting from natural language processing of the associated text of the training data, the plurality of text features resulting, respectively, from a plurality of different extraction techniques, the plurality of text features corresponding to an object within a respective said image of the training data; determining, by the at least one computing device, a correction to at least one said text feature of the plurality of text features of the training data using a consensus reached by majority agreement between the plurality of text features from the plurality of different extraction techniques; correcting, by the at least one computing device, the at least one said text feature in the training data using the correction thereby generating corrected training data; and training, by the at least one computing device, a model including the corrected at least one said text feature in the corrected training data and the image features of the object as part of machine learning, the model once trained is configured to correlate the image features of the object within an input image with the plurality of text features.
  2. 2. The method as described in claim 1, further comprising generating a descriptive summarization of the object of the input image using the model.
  3. 3. The method as described in claim 1, wherein the associated text is free form.
  4. 4. The method as described in claim 3, wherein the associated text is a caption or metadata of the respective said image.
  5. 5. The method as described in claim 1, wherein the plurality of text features are in a form of<subject, predicate, object> tuple.
  6. 6. The method as described in claim 1, further comprising: removing at least one of the plurality of the text features from use as part of the training.
  7. 7. The method as described in claim 1, further comprising: identifying a degree of confidence in the extracting.
  8. 8. The method as described in claim 1, wherein the training includes adapting the plurality of text features or the image features one to another, within a vector space.
  9. 9. The method as described in claim 1, wherein the model explicitly correlates the image features of the input image with the plurality of text features such that at least one of the image features is explicitly correlated with a first one of the plurality of text features but not a second one of the plurality of text features.
  10. 10. The method as described in claim 1, the plurality of text features are explicitly correlated to the image features.
  11. 11. In a digital medium environment, a system by at least one computing device comprising: an extractor module to extract a plurality of text features from text associated with images in training data using natural language processing, the plurality of text features extracted, respectively, using a plurality of different extraction techniques, the extractor module configured to remove at least one said text feature of the plurality of the text features from the training data using a consensus reached by majority agreement between the plurality of text features resulting from the plurality of different extraction techniques as corresponding to an object within a respective said image; and a model training module to train a model using the training data having the at least one said text feature removed and image features as part of machine learning, the model configured for determining probabilities of how well image features of an input image correlate to the plurality of text features.
  12. 12. The system as described in claim 11, wherein the associated text is unstructured.
  13. 13. The system as described in claim 11, wherein the plurality of text features are in a form of <subject, predicate, object> tuple.
  14. 14. The system as described in claim 11, wherein the extractor module is configured to localize at least part of the plurality of text features as corresponding to respective portions within respective said images and as not corresponding to other portions within respective said images.
  15. 15. The system as described in claim 11, further comprising a module configured to use the plurality of text features to locate the input image as part of an image search as part of a determination of how well a caption generated using the plurality of text features corresponds to a search query of the image search based on the determined probabilities.
  16. 16. The system as described in claim 11, further comprising a module configured to generate a caption for the input image based on the plurality of text features.
  17. 17. The system as described in claim 11, further comprising a use module configured to deduce, based on the plurality of text features, scene properties of the input image.
  18. 18. In a digital medium environment, a method implemented by at least one computing device, the method comprising: obtaining, by the at least one computing device, training data including images and associated text; extracting, by the at least one computing device, a plurality of text features using natural language processing from the associated text resulting from a plurality of different extraction techniques, the plurality of text features corresponding to image features of an object within a respective said image of the training data; assigning, by the at least one computing device, a degree of confidence to the plurality of text features using a consensus reached by the plurality of text features resulting from the plurality of different extraction techniques; and training, by the at least one computing device, a model using the plurality of text features, the image features of the object, and the degree of confidence as part of machine learning, the model once trained is configured to correlate the image features of the object within input image with the plurality of text features.
  19. 19. The method of claim 18, wherein the associated text is free form.
  20. 20. The method as described in claim 18, wherein the plurality of text features are in a form of a <subject, predicate, object> tuple.
AU2016225819A 2015-11-11 2016-09-07 Structured knowledge modeling and extraction from images Active AU2016225819B2 (en)

Priority Applications (6)

Application Number Priority Date Filing Date Title
US201562254147P true 2015-11-11 2015-11-11
US201562254143P true 2015-11-11 2015-11-11
US62/254,143 2015-11-11
US62/254,147 2015-11-11
US14/978,350 2015-12-22
US14/978,350 US20170132526A1 (en) 2015-11-11 2015-12-22 Structured Knowledge Modeling and Extraction from Images

Publications (2)

Publication Number Publication Date
AU2016225819A1 AU2016225819A1 (en) 2017-05-25
AU2016225819B2 true AU2016225819B2 (en) 2021-04-22

Family

ID=58735575

Family Applications (2)

Application Number Title Priority Date Filing Date
AU2016225819A Active AU2016225819B2 (en) 2015-11-11 2016-09-07 Structured knowledge modeling and extraction from images
AU2016225820A Active AU2016225820B2 (en) 2015-11-11 2016-09-07 Structured knowledge modeling, extraction and localization from images

Family Applications After (1)

Application Number Title Priority Date Filing Date
AU2016225820A Active AU2016225820B2 (en) 2015-11-11 2016-09-07 Structured knowledge modeling, extraction and localization from images

Country Status (1)

Country Link
AU (2) AU2016225819B2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10922713B2 (en) * 2017-01-03 2021-02-16 Facebook, Inc. Dynamic creative optimization rule engine for effective content delivery

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140328570A1 (en) * 2013-01-09 2014-11-06 Sri International Identifying, describing, and sharing salient events in images and videos
US9330296B2 (en) * 2013-03-15 2016-05-03 Sri International Recognizing entity interactions in visual media

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIN, J. et al., "Aligning where to see and what to tell: image caption with region-based attention and scene factorization", arXiv.org, published 20 June 2015 <URL: https://arxiv.org/pdf/1506.06272.pdf> *
MAO, J. et al., "Deep captioning with multimodal recurrent neural network (M-RNN)", arXiv.org, published 11 June 2015 <URL: https://arxiv.org/pdf/1412.6632.pdf> *

Also Published As

Publication number Publication date
AU2016225820A1 (en) 2017-05-25
AU2016225820B2 (en) 2021-04-15
AU2016225819A1 (en) 2017-05-25

Similar Documents

Publication Publication Date Title
US20170132526A1 (en) Structured Knowledge Modeling and Extraction from Images
US10460033B2 (en) Structured knowledge modeling, extraction and localization from images
Song et al. Tvsum: Summarizing web videos using titles
US10909459B2 (en) Content embedding using deep metric learning algorithms
Tu et al. Joint video and text parsing for understanding events and answering queries
Kong et al. Interactive phrases: Semantic descriptionsfor human interaction recognition
EP2823410B1 (en) Entity augmentation service from latent relational data
Zhang et al. Social image tagging using graph-based reinforcement on multi-type interrelated objects
Rastegar et al. Mdl-cw: A multimodal deep learning framework with cross weights
CN107209762A (en) Visual interactive formula is searched for
GB2544853A (en) Structured knowledge modeling and extraction from images
US10755128B2 (en) Scene and user-input context aided visual search
Guadarrama et al. Understanding object descriptions in robotics by open-vocabulary object retrieval and detection
Kumar et al. Effective information retrieval and feature minimization technique for semantic web data
Chen et al. Probabilistic semantic retrieval for surveillance videos with activity graphs
AU2016225819B2 (en) Structured knowledge modeling and extraction from images
Tran ASurvey OF MACHINE LEARNING AND DATA MINING TECHNIQUES USED IN MULTIMEDIA SYSTEM
US9881023B2 (en) Retrieving/storing images associated with events
Tellex et al. Grounding spatial language for video search
Gilbert et al. Image and video mining through online learning
Rangel et al. Lextomap: lexical-based topological mapping
Situ et al. Cross-modal event retrieval: A dataset and a baseline using deep semantic learning
Jia et al. Semantically-enhanced kernel canonical correlation analysis: a multi-label cross-modal retrieval
GB2544379B (en) Structured knowledge modeling, extraction and localization from images
Cho et al. Recognizing human–human interaction activities using visual and textual information

Legal Events

Date Code Title Description
HB Alteration of name in register

Owner name: ADOBE INC.

Free format text: FORMER NAME(S): ADOBE SYSTEMS INCORPORATED

FGA Letters patent sealed or granted (standard patent)