US20190325342A1 - Embedding multimodal content in a common non-euclidean geometric space - Google Patents

Embedding multimodal content in a common non-euclidean geometric space Download PDF

Info

Publication number
US20190325342A1
US20190325342A1 US16/383,429 US201916383429A US2019325342A1 US 20190325342 A1 US20190325342 A1 US 20190325342A1 US 201916383429 A US201916383429 A US 201916383429A US 2019325342 A1 US2019325342 A1 US 2019325342A1
Authority
US
United States
Prior art keywords
content
modality
multimodal
feature vector
embedded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US16/383,429
Inventor
Karan Sikka
Ajay Divakaran
Julia Kruk
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SRI International Inc
Original Assignee
SRI International Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SRI International Inc filed Critical SRI International Inc
Priority to US16/383,429 priority Critical patent/US20190325342A1/en
Assigned to SRI INTERNATIONAL reassignment SRI INTERNATIONAL ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KRUK, JULIA, DIVAKARAN, AJAY, SIKKA, KARAN
Publication of US20190325342A1 publication Critical patent/US20190325342A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/20Drawing from basic elements, e.g. lines or circles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/255Detecting or recognising potential candidate objects based on visual cues, e.g. shapes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/274Syntactic or semantic context, e.g. balancing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/12Bounding box
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/759Region-based matching

Definitions

  • Machine learning relies on models and inference to perform tasks in a computing environment without having explicit instructions.
  • a mathematical model of sample data is constructed using training data to make predictions or choices based on the learned data.
  • the machine learning may be supervised by using training data composed of both input data and output data or may be unsupervised by using training data with only input data. Since machine learning uses computers that operate by interpreting and manipulating numbers, the training data is typically numerical in nature or transformed into numerical values. The numerical values allow the mathematical model to learn the input data. Input or output information that is not in a numerical form may be first transformed into a numerical representation so that it can be processed through machine learning.
  • the purpose of using machine learning is to infer additional information given a set of data.
  • inferred data becomes more difficult to model due to the complexity of the required mathematical model or the incompatibility of the model with a given form of input information.
  • Embodiments of the present principles generally relate to embedding multimodal content into a common geometric space.
  • a method of creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events may comprise for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model; for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model; and semantically embedding the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the geometric space; wherein embedded modality feature vectors that are related, across modalities, are closer together in the geometric space than unrelated modality feature vectors.
  • the method may further comprise for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector and semantically embedding the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors; semantically embedding content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded combined multimodal feature vector; projecting at least one of content, content-related information, and an event into the geometric space and determining at least one embedded feature vector in the geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event; wherein a, second modality feature vector representative of
  • an apparatus for creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events may comprise a processor; and a memory coupled to the processor, the memory having stored therein at least one of programs or instructions executable by the processor to configure the apparatus to: for each of a plurality of content of the multimodal content, create a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model; for each of a plurality of content of the multimodal content, create a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model; and semantically embed the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the geometric space; wherein embedded modality feature vectors that are related, across modalities
  • the apparatus may further comprise wherein the apparatus is further configured to: for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, form a combined multimodal feature vector from the first modality feature vector and the second modality feature vector and semantically embed the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors; wherein the apparatus is further configured to: semantically embed content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded feature vector; wherein the apparatus is further configured to: project at least one of content, content-related information, and an event into the geometric space and determine at least one embedded feature vector in the geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event.
  • a non-transitory computer-readable medium having stored thereon at least one program, the at least one program including instructions which, when executed by a processor, cause the processor to perform a method for creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events may comprise for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model; for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model and semantically embedding the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the geometric space; wherein embedded modality feature vectors that are
  • the non-transitory computer-readable medium may further include wherein the processor further, for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forms a combined multimodal feature vector from the first modality feature vector and the second modality feature vector and semantically embeds the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors; wherein the processor further semantically embeds content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded feature vector; and/or wherein the processor further: projects at least one of content, content-related information, and an event into the geometric space and determines at least one embedded feature vector in the geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event.
  • FIG. 1A depicts embeddings in a Euclidean space in accordance with an embodiment of the present principles.
  • FIG. 1B depicts embeddings in a non-Euclidean space in accordance with an embodiment of the present principles.
  • FIG. 2 depicts a graphical representation of hierarchy information for images that is preserved in non-Euclidean space in accordance with an embodiment of the present principles.
  • FIG. 3 is a method for embedding multimodal content in a common non-Euclidean geometric space according to an embodiment of the present principles.
  • FIG. 4A illustrates that a Euclidean space does not inherently preserve hierarchies in accordance with an embodiment of the present principles.
  • FIG. 4B illustrates that a non-Euclidean space inherently preserves the hierarchies in accordance with an embodiment of the present principles.
  • FIG. 5 depicts a non-Euclidean embedding process in accordance with an embodiment of the present principles.
  • FIG. 6 shows examples of non-Euclidean embedding spaces in accordance with an embodiment of the present principles.
  • FIG. 7 is a graph illustrating results of non-Euclidean embedding versus Euclidean embedding in accordance with an embodiment of the present principles.
  • FIG. 8 is a method of embedding multimodal content and agent information from social media in a common non-Euclidean geometric space in accordance with an embodiment of the present principles.
  • FIG. 9 shows how a standard loss is extended by adding a ranking loss term for cluster center vectors along with a clustering loss in accordance with an embodiment of the present principles.
  • FIG. 10 depicts a deep learning framework in accordance with an embodiment of the present principles.
  • FIG. 11 is an example of three users and the images pinned by the users in accordance with an embodiment of the present principles.
  • FIG. 12 depicts a high level block diagram of a computing device in which a multimodal content embedding system can be implemented in accordance with an embodiment of the present principles.
  • Embodiments of the present principles generally relate to methods, apparatuses, and systems for embedding multimodal content into a common geometric space. While the concepts of the present principles are susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and are described in detail below. It should be understood that there is no intent to limit the concepts of the present principles to the particular forms disclosed. On the contrary, the intent is to cover all modifications, equivalents, and alternatives consistent with the present principles and the appended claims. For example, although embodiments of the present principles will be described primarily with respect to visual concepts, such teachings should not be considered limiting.
  • Machine learning may be expanded to accept inputs that are not numerically based. For example, if an input is the word ‘apple,’ the word may be transformed into a series of numbers such as ⁇ 3, 2, 1 ⁇ where a ‘3’ represents a color red, a ‘2’ represents a round shape, and a ‘1’ represents a fruit. In this manner, non-numerical information may be converted into a numerical representation that can be processed by a mathematical model.
  • Word2vec is a machine learning process/model that produces word embedding vectors where words are associated with a number to produce a numerical essence of the word. Word2vec produces word embeddings (arrays of numbers) where words with similar meanings or context are physically close to each other in the embedded space.
  • Euclidean distance is the number of graph units between two graphed points or words (similarity of words based on physical closeness based on graphed points).
  • the distance between graphed words can be described as vectors or a distance with a direction. For example, a vector representing “royal” can be added to a vector representing “man” which yields the result “king.”
  • Moving from one graphed word to another graphed word in space allows one to represent/graph the idea of word relationships which are hard coded “word vectors.” By increasing the number of dimensions, more information may be obtained, but the increased number of dimensions makes it very complex for humans to comprehend/visualize.
  • Word2vec is limited to a single modality (text) for its input and output.
  • the inventors have learned that valuable additional information may be obtained when the content being processed is expanded to include multiple modalities such as text, images, video, audio, etc.
  • the inventors have discovered a way to better understand, exploit, and organize multimodal content in order to extract semantics and to make high-level sense of the multimodal content. This includes extracting short and long-term events from the multimodal content even from unorganized, but structured content, such as social media content.
  • the embodiments of the present principles described herein allow data in multiple modalities to provide rich information that allows for interpolation of information by providing complementary information, to fill gaps in individual modalities, and to provide performance benefits.
  • the multimodal embeddings provided by the present principles exploit correlations in different modalities and provide explicit vectorial representations for data points without relying on fixed architectures such as late fusion techniques.
  • Non-Euclidean embeddings provide more relevant information than Euclidean embeddings such as word2vec.
  • Real-world data has inherent semantic hierarchies such as plants-flowers-roses or animals-land animals-mammals-horses.
  • Euclidean multimodal embeddings Unfortunately, such hierarchies are lost because Euclidean mappings are not able to capture distances that grow exponentially in a compact manner.
  • Euclidean space is a flat space that doesn't capture intrinsic (hierarchical) structures.
  • semantics extraction based on Euclidean embeddings tends to be insightful in a small neighborhood of the point of interest but fails to capture the aforementioned inherent semantic hierarchies.
  • the classes in Euclidean space are more tightly packed and are scattered throughout the embedding.
  • a non-Euclidean space such as, for example, a Poincaré space allows distinct classes to form in broader categories such as “plants” and “land animals.”
  • the hierarchical structure of the data is retained in the structure of the embedding.
  • the inventors have found that there is a need to devise embedding methods that retain such semantic hierarchies.
  • Non-Euclidean embeddings such as, for example, hyperbolic embeddings provide a way to capture distances that grow exponentially through a logarithm-like warping of distance space.
  • hyperbolic spaces have tree-like properties that naturally indicate hierarchies and learning can be easily integrated into gradient based optimization.
  • the top of the hierarchy such as plant in plant-flowers-roses, is brought much closer to the bottom of the hierarchy.
  • the hyperbolic embeddings provide a way to capture real-world semantic hierarchies while retaining all of the advantages of unsupervised continuous representation that Euclidean embeddings provide.
  • Hyperbolic embeddings do capture real-world semantic hierarchies, even when realized through a simple approximation that is not completely hyperbolic and match or exceed the state of the art in retrieval accuracy.
  • the hierarchies provided by non-Euclidean embeddings may be used to contract and/or expand a class (e.g., user community) as well. This is particularly useful during clustering (see below for clustering processes).
  • the hierarchy may be traversed upwards or downwards (e.g., from a community of users to a single user or from a community of users to a select group of users, etc.). Similarly, the hierarchy may be leveraged to move from a single user or select group of users to find a more general group of users or a community of users.
  • attributes may be passed up or down a hierarchy. For example, a user living in a particular state such as, for example, Maine, also lives in the United States. Other attributes may only pass in one direction as, for example, some users may live on the coast of Maine while other users in Maine may not. Other attributes may be metric in nature and others may be continuously variable such as, for example, incomes.
  • FIG. 3 is a method 300 for embedding multimodal content in a common non-Euclidean geometric space according to some embodiments of the present principles.
  • the following examples are based on content with two modalities—text and images.
  • the concepts may be expanded to include any number of modalities including, for example, video, audio, etc.
  • a first machine learning model is trained using content relating to a first modality.
  • the first modality may be, for example, text.
  • the model performance improves as the data set size used for training increases.
  • a first modality feature vector is created using the first machine learning model from an input that has multimodal content.
  • the first modality feature vector represents a first modality feature of the multimodal content.
  • the first modality feature vector may represent a first modality (text) of the multimodal content (caption portion).
  • a pre-existing single modality model such as, for example, word2vec (Euclidean space) may be used to provide the first feature vector for text modalities.
  • word2vec Euclidean space
  • performance may be increased by retraining the word2vec with vectors from a non-Euclidean space.
  • a second machine learning model is trained using content relating to a second modality.
  • the second machine learning model may be trained using images (visual representations—photos, drawings, paintings, etc.) as the content of the second modality.
  • the model performance improves as the data set size used for training increases.
  • a second modality feature vector is created using the second machine learning model from the input that has multimodal content.
  • the second modality feature vector represents a second modality feature of the multimodal content. For example, if the multimodal content is an image with a caption, the second modality feature vector may represent a second modality (image) of the multimodal content.
  • a deep learning neural network is used to create the second modality feature vector.
  • the first modality feature vector of the multimodal content and the second modality feature vector of the multimodal content are semantically embedded in a non-Euclidean geometric space, ending the flow 312 .
  • the mapping of the first feature vector and the second feature vector in a common geometric space allows the inventors to exploit additional meanings obtained from, for example, an image and text that would not be obtainable from the text alone or from the image alone.
  • content, content-related information, and/or an event are projected into the common geometric space. An embedded feature vector in the common geometric space close to the projection is then determined as being related to the projected the content, the content-related information, and/or the event.
  • a multimodal feature vector based on the first modality feature vector and the second modality feature vector is created that represents both the first modality feature of the multimodal content and the second modality feature of the multimodal content.
  • the multimodal feature vector of the multimodal content is embedded in (mapped to) a non-Euclidean geometric space.
  • the embedded multimodal feature vector represents both the image and text in a singular notion that allows the inventors to exploit additional meanings obtained from the combination of image and text that would not be obtainable from the text alone or from the image alone.
  • the non-Euclidean space retains hierarchies of the different modality features from the multimodal content.
  • the non-Euclidean common geometric space is an embedding space that represents meaning multiplication of the content from the different modalities while preserving hierarchies of the multimodal content.
  • Meaning multiplication denotes that, for example, when someone posts a meme on social media, the meaning is a combination of the image and its caption—the combination yielding meaning greater than the meaning of the image alone or the meaning of the caption alone. Meaning multiplication is discussed further in the examples that follow.
  • the images and text may be embedded using:
  • the images and text may be embedded using:
  • the Poincaré ball is a realization of hyperbolic space (open d dimensional unit ball) and, in an embodiment, the Poincaré ball can be used to model the hyperbolic embedding space.
  • the image embedding layer is trained in the n-dimensional Poincaré ball.
  • the structure of loss function and the gradient descent are altered to create a linear projection layer to constrain embedding vectors to the manifold.
  • a pre-trained word2vec model may be used and the results can be projected to the manifold via a few non-linear layers.
  • a database with images having semantic tags may be used. Keywords from image captions may then be extracted and used as labels for training the images.
  • a ranking loss algorithm can be adjusted to push similar images and tags (words) together and vice-versa.
  • the mean average precision (MAP) may be output for evaluating the training results.
  • the inventors found that during an evaluation test, the MAP of the Poincaré embedding space rapidly reaches that of a Euclidean embedding space (based on number of iterations) and oftentimes exceeded the Euclidean embedding space as shown in graph 700 of FIG. 7 .
  • the inventors also found that using a pre-trained word2vec was not optimal since the vectors preserve Euclidean structure. In some embodiments, higher performance may be obtained by pre-training the word2vec model with hyperbolic embeddings.
  • FIG. 8 is a method 800 of embedding multimodal content and agent information from social media in a common non-Euclidean geometric space according to an embodiment of the present principles.
  • an agent may be a user that is a human being and/or a machine derived entity (e.g., a bot) and the like.
  • agent and user may be used interchangeably.
  • extracting information from agent postings can prove invaluable. In some embodiments, this is achieved by embedding agent information along with multimodal content in a non-Euclidean geometric space.
  • a social media posting of multimodal content and information relating to the agent (user, bot, etc.) who posted the posting on social media is obtained.
  • postings from Instagram or other social media sites that include multimodal content postings can be used.
  • the agent who did the posting is also readily obtainable (e.g., user avatar, user tagging, etc.).
  • a first machine learning model is trained with content relating to a first modality that may be found in the social media posting (e.g., caption with text, etc.).
  • a first modality feature vector is created from the social media posting that represents a first modality feature of the social media posting using the first machine learning model.
  • the first modality feature vector may represent the words from a caption posted with an image.
  • a second machine learning model is trained with content relating to a second modality that may be found in the social media posting (e.g., an image, etc.).
  • a second modality feature vector is created from the social media posting that represents a second modality feature of the social media posting using the second machine learning model.
  • the second modality feature vector may represent a photo from the social media posting.
  • the method 800 may be used to determine feature vectors for any number of modalities exhibited by the content of the social media posting (e.g., audio, video etc.).
  • the first modality feature vector of the social media posting, the second modality feature vector of the social media posting, and the posting agent information is then semantically embedded (mapped to) a non-Euclidean geometric space, ending the flow 814 .
  • a multimodal feature vector based on the first modality feature vector, the second modality feature vector, and the posting agent information that represents the first modality feature of the social media posting, the second modality feature of the social media posting, and the posting agent information is created.
  • the posting agent information may also include cluster information (e.g., agents/users belonging to a group or distinguishable as a group, etc.).
  • the posting agent information may not be combined into the multimodal feature vector but may be appended as an attribute to the multimodal feature vector.
  • the multimodal feature vector of the social media posting is then embedded (mapped to) a non-Euclidean geometric space.
  • the inventors have found that by using a model that jointly learns agent and content embedding, additional information can be extracted with regard to the original poster of the content and/or other agents who appear nearby in the embedding space.
  • the model may also be adjusted such that agents are clustered based on their posted and/or associated content.
  • the image-text joint embedding framework is leveraged to create content to user embeddings.
  • Each training example is a tuple of an image and a list of users who posted the image.
  • the objective is to minimize distance between the image embedding and the user embeddings by altering the ranking loss algorithm.
  • the standard loss 902 is extended by adding a ranking loss term 904 for the cluster center vectors 906 along with a clustering loss 908 .
  • the problem of content recommendation for users in a social network is an important one and is central to fields such as social media marketing.
  • content-centric platforms are solely interest-based, allowing users to easily collect and group content based on their interest.
  • An embedding framework is developed in which every user, described by a set of images, and the images are mapped to common vector space that ideally captures interest based similarity between users and images. User embeddings are closer to image embeddings with similar interests and vice-versa. As illustrated in FIG. 10 , a deep learning framework 1000 is used to learn these embeddings. Each user is described by a set of images. Deep learning is used to learn a vector for each such user and a mapping from the image space to the interest space such that the user and the images with similar interests are closer together in the embedding space while dissimilar users and images are farther away.
  • the dataset used is a combination of users and images.
  • the following information is available: For each user U i , there is a given the set of images ⁇ that is posted/pinned by the user and for each image I j , there is a given set of users j that have posted/pinned this particular image. For many images, there can be multiple users associated with it and, thus, the image sets of these users will have some overlap.
  • the goal is, for each user U s and each image I j , to obtain vector u i ⁇ D and v j ⁇ D respectively, where D is the embedding dimension, set to 300 in this example.
  • the goal is to provide an embedding space such that user and image embeddings that represent similar interests are closer together in the embedding space and dissimilar users and images are farther away. These embeddings can then be utilized for retrieval of similar users, image recommendation, for grouping of similar users, etc.
  • the user category information although available in the dataset, is not employed while training the networks. It is employed only for evaluating the models.
  • an embedding model referred to as DeViSE is used (see, A. Frome, G. Corrado, and J. Shlens, “DeViSE: A deep visual-semantic embedding model,” Adv. Neural . . . , pp. 1-11, 2013).
  • an image embedding is trained that maps every image to a word embedding space. This may be achieved using a deep convolutional neural network (DCNN) such as, for example, a VGG-16 (Visual Geometry Group) network, attaching a fully connected layer to transform, for example, a 4096-dimensional image feature to a 300 dimensional vector.
  • DCNN deep convolutional neural network
  • This final image embedding layer is trained from scratch while the rest of the network is fine-tuned using a model trained for ImageNet classification.
  • the weight matrix for the image embeddings layer be denoted by w I .
  • a dataset such as Microsoft's COCO which contains multiple captions for each image, may be used.
  • the embedding space dimension is set to 300.
  • the word embeddings are not learned but are initialized using GloVE (see, J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation.”). As per DeViSE, ranking loss is employed as the loss function.
  • the embedding vector for each user U i is obtained as the arithmetic average of the image embedding vectors of the images in i . This serves as a strong baseline for this example.
  • the method learns the user embeddings and the image embeddings jointly in a single neural network.
  • the architecture of this network is similar to that of DeViSE and, thus, allows for fair comparison with the baseline.
  • the embedding layer instead of the embedding layer for words, there is an embedding layer for users. That is, initially each user is represented as a one-hot vector and the embedding layer is a fully connected layer, represented by a matrix w U ⁇ N U ⁇ D that converts this one-hot vector into the desired user embedding. Since the input to this layer is a one-hot vector, the user embedding for user U i is simply the i th column of w U .
  • the ranking loss function (Eq. 3) is minimized similar to DeViSE:
  • the negative users indexed by i are all the users in the mini-batch who have not pinned I j and the positive users indexed by k are all the users who have pinned I j .
  • the reconstruction loss is employed and has been shown to be useful in the case of visual semantic embeddings in that they are more resilient to domain transfer.
  • another fully connected layer is added that takes as input the image embedding and the desired output of the layer is the output image feature of the VGG-16 network, denoted by v N .
  • the reconstruction loss (Eq. 4) for a single image is given by:
  • the loss is computed for a given image embedding vector. Another loss in terms of a given user embedding can also be computed.
  • the final loss function (Eq. 5) used is:
  • L 1 L rank-users + ⁇ *L recon (Eq. 5)
  • some embodiments have a modification which allows learning of a clustering of users in addition to the user embeddings inside the same network. Learning the clusters jointly allows for a better and automatic sharing of information about similar images between user embeddings than what is available explicitly in the dataset.
  • an additional matrix w c ⁇ K ⁇ D is maintained, where C is the number of clusters, a hyperparameter.
  • the loss function proposed previously is further modified. Two other terms are added into the loss function (Eq. 6) given by:
  • L rank-clusters (.) is the cluster-center analogue of the L rank-users . In effect, it tries to push the image features closer to the right clusters and farther away from the wrong clusters.
  • L K-means is similar to the K-means loss function which is used to ensure that the cluster centers are indeed representative of the users assigned to that cluster. Since nearest neighbor computation is not a differentiable operation, the cluster assignments cannot be found inside the network. As an approximation, the cluster assignments are found only once every 1000 iterations and optimize the user embeddings and the cluster centers with the fixed cluster assignments.
  • FIG. 11 shows an example of three users 1102 , 1104 , 1106 and the images 1108 , 1110 , 1112 pinned by the users 1102 , 1104 , 1106 , respectively.
  • each user has only one interest.
  • NDCG Normalized discounted cumulative gain
  • IDCG k is the normalizing factor which corresponds to the best possible retrieval results for a given query.
  • r i denoted the relevance of the i th retrieval result and r i ⁇ 0,1,2 ⁇ calculated as follows:
  • cluster homogeneity a cluster is said to be fully homogeneous if all the elements in that cluster belong to the same class.
  • cluster completeness a cluster assignment is said to be complete if, for each class, all elements of that class belong to the same cluster.
  • V-measure the harmonic mean of cluster homogeneity and cluster completeness. This is also equal to normalized mutual information between the true class labels and the cluster assignments for all the points in the dataset.
  • Table 1 shows how well the joint embedding and clustering framework performs in terms of the clustering results and how the number of clusters affects the performance of clustering, measured in terms of homogeneity, completeness and V-measure.
  • K-means clustering is performed with random users as the initial cluster centers. As the number of clusters increases, the quality of clusters tends to improve.
  • the baseline approach yields consistently poorer results compared to the proposed frameworks of the present principles.
  • using K-means offline tends to produce better clusters. However, this effect is seen to diminish as the number of clusters is increased to 468 (number of categories) from 32 (number of parent categories).
  • Embodiments in accordance with the teachings can be implemented in hardware, firmware, software, or any combination thereof. Embodiments may also be implemented as instructions stored using one or more machine-readable media, which may be read and executed by one or more processors.
  • a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device or a “‘virtual machine” running on one or more computing devices).
  • a machine-readable medium may include any suitable form of volatile or non-volatile memory.
  • Modules, data structures, blocks, and the like are referred to as such for case of discussion, and are not intended to imply that any specific implementation details are required.
  • any of the described modules and/or data structures may be combined or divided into sub-modules, sub-processes or other units of computer code or data as may be required by a particular design or implementation.
  • references herein to rules or templates are not meant to imply any specific implementation details. That is, the multimodal content embedding systems can store rules, templates, etc. in any suitable machine-readable format.
  • FIG. 12 a simplified high level block diagram of an embodiment of the computing device 1200 in which a multimodal content embedding system can be implemented is shown. While the computing device 1200 is shown as involving multiple components and devices, it should be understood that in some embodiments, the computing device 1200 can constitute a single computing device (e.g., a mobile electronic device, laptop or desktop computer) alone or in combination with other devices.
  • the illustrative computing device 1200 can be in communication with one or more other computing systems or devices 542 via one or more networks 540 . In the embodiment of FIG.
  • a portion 110 A of the multimodal content embedding system can be local to the computing device 510 , while another portion 1106 can be distributed across one or more other computing systems or devices 542 that are connected to the network(s) 540 .
  • portions of the multimodal content embedding system can be incorporated into other systems or interactive software applications.
  • Such applications or systems can include, for example, operating systems, middleware or framework software, and/or applications software.
  • portions of the multimodal content embedding system can be incorporated into or accessed by other, more generalized system(s) or intelligent assistance applications.
  • the illustrative computing device 1200 of FIG. 12 includes at least one processor 512 (e.g. a microprocessor, microcontroller, digital signal processor, etc.), memory 514 , and an input/output (I/O) subsystem 516 .
  • processor 512 e.g. a microprocessor, microcontroller, digital signal processor, etc.
  • the computing device 1200 can be embodied as any type of computing device such as a personal computer (e.g., desktop, laptop, tablet, smart phone, body-mounted device, etc.), a server, an enterprise computer system, a network of computers, a combination of computers and other electronic devices, or other electronic devices.
  • a personal computer e.g., desktop, laptop, tablet, smart phone, body-mounted device, etc.
  • a server e.g., a server
  • an enterprise computer system e.g., a server
  • a network of computers e.g., a server
  • a combination of computers and other electronic devices e.g., a combination of computers and other electronic devices, or other electronic devices.
  • the I/O subsystem 516 typically includes, among other things, an I/O controller, a memory controller, and one or more I/O ports.
  • the processor 512 and the I/O subsystem 516 are communicatively coupled to the memory 514 .
  • the memory 514 can be embodied as any type of suitable computer memory device (e.g., volatile memory such as various forms of random access memory).
  • the I/O subsystem 516 is communicatively coupled to a number of hardware components and/or other computing systems including one or more user input devices 518 (e.g., a touchscreen, keyboard, virtual keypad, microphone, etc.), and one or more storage media 520 .
  • the storage media 520 may include one or more hard drives or other suitable data storage devices (e.g., flash memory, memory cards, memory sticks, and/or others).
  • portions of systems software e.g., an operating system, etc.
  • framework/middleware e.g., application-programming interfaces, object libraries, etc.
  • the multimodal content embedding system resides at least temporarily in the storage media 520 .
  • Portions of systems software, framework/middleware, the multimodal content embedding system can also exist in the memory 514 during operation of the computing device 1200 , for faster processing or other reasons.
  • the one or more network interfaces 532 can communicatively couple the computing device 1200 to a local area network, wide area network, a personal cloud, enterprise cloud, public cloud, and/or the Internet, for example.
  • the network interfaces 532 can include one or more wired or wireless network interface cards or adapters, for example, as may be needed pursuant to the specifications and/or design of the particular computing device 1200 .
  • the other computing device(s) 542 can be embodied as any suitable type of computing device such as any of the aforementioned types of devices or other electronic devices.
  • the other computing devices 542 can include one or more server computers used with the multimodal content embedding system.
  • the computing device 1200 can further optionally include an optical character recognition (OCR) system 528 and an automated speech recognition (ASR) system 530 .
  • OCR optical character recognition
  • ASR automated speech recognition
  • each of the foregoing components and/or systems can be integrated with the computing device 1200 or can be a separate component or system that is in communication with the I/O subsystem 516 (e.g., over a network).
  • the computing device 1200 can include other components, subcomponents, and devices not illustrated in FIG. 12 for clarity of the description.
  • the components of the computing device 1200 are communicatively coupled as shown in FIG. 12 by signal paths, which may be embodied as any type of wired or wireless signal paths capable of facilitating communication between the respective devices and components.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)
  • Editing Of Facsimile Originals (AREA)

Abstract

Embedding multimodal content in a common geometric space includes for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model; for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model; and semantically embedding the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the geometric space; wherein embedded modality feature vectors that are related, across modalities, are closer together in the geometric space than unrelated modality feature vectors.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application claims priority to and the benefit of U.S. Provisional Patent Application Ser. No. 62/660,863, filed Apr. 20, 2018 which is incorporated herein by this reference in their entirety.
  • GOVERNMENT RIGHTS
  • This invention was made with Government support under contract number N00014-17-C-1008 awarded by the Office of Naval Research. The Government has certain rights in this invention.
  • BACKGROUND
  • Machine learning relies on models and inference to perform tasks in a computing environment without having explicit instructions. A mathematical model of sample data is constructed using training data to make predictions or choices based on the learned data. The machine learning may be supervised by using training data composed of both input data and output data or may be unsupervised by using training data with only input data. Since machine learning uses computers that operate by interpreting and manipulating numbers, the training data is typically numerical in nature or transformed into numerical values. The numerical values allow the mathematical model to learn the input data. Input or output information that is not in a numerical form may be first transformed into a numerical representation so that it can be processed through machine learning.
  • The purpose of using machine learning is to infer additional information given a set of data. However, as the type of input information becomes more and more diverse, inferred data becomes more difficult to model due to the complexity of the required mathematical model or the incompatibility of the model with a given form of input information.
  • SUMMARY
  • Embodiments of the present principles generally relate to embedding multimodal content into a common geometric space.
  • In some embodiments, a method of creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events may comprise for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model; for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model; and semantically embedding the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the geometric space; wherein embedded modality feature vectors that are related, across modalities, are closer together in the geometric space than unrelated modality feature vectors.
  • In some embodiments, the method may further comprise for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector and semantically embedding the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors; semantically embedding content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded combined multimodal feature vector; projecting at least one of content, content-related information, and an event into the geometric space and determining at least one embedded feature vector in the geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event; wherein a, second modality feature vector representative of content of the multimodal content having a second modality is created using information relating to respective content having a first modality; appending content-related information, including at least one of user information and user grouping information, to at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector; wherein content-related information comprises at least one of agent information or agent grouping information for at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector; wherein the common geometric space comprises a non-Euclidean space; wherein the non-Euclidean space comprises at least one of a hyperbolic, a Lorentzian, and a Poincaré ball; wherein the multimodal content comprises multimodal content posted by an agent on a social media network; wherein the agent comprises at least one of a robot, a person with a social media account, and a participant in a social media network; and/or inferring information for feature vectors embedded in the common geometric space based on a proximity of the feature vectors to at least one other feature vector embedded in the common geometric space.
  • In some embodiments, an apparatus for creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events may comprise a processor; and a memory coupled to the processor, the memory having stored therein at least one of programs or instructions executable by the processor to configure the apparatus to: for each of a plurality of content of the multimodal content, create a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model; for each of a plurality of content of the multimodal content, create a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model; and semantically embed the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the geometric space; wherein embedded modality feature vectors that are related, across modalities, are closer together in the geometric space than unrelated modality feature vectors.
  • In some embodiments, the apparatus may further comprise wherein the apparatus is further configured to: for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, form a combined multimodal feature vector from the first modality feature vector and the second modality feature vector and semantically embed the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors; wherein the apparatus is further configured to: semantically embed content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded feature vector; wherein the apparatus is further configured to: project at least one of content, content-related information, and an event into the geometric space and determine at least one embedded feature vector in the geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event.
  • In some embodiments, a non-transitory computer-readable medium having stored thereon at least one program, the at least one program including instructions which, when executed by a processor, cause the processor to perform a method for creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events may comprise for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model; for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model and semantically embedding the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the geometric space; wherein embedded modality feature vectors that are related, across modalities, are closer together in the geometric space than unrelated modality feature vectors.
  • The non-transitory computer-readable medium may further include wherein the processor further, for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forms a combined multimodal feature vector from the first modality feature vector and the second modality feature vector and semantically embeds the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors; wherein the processor further semantically embeds content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded feature vector; and/or wherein the processor further: projects at least one of content, content-related information, and an event into the geometric space and determines at least one embedded feature vector in the geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event.
  • Other and further embodiments in accordance with the present principles are described below.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • So that the manner in which the above recited features of the present principles can be understood in detail, a more particular description of the principles, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments in accordance with the present principles and are therefore not to be considered limiting of its scope, for the principles may admit to other equally effective embodiments.
  • FIG. 1A depicts embeddings in a Euclidean space in accordance with an embodiment of the present principles.
  • FIG. 1B depicts embeddings in a non-Euclidean space in accordance with an embodiment of the present principles.
  • FIG. 2 depicts a graphical representation of hierarchy information for images that is preserved in non-Euclidean space in accordance with an embodiment of the present principles.
  • FIG. 3 is a method for embedding multimodal content in a common non-Euclidean geometric space according to an embodiment of the present principles.
  • FIG. 4A illustrates that a Euclidean space does not inherently preserve hierarchies in accordance with an embodiment of the present principles.
  • FIG. 4B illustrates that a non-Euclidean space inherently preserves the hierarchies in accordance with an embodiment of the present principles.
  • FIG. 5 depicts a non-Euclidean embedding process in accordance with an embodiment of the present principles.
  • FIG. 6 shows examples of non-Euclidean embedding spaces in accordance with an embodiment of the present principles.
  • FIG. 7 is a graph illustrating results of non-Euclidean embedding versus Euclidean embedding in accordance with an embodiment of the present principles.
  • FIG. 8 is a method of embedding multimodal content and agent information from social media in a common non-Euclidean geometric space in accordance with an embodiment of the present principles.
  • FIG. 9 shows how a standard loss is extended by adding a ranking loss term for cluster center vectors along with a clustering loss in accordance with an embodiment of the present principles.
  • FIG. 10 depicts a deep learning framework in accordance with an embodiment of the present principles.
  • FIG. 11 is an example of three users and the images pinned by the users in accordance with an embodiment of the present principles.
  • FIG. 12 depicts a high level block diagram of a computing device in which a multimodal content embedding system can be implemented in accordance with an embodiment of the present principles.
  • To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. The figures are not drawn to scale and may be simplified for clarity. It is contemplated that elements and features of one embodiment may be beneficially incorporated in other embodiments without further recitation.
  • DETAILED DESCRIPTION
  • Embodiments of the present principles generally relate to methods, apparatuses, and systems for embedding multimodal content into a common geometric space. While the concepts of the present principles are susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and are described in detail below. It should be understood that there is no intent to limit the concepts of the present principles to the particular forms disclosed. On the contrary, the intent is to cover all modifications, equivalents, and alternatives consistent with the present principles and the appended claims. For example, although embodiments of the present principles will be described primarily with respect to visual concepts, such teachings should not be considered limiting.
  • Machine learning may be expanded to accept inputs that are not numerically based. For example, if an input is the word ‘apple,’ the word may be transformed into a series of numbers such as {3, 2, 1} where a ‘3’ represents a color red, a ‘2’ represents a round shape, and a ‘1’ represents a fruit. In this manner, non-numerical information may be converted into a numerical representation that can be processed by a mathematical model. Word2vec is a machine learning process/model that produces word embedding vectors where words are associated with a number to produce a numerical essence of the word. Word2vec produces word embeddings (arrays of numbers) where words with similar meanings or context are physically close to each other in the embedded space. The numbers are typically arranged in arrays that allow mathematical processes to be performed on the numbers. For example, “royal”+“man”=“king,” adds the essence of the words to get a result. Quantifying words as a series of numbers allows machine learning to find a new word similar to the other two words based on numbers and data properties of each word based on a model. The words can then be graphed and compared to words based on mathematical properties.
  • Graphing allows mathematical analysis using spatial properties such as Euclidean distance. Euclidean distance is the number of graph units between two graphed points or words (similarity of words based on physical closeness based on graphed points). The distance between graphed words can be described as vectors or a distance with a direction. For example, a vector representing “royal” can be added to a vector representing “man” which yields the result “king.” Moving from one graphed word to another graphed word in space allows one to represent/graph the idea of word relationships which are hard coded “word vectors.” By increasing the number of dimensions, more information may be obtained, but the increased number of dimensions makes it very complex for humans to comprehend/visualize.
  • However, Word2vec is limited to a single modality (text) for its input and output. The inventors have learned that valuable additional information may be obtained when the content being processed is expanded to include multiple modalities such as text, images, video, audio, etc. Thus, the inventors have discovered a way to better understand, exploit, and organize multimodal content in order to extract semantics and to make high-level sense of the multimodal content. This includes extracting short and long-term events from the multimodal content even from unorganized, but structured content, such as social media content. The embodiments of the present principles described herein allow data in multiple modalities to provide rich information that allows for interpolation of information by providing complementary information, to fill gaps in individual modalities, and to provide performance benefits. The multimodal embeddings provided by the present principles exploit correlations in different modalities and provide explicit vectorial representations for data points without relying on fixed architectures such as late fusion techniques.
  • The inventors have found that non-Euclidean embeddings provide more relevant information than Euclidean embeddings such as word2vec. Real-world data has inherent semantic hierarchies such as plants-flowers-roses or animals-land animals-mammals-horses. With Euclidean multimodal embeddings, unfortunately, such hierarchies are lost because Euclidean mappings are not able to capture distances that grow exponentially in a compact manner. Euclidean space is a flat space that doesn't capture intrinsic (hierarchical) structures. Thus, semantics extraction based on Euclidean embeddings tends to be insightful in a small neighborhood of the point of interest but fails to capture the aforementioned inherent semantic hierarchies. As illustrated in a view 100A of FIG. 1A, the classes in Euclidean space are more tightly packed and are scattered throughout the embedding. As shown in a view 100B of FIG. 1B, using a non-Euclidean space such as, for example, a Poincaré space allows distinct classes to form in broader categories such as “plants” and “land animals.” The hierarchical structure of the data is retained in the structure of the embedding. Thus, the inventors have found that there is a need to devise embedding methods that retain such semantic hierarchies.
  • Non-Euclidean embeddings such as, for example, hyperbolic embeddings provide a way to capture distances that grow exponentially through a logarithm-like warping of distance space. As shown in a view 200 of FIG. 2, hyperbolic spaces have tree-like properties that naturally indicate hierarchies and learning can be easily integrated into gradient based optimization. Thus, the top of the hierarchy, such as plant in plant-flowers-roses, is brought much closer to the bottom of the hierarchy. The hyperbolic embeddings provide a way to capture real-world semantic hierarchies while retaining all of the advantages of unsupervised continuous representation that Euclidean embeddings provide. Hyperbolic embeddings do capture real-world semantic hierarchies, even when realized through a simple approximation that is not completely hyperbolic and match or exceed the state of the art in retrieval accuracy.
  • The hierarchies provided by non-Euclidean embeddings may be used to contract and/or expand a class (e.g., user community) as well. This is particularly useful during clustering (see below for clustering processes). The hierarchy may be traversed upwards or downwards (e.g., from a community of users to a single user or from a community of users to a select group of users, etc.). Similarly, the hierarchy may be leveraged to move from a single user or select group of users to find a more general group of users or a community of users. In some instances, attributes may be passed up or down a hierarchy. For example, a user living in a particular state such as, for example, Maine, also lives in the United States. Other attributes may only pass in one direction as, for example, some users may live on the coast of Maine while other users in Maine may not. Other attributes may be metric in nature and others may be continuously variable such as, for example, incomes.
  • FIG. 3 is a method 300 for embedding multimodal content in a common non-Euclidean geometric space according to some embodiments of the present principles. For the sake of brevity, the following examples are based on content with two modalities—text and images. However, the concepts may be expanded to include any number of modalities including, for example, video, audio, etc. In block 302, a first machine learning model is trained using content relating to a first modality. In some embodiments, the first modality may be, for example, text. In general, the model performance improves as the data set size used for training increases. In block 304, a first modality feature vector is created using the first machine learning model from an input that has multimodal content. The first modality feature vector represents a first modality feature of the multimodal content. For example, if the multimodal content is an image with a caption, the first modality feature vector may represent a first modality (text) of the multimodal content (caption portion). In some embodiments, a pre-existing single modality model such as, for example, word2vec (Euclidean space) may be used to provide the first feature vector for text modalities. However, the inventors have found that performance may be increased by retraining the word2vec with vectors from a non-Euclidean space.
  • In block 306, a second machine learning model is trained using content relating to a second modality. In some embodiments, the second machine learning model may be trained using images (visual representations—photos, drawings, paintings, etc.) as the content of the second modality. In general, the model performance improves as the data set size used for training increases. In block 308, a second modality feature vector is created using the second machine learning model from the input that has multimodal content. The second modality feature vector represents a second modality feature of the multimodal content. For example, if the multimodal content is an image with a caption, the second modality feature vector may represent a second modality (image) of the multimodal content. In some embodiments, a deep learning neural network is used to create the second modality feature vector. In block 310, the first modality feature vector of the multimodal content and the second modality feature vector of the multimodal content are semantically embedded in a non-Euclidean geometric space, ending the flow 312. The mapping of the first feature vector and the second feature vector in a common geometric space allows the inventors to exploit additional meanings obtained from, for example, an image and text that would not be obtainable from the text alone or from the image alone. In some embodiments, content, content-related information, and/or an event are projected into the common geometric space. An embedded feature vector in the common geometric space close to the projection is then determined as being related to the projected the content, the content-related information, and/or the event.
  • In some embodiments, a multimodal feature vector based on the first modality feature vector and the second modality feature vector is created that represents both the first modality feature of the multimodal content and the second modality feature of the multimodal content. The multimodal feature vector of the multimodal content is embedded in (mapped to) a non-Euclidean geometric space. The embedded multimodal feature vector, for example, represents both the image and text in a singular notion that allows the inventors to exploit additional meanings obtained from the combination of image and text that would not be obtainable from the text alone or from the image alone. In addition, the inventors have discovered that the non-Euclidean space retains hierarchies of the different modality features from the multimodal content.
  • There are no limits to the number of modalities of the content nor number of machine learning models that may be used. Similarly, there are no limits on the numbers of feature vectors included in the embedded multimodal feature vector (e.g., may include associated data from more than two modality sources). In some embodiments, additional information may be infused into the multimodal feature vector during embedding processes and/or may be included as an attribute to the embedded multimodal feature vector. Because the content of the first and second modalities are embedded together, the non-Euclidean common geometric space is an embedding space that represents meaning multiplication of the content from the different modalities while preserving hierarchies of the multimodal content. Meaning multiplication denotes that, for example, when someone posts a meme on social media, the meaning is a combination of the image and its caption—the combination yielding meaning greater than the meaning of the image alone or the meaning of the caption alone. Meaning multiplication is discussed further in the examples that follow.
  • In this manner, words and images are transformed into vectors that are embedded into a common non-Euclidean space. Distance between vectors is small when vectors are semantically similar. Hierarchies are determined by the normalization of the embedded vectors. As illustrated in a view 400A of FIG. 4A, Euclidean space does not inherently preserve hierarchies as the content is spread across a single plane. However, as shown in a view 400B of FIG. 4B, non-Euclidean space inherently preserves the natural hierarchies of the content. In one example of an embodiment of the present principles, Microsoft's COCO (Common Objects in Context) database (see, T. Lin, C. L. Zitnick, and P. Doll, “Microsoft COCO: Common Objects in Context,” pp. 1-15) was used for training the model. A contrastive loss function with Riemannian SGD (Stochastic Gradient Descent) was used for the embedding. A Lorentzian space model or Poincaré n-ball space model was used as the common non-Euclidean embedding space. A deep learning convolutional neural network (CNN) 504 with fully connected (FC) and linear layers 506 was used to process the images 502. A bidirectional long short term memory (Bi-LSTM) network 512 was used to process the descriptive text 510 of the image. The results from both processes were then mapped to the non-Euclidean common space 508 as shown in a view 500 of FIG. 5.
  • For a Poincare n-ball model 600A illustrated in FIG. 6, the images and text may be embedded using:
  • d p ( x , y ) = arcosh ( 1 + 2 x - y 2 ( 1 - x 2 ) ( 1 - y 2 ) ) ( Eq . 1 )
  • For a Lorentzian model 600B illustrated in FIG. 6, the images and text may be embedded using:

  • d l(x,y)=arcosh(−
    Figure US20190325342A1-20191024-P00001
    x,y
    Figure US20190325342A1-20191024-P00002
    Figure US20190325342A1-20191024-P00003
    )  (Eq. 2)
  • The Poincaré ball is a realization of hyperbolic space (open d dimensional unit ball) and, in an embodiment, the Poincaré ball can be used to model the hyperbolic embedding space. The image embedding layer is trained in the n-dimensional Poincaré ball. The structure of loss function and the gradient descent are altered to create a linear projection layer to constrain embedding vectors to the manifold. In one example, a pre-trained word2vec model may be used and the results can be projected to the manifold via a few non-linear layers. For training, a database with images having semantic tags may be used. Keywords from image captions may then be extracted and used as labels for training the images. A ranking loss algorithm can be adjusted to push similar images and tags (words) together and vice-versa. The mean average precision (MAP) may be output for evaluating the training results. The inventors found that during an evaluation test, the MAP of the Poincaré embedding space rapidly reaches that of a Euclidean embedding space (based on number of iterations) and oftentimes exceeded the Euclidean embedding space as shown in graph 700 of FIG. 7. However, the inventors also found that using a pre-trained word2vec was not optimal since the vectors preserve Euclidean structure. In some embodiments, higher performance may be obtained by pre-training the word2vec model with hyperbolic embeddings.
  • FIG. 8 is a method 800 of embedding multimodal content and agent information from social media in a common non-Euclidean geometric space according to an embodiment of the present principles. For purposes of the discussions herein, an agent may be a user that is a human being and/or a machine derived entity (e.g., a bot) and the like. In some embodiments, agent and user may be used interchangeably. In social media settings, extracting information from agent postings can prove invaluable. In some embodiments, this is achieved by embedding agent information along with multimodal content in a non-Euclidean geometric space. In block 802, a social media posting of multimodal content and information relating to the agent (user, bot, etc.) who posted the posting on social media is obtained. For example, postings from Instagram or other social media sites that include multimodal content postings (e.g., image with a caption, etc.) can be used. Generally, the agent who did the posting is also readily obtainable (e.g., user avatar, user tagging, etc.). In block 804, a first machine learning model is trained with content relating to a first modality that may be found in the social media posting (e.g., caption with text, etc.). In block 806, a first modality feature vector is created from the social media posting that represents a first modality feature of the social media posting using the first machine learning model. For example, the first modality feature vector may represent the words from a caption posted with an image.
  • In block 808, a second machine learning model is trained with content relating to a second modality that may be found in the social media posting (e.g., an image, etc.). In block 810, a second modality feature vector is created from the social media posting that represents a second modality feature of the social media posting using the second machine learning model. For example, the second modality feature vector may represent a photo from the social media posting. There is no limit on the number of possible modalities that may be extracted from the social media posting nor the number of machine learning models that may be implemented to create modality feature vectors for those modalities. Thus, the method 800 may be used to determine feature vectors for any number of modalities exhibited by the content of the social media posting (e.g., audio, video etc.). In block 812, the first modality feature vector of the social media posting, the second modality feature vector of the social media posting, and the posting agent information (as a vector and/or attribute) is then semantically embedded (mapped to) a non-Euclidean geometric space, ending the flow 814.
  • In some embodiments, a multimodal feature vector based on the first modality feature vector, the second modality feature vector, and the posting agent information that represents the first modality feature of the social media posting, the second modality feature of the social media posting, and the posting agent information is created. In some embodiments, the posting agent information may also include cluster information (e.g., agents/users belonging to a group or distinguishable as a group, etc.). In some embodiments, the posting agent information may not be combined into the multimodal feature vector but may be appended as an attribute to the multimodal feature vector. The multimodal feature vector of the social media posting is then embedded (mapped to) a non-Euclidean geometric space.
  • The inventors have found that by using a model that jointly learns agent and content embedding, additional information can be extracted with regard to the original poster of the content and/or other agents who appear nearby in the embedding space. The model may also be adjusted such that agents are clustered based on their posted and/or associated content. The image-text joint embedding framework is leveraged to create content to user embeddings. Each training example is a tuple of an image and a list of users who posted the image. The objective is to minimize distance between the image embedding and the user embeddings by altering the ranking loss algorithm. As shown in a view 900 of FIG. 9, the standard loss 902 is extended by adding a ranking loss term 904 for the cluster center vectors 906 along with a clustering loss 908.
  • The problem of content recommendation for users in a social network is an important one and is central to fields such as social media marketing. A distinction is made between user-centric social networks like Twitter and content-centric social networks like Pinterest. Unlike user-centric networks where the primary purpose is developing connections and communication, content-centric platforms are solely interest-based, allowing users to easily collect and group content based on their interest.
  • The following example focuses on content-centric networks such as Pinterest where each user pins a set of images that highlights their interest. An embedding framework is developed in which every user, described by a set of images, and the images are mapped to common vector space that ideally captures interest based similarity between users and images. User embeddings are closer to image embeddings with similar interests and vice-versa. As illustrated in FIG. 10, a deep learning framework 1000 is used to learn these embeddings. Each user is described by a set of images. Deep learning is used to learn a vector for each such user and a mapping from the image space to the interest space such that the user and the images with similar interests are closer together in the embedding space while dissimilar users and images are farther away.
  • The dataset used is a combination of users and images. For example, let the users in the dataset be represented by
    Figure US20190325342A1-20191024-P00004
    ={Ui, i=1, 2, . . . NU}. Let all the images in the dataset be represented by
    Figure US20190325342A1-20191024-P00005
    ={Ij, j=1, 2, . . . , NI}. Now, the following information is available: For each user Ui, there is a given the set of images
    Figure US20190325342A1-20191024-P00005
    Figure US20190325342A1-20191024-P00005
    that is posted/pinned by the user and for each image Ij, there is a given set of users
    Figure US20190325342A1-20191024-P00004
    j that have posted/pinned this particular image. For many images, there can be multiple users associated with it and, thus, the image sets of these users will have some overlap. The goal is, for each user Us and each image Ij, to obtain vector ui
    Figure US20190325342A1-20191024-P00006
    D and vj
    Figure US20190325342A1-20191024-P00006
    D respectively, where D is the embedding dimension, set to 300 in this example. The goal is to provide an embedding space such that user and image embeddings that represent similar interests are closer together in the embedding space and dissimilar users and images are farther away. These embeddings can then be utilized for retrieval of similar users, image recommendation, for grouping of similar users, etc. It is noted that the user category information, although available in the dataset, is not employed while training the networks. It is employed only for evaluating the models.
  • In order to construct a simple baseline for this example, an embedding model referred to as DeViSE is used (see, A. Frome, G. Corrado, and J. Shlens, “DeViSE: A deep visual-semantic embedding model,” Adv. Neural . . . , pp. 1-11, 2013). Specifically, an image embedding is trained that maps every image to a word embedding space. This may be achieved using a deep convolutional neural network (DCNN) such as, for example, a VGG-16 (Visual Geometry Group) network, attaching a fully connected layer to transform, for example, a 4096-dimensional image feature to a 300 dimensional vector. This final image embedding layer is trained from scratch while the rest of the network is fine-tuned using a model trained for ImageNet classification. Let the weight matrix for the image embeddings layer be denoted by wI. A dataset, such as Microsoft's COCO which contains multiple captions for each image, may be used. The embedding space dimension, as stated previously, is set to 300. The word embeddings are not learned but are initialized using GloVE (see, J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation.”). As per DeViSE, ranking loss is employed as the loss function. For each positive example, instead of summing over all negative examples, summing is only over the negative examples in the minibatch and done empirically. This serves as a good approximation to the original loss function. Once the embedding vectors of the images are obtained, the embedding vector for each user Ui, is obtained as the arithmetic average of the image embedding vectors of the images in
    Figure US20190325342A1-20191024-P00005
    i. This serves as a strong baseline for this example.
  • In some embodiments, the method learns the user embeddings and the image embeddings jointly in a single neural network. The architecture of this network is similar to that of DeViSE and, thus, allows for fair comparison with the baseline. Instead of the embedding layer for words, there is an embedding layer for users. That is, initially each user is represented as a one-hot vector and the embedding layer is a fully connected layer, represented by a matrix wU
    Figure US20190325342A1-20191024-P00006
    N U ×D that converts this one-hot vector into the desired user embedding. Since the input to this layer is a one-hot vector, the user embedding for user Ui is simply the ith column of wU. The ranking loss function (Eq. 3) is minimized similar to DeViSE:
  • L rank - users ( v j ) = k i i max ( 0 , 1 - u k T v j + u i T v j ) ( Eq . 3 )
  • For each image Ij, the negative users indexed by i are all the users in the mini-batch who have not pinned Ij and the positive users indexed by k are all the users who have pinned Ij. In addition, the reconstruction loss is employed and has been shown to be useful in the case of visual semantic embeddings in that they are more resilient to domain transfer. In order to incorporate the reconstruction loss, another fully connected layer is added that takes as input the image embedding and the desired output of the layer is the output image feature of the VGG-16 network, denoted by vN. The reconstruction loss (Eq. 4) for a single image is given by:

  • L recon =∥v N −w R w I v N∥2  (Eq. 4)
  • In the above loss function, the loss is computed for a given image embedding vector. Another loss in terms of a given user embedding can also be computed. Thus, the final loss function (Eq. 5) used is:

  • L 1 =L rank-users +λ*L recon  (Eq. 5)
  • In addition to the approach outlined above, some embodiments have a modification which allows learning of a clustering of users in addition to the user embeddings inside the same network. Learning the clusters jointly allows for a better and automatic sharing of information about similar images between user embeddings than what is available explicitly in the dataset. To this end, an additional matrix wc
    Figure US20190325342A1-20191024-P00006
    K×D is maintained, where C is the number of clusters, a hyperparameter. Each row of wC, represented by ci,l=1, 2, . . . , K, is the vector representing the cluster center of the lth cluster. Let the clusters of users be denoted by Cl, l=1, 2, . . . , K. In order to learn the cluster centers, the loss function proposed previously is further modified. Two other terms are added into the loss function (Eq. 6) given by:
  • L rank - clusters ( v j ) = l i C i max ( 0 , 1 - c l T v j + c i T v j ) L K - means = l c u c C l 1 - u c T c l . ( Eq . 6 )
  • The first term a, Lrank-clusters(.) is the cluster-center analogue of the Lrank-users. In effect, it tries to push the image features closer to the right clusters and farther away from the wrong clusters. The second term LK-means is similar to the K-means loss function which is used to ensure that the cluster centers are indeed representative of the users assigned to that cluster. Since nearest neighbor computation is not a differentiable operation, the cluster assignments cannot be found inside the network. As an approximation, the cluster assignments are found only once every 1000 iterations and optimize the user embeddings and the cluster centers with the fixed cluster assignments.
  • In one test example, the Pinterest dataset released by Geng et al. is used as a representative dataset (see, X. Geng, H. Zhang, J. Bian, and T. S. Chua, “Learning image and user features for recommendation in social networks,” Proc. IEEE Int. Conf. Comput. Vis., vol. 11-18-NaN=2015, pp. 4274-4282, 2016). It contains 46000 users belonging to 468 interest categories and about 900,000 images pinned by these users. When a user “pins” an image, it indicates that the user has a preference for that image. However, the inverse may not be true. Using the main category list provided by Pinterest, sibling categories with 32 parent categories can be formed. For each user, a list of images pinned by that user is provided. Categories of images are not provided. As discussed above, each image may be pinned by multiple users and this information is also known. FIG. 11 shows an example of three users 1102, 1104, 1106 and the images 1108, 1110, 1112 pinned by the users 1102, 1104, 1106, respectively. In the dataset provided by Geng et al. each user has only one interest.
  • Once the embeddings for each user is obtained, user retrieval accuracy can be measured as follows. Compute NU×NU distance matrix using each user Uj for which the closest M closest users can be found. As discussed before, the knowledge of the category to which each user belongs to is known. Thus, the top-M accuracy based on the ground-truth categories can be computed. Normalized discounted cumulative gain (NDCG) is a widely used metric in information retrieval that takes into account the position of the retrieved elements. The formula for calculating NDCG at k i.e., with k retrieved elements (Eq. 7) is given by:
  • NDCG k = 1 IDCG k i = 1 k 2 r i - 1 log 2 ( i + 1 ) , ( Eq . 7 )
  • where IDCGk is the normalizing factor which corresponds to the best possible retrieval results for a given query. ri denoted the relevance of the ith retrieval result and ri∈{0,1,2} calculated as follows:
      • ri=2 if the query and the retrieved result belong to the same category.
      • ri=1 if the query and the retrieved result belong to sibling categories.
      • ri=0 if the query and the retrieved result belong to unrelated categories.
        In the case of image recommendation, the above assumes that the image categories are known. However, in the dataset currently employed, no image-level class labels are available. Thus, the image labels are first determined by the using the categories of the users who pinned that image. Based on how the image category is determined, there are two variants: 1—The image category is chosen as the most common category of all the user categories. This is a strict metric and is referred to as NDCG-strict, 2—A more relaxed version is obtained by letting the image belong to all the categories that the users who pinned it belong to and is referred to as NDCG-relaxed. In all the experiments, results are shown for k=5 retrieved elements.
  • Three measures are used, all of them in the range [0,1] that are popularly used to measure the quality of clusters when the ground-truth is known. The first is cluster homogeneity—a cluster is said to be fully homogeneous if all the elements in that cluster belong to the same class. The second is cluster completeness—a cluster assignment is said to be complete if, for each class, all elements of that class belong to the same cluster. The third is V-measure—the harmonic mean of cluster homogeneity and cluster completeness. This is also equal to normalized mutual information between the true class labels and the cluster assignments for all the points in the dataset.
  • Table 1 shows how well the joint embedding and clustering framework performs in terms of the clustering results and how the number of clusters affects the performance of clustering, measured in terms of homogeneity, completeness and V-measure. Both in the case of the baseline and in the case where only the user embeddings are learned, K-means clustering is performed with random users as the initial cluster centers. As the number of clusters increases, the quality of clusters tends to improve. The baseline approach yields consistently poorer results compared to the proposed frameworks of the present principles. Also, using K-means offline tends to produce better clusters. However, this effect is seen to diminish as the number of clusters is increased to 468 (number of categories) from 32 (number of parent categories).
  • TABLE 1
    Joint Embedding & Clustering Framework Performance
    Framework NDCG-strict NDCG-relaxed
    Baseline 0.1108 0.2047
    Learning user embeddings 0.1380 0.2318
    Joint user embedding and clustering 0.1426 0.2494
  • In the foregoing description, numerous specific details, examples, and scenarios are set forth in order to provide a more thorough understanding of the present principles. It will be appreciated, however, that embodiments of the principles can be practiced without such specific details. Further, such examples and scenarios are provided for illustration, and are not intended to limit the teachings in any way. Those of ordinary skill in the art, with the included descriptions, should be able to implement appropriate functionality without undue experimentation. References in the specification to “an embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is believed to be within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly indicated. Embodiments in accordance with the teachings can be implemented in hardware, firmware, software, or any combination thereof. Embodiments may also be implemented as instructions stored using one or more machine-readable media, which may be read and executed by one or more processors.
  • A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device or a “‘virtual machine” running on one or more computing devices). For example, a machine-readable medium may include any suitable form of volatile or non-volatile memory. Modules, data structures, blocks, and the like are referred to as such for case of discussion, and are not intended to imply that any specific implementation details are required. For example, any of the described modules and/or data structures may be combined or divided into sub-modules, sub-processes or other units of computer code or data as may be required by a particular design or implementation. Further, references herein to rules or templates are not meant to imply any specific implementation details. That is, the multimodal content embedding systems can store rules, templates, etc. in any suitable machine-readable format.
  • Referring to FIG. 12, a simplified high level block diagram of an embodiment of the computing device 1200 in which a multimodal content embedding system can be implemented is shown. While the computing device 1200 is shown as involving multiple components and devices, it should be understood that in some embodiments, the computing device 1200 can constitute a single computing device (e.g., a mobile electronic device, laptop or desktop computer) alone or in combination with other devices. The illustrative computing device 1200 can be in communication with one or more other computing systems or devices 542 via one or more networks 540. In the embodiment of FIG. 12, illustratively, a portion 110A of the multimodal content embedding system can be local to the computing device 510, while another portion 1106 can be distributed across one or more other computing systems or devices 542 that are connected to the network(s) 540.
  • In some embodiments, portions of the multimodal content embedding system can be incorporated into other systems or interactive software applications. Such applications or systems can include, for example, operating systems, middleware or framework software, and/or applications software. For example, portions of the multimodal content embedding system can be incorporated into or accessed by other, more generalized system(s) or intelligent assistance applications. The illustrative computing device 1200 of FIG. 12 includes at least one processor 512 (e.g. a microprocessor, microcontroller, digital signal processor, etc.), memory 514, and an input/output (I/O) subsystem 516. The computing device 1200 can be embodied as any type of computing device such as a personal computer (e.g., desktop, laptop, tablet, smart phone, body-mounted device, etc.), a server, an enterprise computer system, a network of computers, a combination of computers and other electronic devices, or other electronic devices.
  • Although not specifically shown, it should be understood that the I/O subsystem 516 typically includes, among other things, an I/O controller, a memory controller, and one or more I/O ports. The processor 512 and the I/O subsystem 516 are communicatively coupled to the memory 514. The memory 514 can be embodied as any type of suitable computer memory device (e.g., volatile memory such as various forms of random access memory). In the embodiment of FIG. 12, the I/O subsystem 516 is communicatively coupled to a number of hardware components and/or other computing systems including one or more user input devices 518 (e.g., a touchscreen, keyboard, virtual keypad, microphone, etc.), and one or more storage media 520. The storage media 520 may include one or more hard drives or other suitable data storage devices (e.g., flash memory, memory cards, memory sticks, and/or others).
  • In some embodiments, portions of systems software (e.g., an operating system, etc.), framework/middleware (e.g., application-programming interfaces, object libraries, etc.), the multimodal content embedding system resides at least temporarily in the storage media 520. Portions of systems software, framework/middleware, the multimodal content embedding system can also exist in the memory 514 during operation of the computing device 1200, for faster processing or other reasons. The one or more network interfaces 532 can communicatively couple the computing device 1200 to a local area network, wide area network, a personal cloud, enterprise cloud, public cloud, and/or the Internet, for example. Accordingly, the network interfaces 532 can include one or more wired or wireless network interface cards or adapters, for example, as may be needed pursuant to the specifications and/or design of the particular computing device 1200. The other computing device(s) 542 can be embodied as any suitable type of computing device such as any of the aforementioned types of devices or other electronic devices. For example, in some embodiments, the other computing devices 542 can include one or more server computers used with the multimodal content embedding system.
  • The computing device 1200 can further optionally include an optical character recognition (OCR) system 528 and an automated speech recognition (ASR) system 530. It should be understood that each of the foregoing components and/or systems can be integrated with the computing device 1200 or can be a separate component or system that is in communication with the I/O subsystem 516 (e.g., over a network). The computing device 1200 can include other components, subcomponents, and devices not illustrated in FIG. 12 for clarity of the description. In general, the components of the computing device 1200 are communicatively coupled as shown in FIG. 12 by signal paths, which may be embodied as any type of wired or wireless signal paths capable of facilitating communication between the respective devices and components.
  • In the drawings, specific arrangements or orderings of schematic elements may be shown for ease of description. However, the specific ordering or arrangement of such elements is not meant to imply that a particular order or sequence of processing, or separation of processes, is required in all embodiments. In general, schematic elements used to represent instruction blocks or modules may be implemented using any suitable form of machine-readable instruction, and each such instruction may be implemented using any suitable programming language, library, application-programming interface (API), and/or other software development tools or frameworks. Similarly, schematic elements used to represent data or information may be implemented using any suitable electronic arrangement or data structure. Further, some connections, relationships or associations between elements may be simplified or not shown in the drawings so as not to obscure the teachings herein. While the foregoing is directed to embodiments in accordance with the present principles, other and further embodiments in accordance with the principles described herein may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (20)

1. A method of creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events, the method comprising:
for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality, using a first machine learning model;
for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality, using a second machine learning model; and
semantically embedding the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the common geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the common geometric space;
wherein embedded modality feature vectors that are related, across modalities, are closer together in the common geometric space than unrelated modality feature vectors.
2. The method of claim 1, further comprising:
for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector; and
semantically embedding the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors.
3. The method of claim 1, further comprising:
semantically embedding content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded combined multimodal feature vector.
4. The method of claim 1, further comprising:
projecting at least one of content, content-related information, and an event into the common geometric space; and
determining at least one embedded feature vector in the common geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event.
5. The method of claim 1, wherein a, second modality feature vector representative of content of the multimodal content having a second modality is created using information relating to respective content having a first modality.
6. The method of claim 1, further comprising:
appending content-related information, including at least one of user information and user grouping information, to at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector.
7. The method of claim 1, wherein content-related information comprises at least one of agent information or agent grouping information for at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector.
8. The method of claim 1, wherein the common geometric space comprises a non-Euclidean space.
9. The method of claim 8, wherein the non-Euclidean space comprises at least one of a hyperbolic, a Lorentzian, and a Poincaré ball.
10. The method of claim 1, wherein the multimodal content comprises multimodal content posted by an agent on a social media network.
11. The method of claim 10, wherein the agent comprises at least one of a computer, robot, a person with a social media account, and a participant in a social media network.
12. The method of claim 1, further comprising:
inferring information for feature vectors embedded in the common geometric space based on a proximity of the feature vectors to at least one other feature vector embedded in the common geometric space.
13. An apparatus for creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events, the apparatus comprising:
a processor; and
a memory coupled to the processor, the memory having stored therein at least one of programs or instructions executable by the processor to configure the apparatus to:
for each of a plurality of content of the multimodal content, create a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model;
for each of a plurality of content of the multimodal content, create a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model; and
semantically embed the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the common geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the common geometric space;
wherein embedded modality feature vectors that are related, across modalities, are closer together in the common geometric space than unrelated modality feature vectors.
14. The apparatus of claim 13, wherein the apparatus is further configured to:
for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, form a combined multimodal feature vector from the first modality feature vector and the second modality feature vector; and
semantically embed the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors.
15. The apparatus of claim 13, wherein the apparatus is further configured to:
semantically embed content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded feature vector.
16. The apparatus of claim 13, wherein the apparatus is further configured to:
project at least one of content, content-related information, and an event into the common geometric space; and
determine at least one embedded feature vector in the common geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event.
17. A non-transitory computer-readable medium having stored thereon at least one program, the at least one program including instructions which, when executed by a processor, cause the processor to perform a method for creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events, comprising:
for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model;
for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model; and
semantically embedding the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the common geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the common geometric space;
wherein embedded modality feature vectors that are related, across modalities, are closer together in the common geometric space than unrelated modality feature vectors.
18. The non-transitory computer-readable medium of claim 17, wherein the processor further, for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forms a combined multimodal feature vector from the first modality feature vector and the second modality feature vector; and
semantically embeds the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors.
19. The non-transitory computer-readable medium of claim 17, wherein the processor further semantically embeds content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded feature vector.
20. The non-transitory computer-readable medium of claim 17, wherein the processor further:
projects at least one of content, content-related information, and an event into the common geometric space; and
determines at least one embedded feature vector in the common geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event.
US16/383,429 2018-04-20 2019-04-12 Embedding multimodal content in a common non-euclidean geometric space Pending US20190325342A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/383,429 US20190325342A1 (en) 2018-04-20 2019-04-12 Embedding multimodal content in a common non-euclidean geometric space

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862660863P 2018-04-20 2018-04-20
US16/383,429 US20190325342A1 (en) 2018-04-20 2019-04-12 Embedding multimodal content in a common non-euclidean geometric space

Publications (1)

Publication Number Publication Date
US20190325342A1 true US20190325342A1 (en) 2019-10-24

Family

ID=68236900

Family Applications (3)

Application Number Title Priority Date Filing Date
US16/383,447 Active 2039-10-11 US11055555B2 (en) 2018-04-20 2019-04-12 Zero-shot object detection
US16/383,429 Pending US20190325342A1 (en) 2018-04-20 2019-04-12 Embedding multimodal content in a common non-euclidean geometric space
US17/337,093 Active 2039-07-27 US11610384B2 (en) 2018-04-20 2021-06-02 Zero-shot object detection

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US16/383,447 Active 2039-10-11 US11055555B2 (en) 2018-04-20 2019-04-12 Zero-shot object detection

Family Applications After (1)

Application Number Title Priority Date Filing Date
US17/337,093 Active 2039-07-27 US11610384B2 (en) 2018-04-20 2021-06-02 Zero-shot object detection

Country Status (1)

Country Link
US (3) US11055555B2 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985520A (en) * 2020-05-15 2020-11-24 南京智谷人工智能研究院有限公司 Multi-mode classification method based on graph convolution neural network
US20200380307A1 (en) * 2019-05-31 2020-12-03 Radiant Analytic Solutions Inc. Techniques for deriving and/or leveraging application-centric model metric
CN114139063A (en) * 2022-01-30 2022-03-04 北京淇瑀信息科技有限公司 User tag extraction method and device based on embedded vector and electronic equipment
US20220261430A1 (en) * 2019-12-19 2022-08-18 Fujitsu Limited Storage medium, information processing method, and information processing apparatus
US20220277344A1 (en) * 2021-02-26 2022-09-01 Fulian Precision Electronics (Tianjin) Co., Ltd. Advertising method and electronic device using the same
US11605019B2 (en) * 2019-05-30 2023-03-14 Adobe Inc. Visually guided machine-learning language model
US11604822B2 (en) 2019-05-30 2023-03-14 Adobe Inc. Multi-modal differential search with real-time focus adaptation
JP7332238B2 (en) 2020-03-10 2023-08-23 エスアールアイ インターナショナル Methods and Apparatus for Physics-Guided Deep Multimodal Embedding for Task-Specific Data Utilization
US11775578B2 (en) 2019-05-30 2023-10-03 Adobe Inc. Text-to-visual machine learning embedding techniques
US11874899B2 (en) 2020-12-15 2024-01-16 International Business Machines Corporation Automated multimodal adaptation of multimedia content
US11907339B1 (en) * 2018-12-13 2024-02-20 Amazon Technologies, Inc. Re-identification of agents using image analysis and machine learning

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11475248B2 (en) * 2018-10-30 2022-10-18 Toyota Research Institute, Inc. Auto-labeling of driving logs using analysis-by-synthesis and unsupervised domain adaptation
US10755099B2 (en) * 2018-11-13 2020-08-25 Adobe Inc. Object detection in images
JP7363107B2 (en) * 2019-06-04 2023-10-18 コニカミノルタ株式会社 Idea support devices, idea support systems and programs
CN110414447B (en) * 2019-07-31 2022-04-15 京东方科技集团股份有限公司 Pedestrian tracking method, device and equipment
CN110750987B (en) * 2019-10-28 2021-02-05 腾讯科技(深圳)有限公司 Text processing method, device and storage medium
US11461953B2 (en) * 2019-12-27 2022-10-04 Wipro Limited Method and device for rendering object detection graphics on image frames
US11328170B2 (en) * 2020-02-19 2022-05-10 Toyota Research Institute, Inc. Unknown object identification for robotic device
CN111428733B (en) * 2020-03-12 2023-05-23 山东大学 Zero sample target detection method and system based on semantic feature space conversion
CN113591872A (en) * 2020-04-30 2021-11-02 华为技术有限公司 Data processing system, object detection method and device
US11836930B2 (en) * 2020-11-30 2023-12-05 Accenture Global Solutions Limited Slip-to-slip connection time on oil rigs with computer vision
CN112749738B (en) * 2020-12-30 2023-05-23 之江实验室 Zero sample object detection method for performing superclass reasoning by fusing context
US11961314B2 (en) * 2021-02-16 2024-04-16 Nxp B.V. Method for analyzing an output of an object detector
CN114166204A (en) * 2021-12-03 2022-03-11 东软睿驰汽车技术(沈阳)有限公司 Repositioning method and device based on semantic segmentation and electronic equipment
CN114863207A (en) * 2022-04-14 2022-08-05 北京百度网讯科技有限公司 Pre-training method and device of target detection model and electronic equipment
CN115641510B (en) * 2022-11-18 2023-08-08 中国人民解放军战略支援部队航天工程大学士官学校 Remote sensing image ship detection and identification method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10163043B2 (en) * 2017-03-31 2018-12-25 Clarifai, Inc. System and method for facilitating logo-recognition training of a recognition model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Amir et al., "Modelling Context with User Embeddings for Sarcasm Detection in Social Media", 2016, arXiv, v1607.00976v2, pp 1-11 (Year: 2016) *
Amir et al., "Quantifying Mental Health from Social Media with Neural User Embeddings", 2017, Proceedings of the 2nd Machine Learning for Healthcare Conference, vol 2 (2017), pp 306-321 (Year: 2017) *
Vukotic et al., "Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking", 2016, Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion, vol 2016, pp 37-44 (Year: 2016) *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11907339B1 (en) * 2018-12-13 2024-02-20 Amazon Technologies, Inc. Re-identification of agents using image analysis and machine learning
US11775578B2 (en) 2019-05-30 2023-10-03 Adobe Inc. Text-to-visual machine learning embedding techniques
US11605019B2 (en) * 2019-05-30 2023-03-14 Adobe Inc. Visually guided machine-learning language model
US11604822B2 (en) 2019-05-30 2023-03-14 Adobe Inc. Multi-modal differential search with real-time focus adaptation
US11699108B2 (en) * 2019-05-31 2023-07-11 Maxar Mission Solutions Inc. Techniques for deriving and/or leveraging application-centric model metric
US20200380307A1 (en) * 2019-05-31 2020-12-03 Radiant Analytic Solutions Inc. Techniques for deriving and/or leveraging application-centric model metric
US20200380308A1 (en) * 2019-05-31 2020-12-03 Radiant Analytic Solutions Inc. Techniques for deriving and/or leveraging application-centric model metric
US11657334B2 (en) * 2019-05-31 2023-05-23 Maxar Mission Solutions Inc. Techniques for deriving and/or leveraging application-centric model metric
US20220261430A1 (en) * 2019-12-19 2022-08-18 Fujitsu Limited Storage medium, information processing method, and information processing apparatus
JP7332238B2 (en) 2020-03-10 2023-08-23 エスアールアイ インターナショナル Methods and Apparatus for Physics-Guided Deep Multimodal Embedding for Task-Specific Data Utilization
WO2021227091A1 (en) * 2020-05-15 2021-11-18 南京智谷人工智能研究院有限公司 Multi-modal classification method based on graph convolutional neural network
CN111985520A (en) * 2020-05-15 2020-11-24 南京智谷人工智能研究院有限公司 Multi-mode classification method based on graph convolution neural network
US11874899B2 (en) 2020-12-15 2024-01-16 International Business Machines Corporation Automated multimodal adaptation of multimedia content
US20220277344A1 (en) * 2021-02-26 2022-09-01 Fulian Precision Electronics (Tianjin) Co., Ltd. Advertising method and electronic device using the same
CN114139063A (en) * 2022-01-30 2022-03-04 北京淇瑀信息科技有限公司 User tag extraction method and device based on embedded vector and electronic equipment

Also Published As

Publication number Publication date
US20210295082A1 (en) 2021-09-23
US11610384B2 (en) 2023-03-21
US20190325243A1 (en) 2019-10-24
US11055555B2 (en) 2021-07-06

Similar Documents

Publication Publication Date Title
US20190325342A1 (en) Embedding multimodal content in a common non-euclidean geometric space
WO2020238293A1 (en) Image classification method, and neural network training method and apparatus
CN106682059B (en) Modeling and extraction from structured knowledge of images
Geman et al. Visual turing test for computer vision systems
GB2544379B (en) Structured knowledge modeling, extraction and localization from images
US20210342643A1 (en) Method, apparatus, and electronic device for training place recognition model
WO2019100724A1 (en) Method and device for training multi-label classification model
US10460033B2 (en) Structured knowledge modeling, extraction and localization from images
US20200380403A1 (en) Visually Guided Machine-learning Language Model
Garcia et al. A dataset and baselines for visual question answering on art
Jiao et al. SAR images retrieval based on semantic classification and region-based similarity measure for earth observation
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
US20200311542A1 (en) Encoder Using Machine-Trained Term Frequency Weighting Factors that Produces a Dense Embedding Vector
Song et al. Multimodal similarity gaussian process latent variable model
US20210406324A1 (en) System and method for providing a content item based on computer vision processing of images
Xu et al. Multi-label learning with fused multimodal bi-relational graph
CN114298122B (en) Data classification method, apparatus, device, storage medium and computer program product
Bai et al. Automatic ensemble diffusion for 3D shape and image retrieval
CN109284414B (en) Cross-modal content retrieval method and system based on semantic preservation
Liang et al. Semisupervised online multikernel similarity learning for image retrieval
Muneesawang et al. Multimedia Database Retrieval
AU2016225819A1 (en) Structured knowledge modeling and extraction from images
Rao et al. Deep learning-based image retrieval system with clustering on attention-based representations
Tonde Supervised feature learning via dependency maximization
Liang et al. AMEMD-FSL: fuse attention mechanism and earth mover’s distance metric network to deep learning for few-shot image recognition

Legal Events

Date Code Title Description
AS Assignment

Owner name: SRI INTERNATIONAL, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SIKKA, KARAN;DIVAKARAN, AJAY;KRUK, JULIA;SIGNING DATES FROM 20190410 TO 20190411;REEL/FRAME:048958/0093

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCV Information on status: appeal procedure

Free format text: NOTICE OF APPEAL FILED

STCV Information on status: appeal procedure

Free format text: NOTICE OF APPEAL FILED

STCV Information on status: appeal procedure

Free format text: APPEAL BRIEF (OR SUPPLEMENTAL BRIEF) ENTERED AND FORWARDED TO EXAMINER

STCV Information on status: appeal procedure

Free format text: EXAMINER'S ANSWER TO APPEAL BRIEF MAILED

STCV Information on status: appeal procedure

Free format text: APPEAL READY FOR REVIEW

STCV Information on status: appeal procedure

Free format text: ON APPEAL -- AWAITING DECISION BY THE BOARD OF APPEALS