US20190325342A1

US20190325342A1 - Embedding multimodal content in a common non-euclidean geometric space

Info

Publication number: US20190325342A1
Application number: US16/383,429
Authority: US
Inventors: Karan Sikka; Ajay Divakaran; Julia Kruk
Original assignee: SRI International Inc
Current assignee: SRI International Inc
Priority date: 2018-04-20
Filing date: 2019-04-12
Publication date: 2019-10-24
Also published as: US20210295082A1; US11610384B2; US20190325243A1; US11055555B2

Abstract

Embedding multimodal content in a common geometric space includes for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model; for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model; and semantically embedding the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the geometric space; wherein embedded modality feature vectors that are related, across modalities, are closer together in the geometric space than unrelated modality feature vectors.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to and the benefit of U.S. Provisional Patent Application Ser. No. 62/660,863, filed Apr. 20, 2018 which is incorporated herein by this reference in their entirety.

GOVERNMENT RIGHTS

This invention was made with Government support under contract number N00014-17-C-1008 awarded by the Office of Naval Research. The Government has certain rights in this invention.

BACKGROUND

Machine learning relies on models and inference to perform tasks in a computing environment without having explicit instructions. A mathematical model of sample data is constructed using training data to make predictions or choices based on the learned data. The machine learning may be supervised by using training data composed of both input data and output data or may be unsupervised by using training data with only input data. Since machine learning uses computers that operate by interpreting and manipulating numbers, the training data is typically numerical in nature or transformed into numerical values. The numerical values allow the mathematical model to learn the input data. Input or output information that is not in a numerical form may be first transformed into a numerical representation so that it can be processed through machine learning.
The purpose of using machine learning is to infer additional information given a set of data. However, as the type of input information becomes more and more diverse, inferred data becomes more difficult to model due to the complexity of the required mathematical model or the incompatibility of the model with a given form of input information.

SUMMARY

Embodiments of the present principles generally relate to embedding multimodal content into a common geometric space.
In some embodiments, a method of creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events may comprise for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model; for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model; and semantically embedding the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the geometric space; wherein embedded modality feature vectors that are related, across modalities, are closer together in the geometric space than unrelated modality feature vectors.
In some embodiments, the method may further comprise for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector and semantically embedding the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors; semantically embedding content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded combined multimodal feature vector; projecting at least one of content, content-related information, and an event into the geometric space and determining at least one embedded feature vector in the geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event; wherein a, second modality feature vector representative of content of the multimodal content having a second modality is created using information relating to respective content having a first modality; appending content-related information, including at least one of user information and user grouping information, to at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector; wherein content-related information comprises at least one of agent information or agent grouping information for at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector; wherein the common geometric space comprises a non-Euclidean space; wherein the non-Euclidean space comprises at least one of a hyperbolic, a Lorentzian, and a Poincaré ball; wherein the multimodal content comprises multimodal content posted by an agent on a social media network; wherein the agent comprises at least one of a robot, a person with a social media account, and a participant in a social media network; and/or inferring information for feature vectors embedded in the common geometric space based on a proximity of the feature vectors to at least one other feature vector embedded in the common geometric space.
In some embodiments, an apparatus for creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events may comprise a processor; and a memory coupled to the processor, the memory having stored therein at least one of programs or instructions executable by the processor to configure the apparatus to: for each of a plurality of content of the multimodal content, create a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model; for each of a plurality of content of the multimodal content, create a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model; and semantically embed the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the geometric space; wherein embedded modality feature vectors that are related, across modalities, are closer together in the geometric space than unrelated modality feature vectors.
In some embodiments, the apparatus may further comprise wherein the apparatus is further configured to: for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, form a combined multimodal feature vector from the first modality feature vector and the second modality feature vector and semantically embed the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors; wherein the apparatus is further configured to: semantically embed content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded feature vector; wherein the apparatus is further configured to: project at least one of content, content-related information, and an event into the geometric space and determine at least one embedded feature vector in the geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event.
In some embodiments, a non-transitory computer-readable medium having stored thereon at least one program, the at least one program including instructions which, when executed by a processor, cause the processor to perform a method for creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events may comprise for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model; for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model and semantically embedding the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the geometric space; wherein embedded modality feature vectors that are related, across modalities, are closer together in the geometric space than unrelated modality feature vectors.
The non-transitory computer-readable medium may further include wherein the processor further, for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forms a combined multimodal feature vector from the first modality feature vector and the second modality feature vector and semantically embeds the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors; wherein the processor further semantically embeds content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded feature vector; and/or wherein the processor further: projects at least one of content, content-related information, and an event into the geometric space and determines at least one embedded feature vector in the geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event.
Other and further embodiments in accordance with the present principles are described below.

BRIEF DESCRIPTION OF THE DRAWINGS

So that the manner in which the above recited features of the present principles can be understood in detail, a more particular description of the principles, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments in accordance with the present principles and are therefore not to be considered limiting of its scope, for the principles may admit to other equally effective embodiments.

FIG. 1A depicts embeddings in a Euclidean space in accordance with an embodiment of the present principles.

FIG. 1B depicts embeddings in a non-Euclidean space in accordance with an embodiment of the present principles.

FIG. 2 depicts a graphical representation of hierarchy information for images that is preserved in non-Euclidean space in accordance with an embodiment of the present principles.

FIG. 3 is a method for embedding multimodal content in a common non-Euclidean geometric space according to an embodiment of the present principles.

FIG. 4A illustrates that a Euclidean space does not inherently preserve hierarchies in accordance with an embodiment of the present principles.

FIG. 4B illustrates that a non-Euclidean space inherently preserves the hierarchies in accordance with an embodiment of the present principles.

FIG. 5 depicts a non-Euclidean embedding process in accordance with an embodiment of the present principles.

FIG. 6 shows examples of non-Euclidean embedding spaces in accordance with an embodiment of the present principles.

FIG. 7 is a graph illustrating results of non-Euclidean embedding versus Euclidean embedding in accordance with an embodiment of the present principles.

FIG. 8 is a method of embedding multimodal content and agent information from social media in a common non-Euclidean geometric space in accordance with an embodiment of the present principles.

FIG. 9 shows how a standard loss is extended by adding a ranking loss term for cluster center vectors along with a clustering loss in accordance with an embodiment of the present principles.

FIG. 10 depicts a deep learning framework in accordance with an embodiment of the present principles.

FIG. 11 is an example of three users and the images pinned by the users in accordance with an embodiment of the present principles.

FIG. 12 depicts a high level block diagram of a computing device in which a multimodal content embedding system can be implemented in accordance with an embodiment of the present principles.

To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. The figures are not drawn to scale and may be simplified for clarity. It is contemplated that elements and features of one embodiment may be beneficially incorporated in other embodiments without further recitation.

DETAILED DESCRIPTION

Embodiments of the present principles generally relate to methods, apparatuses, and systems for embedding multimodal content into a common geometric space. While the concepts of the present principles are susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and are described in detail below. It should be understood that there is no intent to limit the concepts of the present principles to the particular forms disclosed. On the contrary, the intent is to cover all modifications, equivalents, and alternatives consistent with the present principles and the appended claims. For example, although embodiments of the present principles will be described primarily with respect to visual concepts, such teachings should not be considered limiting.
Machine learning may be expanded to accept inputs that are not numerically based. For example, if an input is the word ‘apple,’ the word may be transformed into a series of numbers such as {3, 2, 1} where a ‘3’ represents a color red, a ‘2’ represents a round shape, and a ‘1’ represents a fruit. In this manner, non-numerical information may be converted into a numerical representation that can be processed by a mathematical model. Word2vec is a machine learning process/model that produces word embedding vectors where words are associated with a number to produce a numerical essence of the word. Word2vec produces word embeddings (arrays of numbers) where words with similar meanings or context are physically close to each other in the embedded space. The numbers are typically arranged in arrays that allow mathematical processes to be performed on the numbers. For example, “royal”+“man”=“king,” adds the essence of the words to get a result. Quantifying words as a series of numbers allows machine learning to find a new word similar to the other two words based on numbers and data properties of each word based on a model. The words can then be graphed and compared to words based on mathematical properties.
Graphing allows mathematical analysis using spatial properties such as Euclidean distance. Euclidean distance is the number of graph units between two graphed points or words (similarity of words based on physical closeness based on graphed points). The distance between graphed words can be described as vectors or a distance with a direction. For example, a vector representing “royal” can be added to a vector representing “man” which yields the result “king.” Moving from one graphed word to another graphed word in space allows one to represent/graph the idea of word relationships which are hard coded “word vectors.” By increasing the number of dimensions, more information may be obtained, but the increased number of dimensions makes it very complex for humans to comprehend/visualize.
However, Word2vec is limited to a single modality (text) for its input and output. The inventors have learned that valuable additional information may be obtained when the content being processed is expanded to include multiple modalities such as text, images, video, audio, etc. Thus, the inventors have discovered a way to better understand, exploit, and organize multimodal content in order to extract semantics and to make high-level sense of the multimodal content. This includes extracting short and long-term events from the multimodal content even from unorganized, but structured content, such as social media content. The embodiments of the present principles described herein allow data in multiple modalities to provide rich information that allows for interpolation of information by providing complementary information, to fill gaps in individual modalities, and to provide performance benefits. The multimodal embeddings provided by the present principles exploit correlations in different modalities and provide explicit vectorial representations for data points without relying on fixed architectures such as late fusion techniques.
The inventors have found that non-Euclidean embeddings provide more relevant information than Euclidean embeddings such as word2vec. Real-world data has inherent semantic hierarchies such as plants-flowers-roses or animals-land animals-mammals-horses. With Euclidean multimodal embeddings, unfortunately, such hierarchies are lost because Euclidean mappings are not able to capture distances that grow exponentially in a compact manner. Euclidean space is a flat space that doesn't capture intrinsic (hierarchical) structures. Thus, semantics extraction based on Euclidean embeddings tends to be insightful in a small neighborhood of the point of interest but fails to capture the aforementioned inherent semantic hierarchies. As illustrated in a view 100A of FIG. 1A, the classes in Euclidean space are more tightly packed and are scattered throughout the embedding. As shown in a view 100B of FIG. 1B, using a non-Euclidean space such as, for example, a Poincaré space allows distinct classes to form in broader categories such as “plants” and “land animals.” The hierarchical structure of the data is retained in the structure of the embedding. Thus, the inventors have found that there is a need to devise embedding methods that retain such semantic hierarchies.
Non-Euclidean embeddings such as, for example, hyperbolic embeddings provide a way to capture distances that grow exponentially through a logarithm-like warping of distance space. As shown in a view 200 of FIG. 2, hyperbolic spaces have tree-like properties that naturally indicate hierarchies and learning can be easily integrated into gradient based optimization. Thus, the top of the hierarchy, such as plant in plant-flowers-roses, is brought much closer to the bottom of the hierarchy. The hyperbolic embeddings provide a way to capture real-world semantic hierarchies while retaining all of the advantages of unsupervised continuous representation that Euclidean embeddings provide. Hyperbolic embeddings do capture real-world semantic hierarchies, even when realized through a simple approximation that is not completely hyperbolic and match or exceed the state of the art in retrieval accuracy.
The hierarchies provided by non-Euclidean embeddings may be used to contract and/or expand a class (e.g., user community) as well. This is particularly useful during clustering (see below for clustering processes). The hierarchy may be traversed upwards or downwards (e.g., from a community of users to a single user or from a community of users to a select group of users, etc.). Similarly, the hierarchy may be leveraged to move from a single user or select group of users to find a more general group of users or a community of users. In some instances, attributes may be passed up or down a hierarchy. For example, a user living in a particular state such as, for example, Maine, also lives in the United States. Other attributes may only pass in one direction as, for example, some users may live on the coast of Maine while other users in Maine may not. Other attributes may be metric in nature and others may be continuously variable such as, for example, incomes.
FIG. 3 is a method 300 for embedding multimodal content in a common non-Euclidean geometric space according to some embodiments of the present principles. For the sake of brevity, the following examples are based on content with two modalities—text and images. However, the concepts may be expanded to include any number of modalities including, for example, video, audio, etc. In block 302, a first machine learning model is trained using content relating to a first modality. In some embodiments, the first modality may be, for example, text. In general, the model performance improves as the data set size used for training increases. In block 304, a first modality feature vector is created using the first machine learning model from an input that has multimodal content. The first modality feature vector represents a first modality feature of the multimodal content. For example, if the multimodal content is an image with a caption, the first modality feature vector may represent a first modality (text) of the multimodal content (caption portion). In some embodiments, a pre-existing single modality model such as, for example, word2vec (Euclidean space) may be used to provide the first feature vector for text modalities. However, the inventors have found that performance may be increased by retraining the word2vec with vectors from a non-Euclidean space.
In block 306, a second machine learning model is trained using content relating to a second modality. In some embodiments, the second machine learning model may be trained using images (visual representations—photos, drawings, paintings, etc.) as the content of the second modality. In general, the model performance improves as the data set size used for training increases. In block 308, a second modality feature vector is created using the second machine learning model from the input that has multimodal content. The second modality feature vector represents a second modality feature of the multimodal content. For example, if the multimodal content is an image with a caption, the second modality feature vector may represent a second modality (image) of the multimodal content. In some embodiments, a deep learning neural network is used to create the second modality feature vector. In block 310, the first modality feature vector of the multimodal content and the second modality feature vector of the multimodal content are semantically embedded in a non-Euclidean geometric space, ending the flow 312. The mapping of the first feature vector and the second feature vector in a common geometric space allows the inventors to exploit additional meanings obtained from, for example, an image and text that would not be obtainable from the text alone or from the image alone. In some embodiments, content, content-related information, and/or an event are projected into the common geometric space. An embedded feature vector in the common geometric space close to the projection is then determined as being related to the projected the content, the content-related information, and/or the event.
In some embodiments, a multimodal feature vector based on the first modality feature vector and the second modality feature vector is created that represents both the first modality feature of the multimodal content and the second modality feature of the multimodal content. The multimodal feature vector of the multimodal content is embedded in (mapped to) a non-Euclidean geometric space. The embedded multimodal feature vector, for example, represents both the image and text in a singular notion that allows the inventors to exploit additional meanings obtained from the combination of image and text that would not be obtainable from the text alone or from the image alone. In addition, the inventors have discovered that the non-Euclidean space retains hierarchies of the different modality features from the multimodal content.
There are no limits to the number of modalities of the content nor number of machine learning models that may be used. Similarly, there are no limits on the numbers of feature vectors included in the embedded multimodal feature vector (e.g., may include associated data from more than two modality sources). In some embodiments, additional information may be infused into the multimodal feature vector during embedding processes and/or may be included as an attribute to the embedded multimodal feature vector. Because the content of the first and second modalities are embedded together, the non-Euclidean common geometric space is an embedding space that represents meaning multiplication of the content from the different modalities while preserving hierarchies of the multimodal content. Meaning multiplication denotes that, for example, when someone posts a meme on social media, the meaning is a combination of the image and its caption—the combination yielding meaning greater than the meaning of the image alone or the meaning of the caption alone. Meaning multiplication is discussed further in the examples that follow.
In this manner, words and images are transformed into vectors that are embedded into a common non-Euclidean space. Distance between vectors is small when vectors are semantically similar. Hierarchies are determined by the normalization of the embedded vectors. As illustrated in a view 400A of FIG. 4A, Euclidean space does not inherently preserve hierarchies as the content is spread across a single plane. However, as shown in a view 400B of FIG. 4B, non-Euclidean space inherently preserves the natural hierarchies of the content. In one example of an embodiment of the present principles, Microsoft's COCO (Common Objects in Context) database (see, T. Lin, C. L. Zitnick, and P. Doll, “Microsoft COCO: Common Objects in Context,” pp. 1-15) was used for training the model. A contrastive loss function with Riemannian SGD (Stochastic Gradient Descent) was used for the embedding. A Lorentzian space model or Poincaré n-ball space model was used as the common non-Euclidean embedding space. A deep learning convolutional neural network (CNN) 504 with fully connected (FC) and linear layers 506 was used to process the images 502. A bidirectional long short term memory (Bi-LSTM) network 512 was used to process the descriptive text 510 of the image. The results from both processes were then mapped to the non-Euclidean common space 508 as shown in a view 500 of FIG. 5.
For a Poincare n-ball model 600A illustrated in FIG. 6, the images and text may be embedded using:
$\begin{matrix} d_{p} (x, y) = arcosh (1 + 2 \frac{{ x - y }^{2}}{(1 - { x }^{2}) (1 - { y }^{2})}) & (Eq . 1) \end{matrix}$
For a Lorentzian model 600B illustrated in FIG. 6, the images and text may be embedded using:
d _l(x,y)=arcosh(−
x,y

) (Eq. 2)
The Poincaré ball is a realization of hyperbolic space (open d dimensional unit ball) and, in an embodiment, the Poincaré ball can be used to model the hyperbolic embedding space. The image embedding layer is trained in the n-dimensional Poincaré ball. The structure of loss function and the gradient descent are altered to create a linear projection layer to constrain embedding vectors to the manifold. In one example, a pre-trained word2vec model may be used and the results can be projected to the manifold via a few non-linear layers. For training, a database with images having semantic tags may be used. Keywords from image captions may then be extracted and used as labels for training the images. A ranking loss algorithm can be adjusted to push similar images and tags (words) together and vice-versa. The mean average precision (MAP) may be output for evaluating the training results. The inventors found that during an evaluation test, the MAP of the Poincaré embedding space rapidly reaches that of a Euclidean embedding space (based on number of iterations) and oftentimes exceeded the Euclidean embedding space as shown in graph 700 of FIG. 7. However, the inventors also found that using a pre-trained word2vec was not optimal since the vectors preserve Euclidean structure. In some embodiments, higher performance may be obtained by pre-training the word2vec model with hyperbolic embeddings.
FIG. 8 is a method 800 of embedding multimodal content and agent information from social media in a common non-Euclidean geometric space according to an embodiment of the present principles. For purposes of the discussions herein, an agent may be a user that is a human being and/or a machine derived entity (e.g., a bot) and the like. In some embodiments, agent and user may be used interchangeably. In social media settings, extracting information from agent postings can prove invaluable. In some embodiments, this is achieved by embedding agent information along with multimodal content in a non-Euclidean geometric space. In block 802, a social media posting of multimodal content and information relating to the agent (user, bot, etc.) who posted the posting on social media is obtained. For example, postings from Instagram or other social media sites that include multimodal content postings (e.g., image with a caption, etc.) can be used. Generally, the agent who did the posting is also readily obtainable (e.g., user avatar, user tagging, etc.). In block 804, a first machine learning model is trained with content relating to a first modality that may be found in the social media posting (e.g., caption with text, etc.). In block 806, a first modality feature vector is created from the social media posting that represents a first modality feature of the social media posting using the first machine learning model. For example, the first modality feature vector may represent the words from a caption posted with an image.
In block 808, a second machine learning model is trained with content relating to a second modality that may be found in the social media posting (e.g., an image, etc.). In block 810, a second modality feature vector is created from the social media posting that represents a second modality feature of the social media posting using the second machine learning model. For example, the second modality feature vector may represent a photo from the social media posting. There is no limit on the number of possible modalities that may be extracted from the social media posting nor the number of machine learning models that may be implemented to create modality feature vectors for those modalities. Thus, the method 800 may be used to determine feature vectors for any number of modalities exhibited by the content of the social media posting (e.g., audio, video etc.). In block 812, the first modality feature vector of the social media posting, the second modality feature vector of the social media posting, and the posting agent information (as a vector and/or attribute) is then semantically embedded (mapped to) a non-Euclidean geometric space, ending the flow 814.
In some embodiments, a multimodal feature vector based on the first modality feature vector, the second modality feature vector, and the posting agent information that represents the first modality feature of the social media posting, the second modality feature of the social media posting, and the posting agent information is created. In some embodiments, the posting agent information may also include cluster information (e.g., agents/users belonging to a group or distinguishable as a group, etc.). In some embodiments, the posting agent information may not be combined into the multimodal feature vector but may be appended as an attribute to the multimodal feature vector. The multimodal feature vector of the social media posting is then embedded (mapped to) a non-Euclidean geometric space.
The inventors have found that by using a model that jointly learns agent and content embedding, additional information can be extracted with regard to the original poster of the content and/or other agents who appear nearby in the embedding space. The model may also be adjusted such that agents are clustered based on their posted and/or associated content. The image-text joint embedding framework is leveraged to create content to user embeddings. Each training example is a tuple of an image and a list of users who posted the image. The objective is to minimize distance between the image embedding and the user embeddings by altering the ranking loss algorithm. As shown in a view 900 of FIG. 9, the standard loss 902 is extended by adding a ranking loss term 904 for the cluster center vectors 906 along with a clustering loss 908.
The problem of content recommendation for users in a social network is an important one and is central to fields such as social media marketing. A distinction is made between user-centric social networks like Twitter and content-centric social networks like Pinterest. Unlike user-centric networks where the primary purpose is developing connections and communication, content-centric platforms are solely interest-based, allowing users to easily collect and group content based on their interest.
The following example focuses on content-centric networks such as Pinterest where each user pins a set of images that highlights their interest. An embedding framework is developed in which every user, described by a set of images, and the images are mapped to common vector space that ideally captures interest based similarity between users and images. User embeddings are closer to image embeddings with similar interests and vice-versa. As illustrated in FIG. 10, a deep learning framework 1000 is used to learn these embeddings. Each user is described by a set of images. Deep learning is used to learn a vector for each such user and a mapping from the image space to the interest space such that the user and the images with similar interests are closer together in the embedding space while dissimilar users and images are farther away.
The dataset used is a combination of users and images. For example, let the users in the dataset be represented by
={U_i, i=1, 2, . . . N_U}. Let all the images in the dataset be represented by
={I_j, j=1, 2, . . . , N_I}. Now, the following information is available: For each user U_i, there is a given the set of images
★
that is posted/pinned by the user and for each image I_j, there is a given set of users
_jthat have posted/pinned this particular image. For many images, there can be multiple users associated with it and, thus, the image sets of these users will have some overlap. The goal is, for each user U_sand each image I_j, to obtain vector u_i∈
^Dand v_j∈
^Drespectively, where D is the embedding dimension, set to 300 in this example. The goal is to provide an embedding space such that user and image embeddings that represent similar interests are closer together in the embedding space and dissimilar users and images are farther away. These embeddings can then be utilized for retrieval of similar users, image recommendation, for grouping of similar users, etc. It is noted that the user category information, although available in the dataset, is not employed while training the networks. It is employed only for evaluating the models.
In order to construct a simple baseline for this example, an embedding model referred to as DeViSE is used (see, A. Frome, G. Corrado, and J. Shlens, “DeViSE: A deep visual-semantic embedding model,” Adv. Neural . . . , pp. 1-11, 2013). Specifically, an image embedding is trained that maps every image to a word embedding space. This may be achieved using a deep convolutional neural network (DCNN) such as, for example, a VGG-16 (Visual Geometry Group) network, attaching a fully connected layer to transform, for example, a 4096-dimensional image feature to a 300 dimensional vector. This final image embedding layer is trained from scratch while the rest of the network is fine-tuned using a model trained for ImageNet classification. Let the weight matrix for the image embeddings layer be denoted by w_I. A dataset, such as Microsoft's COCO which contains multiple captions for each image, may be used. The embedding space dimension, as stated previously, is set to 300. The word embeddings are not learned but are initialized using GloVE (see, J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation.”). As per DeViSE, ranking loss is employed as the loss function. For each positive example, instead of summing over all negative examples, summing is only over the negative examples in the minibatch and done empirically. This serves as a good approximation to the original loss function. Once the embedding vectors of the images are obtained, the embedding vector for each user U_i, is obtained as the arithmetic average of the image embedding vectors of the images in
_i. This serves as a strong baseline for this example.
In some embodiments, the method learns the user embeddings and the image embeddings jointly in a single neural network. The architecture of this network is similar to that of DeViSE and, thus, allows for fair comparison with the baseline. Instead of the embedding layer for words, there is an embedding layer for users. That is, initially each user is represented as a one-hot vector and the embedding layer is a fully connected layer, represented by a matrix w_U∈
^N ^U ^×Dthat converts this one-hot vector into the desired user embedding. Since the input to this layer is a one-hot vector, the user embedding for user U_iis simply the i^thcolumn of w_U. The ranking loss function (Eq. 3) is minimized similar to DeViSE:
$\begin{matrix} L_{rank - users} (v_{j}) = \sum_{k} \sum_{i \notin _{i}} \max (0, 1 - u_{k}^{T} v_{j} + u_{i}^{T} v_{j}) & (Eq . 3) \end{matrix}$
For each image I_j, the negative users indexed by i are all the users in the mini-batch who have not pinned I_jand the positive users indexed by k are all the users who have pinned I_j. In addition, the reconstruction loss is employed and has been shown to be useful in the case of visual semantic embeddings in that they are more resilient to domain transfer. In order to incorporate the reconstruction loss, another fully connected layer is added that takes as input the image embedding and the desired output of the layer is the output image feature of the VGG-16 network, denoted by v_N. The reconstruction loss (Eq. 4) for a single image is given by:
L _recon =∥v _N −w _R w _I v _N∥2 (Eq. 4)
In the above loss function, the loss is computed for a given image embedding vector. Another loss in terms of a given user embedding can also be computed. Thus, the final loss function (Eq. 5) used is:
L ₁ =L _rank-users +λ*L _recon (Eq. 5)
In addition to the approach outlined above, some embodiments have a modification which allows learning of a clustering of users in addition to the user embeddings inside the same network. Learning the clusters jointly allows for a better and automatic sharing of information about similar images between user embeddings than what is available explicitly in the dataset. To this end, an additional matrix w_c∈
^K×Dis maintained, where C is the number of clusters, a hyperparameter. Each row of w_C, represented by c_i,l=1, 2, . . . , K, is the vector representing the cluster center of the l^thcluster. Let the clusters of users be denoted by C_l, l=1, 2, . . . , K. In order to learn the cluster centers, the loss function proposed previously is further modified. Two other terms are added into the loss function (Eq. 6) given by:
$\begin{matrix} \begin{matrix} L_{rank - clusters} (v_{j}) = \sum_{l} \sum_{i \notin C_{i}} \max (0, 1 - c_{l}^{T} v_{j} + c_{i}^{T} v_{j}) \\ L_{K - means} = \sum_{l} \sum_{c  u_{c} \in C_{l}} 1 - u_{c}^{T} c_{l} . \end{matrix} & (Eq . 6) \end{matrix}$
The first term a, L_{rank-clusters}(.) is the cluster-center analogue of the L_rank-users. In effect, it tries to push the image features closer to the right clusters and farther away from the wrong clusters. The second term L_K-meansis similar to the K-means loss function which is used to ensure that the cluster centers are indeed representative of the users assigned to that cluster. Since nearest neighbor computation is not a differentiable operation, the cluster assignments cannot be found inside the network. As an approximation, the cluster assignments are found only once every 1000 iterations and optimize the user embeddings and the cluster centers with the fixed cluster assignments.
In one test example, the Pinterest dataset released by Geng et al. is used as a representative dataset (see, X. Geng, H. Zhang, J. Bian, and T. S. Chua, “Learning image and user features for recommendation in social networks,” Proc. IEEE Int. Conf. Comput. Vis., vol. 11-18-NaN=2015, pp. 4274-4282, 2016). It contains 46000 users belonging to 468 interest categories and about 900,000 images pinned by these users. When a user “pins” an image, it indicates that the user has a preference for that image. However, the inverse may not be true. Using the main category list provided by Pinterest, sibling categories with 32 parent categories can be formed. For each user, a list of images pinned by that user is provided. Categories of images are not provided. As discussed above, each image may be pinned by multiple users and this information is also known. FIG. 11 shows an example of three users 1102, 1104, 1106 and the images 1108, 1110, 1112 pinned by the users 1102, 1104, 1106, respectively. In the dataset provided by Geng et al. each user has only one interest.
Once the embeddings for each user is obtained, user retrieval accuracy can be measured as follows. Compute N_U×N_Udistance matrix using each user U_jfor which the closest M closest users can be found. As discussed before, the knowledge of the category to which each user belongs to is known. Thus, the top-M accuracy based on the ground-truth categories can be computed. Normalized discounted cumulative gain (NDCG) is a widely used metric in information retrieval that takes into account the position of the retrieved elements. The formula for calculating NDCG at k i.e., with k retrieved elements (Eq. 7) is given by:
$\begin{matrix} {NDCG}_{k} = \frac{1}{{IDCG}_{k}} \overset{k}{\sum_{i = 1}} \frac{2^{r_{i}} - 1}{\log_{2} (i + 1)}, & (Eq . 7) \end{matrix}$
where IDCG_kis the normalizing factor which corresponds to the best possible retrieval results for a given query. r_idenoted the relevance of the i^thretrieval result and r_i∈{0,1,2} calculated as follows:

- r_i=2 if the query and the retrieved result belong to the same category.
- r_i=1 if the query and the retrieved result belong to sibling categories.
- r_i=0 if the query and the retrieved result belong to unrelated categories.
  In the case of image recommendation, the above assumes that the image categories are known. However, in the dataset currently employed, no image-level class labels are available. Thus, the image labels are first determined by the using the categories of the users who pinned that image. Based on how the image category is determined, there are two variants: 1—The image category is chosen as the most common category of all the user categories. This is a strict metric and is referred to as NDCG-strict, 2—A more relaxed version is obtained by letting the image belong to all the categories that the users who pinned it belong to and is referred to as NDCG-relaxed. In all the experiments, results are shown for k=5 retrieved elements.

Three measures are used, all of them in the range [0,1] that are popularly used to measure the quality of clusters when the ground-truth is known. The first is cluster homogeneity—a cluster is said to be fully homogeneous if all the elements in that cluster belong to the same class. The second is cluster completeness—a cluster assignment is said to be complete if, for each class, all elements of that class belong to the same cluster. The third is V-measure—the harmonic mean of cluster homogeneity and cluster completeness. This is also equal to normalized mutual information between the true class labels and the cluster assignments for all the points in the dataset.
Table 1 shows how well the joint embedding and clustering framework performs in terms of the clustering results and how the number of clusters affects the performance of clustering, measured in terms of homogeneity, completeness and V-measure. Both in the case of the baseline and in the case where only the user embeddings are learned, K-means clustering is performed with random users as the initial cluster centers. As the number of clusters increases, the quality of clusters tends to improve. The baseline approach yields consistently poorer results compared to the proposed frameworks of the present principles. Also, using K-means offline tends to produce better clusters. However, this effect is seen to diminish as the number of clusters is increased to 468 (number of categories) from 32 (number of parent categories).

TABLE 1

Joint Embedding & Clustering Framework Performance

Framework	NDCG-strict	NDCG-relaxed

Baseline	0.1108	0.2047
Learning user embeddings	0.1380	0.2318
Joint user embedding and clustering	0.1426	0.2494

In the foregoing description, numerous specific details, examples, and scenarios are set forth in order to provide a more thorough understanding of the present principles. It will be appreciated, however, that embodiments of the principles can be practiced without such specific details. Further, such examples and scenarios are provided for illustration, and are not intended to limit the teachings in any way. Those of ordinary skill in the art, with the included descriptions, should be able to implement appropriate functionality without undue experimentation. References in the specification to “an embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is believed to be within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly indicated. Embodiments in accordance with the teachings can be implemented in hardware, firmware, software, or any combination thereof. Embodiments may also be implemented as instructions stored using one or more machine-readable media, which may be read and executed by one or more processors.
A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device or a “‘virtual machine” running on one or more computing devices). For example, a machine-readable medium may include any suitable form of volatile or non-volatile memory. Modules, data structures, blocks, and the like are referred to as such for case of discussion, and are not intended to imply that any specific implementation details are required. For example, any of the described modules and/or data structures may be combined or divided into sub-modules, sub-processes or other units of computer code or data as may be required by a particular design or implementation. Further, references herein to rules or templates are not meant to imply any specific implementation details. That is, the multimodal content embedding systems can store rules, templates, etc. in any suitable machine-readable format.
Referring to FIG. 12, a simplified high level block diagram of an embodiment of the computing device 1200 in which a multimodal content embedding system can be implemented is shown. While the computing device 1200 is shown as involving multiple components and devices, it should be understood that in some embodiments, the computing device 1200 can constitute a single computing device (e.g., a mobile electronic device, laptop or desktop computer) alone or in combination with other devices. The illustrative computing device 1200 can be in communication with one or more other computing systems or devices 542 via one or more networks 540. In the embodiment of FIG. 12, illustratively, a portion 110A of the multimodal content embedding system can be local to the computing device 510, while another portion 1106 can be distributed across one or more other computing systems or devices 542 that are connected to the network(s) 540.
In some embodiments, portions of the multimodal content embedding system can be incorporated into other systems or interactive software applications. Such applications or systems can include, for example, operating systems, middleware or framework software, and/or applications software. For example, portions of the multimodal content embedding system can be incorporated into or accessed by other, more generalized system(s) or intelligent assistance applications. The illustrative computing device 1200 of FIG. 12 includes at least one processor 512 (e.g. a microprocessor, microcontroller, digital signal processor, etc.), memory 514, and an input/output (I/O) subsystem 516. The computing device 1200 can be embodied as any type of computing device such as a personal computer (e.g., desktop, laptop, tablet, smart phone, body-mounted device, etc.), a server, an enterprise computer system, a network of computers, a combination of computers and other electronic devices, or other electronic devices.
Although not specifically shown, it should be understood that the I/O subsystem 516 typically includes, among other things, an I/O controller, a memory controller, and one or more I/O ports. The processor 512 and the I/O subsystem 516 are communicatively coupled to the memory 514. The memory 514 can be embodied as any type of suitable computer memory device (e.g., volatile memory such as various forms of random access memory). In the embodiment of FIG. 12, the I/O subsystem 516 is communicatively coupled to a number of hardware components and/or other computing systems including one or more user input devices 518 (e.g., a touchscreen, keyboard, virtual keypad, microphone, etc.), and one or more storage media 520. The storage media 520 may include one or more hard drives or other suitable data storage devices (e.g., flash memory, memory cards, memory sticks, and/or others).
In some embodiments, portions of systems software (e.g., an operating system, etc.), framework/middleware (e.g., application-programming interfaces, object libraries, etc.), the multimodal content embedding system resides at least temporarily in the storage media 520. Portions of systems software, framework/middleware, the multimodal content embedding system can also exist in the memory 514 during operation of the computing device 1200, for faster processing or other reasons. The one or more network interfaces 532 can communicatively couple the computing device 1200 to a local area network, wide area network, a personal cloud, enterprise cloud, public cloud, and/or the Internet, for example. Accordingly, the network interfaces 532 can include one or more wired or wireless network interface cards or adapters, for example, as may be needed pursuant to the specifications and/or design of the particular computing device 1200. The other computing device(s) 542 can be embodied as any suitable type of computing device such as any of the aforementioned types of devices or other electronic devices. For example, in some embodiments, the other computing devices 542 can include one or more server computers used with the multimodal content embedding system.
The computing device 1200 can further optionally include an optical character recognition (OCR) system 528 and an automated speech recognition (ASR) system 530. It should be understood that each of the foregoing components and/or systems can be integrated with the computing device 1200 or can be a separate component or system that is in communication with the I/O subsystem 516 (e.g., over a network). The computing device 1200 can include other components, subcomponents, and devices not illustrated in FIG. 12 for clarity of the description. In general, the components of the computing device 1200 are communicatively coupled as shown in FIG. 12 by signal paths, which may be embodied as any type of wired or wireless signal paths capable of facilitating communication between the respective devices and components.
In the drawings, specific arrangements or orderings of schematic elements may be shown for ease of description. However, the specific ordering or arrangement of such elements is not meant to imply that a particular order or sequence of processing, or separation of processes, is required in all embodiments. In general, schematic elements used to represent instruction blocks or modules may be implemented using any suitable form of machine-readable instruction, and each such instruction may be implemented using any suitable programming language, library, application-programming interface (API), and/or other software development tools or frameworks. Similarly, schematic elements used to represent data or information may be implemented using any suitable electronic arrangement or data structure. Further, some connections, relationships or associations between elements may be simplified or not shown in the drawings so as not to obscure the teachings herein. While the foregoing is directed to embodiments in accordance with the present principles, other and further embodiments in accordance with the principles described herein may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. A method of creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events, the method comprising:

for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality, using a first machine learning model;

for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality, using a second machine learning model; and

semantically embedding the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the common geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the common geometric space;

wherein embedded modality feature vectors that are related, across modalities, are closer together in the common geometric space than unrelated modality feature vectors.

2. The method of claim 1, further comprising:

for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector; and

semantically embedding the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors.

3. The method of claim 1, further comprising:

semantically embedding content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded combined multimodal feature vector.

4. The method of claim 1, further comprising:

projecting at least one of content, content-related information, and an event into the common geometric space; and

determining at least one embedded feature vector in the common geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event.

5. The method of claim 1, wherein a, second modality feature vector representative of content of the multimodal content having a second modality is created using information relating to respective content having a first modality.

6. The method of claim 1, further comprising:

appending content-related information, including at least one of user information and user grouping information, to at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector.

7. The method of claim 1, wherein content-related information comprises at least one of agent information or agent grouping information for at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector.

8. The method of claim 1, wherein the common geometric space comprises a non-Euclidean space.

9. The method of claim 8, wherein the non-Euclidean space comprises at least one of a hyperbolic, a Lorentzian, and a Poincaré ball.

10. The method of claim 1, wherein the multimodal content comprises multimodal content posted by an agent on a social media network.

11. The method of claim 10, wherein the agent comprises at least one of a computer, robot, a person with a social media account, and a participant in a social media network.

12. The method of claim 1, further comprising:

inferring information for feature vectors embedded in the common geometric space based on a proximity of the feature vectors to at least one other feature vector embedded in the common geometric space.

13. An apparatus for creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events, the apparatus comprising:

a processor; and

a memory coupled to the processor, the memory having stored therein at least one of programs or instructions executable by the processor to configure the apparatus to:

for each of a plurality of content of the multimodal content, create a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model;

for each of a plurality of content of the multimodal content, create a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model; and

semantically embed the respective, first modality feature vectors and the respective, second modality feature vectors in a common geometric space that provides logarithm-like warping of distance space in the common geometric space to capture hierarchical relationships between seemingly disparate, embedded modality feature vectors of content in the common geometric space;

14. The apparatus of claim 13, wherein the apparatus is further configured to:

for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, form a combined multimodal feature vector from the first modality feature vector and the second modality feature vector; and

semantically embed the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors.

15. The apparatus of claim 13, wherein the apparatus is further configured to:

semantically embed content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded feature vector.

16. The apparatus of claim 13, wherein the apparatus is further configured to:

project at least one of content, content-related information, and an event into the common geometric space; and

determine at least one embedded feature vector in the common geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event.

17. A non-transitory computer-readable medium having stored thereon at least one program, the at least one program including instructions which, when executed by a processor, cause the processor to perform a method for creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events, comprising:

for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model;

for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model; and

18. The non-transitory computer-readable medium of claim 17, wherein the processor further, for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forms a combined multimodal feature vector from the first modality feature vector and the second modality feature vector; and

semantically embeds the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors.

19. The non-transitory computer-readable medium of claim 17, wherein the processor further semantically embeds content-related information, including at least one of user information and user grouping information, in the common geometric space based upon a relationship between the content-related information and at least one embedded feature vector.

20. The non-transitory computer-readable medium of claim 17, wherein the processor further:

projects at least one of content, content-related information, and an event into the common geometric space; and

determines at least one embedded feature vector in the common geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event.