WO2019011936A1

WO2019011936A1 - Method for evaluating an image

Info

Publication number: WO2019011936A1
Application number: PCT/EP2018/068707
Authority: WO
Inventors: Katrien LAENEN; Marie-Francine Moens; Susana ZOGHBI
Original assignee: Katholieke Universiteit Leuven
Priority date: 2017-07-10
Filing date: 2018-07-10
Publication date: 2019-01-17

Abstract

A computer-implemented searching method for evaluating an image in dependence on a multimodal query is disclosed. The method comprises receiving a multimodal query comprising a query image and a query modifier (e.g., query text). The query modifier modifies the query image or adds attributes to the query image. The query image comprises at least one query image fragment, each of the at least one query image fragment having a corresponding query image fragment intermodal representation in a multimodal space. The query modifier comprises at least one query modifier fragment, each of the at least one query modifier fragment having a corresponding query modifier fragment intermodal representation in the multimodal space. The method comprises receiving a candidate image comprising at least one candidate image fragment, each of the at least one candidate image fragment having a corresponding candidate image fragment intermodal representation in the multimodal space. The method comprises calculating a first similarity between the query image and the candidate image in dependence upon at least one query image fragment intermodal representation and at least one candidate image fragment intermodal representation. The method comprises calculating a second similarity between the query modifier and the candidate image in dependence upon at least one query modifier fragment intermodal representation and at least one candidate image fragment intermodal representation. The method comprises calculating an overall similarity between the multimodal query and the candidate image in dependence upon the first similarity and the second similarity.

Description

Title: Method for evaluating an image

Field of the invention

The present invention relates to a multimodal image search method and a device for implementing a multimodal image search method.

Background of the invention

Current e-commerce search mechanisms are often too limited to provide users with their desired products. A common way of searching for products on an e- commerce website is to navigate through a product category hierarchy. Often, a user is then required to search through the entire contents of a category or subcategory with no guarantee that the desired product will be present.

A user may also select filters to narrow the search. However, the user is limited to the filters provided by the website and the desired product attributes may not be among the available filters.

An alternative searching method is text-based. The user inputs keywords into a search bar and the website finds relevant products by matching the keywords to words in product descriptions. This requires the user to know which terms are most likely to be used in the product descriptions of the desired products and does not account for different terms used to describe the same attribute. For example, a target product description may specify "distressed jeans" and a user may search for "jeans with holes"; the search will not return the target product.

Additionally, a search for a product attribute which is not present in the product description will not return the product. A further alternative searching method is an image-based search. A user provides an image of a product and the website provides visually similar products. However, the user is restricted to the content of the provided image (used for the search).

Therefore, there is a need for improved search methods.

Summary of the invention

It is an object of the present invention to provide alternative search methods. It is another object of the present invention to provide quick and efficient search methods.

This object is met by the method and device according to the independent claims of the present invention. The dependent claims relate to preferred embodiments.

According to a first aspect of the present invention there is provided a computer-implemented searching method for evaluating an image in dependence on a multimodal query. The method includes receiving a multimodal query comprising a query image and a query modifier. The query modifier modifies the query image and/or adds attributes to it, i.e. the query modifier modifies the query represented by the query image and/or adds attributes to it. The query image comprises at least one query image fragment, each of the at least one query image fragments having a corresponding query image fragment intermodal

representation in a multimodal space. The query modifier comprises at least one query modifier fragment, each of the at least one query modifier fragments having a corresponding query modifier intermodal representation in the multimodal space. The method further includes receiving a candidate image having at least one candidate image fragment, each of the at least one candidate image fragments having a corresponding candidate image fragment intermodal representation in the multimodal space. The method includes calculating a first similarity between the query image and the candidate image in dependence upon the at least one query image fragment intermodal representation and the at least one candidate image fragment intermodal representation. The method includes calculating a second similarity between the query modifier and the candidate image in dependence upon the at least one query modifier intermodal representation and the at least one candidate image intermodal representation. The method includes calculating an overall similarity between the multimodal query and the candidate image in dependence upon the first similarity and the second similarity.

In this searching method the query image and query modifier are semantically not related. The query modifier modifies an attribute present in the query image and/or adds an attribute not present in the query image. Therefore, the searching method can involve regulating how much the query modifier can change the query image. Relevant images are found by computing the visual similarity between a candidate image and the query image and computing the semantic similarity between a candidate image and the query modifier. Thereto, the multimodal search method can capture the latent semantic correspondences between image regions and query modifiers, such as words.

Embodiments of the present invention advantageously provide alternative image-based searching methods. In prior art methods, a user provides an image of a product and the website provides visually similar products. However, the user may be interested in modifying, e.g. changing, removing and/or adding product attributes (that is, modifying attributes in the image), in order to obtain a result which is not visually similar to the search image in all respects, but differs in the modified attribute. The latter advantageously is enabled by embodiments of the present invention, which can allow a more fine-grained search and to provide more relevant results.

Optionally, the intermodal representation of the desired image is calculated as the intermodal representation of all of the at least one image query fragment, plus the intermodal representations of all of the query modifier fragments, minus the intermodal representations of attributes that the query modifier fragments will replace.

The method may comprise outputting the candidate image in dependence upon the overall similarity. The candidate image may be output if the overall similarity is higher than a predetermined (or relative) threshold. The candidate image may be withheld from outputting if the overall similarity is lower than a predetermined (or relative) threshold.

The query modifier may comprise a query text. The query text can modify the query represented by the query image and/or add one or more attributes to it.

The query modifier may comprise an audio query including at least one spoken word. The method may further comprise, after receiving the query modifier, converting the audio query to a query text.

An intermodal representation may correspond to a vector in the multimodal space. Calculating the first similarity or the second similarity may comprise calculating a similarity metric of vectors in the multimodal space. Optionally, calculating the similarity includes calculating an inner product of a first vector representing the candidate image and a second vector representing the query.

Optionally, the inner product of a query modifier fragment and candidate image fragment is used to require every query modifier attribute to be present in the retrieved images.

The method may comprise determining the query image fragment intermodal representation(s) in dependence upon at least one image attribute of the query image. The method may further comprise, after receiving the query image, extracting the at least one image attribute of the query image using an image segmentation method, optionally a rule-based image segmentation method.

The method may comprise determining the query modifier fragment intermodal representation(s) in dependence upon at least one attribute of the query modifier. The method may further comprise, after receiving the query modifier, extracting the at least one attribute of the query modifier using a segmentation method, optionally a rule-based text segmentation method.

The method may comprise determining the candidate image fragment intermodal representation(s) in dependence upon at least one image attribute of the candidate image. The method may further comprise, after receiving the candidate image, extracting the at least one image attribute of the candidate image using an image segmentation method, optionally a rule-based image segmentation method.

The overall similarity may be equal to the aggregation of the first similarity and the second similarity.

The overall similarity may be equal to a weighted aggregation of the first similarity and the second similarity.

The query image and candidate image may each illustrate at least one item which can be provided in various implementations (e.g. have variations) or have several attributes. The query image and candidate image may each illustrate an object of a type wherein the object has at least one attribute which is visually distinguishable between objects of the type. For example, an image of a dress is an image which illustrates an object: the dress. The object has a type: that is, being an object of the class of 'dresses'. The object has at least one attribute: for example, the length of the dress. The at least one attribute is visually distinguishable between objects of the type: for example, a dress having a short length and a dress having a long length are visually distinguishable, that is, they can be distinguished by a human who views the images, or by a computer which can perform image analysis on the images.

The query image and candidate image may each illustrate at least one fashion item.

The query image and the candidate image may each illustrate at least one toy, car, item of furniture, food item, house, electronic device or accessory, for example a laptop bag or mobile telephone or tablet case or cover.

The multimodal space may be induced by a neural network.

It is an advantage of embodiments of the present invention that image fragment similarity information across a plurality of images may be exploited, in contrast to only using global and local alignments of individual image-text pairs. This mechanism allows to recover information that may be lost due to noise and incompleteness in text descriptions.

According to a second aspect is provided a computer-implemented method for selecting an image from a plurality of images in dependence on a multimodal query. The method includes receiving a multimodal query comprising a query image and a query modifier, wherein the query modifier modifies the query represented by the query image and/or adds attributes to it. The query image comprises at least one query image fragment, each of the at least one query image fragments having a corresponding query image fragment intermodal

representation in a multimodal space. The query modifier comprises at least one query modifier fragment, each of the at least one query modifier fragments having a corresponding query modifier fragment intermodal representation in the multimodal space. The method includes receiving a plurality of candidate images, each comprising at least one candidate image fragment, each of the at least one candidate image fragments having a corresponding candidate image fragment intermodal representation in the multimodal space. The method includes for each of the candidate images calculating a first similarity between the query image and said candidate image in dependence upon at least one query image fragment intermodal representation and at least one candidate image fragment intermodal representation. The method includes calculating a second similarity between the query modifier and said candidate image in dependence upon at least one query modifier fragment intermodal representation and at least one candidate image fragment intermodal representation. The method includes calculating an overall similarity between the multimodal query and said candidate image in dependence upon the first similarity and the second similarity. The method includes selecting from the plurality of candidate images at least one candidate image having highest overall similarity.

Optionally, the method includes for each candidate image adding to a candidate image fragment intermodal representation of said candidate image an attribute retrieved from candidate image fragment intermodal representations of other candidate images that are visually similar to the said candidate image.

Optionally, the method includes for each candidate image removing from a candidate image fragment intermodal representation of said candidate image an attribute not occurring in any image fragment intermodal representation of other candidate images that are visually similar to the said candidate image.

According to a third aspect is provided a computer-implemented method for selecting an image from a plurality of images in dependence on a multimodal query. The method includes receiving a multimodal query comprising a query image and a query modifier, wherein the query modifier modifies the query represented by the query image and/or adds attributes to it, wherein the query image has a corresponding query image intermodal representation in a multimodal space and the query modifier has a corresponding query modifier intermodal representation in the multimodal space. The method includes receiving a plurality of candidate images, each having a corresponding candidate image intermodal representation in the multimodal space. The method includes ranking the candidate images in dependence upon their relevance to the multimodal query which includes the query image and the query modifier. The method includes selecting, from the plurality of candidate images at least one candidate image having highest ranking. Optionally, the method includes segmenting the query image and the candidate images, producing one or more query image fragments for the query image and one or more candidate image fragments for the candidate image.

Optionally, the method includes segmenting the query modifier, producing one or more query modifier fragments, the query modifier fragments referring to attributes to be added and/or interchanged with attributes of the query image. The query image fragments, query modifier fragments and candidate image fragments can be represented using intermodal representations as e.g. inferred by a neural network.

Optionally, determining the relevance of the candidate image includes determining a cosine similarity measure, to measure both the visual similarity of the query image, or a fragment thereof, and the candidate image, or a fragment thereof, and the semantic similarity of the candidate image, or a fragment thereof, and the query modifier, or a fragment thereof.

Optionally, a candidate image is considered to be relevant if it is visually similar to query image and exhibits every attribute expressed by query modifier. However, in some embodiments, it may be that every attribute of query image and/or query modifier need not be present in the candidate image for the candidate image to be considered to be relevant.

Optionally, a candidate image satisfies a query modifier if every query modifier fragment is shown in at least one image fragment of the candidate image.

The relevance of a candidate image may depend on the number of candidate image fragments which display attributes corresponding to query modifier fragments. A weight may be given to the query modifier relative to the query image. A smaller weight will give more relevance to a candidate image which is similar to the query image, whereas a larger weight will give more relevance to a candidate image which satisfy the query modifier but is less similar to the query image. The weighting term may be chosen based on a validation set.

The searching method may be implemented using a multimodal search system. The multimodal search system may include a central processing unit, memory, input devices (e.g., keyboard and pointing devices), output devices (e.g., display devices), and storage devices (e.g., disk drives). The memory and storage devices are computer-readable media that may contain instructions that

implement the multimodal search system. In addition, the data structures may be stored or transmitted via a data transmission medium, such as a signal on a communications link. Various communications links may be used to connect components of the system, such as the Internet, a local area network, a wide area network, a point-to-point dial-up connection, a cell phone network, and so on. The multimodal search system may be implemented in various operating environments that include personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, programmable consumer electronics, digital cameras, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and so on. The devices may include cell phones, personal digital assistants, smart phones, personal computers, programmable consumer electronics, digital cameras, and so on.

According to a fourth aspect is provided a computer-implemented method for identifying an image. The method includes receiving a multimodal query comprising a query image and a query modifier, wherein the query modifier modifies the query represented by the query image and/or adds attributes to it, wherein the query image has a corresponding query image intermodal

representation in a multimodal space and the query modifier has a corresponding query modifier intermodal representation in the multimodal space. The method includes receiving a plurality of candidate images, each having a corresponding candidate image intermodal representation in the multimodal space. The method includes determining for each of the candidate images a measure of correspondence to the multimodal query which includes the query image and the query modifier. The method includes identifying, from the plurality of candidate images at least one candidate image having a highest measure of correspondence.

Hence, the method allows to identify from the plurality of candidate image the at least one candidate image having closest conformity to the query image as modified according to the query modifier. According to a fifth aspect of the present invention there is provided a computer-readable medium containing instructions for performing a method according to the first aspect.

It is an advantage of the present invention that a user can provide a multimodal query and receive one or more query results using a mobile computing device, such as a smartphone. This can allow a user to search Όη-the-go', for example, by taking a photograph of an object or of an image of an object seen in a physical shop, on the street, on a television screen or advertising billboard, in a magazine or any other location, and searching for objects which are similar to the object but have one or more attributes modified and/or added.

It is an advantage of the present invention that a user can provide a multimodal query to a computer- readable medium containing instructions for performing a method according to the first aspect without requiring that the computer-readable medium is comprised in a mobile computing device. For example, a multimodal query may be composed by a user on a mobile computing device and transmitted wirelessly to a server which comprises a computer-readable medium containing instructions for performing a method according to the first aspect. This can allow that computationally intensive steps are not required to take place in the mobile computing device. This can allow search results to be evaluated on a server which may have greater processing power and thus search results may be provided more quickly than if the search is performed within the mobile computing device.

According to a sixth aspect of the present invention there is provided a system including functional units arranged for performing the steps of the above methods. Such system may include a user device and optionally a server. The system may include a processor and software code portions arranged for causing the processor to perform the method steps.

According to a seventh aspect of the present invention there is provided use of a method according to the first aspect or a device according to the second aspect in an e-commerce setting.

Particular and preferred aspects of the invention are set out in the accompanying independent and dependent claims. Features from the dependent claims may be combined with features of the independent claims and with features of other dependent claims as appropriate and not merely as explicitly set out in the claims. It will be appreciated that any of the aspects, features and options described in view of one of the methods apply equally to the other methods, the medium, the system and the use. It will also be clear that any one or more of the above aspects, features and options can be combined.

Brief description of the drawings

Further features of the present invention will become apparent from the examples and figures, wherein:

Figure 1 is a schematic representation of a multimodal query;

Figure 2 illustrates image fragments comprised in an image;

Figure 3 illustrates text fragments comprised in a query text;

Figure 4 illustrates a candidate image;

Figure 5 illustrates image fragments comprised in a candidate image;

Figure 6 is a flow chart showing a multimodal image search process;

Figure 7 shows a table listing example approximate positions of example image fragments and corresponding attributes which may be expected to be found in image fragments;

Figure 8 schematically illustrates image and text fragments, the image fragment cluster representative, their embedding or neural network representation, and a representation of their inner products;

Figure 9 is a table showing a comparison of different multimodal retrieval models Detailed description of preferred embodiments

The present invention will be described with respect to particular embodiments and with reference to certain drawings but the invention is not limited thereto but only by the claims. The drawings described are only schematic and are non-limiting. In the drawings, the size of some of the elements may be exaggerated and not drawn on scale for illustrative purposes. Where the term

"comprising" is used in the present description and claims, it does not exclude other elements or steps. Where an indefinite or definite article is used when referring to a singular noun e.g. "a" or "an", "the", this includes a plural of that noun unless something else is specifically stated. The term "comprising", used in the claims, should not be interpreted as being restricted to the means listed thereafter; it does not exclude other elements or steps. Thus, the scope of the expression "a device comprising means A and B" should not be limited to devices consisting only of components A and B. It means that with respect to the present invention, the only relevant components of the device are A and B. Furthermore, the terms first, second, third and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein. Moreover, the terms top, bottom, over, under and the like in the description and the claims are used for descriptive purposes and not necessarily for describing relative positions. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the

embodiments of the invention described herein are capable of operation in other orientations than described or illustrated herein. In the drawings, like reference numerals indicate like features; and, a reference numeral appearing in more than one figure refers to the same element. The drawings and the following detailed descriptions show specific embodiments of devices and methods for evaluating an image in dependence on a multimodal query.

Referring to Figure 1, a schematic representation of a multimodal query according to embodiments of the present invention is shown. The query comprises a query image 1 and a query modifier 2. In this representation and embodiment, the query modifier 2 takes the form of query text 3 which specifies "find it with spots and long sleeves". However, in alternative embodiments the query modifier 2 may take the form of an audio or speech query, for example a spoken phrase.

In the example of Figure 1, the query image 1 includes a dress 4 which has characteristics, or attributes, such as dress length, colour, pattern (in this case, stripes), sleeve length. The query modifier 2 modifies the query represented by the query image by specifying at least one characteristic or attribute, in this case "spots", which for example is not present in the query image 1. The query modifier 2 may also include words, speech, audio, or terms which are not characteristics or attributes. For example, in Figure 1, the query text 3 includes the words "find it with" which are not characteristics or attributes of a product.

Referring to Figure 2, the query image 1 may be divided into at least one image fragment 5, in which at least one image attribute is likely to be found. The fragments 5 may overlap, that is, a portion of a first image fragment 5i may be present in a second image fragment 5₂. Referring to Figure 3, the query text 3 may be divided into at least one text fragment 6, each text fragment containing a textual attribute. A textual attribute can be any word or phrase which describes an attribute of a product to be searched for and is in general an attribute which is not present in the query image 1. For example, in the example shown in Figure 3, text fragment 6i specifies "spots" and text fragment 6₂ specifies "long sleeves".

As will be described in more detail hereinafter, each image fragment 5 is represented by a corresponding query image fragment intermodal representation. Each modifier, e.g. text fragment 6, is represented by a corresponding query modifier fragment intermodal representation, e.g. text fragment intermodal representation. The intermodal representations are preferably representations within a multimodal space wherein an image fragment and a text fragment have a high similarity as computed in the multimodal space if they represent the same attribute. In certain embodiments, the multimodal space may be a vector space and calculating the similarity may include calculating an inner product.

For example, the image fragment 5i includes the top of the dress 4. The top of the dress shows a dress pattern which includes stripes but does not include spots. The text fragment 61 specifies "spots". The intermodal representation of image fragment 5i and the intermodal representation of text fragment 61 would have a low value of similarity in the multimodal representation as they do not represent the same attribute, that is, the image fragment 5i represents the attribute of stripes and the text fragment 61 represents the attribute of spots.

On the other hand, an intermodal representation of a text fragment specifying "stripes" would have a high value of similarity with the intermodal representation of image fragment 5i, as both fragments represent the attribute of stripes.

Referring to Figure 4, a candidate image 7 is shown. A similarity of the candidate image 7 with the multimodal query 1 is a measure of how closely attributes of the candidate image 7 match both the query image attributes and the query text attributes.

Referring to Figure 5, in a similar manner to that described in relation to the query image 1, the candidate image 7 may be divided into at least one image fragment 8, in which at least one image attribute is likely to be found. Each image fragment 8 is represented by a corresponding candidate image fragment

intermodal representation.

The similarity of the candidate image 7 with the multimodal query may then be calculated based upon both a visual similarity of query image fragments 5 and candidate image fragments 8, and a semantic similarity of query text fragments 6 and candidate image fragments 8. However, other methods for obtaining a measure of similarity known by the skilled person may be used as well. The visual similarity can be calculated in dependence on a similarity value of at least one query image fragment intermodal representation and at least one candidate image fragment intermodal representation. The semantic similarity can be calculated in dependence on a similarity value of at least one query text fragment intermodal representation and at least one candidate image fragment intermodal representation. An image fragment and a text fragment are considered to be semantically similar if they represent the same object and/or concept. For example, the phrase "long sleeves" is semantically similar to a fragment of an image showing the sleeves of a long-sleeved dress.

This allows a query result to be found which exhibits the same attributes as the query image 1, but with those attributes specified by the query modifier 2 added and/or changed. Put differently, the method described herein allows to find images which contain desired attributes of the query image 1 and contain attributes which are not present or are different in the query image 1, which are specified by the query modifier 2. Referring to Figure 6, a flowchart of a method according to embodiments of the present invention described herein is shown.

The multimodal query is received (step Si). The multimodal query comprises the query image 1 and query modifier 2. The query image 1 has at least one corresponding query image fragment intermodal representation in a multimodal space. The query modifier 2 has at least one corresponding query modifier fragment intermodal representation in the multimodal space.

A candidate image is received (step S2). The candidate image has at least one candidate image fragment intermodal representation in the multimodal space. The candidate image may be selected from a database of potential candidate images. The candidate image may be selected at random. The candidate image may be selected from a subset of candidate images in a database of potential candidate images.

A first similarity between the query image and the candidate image is calculated in dependence upon at least one of the at least one query image fragment intermodal representation(s) and at least one of the at least one candidate image fragment intermodal representation(s) (step S3).

A second similarity between the query modifier and the candidate image is calculated in dependence upon at least one of the at least one query modifier fragment intermodal representation(s) and at least one of the at least one candidate image fragment intermodal representation(s) (step S4).

An overall similarity between the multimodal query and the candidate image is calculated in dependence upon the first similarity and the second similarity (step S5).

In dependence on the overall similarity, the candidate image may be provided as an output image which satisfies the multimodal query. For example, an output image may be an image having an overall similarity which is greater than or equal to a similarity threshold. The similarity threshold may be

predetermined, that is, may not be controllable by a user of the search method. The similarity threshold may be controllable, that is, a user may be able to change the similarity threshold if desired. If the candidate image has an overall similarity with the multimodal query which is less than a similarity threshold, the candidate image may not be provided as output. A further, different candidate image may be received and the method may repeat from step S3.

In some embodiments, a set of candidate images may each be evaluated for overall similarity with the multimodal query and the set of candidate images may be ranked in accordance with their overall similarity. For example, the candidate images may be provided in a list where the position of a candidate image in the list depends upon the overall similarity of that candidate image with the multimodal query.

In some embodiments, a set of K candidate images may each be evaluated for overall similarity with the multimodal query and a subset of K' images may be provided as output, where K' is less than K and images in the subset are chosen as the K' images having the largest value of overall similarity. For example, if the K images are ranked in descending order of similarity, then the first K' images in the ranking would form the subset.

The method may be repeated for a plurality of candidate images and the candidate images may be ranked in order of overall similarity and may all be provided as output images. A subset of a plurality of candidate images may be provided as output images in dependence upon the overall similarity of each of the plurality of candidate images.

Aspects of the present invention will now be described in further detail.

Multimodal space

The multimodal space in preferred embodiments is induced by a neural network which learns parameters to project an image fragment and a text fragment to their corresponding intermodal representations in the common, multimodal space. In the multimodal space, the inner product or other similarity metric of an image fragment representation and a text fragment representation is a measure of their semantic similarity. If the inner product is positive, the image fragment and the text fragment are considered to represent the same attribute. If the inner product is negative, the image fragment and the text fragment are considered to represent different attributes. The more positive or negative the inner product, the greater is the certainty of the semantic similarity.

The neural network preferably learns parameters for projection to the intermodal representations by using a set of training data. The training data includes images and associated textual descriptions of the images. In order to extract attributes from images and textual descriptions, an image segmentation or a text segmentation can be performed as described in the following. However, it will be understood that any appropriate image segmentation method may be used in order to determine attributes of an image, regardless of whether the image is a query image 1, a candidate image 7, or an image in the set of test, training, or validation data (not shown). For example, the Selective Search method may be used. A first image segmentation method may be used for segmenting images in the training data and the same image segmentation method may be used for segmenting a query image. Alternatively, a second, different segmentation method may be used for segmenting a query image.

Image segmentation

The attributes of an image 1, 7 can be extracted using an image segmentation method according to embodiments of the present invention.

The exemplary images 1, 7 each include a dress and the segmentation method exploits the geometry common to images of dresses. For example, the overall shape of dresses is generally similar, but a variety of attributes such as length, neckline, colors can vary between individual dresses.

A bounding box is preferably determined which encloses the full dress. This may be achieved through a thresholding process which determined which pixels belong to the dress and which to the background. The region inside the bounding box is assigned to be a first image fragment.

In this example, the region inside the bounding box is further divided into six image fragments containing the top, the full skirt, the part of the skirt above the knee, the neckline, the left sleeve, and the right sleeve, where each fragment is a region of the image where attributes of the dress are likely to be found. For example, the left sleeve fragment is a region of the image where the attribute of sleeve length may be found. Other attributes that may be found in the left sleeve fragment are, for example, one or more of pattern, material, or colour attributes. With this rule-based segmentation approach each image has a plurality of image fragments corresponding to image regions where attributes are likely to be found. Figure 7 shows a table listing example approximate positions of example image fragments and corresponding attributes which may be expected to be found in image fragments. A location may be a rectangle represented as (x, y), w, h, with (x,y) the coordinates of the upper left corner of the rectangle, w the width and h the height of the rectangle. Wis the width of the bounding box and H is the height of the bounding box. For example, the approximate location of the neckline of a dress is expected to be a rectangle having its upper left corner at the upper left corner of the bounding box, having a width equal to the width of the bounding box, and a height equal to 0.20 times the height of the bounding box.

In other embodiments of the present invention, the location and dimensions of image fragments may be determined in dependence upon the expected geometry of subjects of an image. For example, embodiments may be directed to searching for images of cars. A bounding box may be chosen which encloses the car and preferably excludes other objects which may be present in the image, such as trees or buildings. Fragments and corresponding expected locations may be chosen, for example, for one or more wheels of the car, one or more doors of the car, a front and/or rear window.

The image fragments are represented, in preferred embodiments, with the BVLC CaffeNet convolutional neural network (CNN) model. Alternatively, the AlexNet model may be used or any other convolutional neural network

architecture. In embodiments, the CaffeNet CNN may be pre-trained on ImageNet. The image fragment representations are acquired as the activation weights (or inputs) of the last fully connected layer before the softmax layer, which have dimension 4096 in the CNN architecture.

A convolutional neural network according to embodiments of the present invention includes at least one convolutional layer and may include one or more of at least one pooling layer, at least one normalization layer, at least one fully connected layer, a softmax layer. However, any suitable architecture of a neural network may be used to represent the image fragments. Any suitable architecture of a convolutional neural network may be used to represent the image fragments.

In some embodiments, a neural network is not used to represent the image fragments. A Scale-Invariant Feature Transform method may be used to represent the image fragments.

Text segmentation

Word embeddings, or vector representations for words, according to embodiments of the present invention may be trained on text descriptions using a distributional semantic model, for example the Skip-gram model, or a latent word language model. This allows the learning of a single word embedding for multiword fashion expressions (for example, long sleeves) and allows to better capture the syntax and semantics of phrases likely to be included in the text descriptions. However, any neural network architecture, or other method which can be used to train word embeddings, for example a Latent Semantic Analysis method, may be used.

The syntax and semantics of text descriptions may be different for different implementations. For example, the style of writing of clothes descriptions on a fashion e-commerce website is likely to be different to that of descriptions of cars on a second-hand car sales website.

To acquire the text fragments, all words are first converted to lowercase and all non- alphanumeric characters are preferably removed. Words occurring at a low frequency may also be removed, for example, words which occur less than 5 times in the training data set. Next, the text descriptions are preferably filtered to remove phrases which are not related to the subject of the associated images. For example, in the case of images of dresses with corresponding text descriptions, the glossary of an online clothing shop, e.g. the online clothing shop Zappos, may be used, which contains both single word and multiword expressions related to fashion.

Although this can remove much noise from the text descriptions, some noise may remain. Remaining words or phrases may refer to parts of the subject of an image which are not visible in the image. For example, an image may include a front view of a dress and the associated text description may refer to the back of the dress.

Next, each phrase is considered as a text fragment. Thus, the number of text fragments may differ for different text descriptions, and some text descriptions may not result in any text fragments.

It will be understood that the Zappos glossary approach is only one example of a method of acquiring text fragments. Other glossaries may be chosen which are appropriate for the subject of the images and/or text descriptions. In other embodiments, a segmentation method need not use a glossary.

Neural network

The neural network preferably learns projection parameters

f 11 , ^, !%!. I! jf . fr,* } which project an image fragment v_t and text fragment si to their intermodal representations v_t and Sj in the common multimodal space according to:

Vi = W_vvi + b. V ( i )

An activation function / is set to the rectified linear unit (ReLU) which computes f(x) = max(0, ). W_v has dimensions h x 4096 and W_s has dimensions h x dim, where h is the size of the common, multimodal space and dim is the dimension of the word embeddings. Parameters b_v and b_s are bias terms.

If the inner product vfsj of image fragment v_t and text fragment Sj is positive, then the image fragment and the text fragment are considered to represent the same attribute. When the inner product is negative, the image fragment and the text fragment are not considered to represent the same attribute. The more positive or negative the inner product, the more certainty there is about the semantic similarity between the image fragment and the text fragment. where C_F(9) is a fragment alignment objective, C_G (9) is a global ranking objective, and C_/(0) is an image cluster consistency objective. 9 refers to the network parameters and a, β, and γ are hyper parameters to be cross-validated, which are set in dependence on a validation set. A hyper-parameter is a parameter which is set in advance of the training of the neural network, and is not a parameter which is learned by the model. A validation set is a subset of the training data set, which is used to determine appropriate values for the hyper parameters. Cross-validation aims to assess whether a value chosen for a hyper parameter is appropriate. An appropriate value for a hyper parameter is one which assigns for a weight to an objective (C_F(9), C_G (9), or C_/(0)) such that the objective function C(0) guides the neural network model to learn projection parameters which provide a multimodal space which reflects the semantic similarity of image and text fragments.

Each of the objectives C_F(9), C_G (9), C_/(0) is concerned with a different aspect or characteristic of the training data set.

However, the present invention is not limited to use of this objective function. For example, in some embodiments an objective function may include product dependent constraints. Fragment alignment objective

The fragment alignment objective C_F(9) uses fragment co-occurrence information to infer the semantic similarity vfs_j of image fragment and text fragment Sj .

In the training data set, for a given image comprising a set of at least one image fragment and its associated text description comprising a set of at least one text fragment, it is not known which of the image fragments and text fragments refer to the same attribute. For example, the neural network initially has no knowledge that an image fragment 82 (Figure 5) and a text fragment "long sleeves" refer to the same attribute, that is, are semantically similar.

However, for a given text fragment it can be assumed that there is at least one corresponding image fragment in the set of image fragments of the given image. Additionally, the image fragments of all images which do not have a particular text fragment present in their associated text descriptions can be expected to not show the corresponding attribute in their image fragments.

For example, if a text description contains the text fragment "v-neck", then it may be assumed that at least one image fragment in the set of image fragments of the associated image shows a v-neck neckline. Conversely, text descriptions which do not include the text fragment "v-neck" are probably associated with images which do not have any image fragments showing the attribute "v-neck".

The fragment alignment objective C_F 6) uses these assumptions to learn an intermodal representation. The fragment alignment objective may be

formulated as

Cp(O) = min C¾ (0) (4)

C_o(0) = ^ max(0, 1 - yijv sj) (5)

» i

,,, , . _L

subject to 2^ ^— - > 1 Vj (6)

t€f_¾

Vij = -1 Vt, subject to m_v(i)≠ m₃(j) (7)

and € {—1, 1}. (8)

All image fragments v_t and text fragments Sj in the training set are considered by the fragment alignment objective. The variable j¾ reflects whether v_t and sj are expected to show the same attribute (yij = 1) or to not show the same attribute (yij = - 1), and consequently whether their semantic similarity score v sj should be encouraged to be more than 1 or less than - 1 (equation 5) . The value for the variable yij is determined based on the assumptions above. For text fragment sj in a given text description, at least one image fragment v_t in the image associated with the given text description is expected to show the attribute expressed by Sj

(equation 6). In equation 6, j is the collection of image fragments which occur with Sj, that is, all image fragments in the image which is associated with the text to which Sj belongs. In order to determine which image fragment shows the attribute, the fragment alignment objective C_F 9) attempts to find the variables yij which minimize equation 5 (see also equation 4). An image fragment v_t and a text fragment Sj belonging to a non-corresponding image-text pair (??½ {¾) m_s (j)) are expected to represent different attributes (equation 7). In equation 7, m_v(i) is the identification number of the image in the training set to which image fragment vi belongs. m_s(j) is the identification number of the text description in the training set to which text fragment sj belongs. That is, for a given image fragment and a given text fragment, if m_v(i) = m_s(j) then the image fragment and the text fragment belong to an image and a text description, respectively, which form a corresponding image-text pair.

Since the fragment alignment objective benefits from a good initialization of the intermodal representations, it is trained with yij = 1 for all v_t and sj of corresponding image-text pairs during the first 15 epochs. An epoch is a single forward pass of the entire training set through the neural network, followed by a single backward pass of the training set through the neural network. In general terms, an epoch may be referred to as one full training cycle. Subsequently, the fragment alignment objective is changed to equation 4 for refinement of the intermodal representations.

A good initialization is one which already partially captures the semantics of image and text fragments in the intermodal representations. This is preferable to, for instance, an initialization which uses randomly chosen

intermodal representations, as these do not capture any semantic information at the initialization stage.

In this example, when the fragment alignment objective is trained with yij = 1 for all v_t and Sj of corresponding image-text pairs during the first 15 epochs, the model learns that all image and text fragments which occur together are semantically similar and all image and text fragments which do not occur together are not semantically similar. This is not correct, because a text fragment specifying "V-neck" is only semantically similar to an image fragment showing a V-neck and not to other image fragments of an image showing, for example, sleeves or a skirt. However, if the model is initially trained in this way, it is expected that the model will already learn something about the semantics of image fragments and text fragments. This is because some combinations of image and text fragments may occur more frequently. In the example of the V-neck, in the complete dataset, the phrase "V-neck" may occur in a text description which is associated with an image having different image fragments (which may show, for example, different sleeve lengths, different forms and lengths of skirts, ...). However, the phrase "V-neck" will always occur with an image fragment showing a V-neck. So, the combination of the text fragment "V-neck" with an image fragment showing a V-neck is expected to occur in the dataset with a greater frequency than a combination of the text fragment "V-neck" and an image fragment which does not show a V-neck.

Therefore, training the model in this way for the first 15 epochs may result in a good initialization.

The number of epochs during which the model is trained with j¾ = 1 for all Vi and sj of corresponding image-text pairs is not limited to 15. The model can be trained for any number of epochs.

It will be noted that the assumption made by the fragment alignment objective may not always be true. For example, noise in a text description may result in text fragments which are not semantically related to any of the image fragments of the image associated with the text description. If a text description is incomplete, an image fragment may show an attribute which does not have a corresponding text fragment in the text description. However, in many cases, an intermodal representation may be inferred even when the training data available is noisy. This can be an advantage in applications where there exists no high- quality set of training data. In these cases, noisy training data may be extracted from, for example, an e-commerce website and used to train the neural network.

Global ranking objective

Semantic similarity can also be derived from global image-text correspondence. A first image and a first corresponding text description, forming a first image-text pair, should have a higher total semantic similarity than the first image and a second text description which belongs to a second, different image-text pair. That is, the first image does not correspond to the second text description, as they belong to different image-text pairs. This is encoded by the global ranking objective C_G (6) as follows: Gc,"i#) = ^ \ max(0, S_ki - S_kk + Δ) rink images

(93

+ } max(0, Sik - S_kk + Δ) rank lexis

The global ranking objective forces corresponding image-text pairs (k = I) to have a higher total semantic similarity score Skk (by a margin Δ) than non- corresponding image-text pairs. Here, the total semantic similarity score Ski of an image k and text I is computed based on the semantic similarity scores of their respective fragments fk and fi according to the following equation

where n is a smoothing term and hyper parameter which is chosen to prevent shorter texts from having an advantage over longer texts. For example, consider a first text ti including one text fragment, a second text t including five text fragments, and an image including at least one image fragment, and the requirement to determine the similarity between each of the texts and the image. In this example, the text fragment of the short text ti may have only a relatively small positive inner product with the image fragment of image k which best matches the text fragment of the short text ti. The long text t2 may have two text fragments which each have a relatively large positive inner product with an image fragment, and three text fragments which each have a relatively small, or a negative, inner product. In this example, the similarity between the second text and the image would be greater than the similarity between the first text and the image. However, equation 10 requires division by the number of text fragments in the text (fi). If the smoothing term n is not included, then the similarity value for the short text ti is divided by 1 and the score for the long text t2 is divided by 5, which can lead to the similarity between ti and the image being greater than the similarity between t2 and the image, which is not desirable. The smoothing term n aims to counterbalance this discrepancy. The smoothing term may alternatively be chosen in dependence upon a validation set.

Image cluster consistency objective

Text descriptions in the training data set may be noisy and/or incomplete.

This may interfere with the assumption made by the fragment alignment objective. The image cluster consistency objective C_/(0) attempts to deal with this noise and incompleteness by exploiting the fact that image fragments which look similar probably have at least one attribute in common. For example, if image fragment v_t and v_k look similar and image fragment v_t has a high positive semantic similarity score with the text fragment "v-neck", then v_k probably also shows a v-neck.

Conversely, if v_t has a negative semantic similarity score with the text fragment "blue", then the semantic similarity score of v_k and the text fragment "blue" should also be negative.

The idea is that visually similar image fragments probably show the same product attributes. Therefore, if a product attribute occurs in the description of multiple image fragments that are visually similar to image fragment Vi but not in the description of image fragment Vi itself, it can be assumed it also describes a property of image fragment Vi. This can aid in solving incompleteness of data.

Conversely, if a product attribute occurs in the description of image fragment Vi, but is never used to describe image fragments visually similar to Vi, it can be assumed that the product attribute should not be in the description of image fragment v This can aid in removing noise from the data set.

Thus, if image fragment i¾shows a v-neck neckline, but there is no text fragment in the corresponding text description which specifies "v-neck", the image cluster consistency objective may infer the semantic relatedness of image fragment i¾and text fragment "v-neck".

Conversely, if image fragment v_k has an associated text description having a text fragment "blue" and image fragment v_k does not display the attribute of blue, the image cluster consistency objective may infer that this colour attribute is incorrect when considering the negative semantic similarity of a similar image fragment v_t with the text fragment "blue". This can allow the recovery of information which may be otherwise lost.

To identify similar image fragments, the image fragments are clustered in C clusters based on cosine distance with ¾-means clustering. However, any suitable clustering method may be used, for example hierarchical agglomerative clustering. A clustering separates vectors into groups (or clusters) of vectors, where the vectors in each group are similar to each other. A high cosine similarity indicates high similarity, however any suitable similarity metric may be used. The image cluster consistency objective may then be expressed as

<*» = ∑∑∑(i - m=_l i _j i li¾l tM >l^«* ¾l (11)

This objective considers all M image-text description pairs in the training set, and for each pair sums over its image fragments v and text fragments Sj . This objective encourages the difference between the semantic similarity score of a first image fragment v with corresponding first text fragment Sj and the semantic similarity score of a second image fragment c_t (which is similar to image fragment v_t ) with the first text fragment Sj to be as small as possible. The second, similar image fragment c is taken to be the centroid of the cluster of v . The centroid of a cluster is the average of all vectors in that cluster. However, the second image fragment c may be taken to be the medoid of the cluster or the nearest neighbor in the same cluster. The medoid of a cluster is the vector having the maximum average similarity with all other vectors in the cluster. The nearest neighbor of a vector is the vector which is most similar. The difference in semantic similarity scores is weighted by a factor based on cosine similarity of the image fragment and its centroid. This weighting factor allows the image cluster consistency objective to attempt to prevent image fragments in the same cluster from being semantically related to the same text fragment in the case where the image fragments do not have a high degree of visual similarity. This can help to prevent the introduction of errors due to defects in the clustering.

Figure 8 schematically illustrates an embodiment of the present invention, for example image and text fragments, their embedding or neural network representation, and inner product. The neural network preferably learns intermodal representations for the image and text fragments, such that

semantically related image and text fragments have a high inner product (shown as heavier shading).

Multimodal search

A multimodal search according to embodiments of the present invention may be carried out using the intermodal representations inferred by the neural network. It will be understood that the training of the neural network does not need to be carried out each time a search is required. The training may occur once and may occur at a different time and/or on a different device to that used to request or carry out the multimodal search.

The multimodal search may rank candidate images i_c in dependence upon their relevance to a multimodal query q = [i_q, t^} which includes a query image i_q and query text t_q. The query text t_q is considered to be a query modifier. In some embodiments, the query modifier may be first provided in a different format, for example in audio format, and subsequently converted to text format by, for example, a speech-to-text conversion method. Thus, the term "query modifier" can refer to a query which is initially provided in text format or in another format. It is also possible that the query modifier is in a non-text format, such as speech, and has a non-text, such as speech, fragment intermodal representation in the multimodal space.

If the query modifier is provided in a non-text format, it is preferably first converted to text format using any suitable method. The resulting text is then the query text t_q .

A candidate image i_c is preferably evaluated for its similarity to the query q as follows. The query image i_q and the candidate image i_c are segmented using any suitable segmentation method, producing n image fragments v_{i q} and v_{i c} for the query image and the candidate image respectively, where i runs from 1 to n. n is a positive integer. The query text t_q is segmented using any suitable

segmentation method, producing m text fragments Sj _q , where j runs from 1 to m. The text fragments Sj_>q refer to attributes to be added and/or interchanged with attributes of the query image i_q. Image and text fragments are represented using intermodal representations as inferred by the neural network. Candidate image fragment intermodal representations are denoted as v_{i c}. Query image fragment intermodal representations are denoted as v_{i q}. Query text fragment intermodal representations are denoted as Sj _q . A cosine similarity measure is used, which may be for example the inner product scaled by normed vectors, to measure both the visual similarity of two image fragments and the semantic similarity of an image and text fragment. However, any suitable similarity measure could be used.

The similarity of the query q = [i_q, t_q] with a candidate image i_c may preferably then be calculated as follows:

Si

(14)

A candidate image i_c is considered to be relevant if it is visually similar to query image i_q and exhibits every attribute expressed by query text t_q (equation 12). However, in some embodiments, it may be that every attribute of query image i_q and/or query text t_q need not be present in the candidate image i_c for the candidate image i_c to be considered to be relevant. In these embodiments, a different retrieval model may be used. Candidate image i_c resembles query image i_q if image fragments v_{i q} and v_{i c} of corresponding image parts i axe similar (equation 13).

Candidate image i_c satisfies query text t_q if every query text fragment Sj _q is shown in at least one image fragment v_{i c} of the candidate image. Put differently, candidate image i_c satisfies query text t_q if every Sj _q has, for example, a positive inner product with at least one v_{i c}. In some embodiments, other similarity metrics may be used. Consequently, a candidate image which is similar to the query image but which does not show one or more of the attributes specified in the query text will receive a similarity which is negative and infinite.

The relevance of a candidate image depends on the number of candidate image fragments which display attributes corresponding to query text fragments (equation 14). In equation 14, w is a weighting term, which is a positive number, which controls the weight given to the query text relative to the query image. A smaller weight will give more relevance to a candidate image which is similar to the query image, whereas a larger weight will give more relevance to images which satisfy the query text but are less similar to the query image. The weighting term may be chosen based on a validation set.

implement the multimodal search system. In addition, the data structures may be stored or transmitted via a data transmission medium, such as a signal on a communications link. Various communications links may be used to connect components of the system, such as the Internet, a local area network, a wide area network, a point-to-point dial-up connection, a cell phone network, and so on.

Embodiments of the multimodal search system may be implemented in various operating environments that include personal computers, server

computers, hand-held or laptop devices, multiprocessor systems, microprocessor- based systems, programmable consumer electronics, digital cameras, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and so on. The devices may include cell phones, personal digital assistants, smart phones, personal computers,

programmable consumer electronics, digital cameras, and so on.

The multimodal search system may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, and so on that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.

The storage devices may include a database of images which may be evaluated as candidate images for a multimodal query.

A mobile computing device may be used to provide a multimodal query. A mobile computing device may include central processing unit, memory, input devices (e.g., buttons or a touchscreen), output devices (e.g., a display screen), and storage devices (e.g., disk drives). A mobile computing device may include a camera. A user may find an image of a product, for example by taking a picture of the product using a camera or by selecting an image seen online, for example on a social networking website or a news website, and wishes to search for a product similar to the product shown in the image, but with one or more features changes or added.

The user provides the query image and a query modifier to the mobile computing device. The user may provide the query modifier by providing a text input to the mobile computing device. The mobile computing device may include a microphone and the user may provide the query modifier by speaking into the microphone.

The mobile computing device is configured to receive a multimodal query and to provide one or more output images which satisfy the query. The mobile computing device provides the multimodal query comprising the query image and query modifier to a multimodal search system. The multimodal search system may be included in the mobile computing device. Alternatively, the multimodal search system may be a remote system, that is, at a different physical location to the mobile computing device. The mobile computing device may provide the multimodal query through, for example, a wired or wireless internet connection, a Bluetooth connection. The multimodal search system is configured to receive the multimodal query and to perform the multimodal search method and provide one or more output images which satisfy the query. The one or more output images may be provided to the mobile computing device through, for example, a wired or wireless internet connection, a Bluetooth connection.

Example implementation

In the following, an example implementation of the multimodal search method is described. It will be understood that parameters are provided as examples only and not as limiting values. Models used are provided as examples and alternatives may be used as disclosed hereinbefore.

In this example, the training dataset used is a dataset of 53,689 image- text pairs collected from the Amazon webshop between January and February

2015. Each pair consists of an image of a dress and the corresponding textual product description.

The images in the dataset illustrate dresses for different occasions, such as bridesmaid, casual, cocktail, wedding, work, and thus a variety of fine-grained fashion attributes are displayed. The corresponding text descriptions include surrounding natural language text of the webshop, for example the name of the product, product features, and editorial content. The text descriptions describe the image content but may be incomplete and/or noisy, and may include misspellings, incorrect grammar, and incorrect punctuation.

The neural network is trained using 48,689 image-text pairs in the dataset. 4000 image-text pairs are used for validation and 1000 are used for testing.

During testing, the quality of the inferred intermodal representations is evaluated in a multimodal retrieval setting. 100 multimodal queries were collected from 10 test subjects. Two men and eight women each created 10 multimodal queries comprising real search requests created whilst browsing the dataset. The query images are taken from the test dataset and the query modifiers comprise one to three phrases from the Zappos glossary, each phrase denoting a fashion attribute to be modified or added to the corresponding query image. The Zappos glossary includes more than 200 phrases, 47 of which were chosen by the test subjects in their multimodal queries. A retrieved image is considered to be relevant if it resembles the query image and satisfies the query text. In this example, it is required that all fashion attributes of the query image are present in the retrieved image, and that all fashion attributes requested in the query text are either added or interchanged with the necessary query image fashion attributes. However, since the test dataset in this example included only 1000 images, each displaying a wide variety of fashion attributes, it would be expected that for some multimodal queries there may be no relevant images.

For training the neural network, for the images in the training dataset, a rule-based segmentation approach based on garment geometry is used. C groups of similar image fragments are found using k- means clustering on the image fragments. In this example, test values of C were 500, 2500, 5000, 7500, 10000, 12500, 15000, 17500, and 20000. In this example, it was found that the most suitable value for C was 10000. This was evaluated by clustering the image fragments in C clusters. For each cluster of image fragments, the frequency of occurrence of all text fragments in that cluster is calculated. To annotate an image in the validation set, the image fragments of this image and the clusters of these image fragments are considered. The frequencies of all text fragments are aggregated over of all these clusters. Finally, the image is annotated with the K text fragments having the highest frequency. Then the value for C is selected as the value which results in the largest precision@K and recall@K across a selection of values of K from K=l to V, where V is equal to the number of text fragments in the Zappos glossary.

For the text descriptions in the training dataset, 300-dimensional word embeddings were trained using the Skip-gram model. The product (text) descriptions are concatenated, all words are converted to lower case, and non- alphanumeric characters are removed. The Skip-gram model is trained on the resulting text, where each fashion phrase, or group of words relating to a fashion attribute likely to be found in an image of a fashion item, is treated as a single word. A context size of 5 is considered. The context size is the number of phrases which are considered which occur before and after each phrase, to make up the context of a particular phrase when training the word embeddings. As word embeddings learn semantic representations for words or phrases, the context in which a word or phrase occurs provides information about the meaning of that word or phrase. The product descriptions are then filtered, using the Zappos glossary, to retain only fashion phrases. The phrases which remain after the filtering step are considered as the textual fashion attributes, and are text fragments.

The neural network then induces a 1000- dimensional common

multimodal space, where image and text fragments have corresponding intermodal representations which reflect their semantic similarity. The neural network is trained with an objective function (equation 3), which is optimized with a stochastic gradient descent method with mini-batches of 100, a fixed learning rate of lO ⁵, and a momentum of 0.90 during 20 epochs through the training data. A mini-batch is a batch with a small size. The batch size is the number of training examples used for one forwards and backward pass. The learning rate is a parameter which determines how much the network parameters are updated. The momentum is a percentage which influences the size of the steps taken towards the optimal network parameters.

Parameters which were found to work well in this step were n=10 for the smoothing term in equation 10, Δ=40 for the margin term in equation 9, β=0.50 and y=0.25 in equation 3. A weighting term w =2.5 was found to be appropriate based upon a small validation set.

To evaluate this proposed multimodal retrieval model, for each multimodal query the top K most relevant images are retrieved. Images are evaluated by computing precision@i£ for lvalues of 1, 5, and MAP (mean average precision). A retrieved image is considered to be relevant if it resembles the query image and satisfies the query text. In this example, it is required that all fashion attributes of the query image are present in the retrieved image, and that all fashion attributes requested in the query text are either added or interchanged with the necessary query image fashion attributes. If a particular dress fabric is requested, the product descriptions may be used to determine the presence of the fabric in addition to or instead of the product images. Additionally, precision@K is computed for K = 5 for each of the 47 query text fragments individually, and macro- average precision@K (macro AP@K) for K = 5 across all 47 query text fragments. The results of the multimodal retrieval model of this example are compared with those of a simple multimodal retrieval model.

In the simple multimodal retrieval model, an intermodal representation of the desired image is created based on the multimodal query, and candidate images are retrieved which are visually similar to the desired image. The intermodal representation of the desired image is calculated as the intermodal representation of an image fragment of the query image showing the full fashion item (in this example, a dress), plus the intermodal representations of the query text fragments, minus the intermodal representations of attributes that the query text fragments will replace. For example, for the query in Figure 1, the simple multimodal retrieval model takes the intermodal representation of the first image fragment of the query image (that is, the image fragment showing the whole dress), adds the intermodal representations of "spots" and "long sleeves", and subtracts the intermodal representation for "sleeveless". Then, it retrieves candidate images with a high cosine similarity with the resulting vector.

This simple multimodal retrieval model requires the user to also specify in the query text which fashion attribute(s) they would like to be replaced. For example, for the multimodal query in Figure 1, the user must specify "Find similar images but with long sleeves instead of sleeveless and with spots".

In addition to the quantitative evaluation described above, where it is required that all fashion attributes of the query image are present in the retrieved image, and that all fashion attributes requested in the query text are either added or interchanged with the necessary query image fashion attributes, a qualitative evaluation is performed, wherein relevance is expressed as a percentage of attributes required by the multimodal query which are present in the retrieved image. In this qualitative evaluation, a refined precision@K score is computed, expressing relevance as a percentage of the required attributes which are present.

Figure 9 shows a comparison of performance of the simple (referred to in the table as 'standard') multimodal retrieval model and the proposed multimodal retrieval model. P@K represents precision@K for multimodal retrieval. "Attribute P@5" is precision@5 for each individual textual attribute, "macro AP@5" is the average attribute precision@5 over all textual attributes. These results show that the proposed multimodal retrieval model outperforms the simple multimodal retrieval model.

Compared to the simple retrieval model, the proposed retrieval model achieves an increase of 267% on precision®!., of 158% on precision@5, and of 253% on MAP and of 239% on macro AP@5. The simple multimodal retrieval model may be thought of as being intuitive, and it in fact creates the intermodal

representation of the desired image based on the multimodal query. However, it lacks a mechanism to focus on the desired fashion attributes. Simply calculating an overall cosine similarity between the desired image and the candidate images is not sufficient to retrieve relevant images exhibiting both the query image attributes and the query text attributes. In contrast, the proposed multimodal retrieval model uses the inner product of a query text fragment and candidate image fragment to explicitly require every query text attribute to be present in the retrieved images.

A qualitative assessment may also be performed.

The following example multimodal queries were tested.

Query 1

A multimodal query comprising the query "find similar images but with a v-neck" in combination with a query image of a dress. The query image has attributes which include sleeveless, black, white, short, sheath, casual. Five images showing dresses were retrieved by the model and assessed for relevance. The first dress preserves the query image attributes except for the neckline attribute, which is a v-neck as requested, and so this dress satisfies the multimodal query perfectly. The remaining dresses exhibit the requested query text attribute "v-neck" but do not exhibit all of the remaining query image attributes. For example, the second dress has short sleeves, the third and fifth dresses do not have the attribute 'white', the fourth dress is blue. The refined precision@l for the retrieved set of images for this query was calculated to be 100% and the refined precision@5 to be 71.43%.

Query 2

A multimodal query comprising the query "find similar images but shift" in combination with a query image of a dress. The query image has attributes which include jewel neckline, short sleeves, short, pink, floral print, summer. Five images showing dresses were retrieved by the model and assessed for relevance. None of the dresses has all of the requested attributes. For example, the first dress does not have a floral print, the second dress does not have short sleeves. However, each dress has some of the required attributes and can be considered similar to the query image. The refined precision@l for the retrieved set of images for this query was calculated to be 71.43% and the refined precision@5 to be 57.14%.

Query 3

A multimodal query comprising the query "find similar images but with rhinestones" in combination with a query image of a dress. The query image has attributes which include white, strapless, sweetheart, long, A-line, pleated, bridesmaid. Five images showing dresses were retrieved by the model and assessed for relevance. None of the dresses has all of the requested attributes. However, each dress has some of the required attributes and can be considered similar to the query image. The first dress has all of the requested attributes (rhinestones, strapless, sweetheart, long, A-line, pleated, bridesmaid) except for the color (white). The refined precision@l for the retrieved set of images for this query was calculated to be 87.50% and the refined precision@5 to be 77.14%.

Query 4

A multimodal query comprising the query "find similar images but strapless and short" in combination with a query image of a dress. The query image has attributes which include red, single strap, long, bridesmaid and accented with an ornament at the waist. Five images showing dresses were retrieved by the model and assessed for relevance. The first, second, and third dresses show all of the requested attributes, that is, red, strapless, short, bridesmaid and accented with an ornament at the waist. The third image does not have the attribute 'red' and the fifth image does not have the attribute 'short'. The refined precision@l for the retrieved set of images for this query was calculated to be 100% and the refined precision@5 to be 80%.

The results show that the proposed multimodal retrieval model is capable of retrieving relevant candidate images for the provided multimodal queries. The results also indicate that the neural network has learned what certain fashion attributes look like. Hence, it is possible to analyze the 1000-dimensional intermodal representations to acquire insight into the meaning of their different components. In the intermodal representations of textual attributes a component of the intermodal representation is either zero or positive (as a consequence of using ReLU activation function). For all textual attributes in the vocabulary (the Zappos glossary) in this example, it is observed that only a few of the 1000 components are non-zero. This indicates that the model has learned which components of the image fragments to focus on when looking for a specific fashion attribute.

For example, the neckline "sweetheart" has only 7 non-zero components. Hence, image fragments in the common, multimodal space which have positive values for these components will result in a positive inner product with

"sweetheart" and thus are expected to show a sweetheart neckline. Visually similar fashion attributes share some of their non-zero components. This is logical, as some of the same components matter when determining the presence of these fashion attributes. For example, "strapless" has 10 non-zero components of which 3 are also non-zero for "sweetheart". This may explain why previous neural network models, for example that described in Laenen et al., Cross-modal search for fashion attributes, 23rd SIGKDD Conference on Knowledge Discovery and Data Mining, Workshop on 'Machine learning meets fashion', sometimes has trouble

distinguishing between visually similar fashion attributes: a large value for a shared component might produce a positive inner product for an incorrect fashion attribute.

Modifications

It will be appreciated that many modifications may be made to the embodiments herein described.

The query, candidate, and test or validation images may illustrate any type of item having attributes which are visual and which represent features which may vary between items of that type. For example, a user may provide a

multimodal query comprising an image of a toy, for example a teddy bear, and a query modifier such as "find it with a red hat" and the search method may be used to evaluate the similarity of candidate images of toys with the multimodal query. The multimodal space need not be induced by a neural network. For example, a multimodal space may be induced by canonical correlation analysis or bilingual latent Dirichlet allocation.

Claims

1. A computer-implemented searching method for evaluating an image in dependence on a multimodal query; comprising:

receiving a multimodal query comprising a query image and a query modifier, wherein the query modifier modifies the query represented by the query image and/or adds attributes to it, wherein the query image comprises at least one query image fragment, each of the at least one query image fragments having a corresponding query image fragment intermodal representation in a multimodal space and the query modifier comprises at least one query modifier fragment, each of the at least one query modifier fragments having a corresponding query modifier fragment intermodal representation in the multimodal space;

receiving a candidate image comprising at least one candidate image fragment, each of the at least one candidate image fragments having a

corresponding candidate image fragment intermodal representation in the multimodal space;

calculating a first similarity between the query image and the candidate image in dependence upon at least one query image fragment intermodal representation and at least one candidate image fragment intermodal

representation;

calculating a second similarity between the query modifier and the candidate image in dependence upon at least one query modifier fragment intermodal representation and at least one candidate image fragment intermodal representation; and

calculating an overall similarity between the multimodal query and the candidate image in dependence upon the first similarity and the second similarity.

2. A method according to claim 1, further comprising outputting the candidate image in dependence upon the overall similarity.

3. A method according to claim 1 or 2, wherein the query modifier comprises a query text.

4. A method according to claim 1 or 2, wherein the query modifier comprises an audio query including at least one spoken word.

5. A method according to claim 4, further comprising, after receiving the query modifier, converting the audio query to a query text.

6. A method according to any preceding claim, wherein an intermodal representation corresponds to a vector in the multimodal space.

7. A method according to claim 6 wherein calculating the first similarity or the second similarity comprises calculating a similarity metric of vectors in the multimodal space.

8. A method according to any preceding claim, further comprising determining the query image intermodal representation in dependence upon at least one image attribute of the query image.

9. A method according to claim 8 further comprising, after receiving the query image, extracting the at least one image attribute of the query image using an image segmentation method, optionally a rule-based image segmentation method.

10. A method according to any preceding claim, further comprising determining the query modifier intermodal representation in dependence upon at least one attribute of the query modifier.

11. A method according to claim 10 further comprising, after receiving the query modifier, extracting the at least one attribute of the query modifier using a segmentation method, optionally a rule-based text segmentation method.

12. A method according to any preceding claim, further comprising determining the candidate image intermodal representation in dependence upon at least one image attribute of the candidate image.

13. A method according to claim 12 further comprising, after receiving the candidate image, extracting the at least one image attribute of the candidate image using an image segmentation method, optionally a rule-based image segmentation method.

14. A method according to any preceding claim, wherein the overall similarity is equal to the aggregation of the first similarity and the second similarity.

15. A method according to any one of claims 1 to 14, wherein the overall similarity is equal to a weighted aggregation of the first similarity and the second similarity.

16. A method according to any preceding claim, wherein the query image and candidate image each illustrate at least one fashion item.

17. A method according to any one of claims 1 to 16, wherein the query image and the candidate image each illustrate at least one object of a type wherein the object has at least one attribute which is visually distinguishable between objects of the type.

18. A method according to any preceding claim, wherein the multimodal space is induced by a neural network.

19. A computer-implemented method for selecting an image from a plurality of images in dependence on a multimodal query, comprising:

receiving a plurality of candidate images, each comprising at least one candidate image fragment, each of the at least one candidate image fragments having a corresponding candidate image fragment intermodal representation in the multimodal space;

for each of the candidate images calculating a first similarity between the query image and said candidate image in dependence upon at least one query image fragment intermodal representation and at least one candidate image fragment intermodal representation, and calculating a second similarity between the query modifier and said candidate image in dependence upon at least one query modifier fragment intermodal representation and at least one candidate image fragment intermodal representation, calculating an overall similarity between the multimodal query and said candidate image in dependence upon the first similarity and the second similarity; and

selecting, from the plurality of candidate images at least one candidate image having highest overall similarity.

20. A computer-implemented method according to claim 19, including for each candidate image adding to a candidate image fragment intermodal

representation of said candidate image an attribute retrieved from candidate image fragment intermodal representations of other candidate images that are visually similar to the said candidate image.

21. A computer-implemented method according to claim 19 or 20, including for each candidate image removing from a candidate image fragment intermodal representation of said candidate image an attribute not occurring in any image fragment intermodal representation of other candidate images that are visually similar to the said candidate image.

22. A computer-implemented method for selecting an image from a plurality of images in dependence on a multimodal query; comprising:

receiving a multimodal query comprising a query image and a query modifier, wherein the query modifier modifies the query represented by the query image and/or adds attributes to it, wherein the query image has a corresponding query image intermodal representation in a multimodal space and the query modifier has a corresponding query modifier intermodal representation in the multimodal space;

receiving a plurality of candidate images, each having a corresponding candidate image intermodal representation in the multimodal space;

ranking the candidate images in dependence upon their relevance to the multimodal query which includes the query image and the query modifier

Selecting, from the plurality of candidate images at least one candidate image having highest ranking.

23. A computer-implemented method according to claim 22, including:

segmenting the query image and the candidate images, producing one or more query image fragments for the query image and one or more candidate image fragments for the candidate image;

segmenting the query modifier, producing one or more query modifier fragments, the query modifier fragments referring to attributes to be added and/or interchanged with attributes of the query image;

wherein the query image fragments, query modifier fragments and candidate image fragments are represented using intermodal representations as e.g. inferred by a neural network.

24. A computer-implemented method according to claim 22 or 23, wherein determining the relevance of the candidate image includes determining a cosine similarity measure, to measure both the visual similarity of the query image, or a fragment thereof, and the candidate image, or a fragment thereof, and the semantic similarity of the candidate image, or a fragment thereof, and the query modifier, or a fragment thereof.

25. A computer-implemented method for identifying an image; comprising:

determining for each of the candidate images a measure of

correspondence to the multimodal query which includes the query image and the query modifier; and

identifying, from the plurality of candidate images at least one candidate image having a highest measure of correspondence.

26. A computer-readable medium containing instructions for performing a method according to any preceding claim.

27. Use of a method or computer-readable medium according to any preceding claim in an e-commerce setting.