WO2023117041A1

WO2023117041A1 - Computing device and methods for upgrading search service through feature translation

Info

Publication number: WO2023117041A1
Application number: PCT/EP2021/086851
Authority: WO
Inventors: Mert KILICKAYA; Baiqiang XIA
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2023-06-29
Also published as: CN117751357A

Abstract

A computing device for performing a visual search, includes a vision encoder module configured to receive a query image and generate a structured visual representation, representing the query image as a linear combination of visual attributes. The computing device further includes a speech encoder module configured to receive a speech interaction including one or more visual attribute modifications and generate a corresponding speech embedding. The computing device further includes a transformation module configured to transform the structured visual representation by replacing one or more visual attributes with modified visual attributes based on the speech embedding. The computing device further includes a search module configured to generate an image search query based onto the transformed structured visual representation and output at least one target image based on the image search query.

Description

COMPUTING DEVICE AND METHODS FOR UPGRADING SEARCH SERVICE THROUGH FEATURE TRANSLATION

TECHNICAL FIELD

The present disclosure relates generally to the field of visual search; and more specifically to a computing device and a method for upgrading a search service (e.g., performing and upgrading a visual search via conversational interaction) through feature translation.

BACKGROUND

Many online search platforms, systems, and technologies have emerged in the last few decades. Visual search, among other search technologies, is rapidly gaining importance as it increases users' experience by finding content on the web. However, existing visual search technologies are inflexible in nature which makes performing visual search very cumbersome to find relevant results. For example, in certain scenarios, a consumer may need to search for a specific product on an online search platform. For example, the consumer may require apparel of a specific colour and design on the search platform. In such a case, navigating through an entire catalogue of the products to visually search for the required apparel on the e-commerce platform is a cumbersome task. Furthermore, the search platform may provide sorting of the products based on one or more filters that may be set by the consumers for a refined search result. However, such refined search results may include a large number of product listings that may be irrelevant to the consumer, thereby making it difficult for the consumer to search for the required product. Moreover, in some cases, to help the consumers, the search platforms may provide assistance, such as human users, or hots may be employed to guide the consumers through the process of finding and buying the right product. The interaction between the consumers and the human users or hots generally occurs through online chat platforms via simple textual conversations. However, such textual conversations could lead to lexical ambiguity and may further be inaccessible for many consumers. Thus, there exists a technical problem of the inflexibility of existing visual search methods, which makes performing visual search very cumbersome on search platforms to find relevant results, thereby resulting in a low user experience and increasing difficulty in arriving at the final search results that are accurate and relevant. SUMMARY

The present disclosure provides a computing device and methods for upgrading a search service through feature translation (i.e., a method to perform and upgrade a visual search via conversational interaction and a method of training the computing device to perform the visual search via conversational interaction). An aim of the present disclosure is to provide a solution that overcomes the problems encountered in the prior art and provides improved methods for upgrading a search service through feature translation.

According to an aspect of the present disclosure, there is provided with a computing device for performing a visual search. The computing device comprises a vision encoder module configured to receive a query image and generate a structured visual representation, representing the query image as a linear combination of visual attributes. The computing device further comprises a speech encoder module configured to receive a speech interaction including one or more visual attribute modifications and generate a corresponding speech embedding. The computing device further comprises a transformation module configured to transform the structured visual representation by replacing one or more visual attributes with modified visual attributes based on the speech embedding. The computing device further comprises a search module configured to generate an image search query based onto the transformed structured visual representation and output at least one target image based on the image search query.

The disclosed computing device enables an enhanced search service for a user. By virtue of generating the structured visual representation, representing the query image as a linear combination of visual attributes, composability, identifiability, and separability of attributes associated with the query image is achieved, resulting in flexibility in the later transformation of the attributes for improved search performance and user experience. As compared to conventional search methods, such as text-based interactive search, which provide single or multiple rounds of refinement via textual input only, the disclosed computing device enables enhanced search service by allowing multiple rounds of conversational input using speech and original image features for refining image search results. The speech-based refining is more user friendly and very convenient for users, for example, as speaking is generally faster and more natural for user interaction. The transformation of the structured visual representation by replacing one or more visual attributes with modified visual attributes based on the speech embedding provides a technical effect of refining search results by proposing modifications to the input query image via speech interaction (or conversation), thereby improving accuracy and relevancy of final search results where refining image search results can be achieved iteratively via the speech interaction to simplify ease-of-use for enhanced user experience.

In an implementation form, the visual encoder module is a convolutional network or visual transformer network.

The convolutional network or visual transformer network enables an N-dimensional representation of the one or more visual attributes, contributing to the flexibility of the search service.

In a further implementation form, replacing a visual attribute includes identifying a corresponding visual attribute in the structured visual representation, removing the identified attribute and adding the modified visual attribute based on the speech embedding.

It is advantageous to remove the identified attribute and add the modified visual attribute based on the speech embedding, as it improves flexibility of the search service and enables refining search results by proposing modifications to the input query image via conversations.

In a further implementation form, the transformation module includes a multi-layer perceptron including one or more ReLU activations between each layer.

The multi-layer perceptron enables receipt of the N-dimensional one or more visual attributes representation as well as the N-dimensional speech embedding and output the transformed structured visual representation.

In a further implementation form, in response to receiving a second speech interaction, the speech encoder is configured to generate a second speech embedding based on the second speech interaction. The transformation module is configured to generate a second transformed structured visual representation by replacing one or more visual attributes of the transformed structured visual representation with second modified visual attributes based on the second speech embedding. The search module is configured to generate a second image search query based on the second transformed structured visual representation and output at least one second target image based on the second image search query. It is advantageous to iteratively perform the speech interaction between the user and the computing device to provide the target image based on a user preference. The user may provide one or more inputs in the speech interaction to receive the target image based on the user preference. In other words, the iterative transformation of the structured visual representation enables further refining of search results by modifying some attributes of the input query image via the second speech interaction.

In a further implementation form, a method of training the computing device is provided. The method comprises sampling triplets of a query image, a speech interaction and a target image. The method comprises, for each triplet, processing the query image and target image using the vision encoder module to generate a structural visual representation and a target structural visual representation, and processing the speech interaction using the speech encoder module to generate a speech embedding. The method further comprises, for each triplet, transforming the structured visual representation based on the speech embedding to generate a transformed structured visual representation. The method further comprises, for each triplet, determining one or more loss functions include a transformation loss function based on a difference between the transformed structured visual representation and the target structural visual representation. The method further comprises, for each triplet, updating parameters of the vision encoder module, speech encoder module and transformation module to reduce the one or more loss functions.

The disclosed method trains the computing device to improve the functionality of the computing device to achieve composability, identifiability, and separability of attributes associated with the query image, resulting in flexibility in later transformation of the input feature (i.e., one or more attributes associated with input query image) for improved search performance and user experience. The computing device is trained to generate the refined input feature, and then output at least one target image with improved accuracy based on the query image and the speech interaction, where the input feature is refined based on speech interaction.

In a further implementation form, the loss functions further include one or more independent loss functions based on composite loss, identifiability loss and/or separability loss.

The one or more independent loss functions based on composite loss, identifiability loss and/or separability loss are utilized as constraints for the visual encoder module that is the convolutional network or the visual transformer network. Having one or more independent loss functions based on composite loss, identifiability loss and/or separability loss further improves the disclosed method to transform the query image feature conditioned on the speech feedback from the user to get the target query as per desire.

In a further implementation form, the composite loss is configured to reduce a difference between the structured visual representation and a sum of all the visual attributes.

The composite loss enables the transformation of the structured visual representation via elementary operations, such as addition operation or removal operation.

In a further implementation form, the identifiability loss is configured to reduce a crossentropy of the structured visual representation.

The identifiability loss ensures that the visual attributes of the query image are identified.

In a further implementation form, the separability loss is configured to reduce a dot product between all of the visual attributes.

The separability loss ensures that abstract visual attributes of the query image are decorrelated or separated from each other.

According to another aspect of the present disclosure, there is provided a method for performing a visual search. The method comprises receiving, by a vision encoder module, a query image and generating a structured visual representation, representing the query image as a linear combination of visual attributes. The method further comprises receiving, by a speech encoder module, a speech interaction including one or more visual attribute modifications and generating a corresponding speech embedding. The method further comprises transforming, by a transformation module, the structured visual representation by replacing one or more visual attributes with modified visual attributes based on the speech embedding. The method further comprises generating, by a search module, an image search query based on the transformed structured visual representation and outputting at least one target image based on the image search query.

The method achieves all the advantages and technical effects of the computing device of the present disclosure. In another aspect, the present disclosure provides a computer program comprising instructions which, when executed by a processor, cause the processor to carry out the steps of the method.

The computer program achieves all the advantages and technical effects of the computing device and method of the present disclosure.

It has to be noted that all devices, elements, circuitry, units, and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.

Additional aspects, advantages, features, and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow.

BRIEF DESCRIPTION OF THE DRAWINGS

The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers. Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:

FIG. 1 is a diagram illustrating a computing device with exemplary components, in accordance with an embodiment of the present disclosure;

FIG. 2 is a block diagram that depicts a determination of loss functions, in accordance with an embodiment of the present disclosure;

FIG. 3 is a block diagram that depicts transformation of structured visual representation, in accordance with another embodiment of the present disclosure;

FIG. 4 is a flowchart of a method for training a computing device, in accordance with an embodiment of the present disclosure; and

FIG. 5 is a flowchart of a method for performing a visual search based on feature translation, in accordance with an embodiment of the present disclosure.

FIG. 6 illustrates an example of a hardware implementation for the computing device of FIG. 1, in accordance with an embodiment of the present disclosure.

In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.

DETAILED DESCRIPTION OF EMBODIMENTS

The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practising the present disclosure are also possible.

FIG. 1 is a diagram illustrating a computing device with exemplary components, in accordance with an embodiment of the present disclosure. With reference to FIG. 1, there is shown a block diagram 100 of a computing device 102. The computing device 102 includes a vision encoder module 104, a speech encoder module 106, a transformation module 108 and a search module 110. The search module 110 includes a database 110A. Further, the vision encoder module 104 receives a query image 112, and the speech encoder module 106 receives speech interaction 114. The search module 110 performs a visual search and outputs a target image 116.

Throughout the present disclosure, the term computing device 102 refers to an electronic device used by a user, such as a consumer of a search platform. Furthermore, the computing device 102 is intended to be broadly interpreted to include any electronic device that may be used for at least voice, image, and other data communication over a wireless communication network. Examples of the computing device 102 include, but are not limited to, a server, a smartphone, a laptop computer, a handheld device, personal computers, etc. Additionally, the computing device 102 includes a casing, a memory, a processor, a network interface card, a microphone, a speaker, a keypad, and a display.

In operation, the vision encoder module 104 of the computing device 102 is configured to receive the query image 112 and generate the structured visual representation, representing the query image 112 as a linear combination of visual attributes. The query image 112 is received as an input by the vision encoder module 104. For example, a user may upload the query image 112 on the computing device 102. The vision encoder module 104 processes the received query image 112 to extract the visual attributes from the query image 112. For example, the visual attributes include characteristics of a product shown in the query image 112, such as a colour, a shape, a design, a style and so forth of the product. The vision encoder module 104 generates the structured visual representation as the linear combination of the extracted visual attributes. The structured visual representation may be an image-based visual representation that includes the linear combination of the extracted visual attributes.

By virtue of generating the structured visual representation, representing the query image 112 as the linear combination of the visual attributes, composability, identifiability, and separability of attributes associated with the query image 112 is achieved, resulting in flexibility in later transformation of the attributes for improved search performance and user experience. Therefore, as compared to conventional search methods which provides an unstructured representation, the computing device 102 possesses a technical advancement of providing the structured visual representation, which in turn aids in providing enhanced search results to the user. The composability ensures that feature representation is composed of sub-atomic representations. For example, if in the query image 112, a clothing item is visible, such as a blue dress which of one shoulder, the composability segregates the feature of clothing item to sub-atomic visual attributes like “blue”, followed by a “dress” representation and then an attribute of “one-shoulder” representation. The identifiability ensures that visual attributes associated with the query image 112 can be determined from the feature. For example, the visual attributes that represents “blue” colour of product, and that the product is a “dress”, and that the dress is a “one-shoulder” visual attribute can be identified from the extracted feature of the query image 112. Moreover, the separability of attributes associated with the query image 112 ensures that sub-atomic representations are de-correlated to ease their combination and transformation at a later stage. For example, the visual attributes “blue”, “dress”, and “one-shoulder”, can be de-correlated.

In an implementation, the vision encoder module 104 utilizes loss functions to generate the structured visual representation. For example, the loss functions include one or more independent loss functions based on composite loss, identifiability loss and/or separability loss. The loss functions may be utilized as constraints by the vision encoder module 104 to generate the structured visual representation. An example of the loss functions are further described, for example, in FIG. 2.

In an implementation, the vision encoder module 104 is a convolutional network or visual transformer network. The convolutional network may be an artificial neural network configured to receive the image-based input (such as the query image 112) and generate a feature map, such as the structured visual representation based on the analyzed image-based input. The visual transformer network may employ a transformer-like architecture for a task of image classification. For example, the visual transformer network may receive the query image 112, generate fixed-size patches of the query image 112 and linearly embed the fixed- size patches. The embedded patches of the query image 112 may be utilized to extract the visual attributes and generate the structured visual representation. For example, the vision encoder module 104 implemented as the convolutional network or the visual transformer network generates an N-dimensional representation as the structured visual representation. The N-dimensional representation may include the linear combination of the extracted visual attributes. The speech encoder module 106 is configured to receive a speech interaction 114 that includes one or more visual attribute modifications and generate a corresponding speech embedding. For example, the user may provide a verbal message as the speech interaction 114 via a chat user interface (UI). In an example, the chat UI may be a part of the search platform utilized by the user. The speech interaction 114 may include the one or more visual attribute modifications required in the query image 112 by the user to receive a desired search result (such as the target image 116). For example, the speech interaction 114 includes modifications, such as a different colour of a product than the colour of the product in the query image 112, a different pattern of the product than the pattern of the product in the query image 112, and the like. The speech encoder module 106 processes the speech interaction 114 to understand which one or more visual attribute of the query image 112 need modifications. Based on the one or more visual attribute modifications, the speech encoder module 106 generates the corresponding speech embedding.

In an implementation, the speech encoder module 106 is implemented by use of a neural network, such as the convolution network. The speech encoder module 106 implemented as the neural network processes the visual attribute modifications and generates an N- dimensional abstract representation as the speech embedding. The N-dimensional abstract representation may include the combination of the processed visual attributes. The generation of the structured visual representation by the vision encoder module 104 and the generation of the speech embedding by the speech encoder module 106 enables establishment of relationship between the visual attributes of the query image 112 and the visual attribute modifications included in the corresponding speech embedding. Further, such structured visual representation and the speech embedding is utilized to train the vision encoder module 104. Advantageously, the speech interaction 114 may further be utilized by individuals with special needs for whom the textual interaction may be a complex task.

The transformation module 108 is configured to transform the structured visual representation by replacing one or more visual attributes with modified visual attributes based on the speech embedding. The transformation module 108 receives the structured visual representation from the vision encoder module 104 as an input. The transformation module 108 further receives the speech embedding from the speech encoder module 106 as an input. Based on the modified visual attributes included in the speech embedding, the transformation module 108 transforms the one or more visual attributes included in the structured visual representation. The replacement of the one or more visual attributes with the modified visual attributes is based on elementary operations, such as addition and removal of the visual attributes based on the speech embedding.

In an implementation, replacing the visual attribute includes identifying a corresponding visual attribute in the structure visual representation, removing the identified attribute and adding the modified visual attribute based on the speech embedding. For example, a user may speak in the speech interaction 114 that “a dress like this (i.e. , a dress like the query image 112) but black and striped instead”, where a visual attribute “blue” colour identified from the query image 112 is removed from the structure visual representation and replaced with a visual attribute “black” colour, based on the speech embedding, as shown in FIG. 1. For example, an additional visual attribute (such as a striped pattern) which is absent in the query image 112 may be required by the user to get a desired target image of a product, for example, the target image 116. In such a case, the visual attribute “striped” pattern is added based on the speech embedding, as shown in this case. In other words, for a particular image feature or a particular group of image features (represented by different patterned squares in the vision encoder module 104) of the query image 112, there are transformations learned j ointly from the speech features (i.e., obtained by the speech interaction 114), and the original image features of the query image 112. The learned transformations then can be applied by directly applying add or subtract of those new speech-based features (e.g., “black” and “striped” in this example).

In an implementation, the transformation module 108 includes a multi-layer perceptron including one or more rectified linear unit (ReLU) activations between each layer. The multilayer perceptron is one of a feedforward artificial neural network. In an example, the multilayer perceptron utilizes a supervised learning technique for training purposes. The ReLU activations may be non-linear activations present between each layer of the multi-layer perceptron. The multi-layer perceptron is utilized to obtain a refined visual feature or the refined visual attribute. The refined visual attribute is the modified visual attribute. For example, each layer performs an operation, such as an addition or removal to obtain the modified visual attribute. Details of the operations of the transformation module 108 are further described, for example, in FIG. 3. The search module 110 is configured to generate an image search query based onto the transformed structured visual representation and output at least one target image, such as the target image 116 based on the image search query. The search module 110 processes the transformed structured visual representation that includes the modified visual attributes to generate the image search query. For example, the image search query may include a visual attribute, such as “black colour patterned product”. The search module 110 applies a similarity metric, for example, a cosine similarity by use of the refined feature or attribute over a set of images in the database 110A to output at least one target image, such as the target image 116.

In an implementation, the computing device 102 receives a second speech interaction. For example, the user requires more refined target images as compared to the at least one target image. In such a case, the second speech interaction includes a verbal message with more modifications in the visual attributes required in the target image 116. For example, a request for modification of an existing visual attribute “black colour” in the target image 116 with the visual attribute “royal blue colour” is included in the second speech interaction. In response to receiving the second speech interaction, the speech encoder module 106 is configured to generate a second speech embedding based on the second speech interaction. The second speech embedding may include “royal blue colour”.

The transformation module 108 is configured generate a second transformed structured visual representation by replacing one or more visual attributes of the transformed structured visual representation with second modified visual attributes based on the second speech embedding. For example, the transformation module 108 replaces the visual attribute “black colour” of the target image 116 with the second modified visual attribute that includes “royal blue colour”.

Based on the second transformed structured visual representation, the search module 110 is configured to generate a second image search query and output at least one second target image based on the second image search query. The second target image may include the visual attribute, such as “royal blue colour”. Thus, in such a manner, the computing device 102 enables reception of conversational interaction or conversational feedback from the user, to provide accurate and refined search results to the user. In an implementation, the computing device 102 enables reception of two different query images that differs in one visual attribute. For example, the computing device 102 receives a first query image that includes a first product in blue colour. The computing device 102 receives a second query image that includes a first product in black colour. The computing device 102 determines a difference in the visual attributes (such as the colour) of the two query images. Based on the determined difference, the computing device 102 outputs one or more target images. The output one or more target images includes different images of the first product in different colours.

In another implementation, the computing device 102 provides each target image based on each speech interaction received from the user. For example, the computing device 102 receives the query image 112 that includes a blue dress. The computing device 102 further receives a first speech interaction. The first speech interaction includes “change blue to black”. Based on the first speech interaction, the computing device 102 provides a first target image that includes a black dress. The computing device 102 further receives a second speech interaction. The second speech interaction includes “how about pink?”. Based on the second speech interaction, the computing device 102 identifies the modification to be made and provide a second target image that includes a pink dress. Further, the computing device 102 receives a third speech interaction. The third speech interaction includes “add ruffles”. Based on the third speech interaction, the computing device 102 provides a third target image that includes a pink dress with ruffles. In a similar manner, the computing device 102 provides different target images based on different speech interaction iteratively.

The disclosed computing device 102 enables enhanced search service for a user. By virtue of generating the structured visual representation, representing the query image 112 as the linear combination of visual attributes, composability, identifiability, and separability of attributes associated with the query image 112 is achieved, resulting in flexibility in later transformation of the attributes for improved search performance and user experience. As compared to conventional search methods, such as text-based interactive search, which provide only single round of refinement via textual input, the disclosed computing device 102 enables enhanced search service by allowing multiple rounds of conversational input for refining image search results. The transformation of the structured visual representation by replacing one or more visual attributes with modified visual attributes based on the speech embedding, provides a technical effect of refining search results by proposing modifications to the input query image 112 via speech interaction (or conversation), thereby improving accuracy and relevancy of final search results where refining image search results can be achieved iteratively via the speech interaction to simplify ease-of-use for enhanced user experience. Moreover, the conventional search methods may require human intervention (such as customer care support) for providing the desired search results to the user. However, the computing device 102 provides a fully automated environment composed of the vision encoder module 104, the speech encoder module 106, the transformation module 108 and the search module 110 to the user to obtain the desired search results.

FIG. 2 is a block diagram that depicts determination of loss functions, in accordance with an embodiment of the present disclosure. With reference to FIG. 2, there is shown a diagram 200 that depicts determination of loss functions. The diagram 200 includes the vision encoder module 104 and the query image 112 received by the vision encoder module 104. Further, the diagram 200 includes loss functions 202. In some embodiments, the loss functions 202 include one or more independent loss functions based on a composite loss 204, an identifiability loss 206 and a separability loss 208.

The vision encoder module 104 of the computing device 102 utilizes the loss functions 202 to generate the structured visual representation, based on the received query image 112. The loss functions 202 are utilized by the vision encoder module 104 as constraints to generate the structured visual representation. For example, the vision encoder module 104 implemented by use of the convolutional neural network utilizes the loss functions 202 as the constraints to generate the N-dimensional structured visual representation.

In an exemplary scenario, the query image 112 includes a dress. For example, the dress may be a one shoulder dress and blue in colour. The composite loss 204 ensures that the structured visual representation is a linear sum of all the visual attributes. In an example, the structured visual representation is the sum of the visual attributes “blue”, “dress”, and “one shoulder”. By such representation, the structured visual representation may be transformed via elementary operations, such as addition and removal operations.

The identifiability loss 206 ensures that the visual attributes, such as “blue”, “dress”, and “one shoulder” in the query image 112 are recognized from the generated structured visual representation. Thus, the identifiability loss 206 ensures that the visual attribute (or feature) is informative about the entities within the query image 112. The separability loss 208 ensures that the abstract representations of each visual attribute, such as “blue”, “dress”, and “one shoulder” in the query image 112 are de-correlated from one another. The separability loss 208 is utilized for generating the structured visual representation as de-correlation of each of the visual attributes allows ease of combination.

Thus, the vision encoder module 104 receives the query image 112, utilizes the loss functions 202 such as the composite loss 204, the identifiability loss 206 and the separability loss 208 as constraints to generate the structured visual representation.

FIG. 3 is a block diagram that depicts transformation of structured visual representation, in accordance with another embodiment of the present disclosure. With reference to FIG. 3, there is shown a diagram 300 that depicts transformation of structured visual representation. The diagram 300 includes the transformation module 108, the speech interaction 114 and the target image 116. The transformation module 108 further includes a transformation inference unit 302 and a transformation application unit 304.

Once the structured visual representation is generated, the transformation module 108 utilizes the corresponding speech embedding to transform the structured visual representation. The transformation is performed by use of the transformation inference unit 302 and the transformation application unit 304.

The transformation inference unit 302 is utilized to infer two types of transformations. The types of transformations include a removal transformation vector and an addition transformation vector. The removal transformation vector and the addition transformation vector are conditioned jointly on the structured visual representation as well as the speech embedding.

The transformation application unit 304 is configured to receive the removal transformation vector and the addition transformation vector inferred based on the speech embedding. The transformation application unit 304 applies the removal transformation vector on the structured visual representation to remove a visual attribute. For example, the transformation application unit 304 removes the “blue colour” visual attribute from the structured visual representation based on the speech embedding. The transformation application unit 304 further applies the addition transformation vector on the structured visual representation to add a visual attribute. For example, the transformation application unit 304 adds the “black colour” visual attribute to the structured visual representation based on the speech embedding. In case the speech embedding does not include addition or removal of the visual attributes from the query image 112, values of the removal transformation vector and the addition transformation vector may correspond to zero. Thus, in such a manner the transformation module 108 transforms the structured visual representation by replacing one or more visual attributes with the modified visual attributes based on the speech embedding.

In an exemplary scenario, an e-commerce platform may be accessed by a user, such as a consumer. The user may provide an input of the query image 112. In one example, the query image 112 may be an image of a one shoulder dress which may be blue in colour. In such a case, the visual attributes are “dress”, “blue” and “one shoulder”. The structured visual representation includes the linear combination of the visual attributes, such as “dress”, “blue” and “one shoulder”. The consumer may require a dress similar to the dress in the query image 112. For example, the speech interaction 114 may include “similar dress but black in colour with stripes”. Based on the speech interaction 114, the speech encoder module 106 generates the corresponding speech embedding. The transformation module 108 receives the structured visual representation as well as the speech embedding to transform the structured visual representation. For example, the transformation module 108 modifies the visual attribute “blue” in the structured visual representation with the visual attribute “black” and adds the visual attribute “striped” based on the speech embedding. The search module 110 receives the transformed structured visual representation that includes the visual attribute “black” as specified in the speech embedding. Moreover, the database 110A includes the set of product listings (such as dresses) on the search portal (e.g., an e-commerce website). The set of product listings may include one or more images of the product, the information associated with the product and a link to purchase the product. The search module 110 searches for the image search query in the database 110A to output at least one target image, such as the target image 116. The at least one target image may include similar dresses with common visual attributes such as “one shoulder dress”, “black in colour” and “striped pattern”. Therefore, at least one target image is a refined or a filtered image required by the user, such as the consumer.

FIG. 4 is a flowchart of a method for training a computing device, in accordance with an embodiment of the present disclosure. With reference to FIG. 4, there is shown a method 400 for training a computing device. The method 400 includes steps 402 to 410. The method 400 introduces an efficient way of training the computing device 102. The method 400 is described in detail, in following steps.

At 402, the method 400 comprises sampling triplets of a query image, a speech interaction and a target image. For example, a dataset may be created with a plurality of triplets (or samples) of the query image, the speech interaction and the target image. The speech interaction specifies the modification to be done to the query image to obtain the target image as the output.

At 404, the method 400 further comprises for each triplet, processing the query image (Iq) and the target image (Ir) using the vision encoder module 104 to generate the structural visual representation (Xq) and the target structural visual representation (Xr), and processing the speech interaction (s) using the speech encoder module 106 to generate the speech embedding (Xs).

The processing operations are represented as:

Xq = vision encoder (Iq) (1)

Xr = vision encoder (Ir) (2)

Xs = speech encoder (S) (3)

At 406, the method 400 further comprises for each triplet, transforming the structured visual representation (Xq) based on the speech embedding (Xs) to generate the transformed structured visual representation. Transformation of the structured visual representation (Xq) includes generation of the transformation vectors, such as the removal transformation vector (Tr) and the addition transformation vector (Ta).

The transformation operation is represented as:

(Tr, Ta) = transformation (Xq, Xs) (4)

At 408, the method 400 further comprises for each triplet, determining the one or more loss functions 202 that includes a transformation loss function based on a difference between the transformed structured visual representation and the target structural visual representation. The loss functions 202 are utilized as constraints for the vision encoder module 104. The transformation loss (Lt) ensures a resemblance between the modified visual attributes of the query image 112 and the target image 116 by minimization of a mean-squared error.

The transformation loss (Lt) is expressed as:

Lt = || Xq — Xr || (5)

In some embodiments, the loss functions 202 further includes one or more independent loss functions based on the composite loss 204, the identifiability loss 206 and/or the separability loss 208. The loss functions 202 are utilized by the vision encoder module 104 as the constraints to determine the structured visual representation.

In an implementation, the composite loss 204 (Lc) is configured to reduce a difference between the structured visual representation and a sum of all the visual attributes. The difference between the structured visual representation and the sum of all the visual attributes is reduced based on the minimization of the mean-squared error.

The composite loss (Lc) is expressed as:

Lc = 11 Xq - sum (Xwi) 11 (6) where Xwi is the sum of all the visual attributes.

In an implementation, the identifiability loss 206 (Li) is configured to reduce a cross entropy of the structured visual representation. The identifiability loss 206 ensures that existing words are obtained from the structured visual representation.

The identifiability loss (Li) is expressed as:

Li = cross entropy (Xq) (7)

In an implementation, the separability loss 208 (Ls) is configured to reduce a dot product between all of the visual attributes. The reduction in the dot product between all of the visual attributes ensures that each of the visual attributes are as orthogonal to each other as possible.

Ls = (Xwi * Xwj * . ) (8) where i, j represents different visual attributes. At 410, the method 400 further comprises for each triplet, updating parameters of the vision encoder module 104, the speech encoder module 106 and the transformation module 108 to reduce the one or more loss functions 202. The transformation vectors are applied on the structured visual representation to obtain the at least one target image as the output.

The update of the structured visual representation is expressed as:

Xq = (Xq - Tr) + Ta (9)

During testing phase of the computing device 102, the sampled triplets of the query image, the speech interaction and the target image are utilized to generate the target visual representation. The target visual representation is used to query a large database of image visual attributes which have been extracted and stored in the vision encoder module 104 prior to search.

The steps 402 to 410 are only illustrative, and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.

FIG. 5 is a flowchart of a method for performing a visual search based on feature translation, in accordance with an embodiment of the present disclosure. With reference to FIG. 5, there is shown a method 500 for performing a visual search based on feature translation. The method 500 includes steps 502 to 508.

The method 500 introduces an efficient way of performing the visual search based on feature translation. The method 500 is described in detail, in following steps.

At 502, the method 500 comprises receiving, by the vision encoder module 104, the query image 112 and generating the structured visual representation, representing the query image 112 as the linear combination of the visual attributes.

At 504, the method 500 comprises receiving, by the speech encoder module 106, the speech interaction including one or more visual attribute modifications and generating the corresponding speech embedding. At 506, the method 500 comprises transforming, by the transformation module 108, the structured visual representation by replacing one or more visual attributes with modified visual attributes based on the speech embedding.

At 508, the method 500 comprises generating, by the search module 110, the image search query based onto the transformed structured visual representation and outputting at least one target image, such as the target image 116 based on the image search query.

The steps 502 to 508 are only illustrative, and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.

In an implementation, replacing the visual attribute includes identifying the corresponding visual attribute in the structure visual representation, removing the identified attribute, and adding the modified visual attribute based on the speech embedding.

In an implementation, in response to receiving the second speech interaction, the method further comprises generating the second speech embedding based on the second speech interaction. The method further comprises generating the second transformed structured visual representation by replacing one or more visual attributes of the transformed structured visual representation with second modified visual attributes based on the second speech embedding. The method further comprises generating the second image search query based onto the second transformed structured visual representation and outputting at least one second target image based on the second image search query.

FIG. 6 is an illustrates an example of a hardware implementation for the computing device of FIG. 1, in accordance with an embodiment of the present disclosure. With reference to FIG. 6, there is shown a block diagram 600 that includes the computing device 102. The computing device 102 further includes a processor 602, a memory 604 and a network interface 606, the vision encoder module 104, the speech encoder module 106, the transformation module 108, the search module 110.

The vision encoder module 104 is configured to receive an image input, process the image input, and generate an output. In an embodiment, the vision encoder module 104 is implemented as a vision encoder circuit. The speech encoder module 106 is configured to receive a speech input, process the speech input and generate the output. In an embodiment, the speech encoder module 106 is implemented as a speech encoder circuit. The transformation module 108 is configured to transform the received structured visual representation. In an embodiment, the transformation module 108 is implemented as a transformation circuit. The search module 110 is configured to receive an image-based input, generate an image search query based on the input and generate at least one image-based output. In an embodiment, the search module 110 is implemented as a search module circuit.

The processor 602 refers to a computational element that is configured to respond to and process instructions that drive the computing device 102. The processor 602 may cause the vision encoder module 104, the speech encoder module 106, the transformation module 108 and the search module 110 to perform thier repective functions as described. In operation, the processor 602 is configured to perform all the operations of the computing device 102. Examples of implementation of the processor 602 may include, but is not limited to, a central processing unit (CPU), a microprocessor, a microcontroller, a complex instruction set computing (CISC) microprocessor, a reduced instruction set (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, or any other type of processing circuit. Furthermore, the processor 602 may refer to one or more individual processors, processing devices and various elements associated with a processing device that may be shared by other processing devices.

The memory 604 refers to a storage medium, in which the data or software may be stored. For example, the memory 604 may store the instructions that drives the computing device 102. Examples of implementation of the memory 604 may include, but are not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Read Only Memory (ROM), Hard Disk Drive (HDD), Solid-State Drive (SSD), and/or CPU cache memory.

The network interface 606 includes suitable logic, circuitry, and interfaces that may be configured to communicate with one or more external devices, such as a server or another computing device. Examples of the network interface 606 may include, but is not limited to, an antenna, a network interface card (NIC), a transceiver, one or more amplifiers, one or more oscillators, a digital signal processor, and/or a coder-decoder (CODEC) chipset.

Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments. The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the present disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.

Claims

1. A computing device (102) for performing a visual search, comprising: a vision encoder module (104) configured to receive a query image (112) and generate a structured visual representation, representing the query image (112) as a linear combination of visual attributes; a speech encoder module (106) configured to receive a speech inter-action including one or more visual attribute modifications and generate a corresponding speech embedding; a transformation module (108) configured to transform the structured visual representation by replacing one or more visual attributes with modified visual attributes based on the speech embedding; and a search module (110) configured to generate an image search query based onto the transformed structured visual representation and out-put at least one target image (116) based on the image search query.

2. The computing device (102) of claim 1, wherein the visual encoder module (104) is a convolutional network or visual transformer network.

3. The computing device (102) of claim 1 or claim 2, wherein replacing a visual attribute includes identifying a corresponding visual attribute in the structure visual representation, removing the identified attribute and adding the modified visual attribute based on the speech embedding.

4. The computing device (102) of any preceding claim, wherein the transformation module (108) includes a multi-layer perceptron including one or more ReLU activations between each layer.

5. The computing device (102) of any preceding claim, wherein in response to receiving a second speech interaction:

23 the speech encoder module (106) is configured to generate a second speech embedding based on the second speech interaction; the transformation module (108) is configured generate a second transformed structured visual representation by replacing one or more visual attributes of the transformed structured visual representation with second modified visual attributes based on the second speech embedding; and the search module (110) is configured to generate a second image search query based onto the second transformed structured visual representation and output at least one second target image based on the second image search query.

6. A method (400) of training the computing device (102) of any preceding claim, comprising: sampling triplets of a query image (112), a speech interaction (114) and a target image (116), and for each triplet: processing the query image (112) and target image (116) using the vision encoder module (104) to generate a structural visual representation and a target structural visual representation, and processing the speech interaction using the speech encoder module (106) to generate a speech embedding; transforming the structured visual representation based on the speech embedding to generate a transformed structured visual representation; determining one or more loss functions (202) include a transformation loss function based on a difference between the transformed structured visual representation and the target structural visual representation; and updating parameters of the vision encoder module (104), speech encoder module (106) and transformation module (108) to reduce the one or more loss functions (202).

7. The method (400) of claim 6, where the loss functions (202) further include one or more independent loss functions based on composite loss (204), identifiability loss (206) and/or separability loss (208).

8. The method (400) of claim 7, wherein the composite loss (204) is configured to reduce a difference between the structured visual representation and a sum of all the visual attributes.

9. The method (400) of claim 7 or claim 8, wherein the identifiability loss (206) is configured to reduce a cross entropy of the structured visual representation.

10. The method (400) of any one of claims 6 to 9, wherein the separability loss (208) is configured to reduce a dot product between all of the visual attributes.

11. The computing device (102) of any one of claims 1 to 5, training according to the method (400) of any one of claims 6 to 10.

12. A method (500) of performing a visual search, comprising: receiving, by a vision encoder module (104), a query image (112) and generating a structured visual representation, representing the query image (112) as a linear combination of visual attributes; receiving, by a speech encoder module (106), a speech interaction (114) including one or more visual attribute modifications and generating a corresponding speech embedding; transforming, by a transformation module (108), the structured visual representation by replacing one or more visual attributes with modified visual attributes based on the speech embedding; and generating, by a search module (110), an image search query based onto the transformed structured visual representation and outputting at least one target image (116) based on the image search query.

13. The method (500) of claim 12, wherein replacing a visual attribute includes identifying a corresponding visual attribute in the structure visual representation, removing the identified attribute and adding the modified visual attribute based on the speech embedding.

14. The method (500) of claim 12 or claim 13, further comprising, in response to receiving a second speech interaction: generating a second speech embedding based on the second speech interaction; generating a second transformed structured visual representation by replacing one or more visual attributes of the transformed structured visual representation with second modified visual attributes based on the second speech embedding; and generating a second image search query based onto the second transformed structured visual representation and outputting at least one second target image based on the second image search query.

15. A computer readable medium comprising instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 12 to 14.

26