WO2023101679A1

WO2023101679A1 - Text-image cross-modal retrieval based on virtual word expansion

Info

Publication number: WO2023101679A1
Application number: PCT/US2021/061657
Authority: WO
Inventors: Jenhao Hsiao; Yikang Li
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2023-06-08

Abstract

This application is directed to image retrieval. An electronic system obtains a textual query and an image, and determines a plurality of virtual word embeddings associated with a plurality of objects in the image. The image is divided to a plurality of non-overlapping image patches. The electronic system generates a plurality of feature vectors associated with the non-overlapping image patches. The feature vectors and virtual word embeddings are fused to generate mi augmented visual embedding associated with the image. The electronic system generates a text embedding from the textual query, and a similarity level between the text embedding and the augmented visual embedding. The image is retrieved in response to the textual query based on the similarity level. In some embodiments, the similarity level is greater than other similarity levels of a plurality of other images, and the image is retrieved as an image search result to the textual query.

Description

Text-Image Cross-Modal Retrieval Based on Virtual Word Expansion

TECHNICAL FIELD

[0001] This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for using deep learning techniques to retrieve images in response to textual queries.

BACKGROUND

[0002] Text-image cross-modal retrieval aims to retrieve a list of relevant images from an image corpus given textual queries. This has remained a challenge because it is difficult to eliminate a large visual-semantic discrepancy between language and vision. Most previous work embeds images and textual queries independently into the same embedding space and measures text-image similarities using feature distances in a visual-semantic joint space. However, such a visual-semantic embedding does not take into account cross-modal messages during text-image retrieval, nor does it suppress information unrelated to text- image matching (e.g., background regions that are not described in the textual queries) for matched images and textual queries during message passing. Specifically, many photo albums use pre-defined categories as the indexed tags to build their search engines, and keyword matching is performed between category names and a textual query to retrieve photos having the same category name as the user’s search term. For these reasons, text- image cross-modal retrieval has suffered from a limited accuracy level. It would be beneficial to develop systems and methods to retrieve images from an image corpus accurately and efficiently in response to textual queries.

SUMMARY

[0003] Accordingly, there is a need for an accurate and efficient image retrieval mechanism for responding to textual queries. A large number of photos are taken by a user of a smartphone and create a challenge on photo organization and access of a photo album. A photo search engine is implemented based on a text-to-image retrieval method to help users efficiently retrieve related photos by keywords. Specifically, salient objects in an image can be accurately detected and are often mentioned in a textual query. Virtual words expansion (VWE) is applied to facilitate learning of visual-semantic embedding and alignments. In VWE, virtual words describing objects are detected in images as anchor points to bridge a gap between the images and descriptive text of the textual query. For a benchmark dataset, a VWE-based network improves an accuracy level for image retrieval and outperforms existing blind alignment methods. The VWE-based network receives a text-image pair (T, I) as an input data pair and represents each image-text pair as a text-VWords-image triple (T, V, I), where T is a sentence of a textual query, I is an image, and V is virtual words of object tags (in text) that are detected from the image and used to bridge corresponding visual and textual spaces.

[0004] In one aspect, a method is implemented at an electronic system (e.g., a mobile phone) for image retrieval. The method includes obtaining a textual query and an image. The method further includes determining a plurality of virtual word embeddings associated with a plurality of objects in the image, dividing the image to a plurality of non-overlapping image patches, generating a plurality of feature vectors associated with the plurality of non- overlapping image patches, and fusing the plurality of feature vectors and the plurality of virtual word embeddings to generate an augmented visual embedding associated with the image. The method further includes generating a text embedding from the textual query, and generating a similarity level between the text embedding and the augmented visual embedding. The method further includes retrieving the image in response to the textual query based on the similarity level.

[0005] In some embodiments, the image includes a first image, and the similarity level includes a first similarity level between the first image and the textual query. The method includes in accordance with a determination that the first similarity level is greater than a plurality of second similarity levels of a plurality of second images, identifying the first image as an image search result to the textual query. Further, in some embodiments, the method includes obtaining the plurality of second images and for each respective second image, fusing a plurality of second feature vectors and a plurality of second virtual word embeddings of the respective second image to generate a second augmented visual embedding associated with the respective second image. The method further includes generating the plurality of second similarity levels from the text embedding and the second augmented visual embedding associated with each second image.

[0006] In another aspect, some implementations include an electronic system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.

[0007] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0008] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof.

Additional embodiments are discussed in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0010] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0011] Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.

[0012] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.

[0013] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.

[0014] Figure 5 is a flow diagram of an image retrieval process in which an image is retrieved according to a textual query, in accordance with some embodiments.

[0015] Figure 6 is a block diagram of a visual image encoder configured to generate an augmented visual embedding from an image based on virtual word embeddings, in accordance with some embodiments. [0016] Figure 7 is a flow diagram of a training process in which an comprehensive visual-semantic model applied in Figures 5 and 6 are trained, in accordance with some embodiments.

[0017] Figure 8 is a flowchart of an image retrieval method implemented by an electronic system, in accordance with some embodiments.

[0018] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0019] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0020] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104. [0021] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera and a mobile phone 104C. The networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.

[0022] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0023] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. The content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.

[0024] In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D). The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D). The server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), rending virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.

[0025] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The AR glasses 104D can be includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses. The device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.

[0026] As explained above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D. 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using a second data processing model. Training of the first and second data processing models is optionally implemented by the server 102 or AR glasses 104D. Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.

[0027] Figure 2 is a block diagram illustrating an electronic system 200 for data processing, in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof. The electronic system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.

Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.

[0028] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for applications) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);

• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 228 (e.g., applied in an image retrieval process 500 in Figure 5) for processing content data using data processing models 240 (e.g., a comprehensive visual-semantic model), thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;

• One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where in an example, the data processing models 240 includes a comprehensive visual-semantic model that includes one or more of: an object detection module 532, a text encoder 534, a virtual words transformer encoder 534, a linear projection layer 540, a fusion transformer 542, a linear layer 544, and a text encoder 528, e.g., in Figure 5; o Content data and results 242 that are obtained by and outputted to the client device 104 of the electronic system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at the server 102 to provide the associated results 242 to be presented on client device 104.

[0029] Optionally, the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200. Optionally, the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200. In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.

[0030] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0031] Figure 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.

Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.

[0032] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 240 is trained according to a type of the content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 240 is provided to the data processing module 228 to process the content data.

[0033] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0034] The data processing module 228 includes a data pre-processing modules 314, a model-based processing module 316, and a data post-processing module 318. The data pre- processing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the pre- processing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model-based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data. The model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240. In some embodiments, the processed content data is further processed by the data post- processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data. [0035] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 240 is established based on the neural network 400. A corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights w₁, W₂, w₃, and w₄ according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.

[0036] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0037] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 240 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0038] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-vaiying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0039] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0040] Figure 5 is a flow diagram of an image retrieval process 500 in which an image 502 is retrieved according to a textual query 504, in accordance with some embodiments. The image retrieval process 500 is implemented by an electronic system (e.g., system 200 in Figure 2) based on a transformer-based cross-modal virtual term expansion model (also called a comprehensive visual-semantic model). The image retrieval process 500 includes a textual branch 506, a virtual word branch 508, and a visual image branch 510. In the textual branch 506, the electronic system obtains a textual query 504 including a word, phrase, sentence, or paragraph (e.g., “a man is surfing the wave of the ocean”), and generates a text embedding 512 (T_emb) from the textual query 504. In the virtual word branch 508, the electronic system obtains the image 502, identifies a plurality of objects 514 (e.g., “person", “surfing board”, “wave”) from the image 502, and generates a plurality of virtual word embeddings 516 (V_att) associated with the plurality of objects 514 in the image 502. Each object 514 is associated with a respective virtual word embedding 516 (V_att). In some embodiments, each of the plurality of objects 514 is associated with a respective confidence level for detecting the respective object 514 in the image 502, and the respective confidence level is greater than a predefined threshold confidence level.

[0041] In the visual image branch 510, the electronic system obtains the image 502, divides the image 502 to a plurality of non-overlapping image patches 518, and generates a plurality of feature vectors 520 associated with the plurality of non-overlapping image patches 518. Each image patch 518 corresponds to a respective feature vector 520. The electronic system receives the plurality of virtual word embeddings 516 (V_att) from the virtual word branch 508, and fuses the plurality of feature vectors 520 and the plurality of virtual word embeddings 516 (V_att) to generate an augmented visual embedding 522 (I_emb) associated with the image 502.

[0042] The augmented visual embedding 522 (I_emb) of the visual image branch 510 and the text embedding 512 (T_emb) of the textual branch 506 are compared to one another to determine a similarity level 524 between the augmented visual embedding 522 (I_emb) and text embedding 512 (T_emb). In some embodiments, the text embedding 512 (T_emb) has a first dimension, and the augmented visual embedding 522 (I_emb) has a second dimension that is equal to the first dimension. The similarity level 524 represents a cosine similarity of the text embedding and the augmented visual embedding 522 (I_emb). The electronic system determines whether the image 502 is retrieved as an image search result 526 in response to the textual query 504 based on the similarity level 524. In some embodiments, the image 502 is included in an image database 560 (e.g., a photo album). The image retrieval process 500 scans images in the image database 560 and determines whether the image 502 is retrieved as the image search result 526 in response to the textual query based on the similarity level 524.

[0043] In some situations, in accordance with a determination that the similarity level 524 exceeds a threshold similarity level, the electronic system identifies the image 502 as the image search result 526. Conversely, in some situations, in accordance with a determination that the similarity level 524 is equal to or less than the threshold similarity level, the electronic system determines that the image 502 is not an image search result 526 of the textual query 504. In some embodiments, only one image search result 526 of the image 502 is returned for the textual query 504. In some embodiments, a plurality of image search results 526 including a plurality of images 502 are returned by the image database 560. The electronic system ranks similarity levels of the images in the image database 560 to the text query 504, and the plurality of image search results 526 have larger similarity levels than remaining images in the image database 560.

[0044] In some embodiments, the image 502 includes a first image 502 A, and the similarity level 524 includes a first similarity level 524A between the first image 502A and the textual query 504. In accordance with a determination that the first similarity level 524A is greater than a plurality of second similarity levels 5024B of a plurality of second images 502B, the electronic system identifies the first image 502A as an image search result 526 to the textual query 504. The electronic system obtains the plurality of second images 502B. For each respective second image 502B, a plurality of second feature vectors 520B and a plurality of second virtual word embeddings 516B of the respective second image 502B are fused to generate a second augmented visual embedding 522B associated with the respective second image 502B and generate the plurality of second similarity levels 524B from the text embedding 512 (T_emb) and the second augmented visual embedding 522B associated with each second image 502B.

[0045] In some embodiments, the electronic system generates the text embedding 512 (T_emb) from the textual query 504 using a text encoder 528. The text encoder 528 enables a contextual language representation of the textual query 504, and each text embedding 512 (T_emb) includes a CLS token generated by a Bidirectional Encoder Representation from Transformer (BERT) model that considers a context of sentences, sentence pairs, or paragraphs. The CLS token is a special classification token of the text embedding 512 (T_emb), and the last hidden state of the BERT model corresponding to the CLS token is used for classification tasks. The textual query 504 includes an input text sequence T enclosed with a start of sentence (SOS) token and an end of sentence (EOS) token. Activations of a highest layer of a transformer at the CLS token is treated as a feature representation of the textual query 504, and the feature representation is layer normalized and then linearly projected into a multi-modal embedding space to match a dimension from an image encoder, i.e., to match a dimension of the augmented visual embedding 522 (I_emb) outputted by the visual image branch 510. The resulted text embedding 512 is noted as T_emb ∈ R^Demb , and D_emb is the dimension of the text embedding 512 (T_emb).

[0046] The electronic system generates an augmented visual embedding 522 (I_emb) from the visual image 502 using a visual encoder 530. The visual encoder 530 employs a multi-layer self-attention transformer to learn cross-modal contextualized representations based on a singular embedding of each modality. The visual encoder 530 includes a visual image encoder 510 (also called the visual image branch 510) and a visual term generator 508 (also called the virtual word branch 508). The visual encoder 530 receives the image 502 as an input and generates the augmented visual embedding 522 (I_emb) based on cross-modal fusion of image patches 518 and virtual words of the objects 514. Specifically, in some embodiments, the visual term generator 508 includes an object detection module 532, a text encoder 534, and a virtual word transformer encoder 536. The object detection module 532 is configured to receive the image 502 and detect the plurality of objects 514 in the image. Each object 514 is associated with an object tag. The text encoder 534 is configured to generate a plurality of object tag embeddings 538 (V_emb) each corresponding to a respective object 514 or object tag. The virtual word transformer encoder 536 encodes the plurality of object tag embeddings 538 to the plurality of virtual word embeddings 516 (V_att). In an example, the objection detection module 532 includes a Faster R-CNN. The object tags V are represented as {v₁, V₂, .... v_m} (e.g., V={‘car’, ‘building’, ‘tree’}) for m detected objects of the image 502, and used to generate the virtual word embeddings (516) via the text encoder 534 and virtual words transformer encoder 536.

[0047] Additionally, in some embodiments, the visual image encoder 510 includes a linear projection module 540, a fusion transformer 542, and a linear layer 544. The linear projection module 540 is configured to generate the plurality of feature vectors 520 for the plurality of non-overlapping image patches 518. The fusion transformer 542 fuses the plurality of feature vectors 520 to an intermediate visual embedding 546 based on the virtual word embeddings 516 (V_att ). The linear layer is configured to convert the intermediate visual embedding 546 to the augmented visual embedding 522 (I_emb). In an example, the visual image encoder 510 follows a protocol in Vision Transformer (ViT), and the image 502 is divided into N non-overlapping patches and has N+1 feature vectors 520. Specifically, when the image 502 has a resolution of H*W pixels and each patch has a resolution of Pxp, N is equal to Each non-overlapping patch 518 corresponds to a respective single feature vector 520AF, and the plurality of feature vectors 520 further includes an additional feature vector 520AD corresponding to an extra CLS path. In an example, the additional feature vector 520AD is an average of the respective single feature vector 520AF of each of the plurality of non-overlapping patches 518.

[0048] In some embodiments, the image database 560 includes an Android photo album, and the image retrieval process 500 is implemented to search the image database 560 to identify the image 502 in response to the textual query 504. The image retrieval process 500 utilizes visual-semantic feature corresponding to keywords, phrases, or sentences of the textual query 504. For example, the textual query 504 is “a happy and laughing dog”, the image retrieval process 500 accurately locates one or more images 502 capturing a happy and laughing dog. In contrast, some Android photo albums return no results because no photos are precisely labelled as “happy and laughing dog” in those photo albums.

[0049] In some embodiments, the electronic system includes a client device 104 (e.g., a mobile phone 104C, AR glasses 104D), and the client device 104 further includes a camera 260 configured to capture the image 502 or an input device configured to receive a user input of the textual query 504. In some embodiments, the client device 104 implements the image retrieval process 500 to retrieve the image 502 based on the textual query 504. Alternatively, another electronic device of the electronic system (e.g., a distinct client device 104 or a server 102) obtains the image 502 or textual query 504 from the client device 104, and implements the image retrieval process 500 to retrieve the image 502 based on the textual query 504.

[0050] Figure 6 is a block diagram of a visual image encoder 510 configured to generate an augmented visual embedding 522 (I_emb). from an image 502 based on virtual word embeddings 516 (V_att), in accordance with some embodiments. The visual image encoder 510 includes a linear projection module 540, a fusion transformer 542, and a linear layer 544. The linear projection module 540 is configured to generate a plurality of feature vectors 520 (I_patch) for the plurality of non-overlapping image patches 518, and the fusion transformer 542 fuses the plurality of feature vectors 520 (I_patch) to an intermediate visual embedding 546 (I_fuse) based on the virtual word embeddings 516 (V_att). The linear layer 544 is configured to convert the intermediate visual embedding 546 (I_fuse) to the augmented visual embedding 522 (l_emb).

[0051] In some embodiments, the linear projection module 540 includes a two- dimensional (2D) convolutional layer, and the plurality of feature vectors 520 (I_patch) are flattened, forming a sequence of patch embeddings I_patch ∈ R^{NxDpa tch}, where D_patch depends on a number of kernels in the 2D convolutional layer. Optionally, the sequence of patch embeddings Ipatch includes N feature factors 520AF corresponding to N non-overlapping image patches 518, respectively. Optionally, the sequence of patch embeddings I_patch includes an additional feature vector 520AD and N feature factors 520AF that correspond to N non- overlapping image patches 518, respectively.

[0052] In some embodiments, the fusion transformer 542 includes an image patch encoder 602, a bi-directional attention layer 604, and a cross-attention layer 606. The image patch encoder 602 is configured to apply a positional encoding term ∈ R^{NxDpa tch} to each feature vector 520 (also called input token), i.e., each patch embedding I_patch:

I_pos I_patch + E (1) where I_pos is a positional patch embedding (also called a position-adjusted feature vector). As such, patches 518 in the same spatial location are given the same positional encoding term E. The position-adjusted feature vectors I_pos are fed to the bi-directional attention layer 604, which is configured to cross-link two vectors as follows:

where S is a source vector, M is a target vector, and

denote linear transform matrices for query, key, value vector transformations, respectively. (W_qS)(W_kM)^T model the bi-directional relationship between the source and target vectors S and M, and √d is a normalization factor.

[0053] Specifically, the bi-directional attention layer 604 of the fusion transformer 542 cross-links the positional patch embeddings I_pos of different image patches 518 to generate self-attended feature vectors I_att as follows:

[0054] In some embodiments, referring to Figure 5, the visual term generator 508 includes an object detection module 532, a text encoder 534, and a virtual word transformer encoder 536. The object detection module 532 is configured to receive the image 502 and detect the plurality of obj ects 514 in the image. Each object 514 is associated with an object tag. The text encoder 534 is configured to generate a plurality of object tag embeddings 538 each corresponding to a respective object 514 or object tag. The virtual word transformer encoder 536 converts the plurality of object tag embeddings 538 to the plurality of virtual word embeddings 516 (V_att). The object tag embeddings 538 (V_emb) are fed to the virtual words transformer encoder 536 (which is a self-attention module) to generate a self-attended virtual word embedding 516 (V_att) as follows:

[0055] Referring to Figure 6, the virtual word embeddings 516 (V_att) are provided to the cross-attention layer 606 of the fusion transformer 542 to bridge and align the image 502 and textual query 504. Specifically, the cross-attention layer 606 is configured to fuse virtual words and image patches modalities, i.e., to fuse the self-attended virtual word embeddingV_att and the self-attended feature vectors I_att, and generate the intermediate visual embedding 546 (I_fuse) (also called a fused visual vector I_fuse) as follows:

In some embodiments, the plurality of feature vectors 520 includes an additional feature vector 520 AD corresponding to the extra CLS path, and the corresponding intermediate visual embedding I_fuse 546 includes a fused CLS token embedding . The fused CLS

token embedding is fed into the linear layer 544 to generate the augmented visual

embedding 522 (I_emb) as follows:

where I_emb ∈ R^Demb, and the augmented visual embedding 522 ( I_emb) has the same embedding dimension D_emb with text embedding 512 (T_emb) generated from the textual query 504.

[0056] Figure 7 is a flow diagram of a training process 700 in which a comprehensive visual-semantic model applied in Figures 5 and 6 are trained, in accordance with some embodiments The comprehensive visual semantic model includes one or more of: an object detection module 532, a text encoder 534, a virtual word transformer encoder 536, a linear projection module 540, a fusion transformer 542, a linear layer 544, and a text encoder 528. The fusion transformer 542 optionally includes an image patch encoder 602, a bi-directional attention layer 604, and a cross-attention layer 606. In some embodiments, each module of a first subset of the comprehensive visual-semantic model is pre-trained separately. A second subset is complemental to the first subset of the comprehensive visual-semantic model and trained (700) jointly using a training dataset. Alternatively, in some embodiments, the comprehensive visual-semantic model is trained (700) jointly end-to-end using a training dataset. Optionally, this joint training process 700 is implemented remotely at a server 102, and after training, the comprehensive visual-semantic model is provided to a client device 104 and used in the image retrieval process 500 by the client device 104. Optionally, this joint training process 700 is implemented at a client device 104. The client device 104 receives from a server 102 a plurality of training images 702 and a plurality of training textual queries 704. After training, the comprehensive visual-semantic model is used in the image retrieval process 500 locally by the client device 104.

[0057] In some embodiments, the training dataset includes a mini-batch of H pairs of video clip and textual query. The H matching pairs of video clip and textual query of the mini-batch include H video clips or images 702 and //textual queries 704. If randomly organized, the //video clips or images 702 and //textual queries 704 correspond to H x H possible video-text pairs. The comprehensive visual-semantic model is trained to predict which of the H x H possible video-text pairs associated with this mini-batch actually occurred, i.e., corresponds to the //matching pairs. The comprehensive visual-semantic model learns a visual-semantic embedding space by jointly training a visual encoder 530 and text encoder 528 to maximize a similarity level 524 (e.g., a cosine similarity) of the visual and text embeddings 722 (/_emb) and 712 (T_emb) of the H matching pairs in the mini-batch. The other unmatched pairs in the H x H possible video-text pairs, i.e., H(H-1) video-text pairs, are treated as negative examples.

[0058] In the virtual word branch 508, a plurality of training objects 714 are identified from a training image 702, and a plurality of training tag embeddings 738 and a plurality of training word embeddings 716 are successively generated for the training objects 714 in the training image 702. Each training object 714 is associated with a respective training word embedding 716. In the visual image branch 510, the training image 702 is divided to a plurality of non-overlapping training patches 718, and a plurality of training feature vectors 720 are generated for the plurality of non-overlapping training patches 718. Each training patch 718 corresponds to a respective training feature vector 720. The plurality of training feature vectors 720 and the plurality of training word embeddings 716 to generate an intermediate training visual embedding 746 that is further converted to an augmented training visual embedding 722 associated with the training image 702.

[0059] Specifically, a dot product between a normalized training visual embedding z (722) and a training text embedding t (712) is represented as follow:

A contrastive loss 780 for a matching video-text pair having embeddings (z, t) is defined as:

where T denotes a temperature parameter. The contrastive loss 780 is computed across all positive pairs, both (m, ri) and (n, w), in the mini-batch of H pairs of video clip 702 and textual query 704.

[0060] The comprehensive visual-semantic model applied in the image retrieval process 500 provides a high text-to-image retrieval accuracy than existing methods when Flickr30k benchmark dataset is used. Flickr30k is an image-caption dataset containing 31,783 images, with each image annotated with five sentences. The Flickr30k dataset is split into 29,783 training, 1000 validation, and 1000 test images according to a search task protocol. Performance of image-text retrieval is evaluated based on 1000 test sets. A recall rate (e.g., Recall@t, t = [1, 5, 10]) is used as accuracy evaluation metrics. Referring to Table 1 as shown below, the comprehensive visual-semantic model significantly outperforms existing methods that enforce alignment of the training image 702 and textual query 704. Application of virtual words as a bridge is proven to be able to ease learning of visual-semantic embedding and alignments, thereby boosting the text-to-image retrieval accuracy.

Table 1. Detailed comparisons of text-to-image retrieval results in Flickr30K dataset

[0061] Figure 8 is a flowchart of an image retrieval method implemented by an electronic system, in accordance with some embodiments. For convenience, the method 800 is described as being implemented by an electronic system (e.g., a client device 104, a server 102, or a combination thereof). In some embodiments, the client device 104 is a mobile phone 104C. Method 800 is, optionally, governed by instructions that are stored in a non- transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 8 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the computer system 200 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 800 may be combined and/or the order of some operations may be changed.

[0062] The electronic system obtains (802 A) a textual query 504 and obtains (802B) an image 502. A plurality of virtual word embeddings 516 (V_att) are determined (804) for a plurality of objects 514 in the image 502. The electronic system divides (806) the image 502 to a plurality of non-overlapping image patches 518 and generates (808) a plurality of feature vectors 520 associated with the plurality of non-overlapping image patches 518. The plurality of feature vectors 520 and the plurality of virtual word embeddings 516 (V_att) are fused (810) to generate an augmented visual embedding 522 (I_emb) associated with the image 502. The electronic system generates (812) a text embedding 512 (T_emb) from the textual query 504. A similarity level 524 is generated (814) between the text embedding 512 (T_emb) and the augmented visual embedding 522 (V_att ), and the image 502 is retrieved (816) in response to the textual query 504 based on the similarity level 524.

[0063] In some embodiments, the image 502 includes a first image 502 A, and the similarity level 524 includes a first similarity level 524A between the first image 502A and the textual query 504. In accordance with a determination that the first similarity level 524A is greater than a plurality of second similarity levels 524B of a plurality of second images 502B, the electronic device identifies (818) the first image 502 A as an image search result to the textual query 504. Further, in some embodiments, the electronic device obtains the plurality of second images 502B. For each respective second image 502B, the plurality of second feature vectors 520B and a plurality of second virtual word embeddings 516B of the respective second image 502B are fused to generate a second augmented visual embedding 522B associated with the respective second image 502B. The plurality of second similarity levels 524B are generated from the text embedding 512 (T_emb) and the second augmented visual embedding 522B associated with each second image 502B.

[0064] In some embodiments, the electronic system detects (820) the plurality of objects 514 in the image 502, encodes (822) the plurality of objects 514 to a plurality of object tag embeddings 538, and encodes (824) the plurality of object tag embeddings 538 to the plurality of virtual word embeddings 516 (V_att). Further, in some embodiments, each of the plurality of objects 514 is associated with a respective confidence level for detecting the respective object 514 in the image 502, and the respective confidence level is greater than a predefined threshold confidence level.

[0065] In some embodiments, the electronic system generates (824) a text feature vector from the text query 504 and projects (826) the text feature vector into a multi-modal embedding space to match a dimension of the augmented visual embedding 522 (I_emb ).

[0066] In some embodiments, the plurality of non-overlapping image patches 518 includes a first number of image patches. The plurality of feature vectors 520 includes (826) a subset of first feature vectors 520AF and an additional feature vector 520AD. The subset of first feature vectors 520AF includes a second number of feature vectors, and the second number is equal to the first number. Each first feature vector 520AF corresponds to a respective distinct one of the plurality of non-overlapping image patches 518. The additional feature vector 520AD is (828) a combination of the subset of first feature vectors 520AF.

[0067] Referring to Figure 6, in some embodiments, each of the plurality of feature vectors 520 (I_patch) corresponds to a respective image patch 518 having a respective positional encoding term E. The electronic system generates a plurality of position-adjusted feature vectors I_pos by combining each of the plurality of feature vectors 520 (I_patch) with a respective positional encoding term E. Further, in some embodiments, a plurality of self-attended feature vectors I_att are generated from the plurality of position-adjusted feature vectors I_pos using a bi-directional attention layer 604. The plurality of self-attended feature vectors I_att and the plurality of virtual word embeddings V_att are cross-fused to provide an intermediate visual embedding 546 (I_fuse) using a cross-attention layer 606. Additionally, each of the plurality of self-attended feature vectors I_att is represented as equation (3), and the intermediate visual embedding 546 (I_fuse) is represented as equation (5).

[0068] In some embodiments, the text embedding 512 has a first dimension, and the augmented visual embedding 522 (I_emb) has a second dimension that is equal to the first dimension. The similarity level 524 represents a cosine similarity of the text embedding 512 (T_emb) and the augmented visual embedding 522 (I_emb).

[0069] In some embodiments, the electronic system includes an electronic device (e.g., a mobile phone 104C) configured to determine the similarity level 524 between the textual query 504 and the image 502 based on a comprehensive visual-semantic model. The electronic device receives the comprehensive visual-semantic model from a server 102. The comprehensive visual-semantic model is trained remotely at the server 102. Alternatively, in some embodiments, the electronic system includes an electronic device (e.g., a mobile phone 104C) configured to determine the similarity level 524 between the textual query 504 and the image 502 based on a comprehensive visual-semantic model. The electronic device receives, from a server, a plurality of training images 702 and a plurality of training textual queries 704, and trains the comprehensive visual-semantic model locally at the electronic device using the plurality of training images 702 and the plurality of training textual queries 704.

[0070] It should be understood that the particular order in which the operations in Figure 8 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to recognize a hand gesture as described herein. Additionally, it should be noted that details of other processes described above with respect to Figure 5 are also applicable in an analogous manner to method 800 described above with respect to Figure 8. For brevity, these details are not repeated here.

[0071] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[0072] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, constmed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[0073] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[0074] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is:

1. An image retrieval method, implemented by an electronic system, comprising: obtaining a textual query and an image; determining a plurality of virtual word embeddings associated with a plurality of objects in the image; dividing the image to a plurality of non-overlapping image patches; generating a plurality of feature vectors associated with the plurality of non- overlapping image patches; fusing the plurality of feature vectors and the plurality of virtual word embeddings to generate an augmented visual embedding associated with the image; generating a text embedding from the textual query; generating a similarity level between the text embedding and the augmented visual embedding; and retrieving the image in response to the textual query based on the similarity level.

2. The method of claim 1 , wherein the image includes a first image, and the similarity level includes a first similarity level between the first image and the textual query, further comprising: in accordance with a determination that the first similarity level is greater than a plurality of second similarity levels of a plurality of second images, identifying the first image as an image search result to the textual query.

3. The method of claim 2, further comprising: obtaining the plurality of second images; for each respective second image, fusing a plurality of second feature vectors and a plurality of second virtual word embeddings of the respective second image to generate a second augmented visual embedding associated with the respective second image; and generating the plurality of second similarity levels from the text embedding and the second augmented visual embedding associated with each second image.

4. The method of any of the preceding claims, wherein determining the plurality of virtual word embeddings associated with the plurality of objects in the image further comprises: detecting the plurality of objects in the image; and encoding the plurality of objects to a plurality of object tag embeddings; and encoding the plurality of object tag embeddings to the plurality of virtual word embeddings.

5. The method of claim 4, wherein each of the plurality of objects is associated with a respective confidence level for detecting the respective object in the image, and the respective confidence level is greater than a predefined threshold confidence level.

6. The method of any of the preceding claims, wherein generating the text embedding from the textual query further comprising: generating a text feature vector from the text query; projecting the text feature vector into a multi-modal embedding space to match a dimension of the augmented visual embedding.

7. The method of any of the preceding claims, wherein: the plurality of non-overlapping image patches includes a first number of image patches, and the plurality of feature vectors includes a subset of first feature vectors and an additional feature vector; the subset of first feature vectors includes a second number of feature vectors, the second number equal to the first number, each first feature vector corresponding to a respective distinct one of the plurality of non-overlapping image patches; and the additional feature vector is a combination of the subset of first feature vectors.

8. The method of any of the preceding claims, wherein each of the plurality of feature vectors corresponds to a respective image patch having a respective positional encoding term, and fusing the plurality of feature vectors and the plurality of virtual word embeddings further comprises: generating a plurality of position-adjusted feature vectors, including combining each of the plurality of feature vectors with the respective positional encoding term.

9. The method of claim 8, wherein fusing the plurality of feature vectors and the plurality of virtual word embeddings further comprises: generating a plurality of self-attended feature vectors from the plurality of position- adjusted feature vectors using a bi-directional attention layer; and cross-fusing the plurality of self-attended feature vectors and the plurality of virtual word embeddings using a cross-attention layer.

10. The method of claim 9, wherein each of the plurality of self-attended feature vectors (I_att) is represented as:

where I_pos is each of a plurality of position-adjusted feature vector, and W_q, W_k, and W_v denote linear transform matrices for query, key, value vector transformations, (W_ql_pos)(W_kI_pos)^T model a bi-directional relationship of the respective position-adjusted feature vector, and √d is a normalization factor.

11. The method of any of the preceding claims, wherein the text embedding has a first dimension, and the augmented visual embedding has a second dimension that is equal to the first dimension, and the similarity level represents a cosine similarity of the text embedding and the augmented visual embedding.

12. The method of any of claims 1-11, wherein the electronic system includes an electronic device configured to determine the similarity level between the textual query and the image based on a comprehensive visual-semantic model, further comprising: receiving by the electronic device the comprehensive visual-semantic model from a server, wherein the comprehensive visual-semantic model is trained remotely at the server.

13. The method of any of claims 1-11, wherein the electronic system includes an electronic device configured to determine the similarity level between the textual query and the image based on a comprehensive visual-semantic model, further comprising: receiving, from a server, a plurality of training images and a plurality of training textual queries; and training the comprehensive visual-semantic model locally at the electronic device using the plurality of training images and the plurality of training textual queries.

14. An electronic system, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-13.

15. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a method of any of claims 1-13.