WO2022261570A1

WO2022261570A1 - Cross-attention system and method for fast video-text retrieval task with image clip

Info

Publication number: WO2022261570A1
Application number: PCT/US2022/039442
Authority: WO
Inventors: Yikang Li; Jenhao Hsiao; Chiuman HO
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-08-04
Filing date: 2022-08-04
Publication date: 2022-12-15

Abstract

Systems and methods are provided for improving video-text retrieval tasks by employing a cross-attention dual-encoder. The cross-attention dual-encoder performs inference training for a machine learning/artificial intelligence (ML/AI) model by using [CIS] tokens to interchange and fuse a visual modality and a textual modality together. Cross- attention achieves an inference that is guided by both video and text modalities simultaneously, linear in computation and memory, and leverages features at different modalities. Further, cross-attention improves accuracy and speed of video-text retrieval tasks. For example, a mobile computing device can include a cross-attention dual-encoder training the ML/AI model using cross-attention to learn video, text pair similarities and classifications and predict the most relevant video, text pairs. The mobile computing device can also include a video-text retriever performing video-text retrieval tasks guided by the ML/AI model to select one or more most relevant videos from a plurality of video based on a received text query.

Description

CROSS-ATTENTION SYSTEM AND METHOD FOR FAST VIDEO-TEXT RETRIEVAL TASK WITH IMAGE CLIP

Reference to Related Application

[0001] The present application claims priority to U.S. Patent Application No. 63/229,368, filed August 4, 2021 and titled "VIDEOCLIP: A CROSS-ATTENTION MODEL FOR FAST VIDEO-TEXT RETRIEVAL TASK WITH IMAGE CLIP," which is incorporated herein by reference in its entirety.

Description of Related Art

[0002] The present application generally relates to video search and retrieval, particularly to methods and a system for improving relevancy of videos retrieved given textual queries.

Brief Description of the Drawings

[0003] The present disclosure, in accordance with one or more various embodiments, is described in detail with reference to the following figures. The figures are provided for purposes of illustration only and merely depict typical or example embodiments.

[0004] FIG. 1 depicts an example computing system, such as a mobile computing device, implementing a dual-encoder cross-attention system for improved video-text retrieval, in accordance with embodiments of the application.

[0005] FIG. 2 depicts an example architecture for the dual-encoder cross-attention system shown in FIG. 1 for improved video-text retrieval, in accordance with embodiments of the application.

[0006] FIG. 3 is an operational flow diagram illustrating an example method for implementing the building, training, and optimization of a machine learning/artificial intelligence (ML/AI) model using the disclosed cross-attention dual-encoder techniques, in accordance with embodiments of the application.

[0007] FIG. 4 is a block diagram of an example computing component or device for implementing the disclosed techniques, in accordance with embodiments of the application.

[0008] These illustrative embodiments are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.

Detailed Description

[0009] Video-text retrieval is an essential task in cross-modal information retrieval, i.e., retrieving relevant videos from a large and unlabeled dataset given textual queries. There are some existing methods related to video-text retrieval that involve simply pooling the image features (e.g., based on the CLIP encoder) from frames to build the video descriptor. By employing such methods, the semantic space that supports video-text retrieval has been directly composed without any prior knowledge which can cause inefficiencies and limitations in the overall video-text retrieval process. Thus, utilizing these existing approaches that rely substantively on image features only may result in sub-optimal video-text search accuracy, since the information among different modalities are not fully exchanged and aligned.

[0010] As disclosed herein, novel cross-attention dual-encoder method and system are provided to address the challenging video-text retrieval problems related to the aforementioned existing approaches, such as insufficient video-text datasets, inaccurate retrieval, and slower retrieval speeds. The disclosed embodiments include a highly efficient cross-attention dual-encoder method and system that facilitates the information exchange between multiple modalities (i.e., video and text). Consequently, video-text retrieval using the disclosed attention dual-encoder can outperform existing state-of-the-art methods, achieving improved retrieval speeds that are much faster, and with greater accuracy, than traditional query-agnostic search model. As will be described in detail herein, the attention dual-encoder techniques and systems realize several advantages, including: generating a well-defined semantic space from the image-text domain to the video-text domain, in a manner that boosts the performance of video-text retrieval tasks; achieving a linearity in both computation and memory that is able to better leverage features at different modalities; and implementing a query-agnostic search engine that is able to scale by pre-computing a video data index which improves efficiency of video-text retrieval tasks.

[0011] Referring now to FIG. 1, an example of a mobile computing device 110 employing the disclosed cross-attention dual-encoder system for improved video-text retrieval is depicted. The mobile computing device 110 can be a user equipment (UE) device, being implemented as any type of wireless user device that is used directly by an end-user to communicate and having video-text retrieval capabilities, which are performed by the video text retriever 111 implemented on the mobile computing device 110. In the example of FIG. 1, the mobile computing device 110 is shown as a handheld mobile phone, and more specifically a smartphone. However, the mobile computing device 110 may be implemented as various other wireless user devices that are used directly by an end-user to communicate and equipped with telecommunication functions, such as voice, video, and text. For example, the mobile computing device 110 may also be implemented as a cellular telephone, a laptop computer equipped with a mobile broadband adapter, or other computing device. Accordingly, as a smartphone, the mobile computing device 110 is capable of supporting enhanced data services, voice, video, and other telecommunication functions that are commonly employed by subscribers to broadband cellular networks.

[0012] Furthermore, the mobile computing device 110 is depicted to include a video text retriever 111 and a cross-attention dual-encoder 112 that implement the disclosed techniques for supporting video-text retrieval tasks. Although not shown in FIG. 1, the mobile computing device 110 can include other applications, computing sub-systems, and hardware. For example, the mobile computing device 110 can include an operating system that provides an interface between the mobile computing device's 110 hardware (e.g., the input/output mechanisms and a processor executing instructions retrieved from computer-readable medium) and software. Example operating systems include ANDROID, CHROME, IOS, MAC OS X, WINDOWS 7, WINDOWS PHONE 7, SYMBIAN, BLACKBERRY, WEBOS-a variety of UNIX operating systems, or a proprietary operating system for computerized devices. The operating system may provide a platform for the execution of application programs that facilitate interaction between the computing device and a user. The video-text retriever 111 and cross-attention dual-encoder 112, as disclosed herein, can be implemented on the mobile computing device 110 as hardware, a stand-alone processor, firmware, a software application, or any combination thereof.

[0013] In an example, the video-text retriever 111 and the cross-attention dual encoder 112 operate in concert to implement various video-text retrieval features on the mobile computing device 110. As a general description, the cross-attention dual-encoder 112 is configured to perform a distinct multi-modal training of a machine learning/artificial intelligence (ML/AI) model, where the model is trained to determine relevant video, text pairs. The video-text retriever 111 is configured to execute various video-text retrieval tasks on the mobile computing device 111. For example, a video-text retrieval task performed by the video-text retriever 111 can include automatically retrieving a list of videos (from a corpus/library including a vast amounts of videos) that are deemed most relevant given a text query. There is a critical interoperability between the functions of the cross-attention dual encoder 112 and the video-text retriever 111. In detail, the ML/AI model that is generated and/or trained by the cross-attention dual-encoder 112 to accurately determine relevancy between text and video can be employed by the video-text retriever 111 to drive the relevancy-based search for video from a large corpus of text and videos. In other words, as specific text is entered into the video-text retriever 111 for a query, the model trained by the cross-attention dual-encoder 112 to automatically recognize relevancy links between text and video can be used by the video-text retrieve in order to automatically select videos that are relevant to the query-specific text. [0014] According to the embodiments, the cross-attention dual-encoder 112 is distinctly configured to perform a multi-modal training for a ML/AI model that is guided by both video and text modalities simultaneously for improved speed and accuracy, as described in greater detail in reference to FIG. 2, for example. As referred to herein, Al can be described as an automated computer process that can intelligently leverage data analysis for training itself for further optimizing the processes. ML can be generally considered an application of Al. Al techniques can include various approaches that are used in the area to achieve automated data analysis, such as neural networks, automated reasoning analysis (e.g., satisfiability modulo theories), and so on. Al-based techniques can be used to enhance computer-controlled features of a mobile computing device 110 in a manner that improves the over user experience and optimizes performance of applications and/or operation environment. In the example of FIG. 1, AI/ML techniques are specifically used to drive visual language-learning modeling and video-text retrieval tasks, as disclosed.

[0015] As an example of operation, the mobile computing device 110 can have a library of videos stored in its memory that were captured by the device's 110 user/owner using its video recording functions (e.g., built-in video camera). In the illustrated example of FIG. 1, frames from a clip of video 113 are depicted as being stored on, or otherwise accessible by, the mobile computing device 110. The four frames of video 113 illustrated in the example are related to sport of basketball, including images that show basketball players, a court, a goal, and the like. Accordingly, the text 114 includes words, keywords, phrases, etc. that are generally related to basketball and describe the imagery that is portrayed in the frames of video 113. For instance, the text 114 includes captions, or descriptors, that correspond to the contents of the frames of video 113. As seen in FIG. 1, the text 114 that accompanies the frames of video 113 are phrases, or captions, including: "a player is putting the basketball into the post from a distance"; "the player makes a three pointer"; and "people are playing basketball." At a later time, the user/owner of the mobile computing device 110 may desire to search through and retrieve one or more of the stored videos, including video 113, using a searching function of the computer device 110, shown as video-text retriever 111. In other operational example, the video 113 is not necessarily stored on the mobile computing device 110 itself but are stored on distributed and large-scale remote databases/repositories of information that are accessible to the mobile computing device 110 via a communication networks, such as the Internet. In this example, the video-text retriever 111 functions similarly to a search engine, where text entered into a graphical user interface (GUI) of the video-text retriever 111 drives searches for content that is available on the Internet and sites on the World Wide Web.

[0016] As alluded to above, the video-text retriever 111 is configured to utilize text as input which serves as the basis for a query to retrieve a selected one or more relevant videos from a larger group consisting of vast amounts of videos, including videos 113. Generally, videos can correspond to descriptive text, such as keywords, phrases, descriptors, and the like, which describe the contents and/or context of the video. Thus, for example, the user/owner of the mobile computing device 110 can enter text (e.g., keywords, phrases, search string, etc.) into a GUI of the video-text retriever 111 in order to ultimately search through a plurality of videos, such as videos 113, in order to retrieve one or more videos that are deemed most relevant to the text input. Accordingly, the video-text retriever 111 can be described as a multi-modal application (e.g., combining visual and text information). Furthermore, as previously described, the video-text retriever 111 employs a ML/AL model, such as a visual-language learning model, to execute its video-text retrieving tasks. Because the video-text retriever 111 executes cross-modal tasks in the image-text domain, the ML/AI model leveraged by the video-text retriever 111 is trained using a cross-modal approach, referred to herein as cross-attention, that is implemented by the disclosed cross-attention dual-encoder 112. Thus, the video-text retriever 111 is capable of performing enhanced video-text retrieval tasks, by directly leveraging the cross-attention training of the cross attention dual-encoder 112.

[0017] As alluded to above, there are several challenges that are associated with conventional video-text retrieval approaches that apply image-text pre-training in the original image-text domain. One such drawback is related to the video feature representation. Different from image, the generation of an appropriate video feature representation is not trivial and should consider both spatial and temporal dimension. Thus, simply pooling the frame features to build a video descriptor would often result in sub-optimal video-text search accuracy. Another challenge pertains to the multi-modal interaction between video and language. Generally speaking, video-text retrieval is naturally a weakly-supervised learning problem because there are no explicit alignments between the video and text modalities. Moreover, traditional embedding approaches, where video and text embeddings were learned independently and aligned in a brute force manner, often lead to an unsatisfactory result. Yet another challenge involves traditional query-dependent models, where a pair (video, text) is encoded by concatenating and passing into one single network. These types of models can be prohibitively slow to apply to the entire video corpus since the network needs to re-compute for every separate query that is conducted. It is thus impractical to apply such a query-dependent model to a real-world video-text retrieval system.

[0018] Nonetheless, the video-text retriever 111 leverages the capabilities of the cross-attention dual-encoder 112, which performs a cross-attention learning approach that is guided by both video and text modalities simultaneously, in order to address the aforementioned drawbacks. Consequently, the video-text retriever 111 and cross-attention dual-encoder 112 work together to achieve better retrieval accuracy and efficiency. For instance, with conventional approaches, a weakly constructed inference between text and video can cause a query using the text "people shooting a three pointer" which is a slight variation to the text 114 may lead to a failed retrieval of a corresponding video. Due to a finer fusion mechanism that allows multiple modalities (i.e., video and text) to better exchange information based on cross-attention, the cross-attention dual-encoder 112 realizes improved accuracy, while also keeping the efficiency in the inference stage because of the use of a query-agnostic search architecture. Referring back to the previous example, the cross-attention dual-encoder 112 building a strong inference between text and video in the ML/AL model allows a query using the text "people shooting a three pointer" in the video text retriever to successfully retrieve of a corresponding video, such as video 113. [0019] The cross-attention dual-encoder 112 can extend the image-text semantic space created by existing image-text pre-training models, such as Contrastive Language- Image Pre-Training (CLIP), into a more complex video-text space. Moreover, the cross attention dual-encoder 112 is configured to perform a cross-attention technique, described in greater detail in reference to FIG. 2, that uses a non-patch token as an agent to interchange information between branches by attention. Consequently, the cross-attention dual-encoder 112 is linear in both computation and memory and is able to better leverage features at different modalities. Additionally, the cross-attention dual-encoder 112 implements a query- agnostic search model (e.g., as oppossed to query-dependent having to re-compute for every query) that is able to scale by pre-computing and video data index, which makes the system extremely efficient in real-world applications. Thus, referring to the example of the FIG. 1, the cross-attention dual-encoder 112 can train a ML/AI model using a dataset of videos and text, such as large library of video clips and related captions. For example, the dataset can include the videos 113 and text 114 depicted in FIG. 1, where the text 114 is captions that describe the contents and/or context of the corresponding clips of video 113.

[0020] In general, the ML/AI model training that is performed by the cross-attention dual-encoder 112 is a distinct approach that involves, but is not limited to: 1) extracting textual features and visual features from the video and text in the dataset; 2) interchanging tokens from each feature to the opposing feature (e.g., fuse visual token to textual features, and fuse textual tokens to visual features) in a manner that considers both the textual and visual modalities mutually in the inference process; and 3) applying loss functions to the model to reduce the loss on the next evaluation. Details regarding the architecture and training process executed by the cross-attention dual-encoder 112 are described in greater detail below in reference to FIG. 2. The resulting output from the cross-attention dual encoder 112 is a ML/AI model that is trained for visual-language learning using cross-attention techniques. This ML/AI model accurately and efficiently learns a mapping between video and text to predict a closest relevant (video, text) pair outputs that can leveraged by the video text retriever 111 for executing its video-text retrieving tasks. [0021] FIG. 2 depicts an example configuration of the cross-attention dual-encoder 200 that is described above in reference to FIG. 1 for enhancing video-text retrieval tasks. The cross-attention dual-encoder 200 executes an autonomous self-training process for ML/AI model(s), where the model(s) trains on a variety of text and image data in a manner that adapts a well-defined image-text semantic space to the 3D video space. Accordingly, an ML/AL model trained using the cross-attention dual-encoder 200 can learn (video, text) pair similarities and classifications in order to accurately predict the most relevant (video, text) pair from an image-text semantic space. FIG. 2 illustrates that the framework of the cross attention dual-encoder 200 can comprise three main elements, including: 1) a feature extraction module 210 which extracts visual and textual features respectively; 2) a dual encoder transformer module 220 which utilizes respective textual branch, visual branch, and cross-attention branch algorithms for inference training; and 3) loss functions module 230 which utilize symmetric cross-entropy loss and cross-entropy loss functions to optimize the ML/AI model with respect to error/loss estimation. In an embodiment, the cross-attention dual-encoder 200 is implemented as a computer processor device, for example, a microcomputer that includes one or more processing units (e.g., microprocessors), memory storage (e.g., RAM, ROM, etc.), and I/O devices. The processing units of the cross-attention dual-encoder 200 can execute instructions stored in memory to control one or more electrical systems or subsystems of the computer processor device. Furthermore, the aforementioned feature extraction module 210, dual-encoder transformer module 220, loss functions module 230 and other elements comprised thereof, can be implemented as hardware, firmware, software, or any combination thereof. The feature extraction module 210, dual-encoder transformer module 220, and loss functions module 230 can be implemented as components integrated together on a single computer processor device of the cross-attention dual encoder 200, or as separate stand-alone computer processor devices functioning together.

[0022] The feature extraction module 210 is configured to extract frame features, for example features associated with frames of video and text, to serve as inputs into the dual encoder transformer module 220, where the distinct cross-attention techniques are executed. Additionally, the feature extraction module 210 can comprise a pre-trained image- text encoder, shown in FIG. 2 as CLIP 214, which enables the module 210 take advantage of projecting the images and the corresponding text (e.g., captions) into the same semantic space. Further, the cross-attention dual-encoder 200 is particularly designed to combine cross-attention techniques with a dual-encoder architecture, which is illustrated in FIG. 2 within the dual-encoder transformer module 220. FIG. 2 also shows that the cross-attention dual-encoder 200 includes loss functions module 230 which implements two types of loss functions, namely the symmetric cross-entropy and the cross-entropy loss, for maintaining the (video, text) pair similarities and learning the classification-like task respectively.

[0023] As seen in FIG. 2, the cross-attention dual-encoder 200 architecture includes a feature extraction module 210. Particularly, FIG. 2 illustrates the feature extraction module 210 initially receiving input 211 in the form of video 212 and text 213. For example, the video 212 and text 213 can be a part of a larger dataset that includes a wide variety of video images and text to train the ML/AI model. For instance, video 212 can be multiple different video clips, where each video clip comprises a plurality of frames of video. The feature extraction module 210 receives the video 212, specifically frames of video, in a manner that allows features associated with the visual information conveyed in the video 212 to be extracted and further analyzed as training data. Also, because the attention dual-encoder 200 facilitates analysis in both video and text modalities, the feature extraction module 210 also extracts features associated with the text 213. As an example, the text 213 that is input into the feature extraction module 210 can include words, keywords, phrases, and the like, which correspond to one or more of the frames of the video 212. FIG. 2 illustrates that the input 211, comprising video 212 and text 213, are specifically received by the CLIP model 214.

[0024] As referred to herein, CLIP is a neural network that is pre-trained on a large set of (image, text) pairs. CLIP can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task. Thus, because CLIP 214 has been pre-trained to learn visual concepts from natural language supervision, the

CLIP 214 is already equipped to extract features from (image, text) pairs it has received as input 211, namely video 212 and the text 213 (e.g., frames of video and its corresponding caption). In support of the dual-modality capabilities of the cross-attention dual-encoder 200, the CLIP 214 is constructed with a dual-encoder architecture. That is, the CLIP 214 encoder extracts both visual features 215 and textual features 216 from the input 211, respectively. In an embodiment, CLIP 214 extracts visual features 215 and textual features 216 from each frame of video that it receives as input. These visual features 215 and textual features 216, respectively, are represented by all of the outputs of the highest layer of the transformer. The visual features 215 and textual features 216 that are extracted for each frame of video can be represented mathematically as following:

where IVisthe CLIP model visual encoder

[0025] In an example where the dimension of each frame feature is (50, 512), then 50 equals the number of image tokens, which is 49 plus the [CLS] token. As referred to herein, [CLS] is a special classification token that can be used to serve as a representation of the entire image. The 512 is the embedding dimension for each visual token x% where n e [1, N] such that N = number of frames and i e [1, 49] + [CLS]

[0026] Subsequently, a mean pooling along the number of frames can be taken after the feature extraction. Therefore, the final visual features 215 and textual features 216 can be represented mathematically as: ^ncls,!,...49 ^— Mean {Gn(Ii),...,Gn(I^h) (2)

where Tvisthe CLIP model visual encoder [0027] In the example where the output dimension of each text feature is (77, 512), then 77 is the fixed length of the input sentence and 512 is the embedding dimension of each textual token _X . Hence the dimension of the video and the text input for each batch will be (B, 50, 512) and (B, 77, 512) respectively, where the B is the number of batch size.

[0028] Additionally, FIG. 2 illustrates that the cross-attention dual-encoder 200 architecture includes the dual-encoder transformer module 220. As disclosed herein, the dual-encoder transformer module 220 is distinctly configured to perform cross-attention techniques that builds and/or trains three-branches of models within the transformer's 220 architecture. FIG. 2 depicts that separate modalities of information can be analyzed independently (e.g., without "fused" information) within the dual-encoder transformer module 220, where each modality (e.g., visual, textual) traverses a separate branch of inference training, respectively, for a dual-encoder model. Particularly, the textual features are analyzed along a textual branch 221 (indicated in FIG. 2 by solid filled elements/arrows) of the dual-encoder transformer module 220 and the visual features are analyzed along a visual branch 222 (indicated in FIG. 2 by pattern filled elements/arrows) of the dual-encoder transformer module 220. Thus, there are two branches of an inference algorithm for the dual encoder model, namely textual branch 221 and visual branch 222, that involves training the model without "fusing" or containing any mutual information. However, due to leveraging cross-attention, there is a third branch, which involves inference training for a cross-attention model, referred to herein as the cross-attention model branch 250. The algorithm for inference training along the cross-attention model branch 250 is executed with mutual information, which involves a fusing of visual and textual information together. The flow of data corresponding to the third cross-attention model branch 250 is represented in FIG. 2 with dashed lines. In general, the flow for cross-attention aspects of the dual-encoder transformer module 220, which are query-dependent, are illustrated along the cross attention model branch 250 in the center of the dual-encoder transformer module 220. In contrast, the dual-encoder flow, which does not require any cross-attention and/or fusing of modalities, is represented in FIG. 2 with solid lines along the textual branch 221 and the visual branch 222 at the bottom and top of the dual-encoder transformer module 220, respectively.

[0029] The process and computations involved in inference training along the cross attention model branch 250 of the dual-encoder transformer module 220 are described in detail below in reference to Eq.4 to Eq.9. Ultimately, cross-attention [CLS] tokens 248, 249 which contain the cross-attention information (also referred to herein as cross-attention [CLS] tokens) output from the inference training of the cross-attention model, namely cross attention model branch 250 of the dual-encoder transformer module 220, will be fed into the Cross-Entropy (CE) loss function 232 of the loss functions module 230, while the [CLS] tokens 233, 234 which are output directly from the inference training of the dual-encoder model, namely textual branch 221 and visual branch 222 of the dual-encoder transformer module 220, are sent to the Symmetric Cross-Entropy (Sym-CE) loss function 231 to be computed.

[0030] Specifically, the cross-attention model branch 250 of the dual-encoder transformer module 220 illustrates that each of the two encoders (e.g., visual, textual) within the dual-encoder transformer module 220 are not entirely isolated from each other in the same manner as some traditional approaches. In contrast, the disclosed dual-encoder transformer module 220 is uniquely designed to implement a cross-attention functionality during inference that fuses a [CLS] token from each encoder with the patch tokens from the other branch (handling the opposite modality).

[0031] FIG. 2 shows that the visual features 215 and textual features 216 that are extracted and output from the feature extraction module 210 are fed into the dual-encoder transformer module 220. The dual-encoder transformer module 220 can split these visual features 215 and textual features 216 into a sequence of fixed-size non-overlapping patches, which are then linearly embedded. Specifically, the visual features 215 are illustrated as being split into a sequence of visual features 226, and the textual features 216 are illustrated as being split into a sequence of textual features 225. A [CLS] token, for each respective modality, is added to serve as representation of an entire image/text. As illustrated in FIG. 2,

IB a visual [CLS] token 224 is added to the sequence of visual features 226, and a textual [CLS] token 223 is added to the sequence of textual features 225.

[0032] FIG. 2 also shows that the dual-encoder transformer module 220 maps, using a linear projection function, the visual [CLS] token 224 from the visual branch 222 (indicated in FIG. 2 by curved pattern filled arrow) to textual features 225, which creates newly formed textual features 227 having mutual information from both modalities. Similarly, FIG. 2 shows that the dual-encoder transformer module 220 maps, using a linear projection function, the textual [CLS] token 223 from the textual branch 221 (indicated in FIG. 2 by curved solid filled arrow) to visual features 226, which creates newly formed visual features 228 having mutual information from both modalities. In detail, the [CLS] tokens 223,224 will be projected into the other branch space by a linear projection layer before the token fusion is performed. Thus, the cross-attention techniques of the dual-encoder transformer module 220 uses the [CLS] tokens 223, 224 as agents to interchange information from the respective visual branch 222 and textual branch 221, and fuse the modalities together during inference, by attention.

[0033] For the computation of the [CLS] token fusion, the input video features are converted into 3 channels since the video feature contains 4 channels after the feature extraction (e.g., performed by the feature extraction module 210). A dimension flattening along the first and second axis of the video features represented in Eq.2 and Eq.3 is executed, which is equal to converting the (B, N, 50, 512) to (B * N, 50, 512). The process for [CLS] token fusion can be represented mathematically as:

(4)

(5) where X^v is the newly formed visual features,

X¹ is the newly formed textual features, /^{f v} is the linear projection function that maps the [CLS] token from the visual branch to the textual branch, and

/^v-t is the linear projection function that maps the [CLS] token from the textual branch to the visual branch.

[0034] According to the embodiments, the projection and fusion of mutual information (e.g., visual [CLS] token 224 from the visual branch to textual features 225, and textual [CLS] token 223 from the textual branch to the visual features 226) that is represented by eq. 4 and eq. 5 above can be executed K times, where the K equals the depth of the transformer's 220 architecture. Subsequently, the newly formed visual features 228 and the newly formed textual features 227 are then fed into dual (e.g., two) transformer encoders 241, 242 of the dual-encoder transformer module 220 separately in order to obtain mutual information. A single encoder, within the dual-encoder transformer module 220, has a configuration that is based on a Multi-Head Self-Attention (MHSA) module. In the example of FIG. 2, the dual-encoder transformer module 220 is depicted as including an MHSA transformer encoder 241 in the textual branch 221 and an MHSA transformer encoder 242 in the visual branch 222. In an embodiment, the transformer encoders 241, 242 are configured within the dual-encoder transformer module 220 as a stack of several layers of MHSA modules with Layer Normalization (LN) and residual shortcut. The output of each layer of the transformer encoders 241, 242, shown in FIG. 2 cross-encoded textual [CLS] token 245 and cross-encoded visual [CLS] token 244 respectively, can be represented mathematically as:

where y^_ls is the cross-encoded visual [CLS] token, and where y _ls is the cross-encoded textual [CLS] token. [0035] In FIG. 2, a cross-encoded visual [CLS] token 244 is shown as output from the MHSA transformer encoder 242 associated with the visual branch 222, and a cross-encoded textual [CLS] token 245 is shown as output from the MHSA transformer encoder 241 associated with the textual branch 221. The newly cross-encoded [CLS] tokens 244, 245 are then back-projected into the original space and concatenated with the original features to form new features using cross-attention (also referred to as features with cross-attention). This is illustrated in FIG. 2 as the cross-encoded visual [CLS] token 244 being interchanged (indicated in FIG. 2 by curved patten filled arrow) into the space with textual features 225 and the cross-encoded textual [CLS] token 245 being interchanged (indicated in FIG. 2 by curved solid filled arrow) into the space with visual features 226. By utilizing the cross-attention techniques, where the cross-encoded [CLS] tokens 244, 245 are added to the opposing features 225,226, new cross-attention features 246, 247 are formed, respectively. FIG. 2 particularly illustrates that projecting cross-encoded textual [CLS] token 245 onto visual features 226 forms new cross-attention visual features 246 and projecting cross-encoded visual [CLS] token 244 onto textual features 225 forms new cross-attention textual features 247. Furthermore, according to the embodiments, performing cross-attention of encoded [CLS] tokens 244, 245 in order to generate these newly formed cross-attention representations can repeat K times, where K is the depth of the transformer architecture. Restated, steps represented by Eq. 4 to Eq. 7, can be repeated K times. Further, transformer functions associated with the MHSA transformer encoders 241, 242 are performed to obtain the new cross-attention features 246, 247. The new cross-attention features 246, 247, resulting from transformation and interchanging/projecting, can be represented mathematically as: z^1' = Transt9^t_v(¾)|| i..._,49] (8)

Z^c = TG3h5[G-'(¾)||¾...,₇₆] (9) where Trans is the transformer architecture, where g^{l v} is the back-project function for the cross-encoded visual [CLS] token to the textual branch, where g^v~t is the back-project function for the cross-encoded textual [CLS] token to the visual branch

[0036] Also, FIG. 2 shows that [CLS] tokens corresponding to the new cross-attention features 246, 247 can be generated by the dual-encoder transformer module 220. In FIG. 2, the cross-attention visual features 246 has corresponding cross-attention visual [CLS] token 248 and cross-attention textual features 247 has corresponding cross-attention textual [CLS] token 249. The cross-attention dual-encoder 200 is configured to utilize these cross-attention [CLS] tokens from the new cross-attention features 246, 247 for further training and testing of ML/AI model. Accordingly, cross-attention visual [CLS] token 248 and the cross-attention textual [CLS] token 249 are depicted as the output from the dual-encodertransformer module 220 that are fed to the loss functions module 230.

[0037] As alluded to above, ML/AI models that are built and/or trained by the disclosed cross-attention dual-encoder 200 can be query-agnostic during the inference process. Therefore, in order to achieve these query-agnostic aspects, in addition to leveraging the cross-attention techniques, the cross-attention model (associated with the cross attention model branch 250) is also trained simultaneously by the cross-attention dual encoder 200 without fusing or inter-changing the [CLS] tokens. For example, the computations associated with eq. 4 and eq. 5 are performed without mutual information. This training stage can be represented mathematically as:

[0038] In training without fusing, the aforementioned computations in Eq. 4 to Eq. 9 can be performed on the separate modalities (e.g., textual, visual) respectively, without containing any mutual orfused information (e.g., no fusing of the visual [CLS] token to textual features, and no fusing of the textual [CLS] token to visual features). The "no fusing" training along the textual branch 221 and the visual branch 222 of the dual-encoder transformer module 220 can be regarded as an independent dual-encoder model for the inference procedure. This process is equivalent to sharing the weights of the cross-attention model with the dual-encoder training strategy. Consequently, the query-agnostic branches associated with dual-encoder training, namely the textual branch 221 and the visual branch 222, will be guided by the information of the query-dependent branch, namely the cross-attention model branch 250. This strategy of simultaneously utilizing the cross-attention model branch 250 (e.g., query-dependent) and the separate dual-encoder model branches 221, 222 (e.g., query- agnostic) can realize several advantages, such as improved overall performance and improved speed of video-text retrieval tasks.

[0039] FIG. 2 also illustrates that the cross-attention dual-encoder 200 architecture includes loss functions module 230. In the example, the loss functions module 230 comprises two forms of loss functions to train and optimize the ML/AI model(s), which are the Sym-CE loss function 231 and the CE loss function 232. Generally, the Sym-CE loss function 231 can be used for contrastive learning-based methods. In the cross-attention dual-encoder 200 architecture, the output from the dual-encoder transformer module 220 that is associated with dual-encoder model pipelines, namely output from the textual branch 221 and the visual branch 222, are fed to the Sym-CE loss function 231. In detail, the visual [CLS] token 234 that is output from the visual branch 222 of the dual-encoder transformer module 220 and the textual [CLS] token 233 that is output from the textual branch 221 of the dual-encoder transformer module 220 are sent to the Sym-CE loss function 231. The (video, text) pair similarity matrix Sym-CE loss, which is implemented by the Sym-CE loss function 231 for example, can be represented mathematically as:

where M is the batch size, t is the temperature hyper-parameter, and S is the similarity matrix of (video, text) pairs.

[0040] FIG. 2 also shows that the CE loss function 232 receives output from the dual encoder transformer module 220 that is associated with the cross-attention model pipeline, namely the output from cross-attention model branch 250. As seen, the cross-attention [CLS] tokens 248, 249 that are particularly output by the cross-attention model branch 250 of the dual-encoder transformer module 220 are fed to the CE loss function 232. The aim of a video text retrieval task is to find the closest (video, text) pair, where the (video, text) pair can be classified into binary categories as "best pair (1)" or "not (0)". Consequently, a video-text retrieval task can be converted into a classification task. For purposes of training, pseudo binary labels (0, 1) for (video, text) pairs in each batch can be generated to represent "paired" or "not". The "not" paired (video, text) can be randomly selected from the dataset. Consequently, the CE loss, which is implemented by the CL loss function 232, can be represented mathematically as:

-Cc = -^å?¾ log(p(i^?, t)i) + (1 - if )(l - P(v, t)j)] ⁽13⁾ where l_t E {0,1} is the binary label for (video, text) pairs, and p(v,t)is the probability of (video, text) pair is the closest pair.

[0041] The total loss associated with the ML/AI model(s) that are built and/or trained by the attention dual-encoder 200 architecture can be represented mathematically as:

X = X_s + £_c [0042] Due to the classification-like task with cross-entropy loss, the cross-attention visual [CLS] token 248 and the cross-attention textual [CLS] token 249 (e.g., Z^_ls, Z _is) that are output from the cross-attention model branch 250 of the dual-encoder transformer module 220 are concatenated together and feed it into a Feed Forward Network (FFN) with a softmax activation function. The output probability of the classification-like task can be represented mathematically as:

PO, t) = softmax ( FFN(\Z_c ^v _ls ||Z¾_S])) (14)

[0043] Thus, as part of optimizing the ML/AI model(s) of the cross-attention dual encoder 200, the error for the current state of the model must be estimated repeatedly. The loss functions module 230 can be used to estimate the loss during training the model(s) so that the weights can be updated to reduce the loss on the next evaluation. Consequently, ML/AI model(s) that are built, trained, and optimized using the cross-attention dual-encoder 200 learn (video, text) pair similarities and classifications in order to accurately predict the most relevant (video, text) pairs from a video-text semantic space in a manner that is optimal for video-text retrieval tasks.

[0044] A flowchart is shown in FIG. 3, illustrating an example of a process 300 that is performed for building and/or training a ML/AL model using cross-attention, according to a n embodiment of the systems and methods described herein. As seen in FIG. 3, process 300 is illustrated as a series of executable operations in a machine-readable storage media 306 performed by a hardware processor 304. The computing component 302 can be a computer device used for telecommunication functions, such as voice, video, and text, and video-text retrieval tasks. For example, the computing component 302 may be the mobile computing device (e.g., smartphone) described above in reference to FIG. 1. Generally, process 300 implements building and/or training of a ML/AI model, such as visual language learning model, which is guided by both video and text modalities simultaneously using cross attention techniques, according to some embodiments. [0045] The process BOO can begin at operation 305, extracting visual features and textual features from input. Operation 305 can involve extracting visual features and textual features from input comprising a dataset of various videos (e.g., frames of video clips) and text data corresponding to the videos (e.g., captions). In some cases, there is text that corresponds to each frame of video. Restated, visual features and textual features can be extracted from (video, text) pairs that are received as input for building and/or training the ML/AI model. In an embodiment, the visual features and textual features are extracted using a CLIP encoder. Visual features and textual features can be extracted from each frame of video that is received as input. Operation 305 can include and steps and calculations performed by a feature extraction module (described in detail in reference to FIG. 2).

[0046] Next, the process 300 continues to operation 310 where [CLS] tokens and features are generated respectively for a textual branch and a visual branch. Operation 310 can involve a dual-encoder transformer module (described in detail in reference to FIG. 2) receiving visual features and textual features that were extracted in previous operation 305. As previously described, separate modalities of information can be analyzed independently (e.g., without "fused" information) within the dual-encoder transformer module. That is, each modality (e.g., visual, textual) can traverse a separate branch of inference training, respectively. Initially, at each branch, the visual features and textual features are split into a sequence of fixed-size non-overlapping patches. A [CLS] token for each respective modality is generated and added to the features to serve as representation of an entire image/text. For example, a visual [CLS] token is added to the sequence of visual features, and a textual [CLS] token is added to the sequence of textual features. Accordingly, textual features and the textual [CLS] token are associated with being analyzed along a textual branch of the dual encoder transformer module and the visual features and visual [CLS] token are associated with being analyzed along a visual branch of the dual-encoder transformer module.

[0047] As alluded to above, there are two branches of inference training in the dual encoder transformer module, namely the textual branch and the visual branch, that can involve training a ML/AI model using each modality respectively. That is, the visual features and visual [CLS] token are analyzed along the visual branch and the textual features and textual [CLS] token are analyzed along the textual branch without "fusing" or using any mutual information from both modalities. Nonetheless, a key aspect of the disclosed process S00 is an inference training for a ML/AI model that involves mutual information from both modalities, visual and textual, that is achieved by leveraging cross-attention. Thus, there dual encoder transformer module can be considered to include a third branch, which involves inference training for a cross-attention model, referred to herein as the cross-attention model branch.

[0048] Cross-attention aspects of the process 300 involve using a non-patch token, namely [CLS] token, as an agent to interchange information between branches by attention. Thus, at operation 315, the [CLS] tokens are projected and fused to an opposing branch. In detail, the visual [CLS] token from the visual branch is interchanged, or projected, onto the textual features from the textual branch, and the textual [CLS] token from the textual branch is interchanged, or projected, onto the visual features from the visual branch. In an embodiment, projecting includes using a linear projection function to map the [CLS] token to features from its opposing branch. The visual [CLS] token is then fused to textual features and the textual [CLS] token is then fused to the visual features. The projection and [CLS] token fusion executed in operation 315 can involve computations shown in eq. 4 and eq. 5. As a result of the projection and [CLS] token fusion of operation 315, new visual features and new textual features, which include mutual information from both modalities, are formed. Operation 315 can include performing inference training of the ML/AI model using the cross attention model branch algorithm, by applying the interchanged mutual modalities formed by the fused [CLS] tokens.

[0049] Subsequently, the newly formed visual features and textual features, that are output from previous operation 315 can be input to dual-encoder transformers in operation 320. As described in detail in reference to FIG. 2, the dual-encoder transformer module can include dual-encoder transformers, where one encoder transformer is associated with the textual branch and the second encoder transformer is associated with the visual branch. Accordingly, operation 320 can include feeding the newly formed visual features to the encoder transformer that is associated with the visual branch, and feeding the newly formed textual features to the encoder transformer that is associated with the textual branch. In an embodiment, the dual-encoder transformers are implemented as several layers of MHSA modules with LN and residual shortcut. Operation 320 can involve encoding that is executed on the newly formed features by the dual-encoder transformers, which results in cross- encoded visual [CLS] token associated with visual features and a cross-encoded textual [CLS] token associated with text features. Operation 320 can involve computations shown in eq. 6 and eq. 7 in order to derive the cross-encoded [CLS] tokens.

[0050] At operation 325, the cross-encoded [CLS] tokens that are output form pervious operation 320 are projected and fused to an opposing branch. For example, the cross-encoded visual [CLS] token is back-projected into the original space and concatenated with the original textual features to form new cross-attention textual features, and the cross- encoded textual [CLS] token is back-projected into the original space and concatenated with the original visual features to form new cross-attention visual features. Operation 325 can involve computations shown in eq. 8 and eq. 9 in order to derive the cross-encoded [CLS] tokens. In an embodiment, operations 320-325 can be performed iteratively, continuing K times, where K is the depth of the transformer architecture, in order to obtain the new cross attention features. Operation 325 can include performing inference training of the ML/AI model using the cross-attention model branch algorithm, applying the interchanged mutual modalities formed by the cross-encoded [CLS] tokens.

[0051] Next, at operation 330, [CLS] tokens associated with the new cross-attention features formed in previous operation 325 are output to a loss function. The [CLS] tokens associated with cross-attention features are referred to herein as cross-attention [CLS] token. Accordingly, a cross-attention visual [CLS] token associated with the visual cross-attention features and a cross-attention textual [CLS] associated with the textual cross-attention features are fed to a CE loss function. The CE loss function further trains and/or optimizes the

ML/AI model from the cross-attention model branch with respect to error/loss estimation. In an embodiment, the CE loss function is implemented by a loss function module described in detail in reference to FIG. 2. Operation 330 can involve the CE loss function performing computations shown in eq. 13 in order to train and/or optimize the ML/AI model with respect to error/loss estimation. Also, operation 330 can include performing inference training of the ML/AI model using the cross-attention model branch algorithm, applying the interchanged mutual modalities formed by the cross-attention [CLS] tokens.

[0052] Further, the process 300 continues to operation 335 where the model associated with the cross-attention branch is trained using the algorithms associated with the respective textual branch and visual branch. As alluded to above, the ML/AI model that is built and/or trained using the cross-attention model branch can have query-dependent characteristics. Therefore, in order to achieve query-agnostic aspects during the inference process, the ML/AI model associated with the cross-attention model branch is also trained simultaneously without fusing or inter-changing the [CLS] tokens. That is, inference training for the ML/AI model is also conducted along the visual branch (e.g., visual modality without mutual information) and the along the textual branch (e.g., textual modality without mutual information), which is also referred to herein as dual-encoder training strategy/modeling. Operation 335 can involve computations associated with eq. 4 and eq. 5 being performed without mutual information. Furthermore, operation 335 can involve outputting the visual token from inference training along the visual branch and the textual token from inference training along the textual branch to a Sym-CE loss function. Accordingly, operation 335 can also involve the Sym-CE loss function performing computations shown in eq. 12 in order to train and/or optimize the ML/AI model with respect to error/loss estimation.

[0053] Consequently, process 300 builds, trains, and optimizes ML/AI model(s) by leveraging cross-attention, in a manner wherein inference is guided by the interchanging and/or fusing of both video and text modalities. The ML/AI models generated from process 300 can learn (video, text) pair similarities and classifications in order to accurately predict the most relevant (video, text) pairs, which is leveraged for video-text retrieval tasks.

Therefore, process 300 implementing the cross-attention techniques disclosed herein, realizes several advantages, such as improved retrieval accuracy for video-text retrieval tasks, and improved speed and efficiency for video-text retrieval tasks.

[0054] FIG. 4 depicts a block diagram of an example computer system 400 in which various features described herein may be implemented. For example, the computer system 400 can be a device (shown in FIG. 1) implementing the disclosed cross-attention dual encoder and video-text retrieval system and methods. The computer system 400 includes a bus 402 or other communication mechanism for communicating information, one or more hardware processors 404 coupled with bus 402 for processing information. Hardware processor(s) 404 may be, for example, one or more general purpose microprocessors.

[0055] The computer system 400 also includes a main memory 406, such as a random-access memory (RAM), cache and/or other dynamic storage devices, coupled to bus 402 for storing information and instructions to be executed by processor 404. Main memory 406 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 404. Such instructions, when stored in storage media accessible to processor 404, render computer system 400 into a special-purpose machine that is customized to perform the operations specified in the instructions.

[0056] The computer system 400 further includes a read only memory (ROM) 408 or other static storage device coupled to bus 402 for storing static information and instructions for processor 404. A storage device 410, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive), etc., is provided and coupled to bus 402 for storing information and instructions.

[0057] The computer system 400 may be coupled via bus 402 to a display 412, such as a liquid crystal display (LCD) (or touch screen), for displaying information to a computer user. An input device 414, including alphanumeric and other keys, is coupled to bus 402 for communicating information and command selections to processor 404. Another type of user input device is cursor control 416, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 404 and for controlling cursor movement on display 412. In some embodiments, the same direction information and command selections as cursor control may be implemented via receiving touches on a touch screen without a cursor.

[0058] The computing system 400 may include a user interface module to implement a GUI that may be stored in a mass storage device as executable software codes that are executed by the computing device(s). This and other modules may include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

[0059] In general, the word "component," "engine," "system," "database," data store," and the like, as used herein, can refer to logic embodied in hardware or firmware, or to a collection of software instructions, possibly having entry and exit points, written in a programming language, such as, for example, Java, C or C++. A software component may be compiled and linked into an executable program, installed in a dynamic link library, or may be written in an interpreted programming language such as, for example, BASIC, Perl, or Python. It will be appreciated that software components may be callable from other components or from themselves, and/or may be invoked in response to detected events or interrupts. Software components configured for execution on computing devices may be provided on a computer readable medium, such as a compact disc, digital video disc, flash drive, magnetic disc, or any other tangible medium, or as a digital download (and may be originally stored in a compressed or installable format that requires installation, decompression or decryption prior to execution). Such software code may be stored, partially or fully, on a memory device of the executing computing device, for execution by the computing device. Software instructions may be embedded in firmware, such as an EPROM. It will be further appreciated that hardware components may be comprised of connected logic units, such as gates and flip- flops, and/or may be comprised of programmable units, such as programmable gate arrays or processors. [0060] The computer system 400 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 400 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 400 in response to processor(s) 404 executing one or more sequences of one or more instructions contained in main memory 406. Such instructions may be read into main memory 406 from another storage medium, such as storage device 410. Execution of the sequences of instructions contained in main memory 406 causes processor(s) 404 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

[0061] The term "non-transitory media," and similar terms, as used herein refers to any media that store data and/or instructions that cause a machine to operate in a specific fashion. Such non-transitory media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 410. Volatile media includes dynamic memory, such as main memory 406. Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge, and networked versions of the same.

[0062] Non-transitory media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between non-transitory media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 402. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications. [0063] The computer system 400 also includes a communication interface 418 coupled to bus 402. Communication interface 418 provides a two-way data communication coupling to one or more network links that are connected to one or more local networks. For example, communication interface 418 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 418 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN (or WAN component to communicated with a WAN). Wireless links may also be implemented. In any such implementation, communication interface 418 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

[0064] A network link typically provides data communication through one or more networks to other data devices. For example, a network link may provide a connection through local network to a host computer or to data equipment operated by an Internet Service Provider (ISP). The ISP in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet." Local network and Internet both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link and through communication interface 418, which carry the digital data to and from computer system 400, are example forms of transmission media.

[0065] The computer system 400 can send messages and receive data, including program code, through the network(s), network link and communication interface 418. In the Internet example, a server might transmit a requested code for an application program through the Internet, the ISP, the local network and the communication interface 418.

[0066] The received code may be executed by processor 404 as it is received, and/or stored in storage device 410, or other non-volatile storage for later execution.

[0067] Each of the processes, methods, and algorithms described in the preceding sections may be embodied in, and fully or partially automated by, code components executed by one or more computer systems or computer processors comprising computer hardware. The one or more computer systems or computer processors may also operate to support performance of the relevant operations in a "cloud computing" environment or as a "software as a service" (SaaS). The processes and algorithms may be implemented partially or wholly in application-specific circuitry. The various features and processes described above may be used independently of one another or may be combined in various ways. Different combinations and sub-combinations are intended to fall within the scope of this disclosure, and certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate, or may be performed in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed example embodiments. The performance of certain of the operations or processes may be distributed among computer systems or computers processors, not only residing within a single machine, but deployed across a number of machines.

[0068] As used herein, a circuit might be implemented utilizing any form of hardware, software, or a combination thereof. For example, one or more processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up a circuit. In implementation, the various circuits described herein might be implemented as discrete circuits or the functions and features described can be shared in part or in total among one or more circuits. Even though various features or elements of functionality may be individually described or claimed as separate circuits, these features and functionality can be shared among one or more common circuits, and such description shall not require or imply that separate circuits are required to implement such features or functionality. Where a circuit is implemented in whole or in part using software, such software can be implemented to operate with a computing or processing system capable of carrying out the functionality described with respect thereto, such as computer system

400. [0069] As used herein, the term "or" may be construed in either an inclusive or exclusive sense. Moreover, the description of resources, operations, or structures in the singular shall not be read to exclude the plural. Conditional language, such as, among others, "can," "could," "might," or "may," unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps.

[0070] Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. Adjectives such as "conventional," "traditional," "normal," "standard," "known," and terms of similar meaning should not be construed as limiting the item described to a given time period or to an item available as of a given time, but instead should be read to encompass conventional, traditional, normal, or standard technologies that may be available or known now or at any time in the future. The presence of broadening words and phrases such as "one or more," "at least," "but not limited to" or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent.

BO

Claims

Claims What is claimed is:

1. A computer-implemented method comprising: extracting visual features and textual features from an input; generating a visual token for a visual branch of inference training for a model associated with the visual features and a textual token for a textual branch of inference training for the model associated with the textual features; projecting the visual token onto the textual features associated with the textual branch and the textual token onto the visual features associated with the visual branch, wherein the projecting generates newly formed visual features and newly formed textual features; and performing inference training for the model associated with a cross-attention branch, wherein the inference training associated with the cross-attention branch is performed by applying mutual modalities of the projected tokens from the visual branch and the textual branch.

2. The computer-implemented method of claim 1, further comprising: inputting the newly formed visual features and the newly formed textual features into dual-encoder transformers, wherein the dual-encoder transformers generate a cross- encoded visual token associated with the newly formed visual features and a cross-encoded textual token associated with the newly formed textual features; projecting the cross-encoded visual token onto the textual features associated with the textual branch and the cross-encoded textual token onto the visual features associated with the visual branch, wherein the projecting generates cross-attention visual features and cross-attention textual features; and performing inference training for the model associated with the cross-attention branch, wherein the inference training associated with the cross-attention branch is performed by applying mutual modalities of the cross-encoded tokens.

3. The computer-implemented method of claim 2, further comprising: generating a cross-attention visual token associated with the cross-attention visual features and a cross-attention textual token associated with the cross-attention textual features; and performing inference training for the model associated with the cross-attention branch, wherein the inference training associated with the cross-attention branch is performed by applying mutual modalities of the cross-attention tokens.

4. The computer-implemented method of claim 3, further comprising: outputting the cross-attention visual token and the cross-attention textual token to a loss function; and optimizing and training the model based on the loss function.

5. The computer-implemented method of 4, further comprising: performing inference training for the model associated with the visual branch and the textual branch without applying mutual modalities.

6. The computer-implemented method 1, wherein the input comprises a dataset of video and text corresponding to the video.

The computer-implemented method of claim 1, wherein extracting the visual features and textual features comprises applying a Contrastive Language-Image Pre-Training (CLIP) encoder.

8. The computer-implemented method of claim 1, wherein projecting the visual token onto the textual features associated with the textual branch and the textual token onto the visual features associated with the visual branch comprises employing a linear projection function.

9. The computer-implemented method of claim 1, further comprising fusing the visual token onto the textual features associated with the textual branch and the textual token onto the visual features associated with the visual branch comprises employing a linear projection function.

10. The computer-implemented method of claim 2, wherein the dual-encoder transformers comprise Multi-Head Self-Attention (MHSA) modules.

11. The computer-implemented method of claim 2, further comprising: a first dual-encoder transformer associated with the visual branch receiving the newly formed visual features; and a second dual-encoder associated with the textual branch receiving the newly formed textual features.

12. The computer-implemented method of claim 2, wherein projecting the cross- encoded tokens comprises back-projecting the cross-encoded visual token into an original space of the textual branch and concatenating with the original textual features to create the cross-attention textual features.

IB. The computer-implemented method of claim 2, wherein projecting the cross- encoded tokens comprises back-projecting the cross-encoded textual token into an original space of the visual branch and concatenating with the original visual features to create the cross-attention visual features.

14. The computer-implemented method of claim 1, wherein the model comprises a machine learning/artificial intelligence (ML/AI) model, and further wherein the visual token and the textual token comprise [CLS] tokens.

15. A computer system, comprising: one or more processors; and a memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform: extracting visual features and textual features from an input; generating a visual token for a visual branch of inference training for a model associated with the visual features and a textual token for a textual branch of inference training for the model associated with the textual features; projecting the visual token onto the textual features associated with the textual branch and the textual token onto the visual features associated with the visual branch, wherein the projecting generates newly formed visual features and newly formed textual features; and performing inference training for the model associated with a cross-attention branch, wherein the inference training associated with the cross-attention branch is performed by applying mutual modalities from the projected tokens from the visual branch and the textual branch.

16. The computer system of claim 15, wherein the memory has further instructions stored thereon, which when executed by the one or more processors cause the processors to further perform: inputting the newly formed visual features and the newly formed textual features into dual-encoder transformers, wherein the dual-encoder transformers generate a cross- encoded visual token associated with the newly formed visual features and a cross-encoded textual token associated with the newly formed textual features; projecting the cross-encoded visual token onto the textual features associated with the textual branch and the cross-encoded textual token onto the visual features associated with the visual branch, wherein the projecting generates cross-attention visual features and cross-attention textual features; and performing inference training for the model associated with the cross-attention branch, wherein the inference training associated with the cross-attention branch is performed by applying mutual modalities from the cross-encoded tokens.

17. The computer system of claim 16, wherein the memory has further instructions stored thereon, which when executed by the one or more processors cause the processors to further perform: generating a cross-attention visual token associated with the cross-attention visual features and a cross-attention textual token associated with the cross-attention textual features; and performing inference training for the model associated with the cross-attention branch, wherein the inference training associated with the cross-attention branch is performed by applying mutual modalities from the cross-attention tokens.

18. The computer system of claim 16, wherein the memory has further instructions stored thereon, which when executed by the one or more processors cause the processors to further perform: outputting the cross-attention visual token and the cross-attention textual token to a loss function; optimizing and training the model based on the loss function; and performing inference training for the model associated with the visual branch and the textual branch without applying mutual modalities.

19. A mobile computing device comprising: a cross-attention dual-encoder training a machine learning/artificial intelligence

(ML/AI) model using cross-attention to learn video, text pair similarities and classifications and predict the most relevant video, text pairs; and a video-text retriever performing video-text retrieval tasks to select one or more most relevant videos from a plurality of video based on a received text query, wherein the video-text retrieval tasks is guided by the ML/AI model trained by the cross-attention dual encoder.

20. The mobile computing device of claim 19, wherein cross-attention comprises using [CLS] tokens to interchange and fuse a visual modality and a textual modality together during training the ML/AI model.