WO2023004206A1

WO2023004206A1 - Unsupervised hashing method for cross-modal video-text retrieval with clip

Info

Publication number: WO2023004206A1
Application number: PCT/US2022/039445
Authority: WO
Inventors: Yaoxin ZHUO; Yikang Li; Jenhao Hsiao; Chiuman HO
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-08-04
Filing date: 2022-08-04
Publication date: 2023-01-26

Abstract

Systems and methods are provided for improving video-text retrieval tasks by employing a cross-modal hashing module. The cross-modal hashing module implements an unsupervised cross-modal hash learning process to train a machine learning/artificial intelligence model (ML/AL) model to learn semantic relevant binary codes that is improved by an affinity matrix and a Hamming space. The affinity matrix refines cross-modal relationships between video and text features and guides the unsupervised hash learning process. Further, the cross-modal hashing module leverages a Contrastive Language-Image Pre-Training (CLIP) model that generated a well-defined cross-modal semantic space. The cross-modal hashing module improves accuracy and speed of video-text retrieval tasks. For example, a mobile computing device includes a cross-modal hashing module training the ML/AI model using unsupervised cross-modal hash learning. The mobile computing device can also include a video-text retriever performing video-text retrieval tasks that are enhanced by the ML/AI model.

Description

UNSUPERVISED HASHING METHOD FOR CROSS-MODAL VIDEO-TEXT

RETRIEVAL WITH CLIP

Reference to Related Application

[0001] The present application claims priority to U.S. Patent Application No. 63/229,429, filed August 4, 2021 and titled "CLIP4HASHING: AN UNSUPERVISED HASHING METHOD FOR CROSS-MODAL VIDEO-TEXT RETRIEVAL," which is incorporated herein by reference in its entirety.

Description of Related Art

[0002] The present application generally relates to cross-modal retrieval, particularly to methods and a system for improving speed and relevancy of videos retrieved given textual queries.

Brief Description of the Drawings

[0003] The present disclosure, in accordance with one or more various embodiments, is described in detail with reference to the following figures. The figures are provided for purposes of illustration only and merely depict typical or example embodiments.

[0004] FIG. 1 depicts an example computing system, such as a mobile computing device, implementing a cross-modal hashing system for improved video-text retrieval, in accordance with embodiments of the application.

[0005] FIG. 2 depicts an example architecture for the cross-modal hashing system shown in FIG. 1 for improved video-text retrieval, in accordance with embodiments of the application.

[0006] FIG. 3 is an operational flow diagram illustrating an example method for implementing the building, training, and optimization of a machine learning/artificial intelligence (ML/AI) model using the disclosed unsupervised cross-modal hash learning techniques, in accordance with embodiments of the application.

[0007] FIG. 4 is a block diagram of an example computing component or device for implementing the disclosed techniques, in accordance with embodiments of the application.

[0008] These illustrative embodiments are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.

Detailed Description

[0009] Video-text retrieval is an essential task in cross-modal information retrieval, i.e., retrieving relevant videos from a large and unlabeled dataset given textual queries. With the enormously increased amount of multimedia data that is permeating the use of communication technology and mobile devices, it is important to boost the speed of video text retrieval. For example, recently, there has been an explosively increasing amount of multimedia data being shared in social media (e.g., Twitter, Instagram) and video-sharing apps (e.g., Tik-Tok and Kuaishou), where vast amounts of images, video, and text are disseminated and shared on mobile devices, such as smartphones. With multimedia data being so ubiquitous in mobile computing, it is driving a high-demand for applications that interact with multimedia data (e.g., searching, retrieving, classifying) to have fast and accurate multimedia data retrieval (e.g., video-text retrieval capabilities). There are some existing retrieval functions that utilize hashing algorithms. However, these conventional hashing related retrieval tasks typically focus on a single modality. For instance, a retrieval task driven by a conventional hashing algorithm involves the query and the target data being restricted to the same modality, such as an image only query/retrieval or a video only query/retrieval. Recent research has begun to trend towards cross-modal hashing for retrieval, which would enable the integration of different modalities in a retrieval task (e.g., cross-modal retrieval), which is more optimal for multimedia data and practical for real-world applications, like social media. Nonetheless, there are difficulties and challenges related to implementing such cross-modal hashing. Cross-modal retrieval is inherently challenging since it is difficult to model semantic relationships between different modalities.

[0010] A novel unsupervised cross-modal hashing learning method and system based on CLIP, also referred to herein as cross-modal hashing, which can be utilized to improve the accuracy and speed of cross-modal retrieval tasks is disclosed. The deep cross-modal hashing algorithm is able to encode data from different modalities into one common Hamming space, which can be ultimately leveraged for fast cross-modal retrieval, such as video-text retrieval tasks. The disclosed cross-modal hashing method and system are capable of connecting the semantic visual information with textual information in a Hamming space by using the integration of the neighborhood correlation from a well-defined semantic space. Thus, the disclosed cross-modal hashing method and system learn a better Hamming space that is guided by a new proposed affinity matrix S, in which the well-defined semantic space is involved. In addition, the cross-modal hashing techniques do not require any hyper parameters for different datasets. As will be described in greater detail herein, the cross- modal hashing method and system realize several advantages, including: leveraging a cross- modal affinity matrix S, which is derived from cross-modal semantic relationships, in order to improve hash learning; utilizing a dynamic approach to diffuse the cross-modal affinity matrix S such that hyper-parameters are not required; and improving the performance of video-text retrieval tasks.

[0011] Referring now to FIG. 1, an example of a mobile computing device 110 employing the disclosed cross-modal hashing method and system for improved video-text retrieval is depicted. The mobile computing device 110 can be a user equipment (UE) device, being implemented as any type of wireless user device that is used directly by an end-user to communicate and having video-text retrieval capabilities, which are performed by the video text retriever 111 implemented on the mobile computing device 110. In the example of FIG. 1, the mobile computing device 110 is shown as a handheld mobile phone, and more specifically a smartphone. However, the mobile computing device 110 may be implemented as various other wireless user devices that are used directly by an end-user to communicate and equipped with telecommunication functions, such as voice, video, and text. For example, the mobile computing device 110 may also be implemented as a cellular telephone, a laptop computer equipped with a mobile broadband adapter, or other computing device. Accordingly, as a smartphone, the mobile computing device 110 is capable of supporting enhanced data services, voice, video, and other telecommunication functions that are commonly employed by subscribers to broadband cellular networks.

[0012] Furthermore, the mobile computing device 110 is depicted to include a video text retriever 111 and a cross-modal hashing module 112 that implement the disclosed techniques for supporting cross-model retrieval tasks. Although not shown in FIG. 1, the mobile computing device 110 can include other applications, computing sub-systems, and hardware. For example, the mobile computing device 110 can include an operating system that provides an interface between the mobile computing device's 110 hardware (e.g., the input/output mechanisms and a processor executing instructions retrieved from computer- readable medium) and software. Example operating systems include ANDROID, CHROME, IOS, MAC OS X, WINDOWS 7, WINDOWS PHONE 7, SYMBIAN, BLACKBERRY, WEBOS-a variety of UNIX operating systems, or a proprietary operating system for computerized devices. The operating system may provide a platform for the execution of application programs that facilitate interaction between the computing device and a user. The video-text retriever 111 and cross-modal hashing module 112, as disclosed herein, can be implemented on the mobile computing device 110 as hardware, a stand-alone processor, firmware, a software application, or any combination thereof.

[0013] In an example, the video-text retriever 111 and the cross-modal hashing module 112 operate in concert to implement various cross-modal retrieval features, such as video-text retrieval features on the mobile computing device 110. As a general description, the cross-modal hashing module 112 is configured to perform a distinct hash-based training of a machine learning/artificial intelligence (ML/AI) model, where the model is trained to learn semantic relevant binary codes and ultimately recognize semantic correlations between video and text data. The video-text retriever 111 is configured to execute various video-text retrieval tasks on the mobile computing device 111. For example, a video-text retrieval task performed by the video-text retriever 111 can include automatically retrieving a list of videos (from a corpus/library including a vast amounts of videos) that are deemed most relevant given a text query. There is a critical interoperability between the functions of the cross-modal hashing module 112 and the video-text retriever 111. In detail, the ML/AI model that is generated and/or trained by the cross-modal hashing module 112 to accurately determine relevancy between text and video can be employed by the video-text retriever 111 to drive the relevancy-based search for video from a large corpus of text and videos. In other words, as specific text is entered into the video-text retriever 111 for a query, the model trained by the cross-modal hashing module 112 to automatically recognize relevancy links between text and video can be used by the video-text retriever 111 in order to automatically select videos that are relevant to the query-specific text.

[0014] According to the embodiments, the cross-modal hashing module 112 is configured to perform a distinct unsupervised hash learning process for training for a ML/AI model where the learning the semantic relevant binary codes is guided by a cross-modal affinity matrix (e.g., defining cross-modal semantic relationships) and a Hamming space for improved speed and accuracy. The structure and function of the cross-modal hashing module 112, particularly the unsupervised hash learning process, are described in greater detail in reference to FIG. 2. As referred to herein, Al can be described as an automated computer process that can intelligently leverage data analysis for training itself for further optimizing the processes. ML can be generally considered an application of Al. Al techniques can include various approaches that are used in the area to achieve automated data analysis, such as neural networks, automated reasoning analysis (e.g., satisfiability modulo theories), and so on. Al-based techniques can be used to enhance computer-controlled features of a mobile computing device 110 in a manner that improves the over user experience and optimizes performance of applications and/or operation environment. In the example of FIG. 1, AI/ML techniques are specifically used to drive visual language-learning modeling, hash modeling, and video-text retrieval tasks, as disclosed. As described in detail, the ML/AI model(s) are built and/or trained by an unsupervised hashing learning process, and thus the resulting ML/AI can be considered as an unsupervised hash model. As referred to herein, an unsupervised hash model can be described as models that account for the distribution of the data in an unsupervised manner without the need manually acquired labels. Unsupervised hash models typically achieve this by using techniques that factorize the data covariance matrix or cluster related data-points into groups. These models generally exhibit a good retrieval effectiveness lying somewhere between the data independent and supervised models.

[0015] As an example of operation, the mobile computing device 110 can have a library of videos stored in its memory that were captured by the device's 110 user/owner using its video recording functions (e.g., built-in video camera). In the illustrated example of FIG. 1, frames from a clip of video 113 are depicted as being stored on, or otherwise accessible by, the mobile computing device 110. The four frames of video 113 illustrated in the example are related to sport of basketball, including images that show basketball players, a court, a goal, and the like. Accordingly, the text 114 includes words, keywords, phrases, etc. that are generally related to basketball and describe the imagery that is portrayed in the frames of video 113. For instance, the text 114 includes captions, or descriptors, that correspond to the contents of the frames of video 113. As seen in FIG. 1, the text 114 that accompanies the frames of video 113 are phrases, or captions, including: "a player is putting the basketball into the post from a distance"; "the player makes a three pointer"; and "people are playing basketball." At a later time, the user/owner of the mobile computing device 110 may desire to search through and retrieve one or more of the stored videos, including video 113, using a searching function of the computer device 110, shown as video-text retriever 111. In other operational example, the video 113 is not necessarily stored on the mobile computing device 110 itself but are stored on distributed and large-scale remote databases/repositories of information that are accessible to the mobile computing device 110 via a communication networks, such as the Internet. In this example, the video-text retriever 111 functions similarly to a search engine, where text entered into a graphical user interface (GUI) of the video-text retriever 111 drives searches for content that is available on the Internet and sites on the World Wide Web.

[0016] As alluded to above, the video-text retriever 111 is configured to utilize text as input which serves as the basis for a query to retrieve a selected one or more relevant videos from a larger group consisting of vast amounts of videos, including videos 113. Generally, videos can correspond to descriptive text, such as keywords, phrases, descriptors, and the like, which describe the contents and/or context of the video. Thus, for example, the user/owner of the mobile computing device 110 can enter text (e.g., keywords, phrases, search string, etc.) into a GUI of the video-text retriever 111 in order to ultimately search through a plurality of videos, such as videos 113, in order to retrieve one or more videos that are deemed most relevant to the text input. Accordingly, the video-text retriever 111 can be described as a multi-modal application (e.g., combining visual and text information). Furthermore, as previously described, the video-text retriever 111 employs a ML/AL model, such as a visual-language learning model, to execute its video-text retrieving tasks. Because the video-text retriever 111 executes cross-modal tasks in the image-text domain, the ML/AI model leveraged by the video-text retriever 111 is trained using a cross-modal hash learning approach, that is implemented by the disclosed cross-modal hashing module 112. Thus, the video-text retriever 111 is capable of performing enhanced video-text retrieval tasks, by directly leveraging the cross-modal hash learning of the cross-modal hashing module 112.

[0017] As alluded to above, there are several challenges that are associated with conventional retrieval approaches that utilize conventional hash learning approaches, such as requiring label information, being limited to image-text retrieval (e.g., cannot be readily adopted into the video space), and requiring supervised learning for hash codes. Nonetheless, the video-text retriever 111 leverages the capabilities of the cross-modal hashing module 112, which performs an unsupervised cross-modal hash learning process that is optimal for video retrieval tasks, in order to address the aforementioned drawbacks. Consequently, the video text retriever 111 and cross-modal hashing module 112 work together to achieve better retrieval accuracy and efficiency. For instance, with conventional approaches, a weakly constructed inference between text and video can cause a query using the text "people shooting a three pointer" which is a slight variation to the text 114 may lead to a failed retrieval of a corresponding video. In contrast, the cross-modal hashing module 112 enables the ML/AI model to learn semantic relevant binary codes in a manner that is improved by the cross-modal relationships that are represented in an affinity matrix. Thus, the deep cross- modal unsupervised hashing algorithm that is executed by the cross-modal hashing module 112 is able to encode data from different modalities into one common Hamming space, and further capable of connecting the semantic visual information with textual information in the Hamming space (utilizing the integration of the neighborhood correlation from the well- defined semantic spacer) in a manner that realizes improved accuracy and speed for video text retrieval tasks. Referring back to the previous example, the cross-modal hashing module 112 builds strong correlations between video and text data (based on the hash learning of semantic relevant binary codes) that are learned by the ML/AL model, which allows a query using the text "people shooting a three pointer" in the video-text retriever 111 to successfully retrieve of a corresponding video, such as video 113.

[0018] Referring to the example of the FIG. 1, the cross-modal hashing module 112 can train a ML/AI model using a dataset of videos and text, such as large library of video clips and related captions. For example, the dataset can include the videos 113 and text 114 depicted in FIG. 1, where the text 114 is captions that describe the contents and/or context of the corresponding clips of video 113. In general, the ML/AI model training that is performed by the cross-modal hashing module 112 is a distinct approach that involves, but is not limited to: 1) extracting textual features and visual features from the video and text in the dataset; 2) constructing a cross-modal affinity matrix; 3) constructing a Hamming space; and 4) performing an unsupervised hash learning process that trains the model to learn binary codes that is guided by the cross-modal semantic relationships from Hamming space and the cross- modal affinity matrix. Details regarding the architecture and unsupervised cross-modal hash learning process executed by the cross-modal hashing module 112 are described in greater detail below in reference to FIG. 2. The resulting output from the cross-modal hashing modile 112 is a ML/AI model that is trained to learn semantic relevant binary codes and recognize cross-modal semantic relationships between video and text in a manner that is leveraged by the video-text retriever 111 for executing its video-text retrieving tasks. It should be appreciated that the functionality of the cross-modal hashing module 112 is not limited to video-text retrieval, as described in the example of FIG. 1, and can be applicable to other forms of cross-modal retrieval such as text-video retrieval (e.g., retrieving text based on a video query).

[0019] FIG. 2 depicts an example configuration of the cross-modal hashing module 200 that is described above in reference to FIG. 1 for enhancing cross-modal retrieval tasks (e.g., video-text retrieval). The cross-modal hashing module 200, as disclosed herein, executes an autonomous self-training process, namely an unsupervised hash learning, for ML/AI model(s), where the model(s) trains on a variety of text and video data in order to learn semantic relevant binary codes. Thus, a ML/AI model that is built and/or trained by the cross- modal hashing module 200 can recognize cross-modal semantic correlations between video and text in a manner that can be employed for video-text retrieval tasks, for example. Additionally, the disclosed cross-modal hashing module 200 is configured to leverage the many capabilities of a pre-trained CLIP model. For example, the cross-modal hashing module 200 utilizes a well-defined semantic space that is generated by CLIP to construct a Hamming space (e.g., hashing with a single hashing model) which is used in unsupervised hash learning for the ML/AI model. Further, the cross-modal hashing module 200 leverages the well- constructed cross-modal semantic space that is provided by CLIP, illustrated in FIG. 2 as CLIP space 221, in order to derive a cross-modal affinity matrix S 224 which improves the performance of the hash learning process. The cross-modal hashing module 200 is also configured to utilize other techniques, such as binary code learning and dynamic weighting, which optimizes the unsupervised hash learning of the ML/AI model particularly for cross- modal retrieval tasks. Accordingly, an ML/AL model trained using the cross-modal hashing module 200, as disclosed herein, can implement cross-modal retrieval functions that are more efficient and accurate than traditional hashing methods. [0020] In the example of FIG. 2, the framework of the cross-modal hashing module 200 comprises: 1) a feature extraction module 210 for extracting textual features and video features from input; 2) an affinity matrix module 220 for constructing a cross-modal affinity matrix S 224; 3) a binary code module 230 for producing a Hamming space 241 and binary codes utilizing a single hashing function; and 4) a hashing learning module 240 for implementing the unsupervised hash learning for the ML/AI model that is guided by the Hamming space 241 and the cross-modal affinity matrix S 224. In an embodiment, the cross- modal hashing module 200 is implemented as a computer processor device, for example, a microcomputer that includes one or more processing units (e.g., microprocessors), memory storage (e.g., RAM, ROM, etc.), and I/O devices. The processing units of the cross-modal hashing module 200 can execute instructions stored in memory to control one or more electrical systems or subsystems of the computer processor device. Furthermore, the aforementioned feature extraction module 210, affinity matrix module 220, binary code module 230, hashing learning module 240 and other elements comprised thereof, can be implemented as hardware, firmware, software, or any combination thereof. Furthermore, the feature extraction module 210, affinity matrix module 220, binary code module 230, and hashing learning module 240 can be implemented as components integrated together on a single computer processor device of the cross-modal hashing module 200, or as separate stand-alone computer processor devices functioning together.

[0021] FIG. 2 shows that the cross-modal hashing module 200 comprises a feature extraction module 210. Particularly, FIG. 2 illustrates the feature extraction module 210 initially receiving input in the form of video 211 and text 212. For example, the video 211 and text 212 can be a part of a larger dataset that includes a wide variety of video images and text to train the ML/AI model. The video 211 can be multiple different video clips, where each video clip comprises a plurality of frames of video. The feature extraction module 210 receives the video 211, specifically frames of video, in a manner that allows features associated with the visual information conveyed in the video 211 to be extracted and further analyzed as training data. Also, because the cross-modal hashing module 200 facilitates analysis in both video and text modalities, the feature extraction module 210 also extracts features associated with the text 211. As an example, the text 212 that is input into the feature extraction module 210 can include words, keywords, phrases, and the like, which correspond to one or more of the frames of the video 211 that describes the contents and/or context of the video 211. In the example, FIG. 2 depicts the video 211 as a series of frames including images of women on a television show, and depicts text 212 corresponding to these frames, such as a caption including the phrase "two women from a comedic television show." FIG. 2 also illustrates that the input, comprising video 211 and text 212, are respectively received by a CLIP encoder 213, 214 that are implemented in the feature extraction module 210. Specifically, in the example of FIG. 2, video 211 is fed to CLIP encoder 213, which extracts multiple video features 215 and text 212 is fed to CLIP encoder 214, which extracts multiple textual features 216. Accordingly, an ML/AI model that is built and/or trained by the cross-modal hashing module 200 is trained using video features 215 and textual features 216.

[0022] As referred to herein, CLIP is a neural network that is pre-trained on a large set of (image, text) pairs. CLIP can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task. Thus, because CLIP encoders 213, 214 have been pre-trained to learn visual concepts from natural language supervision, the CLIP encoders 213, 214 are already equipped to extract features from (image, text) pairs it has received as input, namely video 211 and the text 212 (e.g., frames of video and its corresponding caption). In support of the dual-modality capabilities of the cross-modal hashing module 200, the CLIP encoders 213, 214 are constructed with a dual-encoder architecture. That is, the dual CLIP encoders 213, 2143 respectively extract video features 215 and textual features 216 from the input. In an embodiment, CLIP encoders 213, 214 extract video features 213 and textual features 216 from each frame of video that it receives as input. In other words, a plurality of video and textual features 215, 216 F are extracted by CLIP dual encoders 213, 214. Furthermore, in an embodiment, the feature extraction module 210 utilizes a mean-pooling frame fusion to make the video features 215 f_v the same size as the text features 216 f_t prior to being output from the module 210. [0023] Also, FIG. 2 illustrates that the cross-modal hashing 200 includes an affinity matrix module 220 which implements the cross-modal affinity matrix S 224. In the example of FIG. 2, the affinity matrix module 220 receives the video features 215 and the textual features 216 from the feature extraction module 210. As alluded to above, the affinity matrix module 220 is configured to leverage the well-defined cross-modal semantic space, shown in FIG. 2 as CLIP space 221, that is generated as a function of the CLIP encoders 213, 214. As illustrated, the affinity matrix module 220 can include a CLIP space 221 where both video features 215 and textual features 216 are projected into one domain, thereby creating a cross-model semantic space. Furthermore, the affinity matrix module 220 is particularly configured to construct the cross-modal affinity matrix S 224. Because the cross-modal affinity matrix 5224 is derived from cross-modal semantic relationships that are defined by the CLIP space 221, the matrix 224 can be leveraged to strengthen cross-model relationships that are learned by the ML/AL model in binary codes, which improves the hash learning process. In order to achieve this, the affinity matrix module 220 first calculates cross-modal cosine similarity matrices 222, 223 from features in the CLIP space 221, where the cross- modal cosine similarity matrices 222, 223 are represented mathematically as:

S_TV= P_TPiE[-l,l]^mxm (2) where F represents the normalization of original features F

[0024] Again, it is the CLIP models (implemented by the CLIP dual-encoders 213, 214) that make the aforementioned calculations involving eq. 1 and eq. 2 possible, because CLIP provides the well-defined cross-modal semantic space, namely CLIP space 221. Secondly, the diagonal values of the cross-modal cosine similarity matrices S_VT 222 and S_TV 223 are set to 1 because the paired video and text should be closest to each other. Subsequently, the cross- modal affinity matrix S 224 can be formed by using the second order of the similarity matrix, which can be represented mathematically as:

where S = 0.5 x ( S_VT + S_TV), and a is the hyper-parameter to adjust the weight of second order neighborhood correlations.

[0025] Thus, as the cross-modal affinity matrix S 224 refines the defined cross-modal relationships, it can be used to guide the module's 200 hash learning algorithm in a manner that enables the model to learn semantic relevant binary codes and improve its learned correlations between video and text modalities.

[0026] Additionally, FIG. 2 shows that the architecture of the cross-modal hashing module 200 includes the binary code module 230. As a general description, the binary code module 230 is configured to generate a Hamming space 231, including both video and textual modalities, which can be used during unsupervised hash learningforthe ML/AI model to learn semantic relevant binary codes. A Hamming space, as referred to herein, can be described as a mathematical space in which words of some given length may be situated, where the separation of points in the space being can be measured by a Hamming distance. The dimensionality of the space is equal to the number of digits in the words, and the coordinate in each dimension is given by each successive digit in the words. As previously described, some conventional hashing approaches adopted a dual-encoder architecture, which analyzes visual and textual inputs separately because of the difficulty associated with fusing different modalities. Nonetheless, although retrieval in a single modality may be easier to implement, it often experiences a performance tradeoff with respect to decreased overall accuracy of cross-modal retrieval tasks. In contrast, the cross-modal hashing module 200 utilizes a "fused" modality approach which takes advantage of the well-defined continuous semantic space that projects both visual and textual features into one domain simultaneously. Since the pre- trained CLIP model provides such a well-constructed cross-modal semantic space (e.g., built upon over 10 billion image-text pairs), the CLIP capabilities are also employed in the hashing functionality of the cross-modal hashing 200 architecture, in order to support other cross- modal aspects of the framework. As seen in FIG. 2, the video features 215 and the textual features 216 that are output from the dual CLIP encoders 213, 214 respectively, are also fed to the binary code module 230. Consequently, the binary code module 230 (in addition to the affinity matrix module 220) can also utilize the cross-modal semantic space that is generated by CLIP. FIG. 2 illustrates that the video features 215 and the textual features 216, as a defined semantic space, are sent to a HashNet 232 in order to ultimately construct the Hamming space 231. In an embodiment, the binary code module 230 is configured to implement a three-layer Multi-Layer Perceptron (MLP) as the HashNet 232 to obtain that Hamming space 231, H_r and H_r, which encodes data from both video and textual modalities into one common Hamming space. Thus, the Hamming space 231 represents relationships between semantic textual and visual information. FIG. 2 also illustrates that the binary code module 230 can utilize another hashing layer 233 that follows the HashNet 232 in order to produce binarized codes B_v and B_r.

[0027] Some naive binarization methods like Sign function or Tanh function are usually utilized for obtaining the binary codes. However, unlike the disclosed embodiments, these conventional binarization methods do not consider the utilization of a Hamming space. In contrast, the binary code module 230 is configured to execute its analysis using the Hamming space 231 to obtain binary codes. According to an embodiment, the binary code module 230 applies a Bi-Half forward and backward learning strategy for the hashing layer 233 to maximize the bit entropy in the Hamming space 231. This learning strategy utilized by the binary code module 240 can be represented mathematically as:

Forward: B = 7T₀ (H) (4)

Backward:

where p₀ is a transport plan, and 1/y = learning rate.

[0028] The binary code module 240 can use p₀ transport plan from eq. 4 to firstly sort hash bits from m batch instances, and then assign a top half of the elements to +1 and others to -1. Therefore, the ML/AL model that is built and/or trained by the cross-modal hashing module 200 learns a Hamming space 231 that is guided by the cross-modal semantic relationships defined in the cross-modal affinity matrix S 224 in a manner that improves accuracy of the hash learning.

[0029] Also, FIG. 2 illustrates that the architecture of the cross-modal hashing module 200 includes a hashing learning module 240. The hashing learning module 240 is configured to perform an unsupervised hash learning for a ML/AI model, using a process that leverages both the cross-modal affinity matrix S 224 and the Hamming space 241 (e.g., binary codes obtained from the hashing space). In an embodiment, the hashing learning module 240 also performs additional functions that can optimize the unsupervised hash learning process, such as dynamic weighting of the cross-modal affinity matrix S 224 and optimizing the ML/AI model with respect to error/loss estimation (e.g., loss function). As described above in eq. 3, the cross-modal affinity matrix S_c 224 requires the hyper-parameter a. To avoid setting up the hyper-parameter heuristically, the hashing learning module 240 is configured to execute a dynamic weighting strategy to diffuse the cross-modal affinity matrix S_c 224 in each training batch. Different from the eq .3, there are three steps used to obtain the new dynamically weighted affinity matrix S 221. Firstly, a balanced weighting is performed to derive S_c, which can be represented mathematically as:

[0030] Secondly, the mean, min, and max values of S_c (denoted as s_mean, 5mm and S_max) are acquired. Then sy in S_c is determined as a "dissimilar pair" or a "similar pair" by comparing the distance between itself and two borderlines, s_min and s_max. Also, sy will be reweighted to sy which can be represented mathematically as:

[0031] The weights W and W⁺ be represented mathematically as:

[0032] The resulting weighted affinity matrix S 224 is formed, which is represented mathematically as:

5 = {¾;}£_W=1 (10)

[0033] Accordingly, the hashing learning module 240 adopts the new weighted affinity matrix S 224 into the unsupervised hash learning of the ML/AI model to guide all the relationships in the Hamming space 241. In order to achieve a hash learning process that involves both the weighted affinity matrix S 224 and the Hamming space 231, the hashing learning module 240 defines all of the relationships in the Hamming space 231 with cosine similarity, where cosine similarity allows intra-modal and inter-modal semantic relationships in the Hamming space to be defined. Thus, the disclosed hash learning process is capable of connecting the semantic visual information with textual information in the Hamming space 231 by utilizing the integration of the neighborhood correlation from the well-defined semantic space. The intra-modal similarity for video features 241 is calculated as cos(B , B ) and the intra-modal similarity for textual features 242 is calculated as cos(B_r, B_r). The inter- modal similarity 243 is calculated as cos(B , B_r). FIG. 2 illustrates that the hashing learning module 240 includes the weighted affinity matrix S 224, the intra-model similarities 241, 242 derived from the Hamming space 231, and the inter-modal similarities 243 derived from the Hamming space 231 into the hashing learning process for the ML/AI model. Furthermore, the hashing learning module 240 utilizes intra-model similarities 241, 242 and the inter-modal similarities 243 to optimize the ML/AI model. Particularly, the hashing learning module 240 trains the ML/AI model to minimize a loss function, where the loss function is represented mathematically as:

[0034] In eq. 11 the variables l₁ , l₂ A3 control the tradeoff balance the intra-modal and inter-modal weights. For example, as part of optimizing the ML/AI model(s) of the cross- modal hashing module 200, the error for the current state of the model must be estimated repeatedly. The loss function in eq. 11 can be used to estimate the loss during training the model(s) so that the weights can be updated to reduce the loss on the next evaluation. Consequently, the resulting ML/AI model from the cross-modal hashing module 200 is a hashing model that has learned the semantic relevant binary codes in a distinct hashing learning process that is guided by the cross-modal affinity matrix S 224 and has dynamic weighting. Restated, the cross-modal hashing module 200 allows the model to learn semantic relevant binary codes, which is further improved by the cross-modal semantic relationships of the cross-modal affinity matrix S 224. Thus, the ML/AI model that is built and/or trained by the cross-modal is trained to recognize cross-modal semantic relationships between video and text data in a manner that can support fast and accurate video-text retrieval tasks.

[0035] A flowchart is shown in FIG. 3, illustrating an example of a process 300 that is performed for building and/or training a ML/AI model using an unsupervised cross-modal hash learning process, in accordance with an embodiment of the systems and methods described herein. As seen in FIG. 3, process 300 is illustrated as a series of executable operations in a machine-readable storage media 306 performed by a hardware processor 304. The computing component 302 can be a computer device used for telecommunication functions, such as voice, video, and text, and video-text retrieval tasks. For example, the computing component 302 may be the mobile computing device (e.g., smartphone) described above in reference to FIG. 1. Generally, process 300 implements building and/or training of a ML/AI model, such as a hash model, using the unsupervised hash learning process which trains the model to learn semantic relevant binary codes, according to some embodiments.

[0036] The process 300 can begin at operation 305, extracting visual features and textual features from input. Operation 305 can involve extracting visual features and textual features from input comprising a dataset of various videos (e.g., frames of video clips) and text data corresponding to the videos (e.g., captions). In some cases, there is text that corresponds to each frame of video. Restated, video features and textual features can be extracted from (video, text) pairs that are received as input for building and/or training the ML/AI model. In an embodiment, the video features and textual features are extracted using a CLIP dual-encoders. Video features and textual features can be extracted from each frame of video that is received as input. Operation 305 can include and steps and calculations performed by a feature extraction module (described in detail in reference to FIG. 2). In an embodiment, operation 305 includes a mean-pooling frame fusion that is used to make the video features the same size as the text features. Furthermore, by leveraging the capabilities of a pre-trained CLIP model, operation 305 outputs a well-defined cross-modal space, also referred to herein as a CLIP Space. The CLIP space defines semantic relationships between the extracted video features and textual features in a manner that can be utilized throughout the process 300.

[0037] Next, the process 300 continues to operation 310 where a cross-modal affinity matrix is constructed. The CLIP space, which a well-defined cross-modal semantic space representing the relationship between video features and textual features, can be used to construct the cross-modal affinity matrix in operation 310. The cross-modal affinity matrix is a key aspect of the unsupervised hash learning process, and improves the overall binary code learning of the process since the matrix is derived from cross-modal semantic relationships (e.g., CLIP space). In an embodiment, operation BIO can involve computations shown in eq. 1 and eq. 2 in order to calculate the cross-modal cosine similarity matrices. Then, the cross- modal affinity matrix can be formed by using the second order of the similarity matrix, which can involve computations shown in eq. 3. In an embodiment, operation 310 includes the functions implemented by the affinity matrix module (described in detail in reference to FIG. 2). As will be described, the cross-modal affinity matrix guides learning relationships in the Hamming space, which improves the binary codes learned in the unsupervised hash learning process.

[0038] Subsequently, at operation 315, a Hamming space is constructed. As a general description, operation 315 involves encoding data from different modalities into one common Hamming space that is ultimately utilized for enhanced video-text retrieval. The well-defined cross-modal semantic space, which is produced based on leveraging the CLIP capabilities in previous operation 305, can be used to construct the Hamming space in operation 315. For example, this well-defined continuous semantic space output from previous operation 305, which projects both video features and textual features into one domain, is fed through a HashNet to construct the Hamming space. In an embodiment, the HashNet is implemented as a three-layer MLP. Additionally, operation 315 can involve applying an additional hash layer, following the HashNet, in order to derive binary codes from the Hamming Space. In an embodiment, utilizing the hashing layer to construct the Hamming space includes applying a Bi-Half forward and backward learning strategy, which involves computations shown in eq. 4 and eq. 5. As will be described, the process 300 applies hash learning to allow the ML/AI model to learn a better Hamming space that is guided by the cross-modal affinity matrix constructed in previous operation 310. In an embodiment, operation 315 includes the functions implemented by the binary code module (described in detail in reference to FIG. 2).

[0039] Thereafter, at operation 320, where unsupervised cross-modal hash learning is conducted to build and/or train the ML/AI model to learn the semantic relevant binary codes. As previously described, the disclosed unsupervised cross-modal hash learning process trains the model to learn the Hamming space/binary codes constructed in previous operation 315, which is improved by being guided by the cross-modal affinity matrix (e.g., defining cross- modal semantic relationships) constructed in previous operation 310. In an embodiment, operation 320 involves dynamically weighting the cross-modal affinity matrix, which circumvents the requirement for hyper-parameters for different datasets. Accordingly, operation 320 can include computations shown in eq. 6 - eq. 10 in order to derive the weighted affinity matrix. Consequently, the unsupervised hash learning process can adopt the weighted affinity matrix in the process in order to guide all of the relationships learned from the Hamming space. In an embodiment, the relationships in the Hamming space are defined with cosine similarity. Thus, operation 320 can include calculating intra-modal similarity and inter-modal similarity of binary codes associated with the Hamming space. The unsupervised hash learning process can include training the ML/AI model using the cosine similarities and the weight affinity matrix, where the affinity matrix refines the cross-modal relationships and guides the model to learn semantic relevant binary codes. Restated, the unsupervised cross- modal learning process connects the semantic video information with textual information in the Hamming space by utilizing the integration of the neighborhood correlation from the well- defined semantic space. Furthermore, operation 320 can also involve applying loss function performing computations shown in eq. 11 in order to train and/or optimize the ML/AI model with respect to error/loss estimation. In an embodiment, operation 320 includes the functions implemented by the hashing learning module (described in detail in reference to FIG. 2).

[0040] Consequently, process 300 builds, trains, and optimizes ML/AI model(s) using the disclosed unsupervised cross-modal hash learning techniques. The ML/AI models generated from process 300 learns semantic relevant binary codes and recognizes cross- modal semantic correlations between video and text, which is leveraged for video-text retrieval tasks. Therefore, process 300 implementing the unsupervised cross-modal hash learning disclosed herein, realizes several advantages, such as improved retrieval accuracy for video-text retrieval tasks, and improved speed and efficiency for video-text retrieval tasks. [0041] FIG. 4 depicts a block diagram of an example computer system 400 in which various features described herein may be implemented. For example, the computer system 400 can be a device (shown in FIG. 1) implementing the disclosed cross-modal hashing module and video-text retrieval system and methods. The computer system 400 includes a bus 402 or other communication mechanism for communicating information, one or more hardware processors 404 coupled with bus 402 for processing information. Hardware processor(s) 404 may be, for example, one or more general purpose microprocessors.

[0042] The computer system 400 also includes a main memory 406, such as a random-access memory (RAM), cache and/or other dynamic storage devices, coupled to bus 402 for storing information and instructions to be executed by processor 404. Main memory 406 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 404. Such instructions, when stored in storage media accessible to processor 404, render computer system 400 into a special-purpose machine that is customized to perform the operations specified in the instructions.

[0043] The computer system 400 further includes a read only memory (ROM) 408 or other static storage device coupled to bus 402 for storing static information and instructions for processor 404. A storage device 410, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive), etc., is provided and coupled to bus 402 for storing information and instructions.

[0044] The computer system 400 may be coupled via bus 402 to a display 412, such as a liquid crystal display (LCD) (or touch screen), for displaying information to a computer user. An input device 414, including alphanumeric and other keys, is coupled to bus 402 for communicating information and command selections to processor 404. Another type of user input device is cursor control 416, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 404 and for controlling cursor movement on display 412. In some embodiments, the same direction information and command selections as cursor control may be implemented via receiving touches on a touch screen without a cursor.

[0045] The computing system 400 may include a user interface module to implement a GUI that may be stored in a mass storage device as executable software codes that are executed by the computing device(s). This and other modules may include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

[0046] In general, the word "component," "engine," "system," "database," data store," and the like, as used herein, can refer to logic embodied in hardware or firmware, or to a collection of software instructions, possibly having entry and exit points, written in a programming language, such as, for example, Java, C or C++. A software component may be compiled and linked into an executable program, installed in a dynamic link library, or may be written in an interpreted programming language such as, for example, BASIC, Perl, or Python. It will be appreciated that software components may be callable from other components or from themselves, and/or may be invoked in response to detected events or interrupts. Software components configured for execution on computing devices may be provided on a computer readable medium, such as a compact disc, digital video disc, flash drive, magnetic disc, or any other tangible medium, or as a digital download (and may be originally stored in a compressed or installable format that requires installation, decompression or decryption prior to execution). Such software code may be stored, partially or fully, on a memory device of the executing computing device, for execution by the computing device. Software instructions may be embedded in firmware, such as an EPROM. It will be further appreciated that hardware components may be comprised of connected logic units, such as gates and flip- flops, and/or may be comprised of programmable units, such as programmable gate arrays or processors. [0047] The computer system 400 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 400 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 400 in response to processor(s) 404 executing one or more sequences of one or more instructions contained in main memory 406. Such instructions may be read into main memory 406 from another storage medium, such as storage device 410. Execution of the sequences of instructions contained in main memory 406 causes processor(s) 404 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

[0048] The term "non-transitory media," and similar terms, as used herein refers to any media that store data and/or instructions that cause a machine to operate in a specific fashion. Such non-transitory media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 410. Volatile media includes dynamic memory, such as main memory 406. Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge, and networked versions of the same.

[0049] Non-transitory media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between non-transitory media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 402. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications. [0050] The computer system 400 also includes a communication interface 418 coupled to bus 402. Communication interface 418 provides a two-way data communication coupling to one or more network links that are connected to one or more local networks. For example, communication interface 418 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 418 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN (or WAN component to communicated with a WAN). Wireless links may also be implemented. In any such implementation, communication interface 418 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

[0051] A network link typically provides data communication through one or more networks to other data devices. For example, a network link may provide a connection through local network to a host computer or to data equipment operated by an Internet Service Provider (ISP). The ISP in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet." Local network and Internet both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link and through communication interface 418, which carry the digital data to and from computer system 400, are example forms of transmission media.

[0052] The computer system 400 can send messages and receive data, including program code, through the network(s), network link and communication interface 418. In the Internet example, a server might transmit a requested code for an application program through the Internet, the ISP, the local network and the communication interface 418.

[0053] The received code may be executed by processor 404 as it is received, and/or stored in storage device 410, or other non-volatile storage for later execution.

[0054] Each of the processes, methods, and algorithms described in the preceding sections may be embodied in, and fully or partially automated by, code components executed by one or more computer systems or computer processors comprising computer hardware. The one or more computer systems or computer processors may also operate to support performance of the relevant operations in a "cloud computing" environment or as a "software as a service" (SaaS). The processes and algorithms may be implemented partially or wholly in application-specific circuitry. The various features and processes described above may be used independently of one another or may be combined in various ways. Different combinations and sub-combinations are intended to fall within the scope of this disclosure, and certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate, or may be performed in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed example embodiments. The performance of certain of the operations or processes may be distributed among computer systems or computers processors, not only residing within a single machine, but deployed across a number of machines.

[0055] As used herein, a circuit might be implemented utilizing any form of hardware, software, or a combination thereof. For example, one or more processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up a circuit. In implementation, the various circuits described herein might be implemented as discrete circuits or the functions and features described can be shared in part or in total among one or more circuits. Even though various features or elements of functionality may be individually described or claimed as separate circuits, these features and functionality can be shared among one or more common circuits, and such description shall not require or imply that separate circuits are required to implement such features or functionality. Where a circuit is implemented in whole or in part using software, such software can be implemented to operate with a computing or processing system capable of carrying out the functionality described with respect thereto, such as computer system

400. [0056] As used herein, the term "or" may be construed in either an inclusive or exclusive sense. Moreover, the description of resources, operations, or structures in the singular shall not be read to exclude the plural. Conditional language, such as, among others, "can," "could," "might," or "may," unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps.

[0057] Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. Adjectives such as "conventional," "traditional," "normal," "standard," "known," and terms of similar meaning should not be construed as limiting the item described to a given time period or to an item available as of a given time, but instead should be read to encompass conventional, traditional, normal, or standard technologies that may be available or known now or at any time in the future. The presence of broadening words and phrases such as "one or more," "at least," "but not limited to" or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent.

Claims

Claims What is claimed is:

1. A computer-implemented method, comprising: extracting video features and textual features from an input; and conducting a hash learning for a model to learn semantic relevant binary codes, wherein the hash learning comprises applying a cross-modal affinity matrix that defines cross-modal relationships between the extracted video features and the extracted textual features in learning the semantic relevant binary codes.

2. The computer-implemented method of claim 1, wherein the input comprises a dataset of video and text corresponding to the video.

3. The computer-implemented method of claim 1, wherein extracting the video features and textual features comprises applying dual Contrastive Language-Image Pre- Training (CLIP) encoders.

4. The computer-implemented method of claim 3, wherein the dual CLIP encoders generate a cross-modal semantic space that projects the extracted video features and the extracted textual features in one domain.

5. The computer-implemented method of claim 4, further comprising constructing the cross-modal affinity matrix based on the cross-modal semantic space generated by the dual CLIP encoders.

6. The computer-implemented method of claim 5, wherein constructing the cross- modal affinity matrix comprises calculating cross-modal cosine similarity matrices from the extracted video features and the extracted textual features.

7. The computer-implemented method of claim 6, wherein constructing the cross- modal affinity matrix comprises calculating the second order of the cross-modal cosine similarity matrices.

8. The computer-implemented method of claim 4, further comprising constructing a Hamming space based on the cross-modal semantic space generated by the dual CLIP encoders.

9. The computer-implemented method of claim 8, wherein constructing the Hamming space comprises applying a hashing layer to the cross-modal semantic space generated by the dual CLIP encoders.

10. The computer-implemented method of claim 9, wherein constructing the Hamming space comprises applying a Bi-Half forward and backward learning strategy to the hashing layer.

11. The computer-implemented method of claim 9, wherein constructing the Hamming space comprises applying a Bi-Half forward and backward learning strategy to the hashing layer.

12. The computer-implemented method of claim 11, wherein constructing the cross- modal affinity matrix comprises dynamically weighting the cross-modal affinity matrix to generate a weighted affinity matrix.

13. The computer-implemented method of claim 12, wherein conducting the hash learning for the model comprises calculating intermodal similarities and intramodal similarities from the Hamming space.

14. The computer-implemented method of claim IB, wherein conducting the hash learning for the model comprises training the model to learn semantic relevant binary codes based on the intermodal similarities and intramodal similarities from the Hamming space and applying the weighted affinity matrix.

15. The computer-implemented method of claim 1, wherein the model comprises a machine learning/artificial intelligence (ML/AI) model, and further wherein the hash learning comprises an unsupervised cross-modal hash learning.

16. A computer system, comprising: one or more processors; and a memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform: extracting video features and textual features from an input; and conducting a hash learning for a model to learn semantic relevant binary codes, wherein the hash learning comprises applying a cross-modal affinity matrix that defines cross-modal relationships between the extracted video features and the extracted textual features in learning the semantic relevant binary codes.

17. The computer system of claim 16, wherein extracting the video features and textual features comprises applying a Contrastive Language-Image Pre-Training (CLIP) model to generate a cross-modal semantic space that projects the extracted video features and the extracted textual features in one domain.

18. The computer system of claim 17, wherein the memory has further instructions stored thereon, which when executed by the one or more processors cause the processors to further perform: constructing the cross-modal affinity matrix based on the cross-modal semantic space generated by the CLIP model.

19. The computer system of claim 17, wherein the memory has further instructions stored thereon, which when executed by the one or more processors cause the processors to further perform: constructing a Hamming space based on the cross-modal semantic space generated by the dual CLIP encoders; and applying the cross-modal affinity matrix in conducting the hash learning based on cross-modal relationships in the Hamming space.

20. A mobile computing device, comprising: a cross-modal hashing module training a machine learning/artificial intelligence (ML/AI) model using unsupervised cross-modal hash learning to learn semantic relevant binary codes; and a video-text retriever performing video-text retrieval tasks to select one or more most relevant videos from a plurality of video based on a received text query, wherein the video-text retrieval tasks is guided by the ML/AI model trained by the cross-modal hashing module.