WO2023091507A1

WO2023091507A1 - Methods and systems for correlating video and text

Info

Publication number: WO2023091507A1
Application number: PCT/US2022/050136
Authority: WO
Inventors: Yikang Li; Yaoxin ZHUO; Jenhao Hsiao; Chiu Man HO
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-11-18
Filing date: 2022-11-16
Publication date: 2023-05-25

Abstract

The present invention is directed to image/video retrieval methods and techniques. According to a specific embodiment, a text query is received from a user. The most relevant images and/or video segments are identified and retrieved using a hashing model, which is trained using a machine learning process. A cross-modal affinity matrix may be used for training purposes. There are other embodiments as well.

Description

METHODS AND SYSTEMS FOR CORRELATING VIDEO AND TEXT

CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] The present application claims priority to U.S. Provisional Application No.

63/280,894, entitled “UNSUPERVISED MIN-MAX DEEP HASHING FOR CROSS-MODAL VIDEO-TEXT RETRIEV AL”, filed on November 18, 2021, which is commonly owned and incorporated by reference herein for all purposes.

BACKGROUND OF THE INVENTION

[0002] As more and more multimedia data are stored electronically, recognizing and retrieving images and/or video moments from media files have become ubiquitous. For example, a photo search engine (e.g., based on cross-modal video-text retrieval algorithms) allows users to efficiently locate the most relevant photos and/or video segments based on a text description.

[0003] There have been various conventional techniques, but unfortunately, they are inadequate, for the reasons provided below. New and improved methods and systems are desired.

BRIEF SUMMARY OF THE INVENTION

[0004] The present invention is directed to image/video retrieval methods and techniques. According to a specific embodiment, a text query is received from a user. The most relevant images and/or video segments are identified and retrieved using a hashing model, which is trained using a machine learning process. A cross-modal affinity matrix may be used for training purposes. There are other embodiments as well.

[0005] Embodiments of the present invention can be implemented in conjunction with existing systems and processes. For example, the image/video retrieval system according to the present invention can be used in a wide variety of systems, including mobile devices, communication systems, and the like. Additionally, various techniques according to the present invention can be adopted into existing systems via the training of one or more neural network models, which are compatible with most image/video retrieval applications. There are other benefits as well.

[0006] A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by the data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a method for correlating media and text. The method includes obtaining a plurality of media data from a first data storage. The method also includes obtaining a plurality of text data. The method also includes calculating a first matrix based on media-to-text similarity values. The method also includes calculating a second matrix based on text-to-media similarity values. The method also includes calculating an affinity matrix based on the first matrix and the second matrix, the affinity matrix may include average similarity values. The method also includes obtaining a minimum value and a maximum value from the affinity matrix. The method also includes calculating weight values based on distances between average similarity values and the minimum value or the maximum value. The method also includes updating the affinity matrix using the weight values. The method also includes generating a binary hash using at least the updated affinity matrix. The method also includes storing the binary hash. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0007] Implementations may include one or more of the following features. The method may include: obtaining a reference hash, obtaining a plurality of reference similarity values using the reference hash and the plurality of media data and the plurality of text data, comparing the reference hash against the binary hash, and updating the reference hash. The plurality of media data may include video files. In some embodiments, the plurality of media data may include image files. The method may include processing the performing image recognition for the plurality of image data using a graphical processing unit. The method may include calculating a loss based at least on a media-to-media similarity and a media-to-text similarity. The method may include setting a plurality of media-to-text similarity values to one, the plurality of media- to-text similarity values being positioned on a diagonal of the first matrix. The binary hash is stored in a second data storage. The method may include quantizing the updated affinity matrix to binary values. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0008] One general aspect includes a system for correlating media and text. The system also includes a communication interface configured to obtain a plurality of media data and a plurality of text data. The system also includes a first storage coupled to the communication interface. The first storage is configured to store the plurality of media data and the plurality of text data. The system also includes a processor coupled to the first storage. The processor is configured to extract a plurality of visual features from the plurality of media data, extract a plurality of textual features from the plurality of text data, and calculate an affinity matrix based on the plurality of visual features and the plurality of textual features using at least media-to-text similarity values and text-to-media similarity values, obtain a minimum value and a maximum value from the affinity matrix, calculate weight values based on distances between average similarity values and the minimum value or the maximum value, update the affinity matrix using the weight values, and generate a binary hash using at least the updated affinity matrix. The system also includes a second data storage coupled to the processor. The second data storage is configured to store the binary hash. The system also includes a memory coupled to the processor. The memory is configured to store the plurality of visual features and the plurality of textual features. Oilier embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0009] Implementations may include one or more of the following features. The system where the plurality of media data may include video files and/or image files. The processor may include a graphical processing unit configured to perform image recognition for the plurality of media data. The processor may include a central processing unit. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer- accessible medium.

[0010] One general aspect includes a method for correlating media and text. The method includes obtaining a plurality of media data from a first data storage. The method also includes obtaining a plurality of text data. The method also includes extracting a plurality of visual features from the plurality of media data. The method also includes extracting a plurality of textual features from the plurality of text data. The method also includes calculating a first matrix based on media-to-text similarity values using at least the plurality of visual features and the plurality of textual features. The method also includes calculating a second matrix based on text-to-media similarity values using at least the plurality of visual features and the plurality of textual features. The method also includes calculating an affinity matrix based on the first matrix and the second matrix, the affinity matrix may include average similarity values. The method also includes obtaining a minimum value and a maximum value from the affinity matrix. The method also includes calculating weight values based on distances between average similarity values and the minimum value or the maximum value. The method also includes updating the affinity matrix using the weight values. The method also includes generating a binary hash using at least the updated affinity matrix. The method also includes storing the binary hash. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0011] Implementations may include one or more of the foilowing features. In some embodiments, the plurality of media data may include video files characterized by a frame rate greater than or equal to 30fps. The media-to-text similarity values and the text-to-media similarity values may be cosine similarity values. The method may include calculating a consistency preserving loss based at least on the affinity matrix. The consistency preserving loss is associated with a modality gap between the plurality of media data and the plurality of text data. The method may include: generating a plurality of latent features using at least the plurality of visual features and the plurality of textual features, obtaining a minimum value and a maximum value for each of the plurality of latent features, and calculating a first distance between each of the latent features and the minimum value, calculating a second distance between each of the latent features and the maximum value, and comparing the first distance and the second distance. The method may include quantizing the updated affinity matrix to binary values based at least on the first distance and the second distance. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0012] It is to be appreciated that embodiments of the present invention provide many advantages over conventional techniques. Among other things, the present systems and methods for video-text retrieval employ a single bashing network that provides a well-defined joint semantic space by mitigating the modality gap, resulting in fast and accurate retrieval. With a larger bit size (e.g., 2048 bits), the present invention is also well-suited for large-scale video retrieval, which provides more competitive search results with significantly reduced search time. Additionally, the hashing model is further refined through dynamic weight adjustment and novel quantization methods.

[0013] Embodi ments of the present invention can be implemented in conjunction with existing systems and processes. For example, the video/image retrieval methods according to the present invention can be used in a wide variety of systems, including video streaming, video hosting, client-side media players, online media platforms, and/or the like. There are other benefits as well.

[0014] The present invention achieves these benefits and others in the context of known technology. However, a further understanding of the nature and advantages of the present invention may be realized by reference to the latter portions of the specification and attached drawings. BRIEF DESCRIPTION OF THE DRAWINGS

[0015] Figure 1 is a simplified block diagram illustrating a system for correlating media and text according to embodiments of the present invention.

[0016] Figure 2 is a simplified block diagram illustrating a process for correlating media and text according to embodiments of tire present invention.

[0017] Figure 3 is a simplified flow diagram illustrating a method for correlating media and text according to embodiments of the present invention.

[0018] Figure 4 is a simplified flow diagram illustrating a method for correlating media and text according to embodiments of the present invention.

[0019] Figure 5 is a simplified block diagram illustrating a process for correlating media and text according to embodiments of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

[0020] The present invention is directed to image/video retrieval methods and techniques. According to a specific embodiment, a text query is received from a user. The most relevant images and/or video segments are identified and retrieved using a hashing model, which is trained using a machine learning process. A cross-modal affinity matrix may be used for training purposes. There are other embodiments as well.

[0021] As mentioned above, conventional techniques are inadequate. For example, most conventional techniques use pre-defined categories as indexed tags that entail exact keyword matching with the user’s query, which becomes extremely challenging and time-consuming as the scale of datasets grows increasingly.

[0022] Over the years, many techniques for cross-modal retrieval have been developed, including traditional non-hashing methods and hashing-based methods. Compared with existing non-hashing methods that operate in the continuous feature space, hashing-based methods achieve a faster and more efficient retrieval. Conventional hashing-based techniques usually focus on a single modality (e.g., the visual modality) that requires both the query data and retrieval data to be in the same format (e.g., both are images/videos), which significantly restricts the generality and usability of image/video retrieval methods. Cross-modal retrieval that accounts for two or more types of data offers more accurate search results and a wider range of applications. However, it remains a challenging task to effectively capture and preserve cross-modal correlations for high-performance retrieval. New and improved methods and systems for media-text retrieval are desired. [0023] Embodiments of the present invention provide a complete image/video retrieval system that incorporates both visual and textual features into a single hashing network, which advantageously reduces the modality gap and captures tire semantic relationships across modalities for accurate retrieval. The training of the hashing network operates in a fully unsupervised manner without requiring category or annotation information. In addition, the complete system achieves a more efficient yet flexible hashing scheme through the dynamic weighting and binarization modules that do not require dataset-specific hyper-parameters for overall efficiency and effectiveness.

[0024] The following description is presented to enable one of ordinary skill in the art to make and use the invention and to incorporate it into the context of particular applications. Various modifications, as well as a variety of uses in different applications, will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to a wide range of embodiments. Thus, the present invention is not intended to be limited to the embodiments presented but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

[0025] In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of tire present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without necessarily being limited to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.

[0026] The reader's attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. All the features disclosed in this specification, (including any accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features. [0027] Furthermore, tiny element in a claim that does not explicitly state “means for” performing a specified function, or “step for” performing a specific function, is not to be interpreted as a “means” or “step” clause as specified in 35 U.S.C. Section 112, Paragraph 6.

In particular, the use of “step of’ or “act of’ in the Claims herein is not intended to invoke the provisions of 35 U.S.C. 112, Paragraph 6.

[0028] Please note, if used, the labels left, right, front, back, top, bottom, forward, reverse, clockwise and counterclockwise have been used for convenience purposes only and are not intended to imply any particular fixed direction. Instead, they are used to reflect relative locations and/or directions between various portions of tin object.

[0029] Figure 1 is a simplified block diagram illustrating a system 100 for correlating media and text according to embodiments of the present invention. The system 100 may be a client device, and/or a server, and/or the like. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in toe art would recognize many variations, alternatives, and modifications.

[0030] As shown, system 100 includes first data storage 110 and second data storage 120 coupled via data bus 130 to processor 140. Processor 140 is coupled to communication interface 170, which is configured to obtain a plurality of media data and a plurality of text data as inputs. For example, the plurality of media data comprises video files. In some cases, the plurality of media data comprises image files. Elements of system 100 can be configured together to perform a media-text correlating process, as further described below.

[0031] In some embodiments, first data storage 110, coupled to communication interface 170, is configured to store media and text data received from communication interface 170 for processing. Depending on the implementations, the media data may be received from, without limitation, content sharing platforms, content delivery services, social networking platforms, live streaming platforms, mobile applications and services, and/or the like. In some cases, the video files of the plurality of media data may be characterized by a frame rate greater than or equal to 30fps. In various implementations, first data storage 110 may include, without limitation, local and/or network accessible storage, a disk drive, a drive array, an optical storage device, and solid-state storage device, which can be programmable, flash-updateable, and/or the like.

[0032] In some embodiments, the processor 140 includes central processing unit (CPU) 150, graphics processing unit (GPU) 160, and/or the like. CPU 150 may be configured to handle various types of system functions, such as retrieving the plurality of media data and text data from first data storage 1 10, and executing executable instructions (e.g., visual feature extraction, textual feature extraction, affinity matrix calculation, feature mapping, etc.). In some embodiments, GPU 160 may be specially designed to provide image recognition and processing. In some implementations, GPU 160 is configured to provide image recognition for the plurality of media data. In various implementations, processor 140 is configured to retrieve media data and text data, and respectively extract visual features and textual features. The visual features and textual features may be temporarily stored in memory 180 and can form a plurality of training pairs for the following training process. Memory 180 may include a random-access memory (RAM) device, a data buffer device, and/or the like. [0033] In certain embodiments, processor 140 is further configured to calculate an affinity matrix based on the plurality of visual features and the plurality of textual features. For example, the affinity matrix is associated with the semantic correlation between the visual and textual modalities. In a specific example, processor 140 is further configured to map the affinity matrix to a binary hash, which is configured for media-text correlation. In various implementations, second data storage 120 is coupled to processor 140 and configured to store the binary hash for further processing. Processor 140 can be coupled to each of the above mentioned components and be configured to communicate between these components. Second data storage 120 may include, without limitation, local and/or network accessible storage, a disk drive, a drive array, an optical storage device, and a solid-state storage device, which can be programmable, flash-updateable, and/or the like.

[0034] Other embodiments of the system include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. Further details of methods for image/video retrieval, model training for media-text correlating, and related techniques are discussed with reference to the following figures.

[0035] Figure 2 is a simplified block diagram illustrating a process for correlating media and text according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. According to various embodiments, image/video retrieval may be implemented with system 100 in Figure 1, as shown. For example, media data 205 and text data 210 stored in first data storage 110 may be retrieved by processor 140 to perform feature extraction process at module 215.

[0036] In various implementations, feature extraction module 215 extracts visual features 220 from the media data and extracts textual features 225 from the text data 210. For example, GPU 160 in Figure 1 may perform an image recognition process for the plurality of media data to facilitate the visual feature extraction. In some cases, the plurality of visual features 220 and the plurality of textual features 225 form a plurality of training pairs 230 that are configured as inputs of separate pipelines for further parallel processing. In a specific example, feature extraction module 215 includes a pre-trained feature extraction model that is configured to predict the most relevant pairs of a batch of training data (e.g., media and text). The pretrained feature extraction model provides a common semantic space for both visual and textual features, where visual features 220 and textual features 225 can be represented as semantic vectors in the same domain based on the semantic relations between the visual and textual vectors. It is to be appreciated that this common semantic space allows for the construction of a single hashing network, where the cross-modal information can be properly preserved.

[0037] In one aspect, visual features 220 and textual features 225 may be used for making an affinity matrix at an affinity matrix construction module 235. For example, a first matrix based on media-to-text similarity values is calculated using at least the plurality of visual features 220 and the plurality of textual features 225 (e.g., implemented with CPU 150). A second matrix based on text-to-media similarity values is calculated using at least the plurality of visual features 220 and the plurality of textual features 225 (e.g., implemented with CPU 150). In some cases, the first matrix is represented as S_VT , and the second matrix is represented as S_TV. F_v and F_T denote visual features 220 and textual features 225 extracted by the feature extraction module 215. The first and second matrices may be calculated as follows:

where m denotes the batch size (e.g., m=32). The

represents the cosine similarity. are the normalized ones from the extracted visual

and textual features, respectively.

[0038] In the pre-defined semantic space, the paired media and text- -corresponding to the diagonal elements of the first and second matrix — should be the closest to each other.

However, these diagonal values can hardly reach one due to the modality gap between different modalities. Therefore, in various implementations, all diagonal elements of the first and second matrix are set to one to strengthen the correlation between the paired examples (e.g., media and text). For example, media-to-text similarity values positioned on a diagonal of the first matrix is set to one. In some cases, text-to-media similarity values positioned on a diagonal of the second matrix is set to one.

[0039] According to various embodiments, a cross-modal affinity matrix is calculated based on the first matrix and the second matrix, the affinity matrix comprises average similarity values as shown below:

where S_c denotes the affinity matrix. It is to be appreciated that the affinity matrix is configured symmetric — similar to the case in the single modality — by calculating the average similarity values of the first matrix and second matrix.

[0040] In various implementations, in addition to the paired video-text relations, the unpaired video-text relations are also taken into account to condition the constructed affinity matrix at a weight calculation module 240, where a dynamic weighting strategy may be adopted. In a specific example, a mean value, a minimum value, and a maximum value of the affinity matrix (denoted as s_mean, §_mint and s_max) of each batch is obtained (e.g., implemented with CPU 150 as shown in Figure 1). For each element in the affinity matrix (denoted as s_{£ i}-), the distances between the average similarity value and the minimum value or the maximum value are calculated to determine whether the corresponding training pair is a “dissimilar pair” or “similar pair.” For example, each element may be compared against the mean value to determine whether it has a “dissimilar pair” or a “similar pair.” For example, the element may be determined as a “dissimilar pair” if it i s less than or equal to the mean value: the element may be determined as a “similar pair” if it is greater than or equal to the mean value. Further, weight values can then be calculated based on the distances between the average similarity values and the minimum value or the maximum value. The re-weighted S,- .- represented as

is calculated as follows: where the statu

s weights of W and ^r are computed by:

where W and W⁺ represent the weights of diffusing the “dissimilar pair” or “similar pair” respectively. are magnification factors for distance.

Accordingly, the unpaired video-text relations are made more distinctive by strengthening the off-diagonal values in the affinity matrix for better discriminative learning, which leads to a more refined network training. The re-weighted affinity matrix may be formed as S =

[0041] In a parallel pipeline, a reference hash 2.45 may be obtained. For example, the reference hash 245 is a multi-layer perceptron (MLP). In some cases, reference hash 245 may be a three-layer MLP. Reference hash 245 may be configured to generate a plurality of latent features 2.50 using at least the plurality of visual features 220 and the plurality of textual features 225 (e.g., implemented with CPU 150 in Figure 1). For example, the plurality of latent features 250 generated by reference hash 245 may be denoted as

[0,l]^mxZ and

H_T ■• h ] 6 [0,1] ^Z, corresponding to the input F_v and F_T, where Z is the target encoding bit size (e.g., 256 or bigger). [0042] In various implementations, a plurality of reference similarity values is obtained using the reference hash, the plurality of media data, and the plurality of text data. To preserve the relationship between different modalities (e.g., visual and textual), a loss associated with the plurality of reference similarity values is calculated at a loss control module 255. For example, the loss may be associated with an intra-modal similarity and/or an inter-modal similarity. The intra-modal similarity and the inter-modal similarity may be cosine similarity values. In an example, a loss associated with the intra-modal similarity is calculated based at least on a media-to-media similarity and a text-to-text similarity that are respectively denoted as cost'Hjz, Hy) and cos(H_r, H_T). In some cases, a loss associated with the inter-modal similarity is calculated based at least on a media-to-text similarity, denoted as cos(H_v, H_T). The loss associated with the intra-modal similarity and the loss associated with the inter-modal similarity may be calculated as follows:

[0043] In some embodiments, a consistency preserving loss associated with a modality gap between the plurality of media data 205 and the plurality of text data 210 may be calculated based at least on the affinity matrix at loss control module 255. The consistency preserving loss may be calculated as follows:

[0044] According to some embodiments, all the aforementioned losses may be considered to calculate a total loss to guide the hash learning process. The total loss may be calculated as follows:

where A_x, A₂, and A₃ may be adaptively adjusted to control the comparable contributions of their corresponding losses for refining the hash model training.

[0045] In various implementations, a quantization module 260 may be used after the loss control module 255 to generate a hashing code. For example, quantization module 260 may be configured to quantize the updated affinity matrix (e.g., after the re-weighting step at weight calculation module 240) to binary values to obtain a binary hash 265. In some cases, quantization module 260 includes a hashing layer that maps the updated affinity matrix to a binary hash 265. The hashing layer may be denoted as B, whose forward mapping can be presented as:

where tr is the mapping plan. In some cases, the hash layer may map the latent features 250 into a Hamming space, where the dimensionality of the space is associated with the number of digits in words of a certain length. To obtain the binary code, quantization module 260 may firstly obtain a minimum value and a maximum value for each vector along the batch axis (denoted as , where the feature //^z (z=l, 2, Z) represents the feature batch

and Z is the dimension of the latent feature H^z (e.g., z=512). Then quantization module 260 may perform to calculate distances between each element and the maximum and minimum values (denoted as in the Hamming space. For example, quantization module

260 calculates a first distance between each of the latent features and the minimum value and a second distance between each of the latent features and the maximum value. Lastly, the binary value may then be determined by comparing the first distance and the second distance. For example, quantization module 260 may perform to quantize the updated affinity matrix to binary values based at least on the first distance and the second distance. In some cases, the elements of each vector are assigned to +1 if they are close to , or assigned to -1 if they

are close to as follows:

where H^z is the z-th dimension latent feature elements in the batch. The above quantization methods based on the calculation of minimum and maximum values can group the features with a dynamic threshold based on local statistics, which advantageously preserves the relative relationships among tire latent features. It is to be appreciated that projecting data from different modalities into a common Hamming space to learn binary codes results in a more efficient and fast image/video retrieval process, where the similarity is measured by a Hamming distance among elements in the Hamming space. The hashing-based network also effectively reduces the size of the original continuous semantic space (e.g., by 8 times), allowing for a fast retrieval process with less computational cost.

[0046] Figure 3 is a simplified flow diagram illustrating a method for correlating media and text according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, rearranged, replaced, modified, and/or overlapped, and they should not limit the scope of the claims.

[0047] As shown, method 300 includes step 302 of obtaining a plurality of media data from a first storage, and step 304 of obtaining a plurality of text data. Referring to system 100 of Figure 1, processor 140 can be configured to obtain the plurality of media data and/or the plurality of text data from communication interface 170. In some cases, the plurality of media data and/or the plurality of text data may be stored at first data storage 110. For example, the plurality of media data comprises video files and/or image files.

[0048] In steps 306 and 308, method 300 includes calculating a first matrix based on media- to-text similarity values and a second matrix based on text-to-media similarity values. In a specific example, processor 140 is configured to retrieve the plurality of media data and/or the plurality of text data stored in first data storage 110 to obtain a plurality of visual features and/or a plurality of textual features via a feature extraction process. The extracted visual features and textual features can then be used to calculate the first matrix and the second matrix based on a plurality of media-to-text similarity values and a plurality of text-to-media similarity values, respectively. For example, the media-to-text similarity values and text-to- media similarity values may be cosine similarity values, which indicates the semantic relationships between the media and text data.

[0049] In step 310, method 300 includes calculating an affinity matrix. The affinity matrix may be calculated based at least on the first matrix and the second matrix. In a specific example, before calculating the affinity matrix, a plurality of media-to-text similarity values positioned on a diagonal of the first matrix--corresponding to the most related media-text pairs — is set to one to strengthen the correlation between the paired media and text data. In some embodiments, the affinity matrix includes average similarity values. For example, the affinity matrix is calculated as an average of the first matrix and the second matrix, where the unpaired media and text data may also be taken into account for affinity matrix construction. [0050] In step 312, method 300 includes obtaining a minimum value and a maximum value from the affinity matrix. Referring to system 100 of Figure 1, processor 140 can be configured to obtain the mean, minimum, and maximum values of the affinity matrix, which may later be used for weight adjustment of the affinity matrix to preserve the relations between the unpaired media and text data by strengthening the off-diagonal values in the affinity matrix.

[0051] In step 314, method 300 includes calculating weight values based on distances between average similarity values and the minimum value or the maximum value. For each element in the affinity matrix, distances between itself and the maximum and minimum value may be calculated and compared. In some cases, each element may be compared against the mean value to determine whether it has a “dissimilar pair” or a “similar pair.” For example, the element may be determined as a “dissimilar' pair” if it is less than or equal to the mean value; the element may be determined as a “similar pair” if it is greater than or equal to the mean value, as shown below':

Accordingly, the weights (denoted as W~ and W~) can be calculated as follows:

where

⁺ represent the weights of diffusing the “dissimilar pair” or “similar pair” respectively. are magnification factors for distance.

[0052] In step 316, method 300 includes updating die affinity matrix using the weight values. It is to be appreciated that the aforementioned weight calculation and adjustment process allows for better discriminative learning by equalizing the distribution of the distances in the affinity matrix.

[0053] In step 318, method 300 includes mapping the updated affinity matrix to a binary hash. For example, a multi-layer perceptron (e.g., a three-layer MLP) may be used to map the updated affinity matrix to a binary hash. In some cases, the plurality of visual features and the plurality of textual features may be fed to die three-layer MLP as inputs to generate a plurality of latent features that are later associated with the affinity matrix in loss computation, which in turn guides the hash network training (as shown in Figure 2). According to some embodiments, method 300 further includes quantizing the updated affinity matrix to binary values, where a binary code may be obtained. For example, method 300 may include mapping the plurality of latent features into a Hamming space, where a minimum value and a maximum value for each vector along the batch axis tire acquired. The binary code may be obtained by comparing distances between each element and the maximum and minimum values in the Hamming space. For example, the elements of each vector are assigned to +1 if they are close to the maximum value, or assigned to -1 if they are close to the minimum value.

[0054] In step 320, method 300 includes storing the binary hash. For example, the binary hash is stored in a second data storage 120 of Figure 1. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer- accessible medium.

[0055] Figure 4 is a simplified flow diagram illustrating a method for correlating media and text according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the arc would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, rearranged, replaced, modified, and/or overlapped, and they should not limit the scope of the claims.

[0056] As shown, method 400 includes step 402 of obtaining a reference hash. For example, the reference hash is a three-layer MLP. The reference hash is configured to generate a plurality of latent features, which may later be used in conjunction with the affinity matrix in generating a loss for guiding the training of hashing model.

[0057] In step 404, method 400 includes obtaining a plurality of reference similarity values using the reference hash, the plurali ty of media data, and the plurality of text data. Method 400 may also include calculating a loss based at least on a media-to-media similarity and a media- to-text similarity. In various implementations, a loss associated with the plurality of reference similarity values is calculated to preserve the relationship between different modalities (e.g., visual and textual). For example, the loss may be associated with an intra-modal similarity and/or an inter-modal similarity. In some embodiments, a loss associated with the intra-modal similarity may be calculated based at least on a media-to-media similarity and a text-to-text similarity. In other examples, a loss associated with the inter-modal similarity is calculated based at least on a media-to-text similarity. According to some embodiments, method 400 may further include calculating a consistency preserving loss based at least on the affinity matrix. The consistency preserving loss is associated with a modality gap between the plurality of media data and the plurality of text data.

[0058] In steps 406 and 408, method 400 includes comparing the reference hash against tire binary hash and updating the reference hash. Depending on the implementations, one or more of the aforementioned losses may be included to calculate a total loss to guide the training process. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0059] Figure 5 is a simplified block diagram illustrating a process 500 for correlating media and text according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in tire art would recognize many variations, alternatives, and modifications. [0060] In various implementations, process 500 may be implemented with system 100 of Figure 1. The system 100 may be a client device, a server, and/or the like. For example, system 100 obtains a media file 505 and a text query 510 from communication interface 170. In some cases, media file 505 is a video file that comprises a plurality of continuous image frames. The video file may be characterized by a frame rate greater than or equal to 30fps. Depending on the implementations, the media file 505 can be divided into one or more video segments illustrating various scenes, including a video segment 540 that is the most relevant to the text query 510.

[0061] As shown in Figure 5, an example of the media file 505 includes a first video segment depicting a person running (e.g., the first frame of 505) and a second video segment depicting a person playing basketball (e.g., the second to four frames of 505). In some cases, the text query 510 may be received from a user through a user interface (e.g., a touch screen). The text query 510 may include one or more words or sentences. An example of text query 510 is “One person is playing basketball.” The system 100 may then perform to correlate the media file 505 and the text query 510 and retrieve a video segment 540 that is the most relevant to the text query 510 (e.g., the second video segment of 505).

[0062] In various implementations, a visual feature 515 is extracted for the media file 515 and a textual feature 525 is extracted for the text query 510 (e.g., implemented with CPU 150 and/or GPU 160 of Figure 1). Textual feature 525 is fed into a model-based process module 530, which may include a pre-trained hashing model to generate hash code 535. The hash code 535 may then be fed to a binary hash 520 as input to perform a hash-based search. In certain embodiments, hash code 535 may be characterized by a relatively large bit size (e.g., 256, 512, 1024, 2048, or bigger), which allows for more flexible semantic representation and is more suitable for maintaining the semantic information in video retrieval. In some cases, the binary hash 520 may be previously trained using a plurality of local media files and then stored at local storage (e.g., second data storage 120 of Figure 1). It is to be appreciated that such a hashing-based search process — where the similarity is measured in Hamming space — provides a more efficient and fast video/image retrieval compared with the conventional non-hashing method based on matrix multiplication.

[0063] While the above is a full description of the specific embodiments, various modifications, alternative constructions and equivalents may be used. Therefore, the above description and illustrations should not be taken as limiting the scope of the present invention which is defined by the appended claims.

Claims

WHAT IS CLAIMED IS:

1. A method for correlating media and text, the method comprising: obtaining a plurality of media data from a first data storage; obtaining a plurality of text data; calculating a first matrix based on media-to-text similarity values; calculating a second matrix based on text-to-media similarity values; calculating an affinity matrix based on the first matrix and the second matrix, the affinity matrix comprising average similarity values; obtaining a minimum value and a maximum value from the affinity matrix; calculating weight values based on distances between average similarity values and the minimum value or the maximum value; updating the affinity matrix using the weight values; generating a binary hash using at least the updated affinity matrix; and storing tire binary hash.

2. The method of claim 1 further comprising: obtaining a reference hash; obtaining a plural ity of reference similarity values using the reference hash and the plurality of media data and the plurality of text data; comparing the reference hash against the binary hash; and tipdating the reference hash.

The method of claim 1 where tire plurality of media data comprises video files.

4. The method of claim 1 where the plurality of media data comprises image files.

5. The method of claim 1 further comprising performing image recognition for the plurali ty of media data using a graphical processing unit.

6. The method of claim 1 further comprising calculating a loss based at least on a media-to-media similarity and a media-to-text similarity.

7. The method of claim 1 further comprising setting a plurality of media- to-text similarity values to one, the plurality of media-to-text similarity values being positioned on a diagonal of the first matrix.

8. The method of claim 1 wherein the binary hash is stored in a second data storage.

9. The method of claim 1 further comprising quantizing the updated affinity matrix to binary values.

10. A system for correlating media and text, the system comprising: a communication interface configured to obtain a plurality of media data and a plurality of text data; a first storage coupled to the communication interface, the first storage being configured to store the plurality of media data and the plurality of text data; a processor coupled to the first storage, the processor being configured to: extracting a plurality of visual features from the plurality of media data; extracting a plurality of textual features from the plurality of text data: calculating an affinity matrix based on the plurality of visual features and the plurality of textual features using at least media-to-text similarity values and text-to- media similarity values; obtaining a minimum value and a maximum value from the affinity matrix: calculating weight values based on distances between average similarity values and the minimum value or the maximum value; updating the affinity matrix using the weight values; and generating a binary hash using at least the updated affinity matrix; a second data storage coupled to the processor, the second data storage being configured to store the binary hash; and a memory coupled to the processor, the memory being configured to store the plurality of visual features and the plurality of textual features.

11. The system of claim 10, wherein the plurality of media data comprises video files and/or image files.

12. The system of claim 10, wherein the processor comprises a graphical processing unit configured to perform image recognition for the plurality of media data.

13. The system of claim 10, wherein the processor comprises a central processing unit.

14. A method for correlating media and text, the method comprising: obtaining a plurality of media data from a first data storage; obtaining a plurality of text data; extracting a plurality of visual features from the plurality of media data; extracting a plurality of textual features from the plurality of text data; calculating a first matrix based on media-to-text similarity values using at least the plurality of visual features and the plurality of textual features; calculating a second matrix based on text-to-media similarity values using at least the plurality of visual features and the plurality of textual features; calculating an affinity matrix based on the first matrix and the second matrix, the affinity matrix comprising average similarity values; obtaining a minimum value and a maximum value from the affinity matrix; calculating weight values based on distances between average similarity values and the minimum value or the maximum value; updating tire affinity matrix using the weight values; generating a binary hash using at least the updated affinity matrix; and storing the binary hash.

15. The method of claim 14, wherein the plurality of media data comprises video files characterized by a frame rate greater than or equal to 30fps.

16. The method of claim 14, wherein the media-to-text similarity values and the text-to -media similarity values are cosine similarity values.

17. The method of claim 14 further comprising calculating a consistency preserving loss based at least on the affinity matrix.

18. The method of claim 14 wherein the consistency preserving loss is associated with a modality gap between the plurality of media data and the plurality of text data.

19. The method of claim 14 further comprising: generating a plurality of latent features using at least the plurality of visual features and the plurality of textual features; obtaining a minimum value and a maximum value for each of the plurality of latent features; calculating a first distance between each of the latent features and the minimum value; calculating a second distance between each of the latent features and the maximum value; and comparing the first distance and the second distance. 20. The method of claim 19 further comprising quantizing the updated affinity matrix to binary values based at least on the first distance and the second distance.