CN116958868A

CN116958868A - Method and device for determining similarity between text and video

Info

Publication number: CN116958868A
Application number: CN202310906058.1A
Authority: CN
Inventors: 蒋晨; 刘洪�; 俞旭铮; 郭清沛
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-07-21
Filing date: 2023-07-21
Publication date: 2023-10-27

Abstract

Embodiments of the present description provide a method and apparatus for determining similarity between text and video. In the method for determining the similarity between the text and the video, the acquired text and video pairs are respectively provided for a text feature extraction model and a video feature extraction model to obtain a corresponding character feature sequence and an image feature sequence; determining related character feature-image feature pairs according to the similarity between each character feature and each image feature; aiming at each related character feature-image feature pair, the similarity between the character feature and the image feature and the similarity between the determined similar image feature corresponding to the image feature and the character feature sequence are aggregated to generate similar image constraint similarity; and determining the similarity between the text and the video in the text video pair based on the obtained similar image constraint similarity.

Description

Method and device for determining similarity between text and video

Technical Field

Embodiments of the present specification relate generally to the field of computer technology, and more particularly, to a method for determining similarity between text and video, a text-to-video retrieval method, and a method and apparatus for training a feature extraction model.

Background

With the rapid development of internet technology, the scale of network video is also increasing, and in the task of text-video retrieval, for example, the requirement for accurately calculating the semantic similarity between text and video is also increasing. The related mode is to map the text and the video into the feature space, and then adopt cosine similarity equidistant measurement to determine the semantic similarity of the text and the video. The accuracy of calculating the similarity between text and video in the above manner is often greatly dependent on the strength of the characterization capability of the obtained semantic features, and thus it is necessary to provide a scheme capable of accurately calculating the semantic similarity between text and video.

Disclosure of Invention

In view of the foregoing, embodiments of the present specification provide a method for determining similarity between text and video, a text-to-video retrieval method, a method and apparatus for training a feature extraction model. By using the method and the device, the semantic similarity between the text and the video can be accurately calculated.

According to an aspect of embodiments of the present specification, there is provided a method for determining similarity between text and video, comprising: providing the text and video included in the obtained text video pair to a text feature extraction model and a video feature extraction model respectively to obtain corresponding text features and video features, wherein the text features comprise word features corresponding to each word (token) included in the text, and the video features comprise image feature sequences extracted based on images included in the video; according to the similarity between the image features in the image feature sequence, determining the similar image features corresponding to the image features; determining related logographic feature-image feature pairs according to the similarity between the text feature and each image feature in the image feature sequence; aiming at each related character feature-image feature pair, the similarity between the character feature and the image feature indicated by the related character feature-image feature pair and the similarity between the similar image feature corresponding to the image feature and the text feature are aggregated, and similar image constraint similarity is generated; and determining the similarity between the text and the video based on the obtained similarity image constraint similarity.

According to another aspect of the embodiments of the present specification, there is provided a text video retrieval method, including: receiving query text provided by a user; according to the method for determining the similarity between the text and the video, the similarity between the query text and the candidate video included in each query text video pair is determined, wherein each query text video pair is obtained according to each candidate video in the query text and the candidate video set; determining matching videos from the candidate video set as video search results based on the determined similarity; and providing the video search results to the user.

According to another aspect of embodiments of the present specification, there is provided a method for training a feature extraction model, the feature extraction model comprising a text feature extraction model and a video feature extraction model, the method comprising: the following model training process is circularly executed by using a training sample set until a training ending condition is met, wherein each training sample in the training sample set comprises a positive text video pair consisting of matched text data and video data or a negative text video pair consisting of unmatched text data and video data: providing the text data of each current training sample in the current training sample set for a current text feature extraction model to obtain the text features of each current training sample; providing the video data of each current training sample to a current video feature extraction model to obtain an image feature sequence of each current training sample; aiming at each current training sample, determining the similar image features corresponding to each image feature according to the similarity between each image feature in the image feature sequence of the current training sample; for each current training sample, determining related character feature-image feature pairs according to the similarity between the text features of the current training sample and each image feature in the corresponding image feature sequence; aiming at each related character feature-image feature pair, the similarity between the character feature and the image feature indicated by the related character feature-image feature pair and the similarity between the similar image feature corresponding to the image feature and the text feature are aggregated, and similar image constraint similarity is generated; based on the obtained similar image constraint similarity, determining the similarity between the text data of the current training sample and the corresponding video data; determining a contrast loss value corresponding to the current training sample set based on the determined similarity between the text data and the corresponding video data of each current training sample; and adjusting model parameters of the current feature extraction model according to the comparison loss value in response to the condition that the training ending condition is not met, wherein the feature extraction model subjected to model parameter adjustment serves as the current feature extraction model of the next model training process.

According to still another aspect of embodiments of the present specification, there is provided an apparatus for determining similarity between text and video, comprising: a feature extraction unit configured to provide the text and video included in the acquired text-video pair to a text feature extraction model and a video feature extraction model, respectively, to obtain corresponding text features and video features, wherein the text features include word features corresponding to each word included in the text, and the video features include an image feature sequence extracted based on images included in the video; a similar feature determining unit configured to determine similar image features corresponding to the image features according to the similarity between the image features in the image feature sequence; a related feature pair determining unit configured to determine related logographic feature-image feature pairs according to a degree of similarity between the text feature and each image feature in the image feature sequence; a similarity aggregation unit configured to aggregate, for each related logout feature-image feature pair, a similarity between the logout feature and the image feature indicated by the related logout feature-image feature pair and a similarity between a similar image feature corresponding to the image feature and the text feature, to generate a similar image constraint similarity; and a similarity determination unit configured to determine a similarity between the text and the video based on the obtained close image constraint similarity.

According to still another aspect of the embodiments of the present specification, there is provided a text video retrieval apparatus including: a text receiving unit configured to receive a query text provided by a user; a similarity calculation unit configured to determine, according to the means for determining similarity between text and video as described above, similarity between the query text and candidate video included in each query text video pair, wherein each query text video pair is derived from each candidate video in the query text and candidate video set; a video result providing unit configured to determine a matching video from the candidate video set as a video search result based on the determined similarity; and providing the video search results to the user.

According to another aspect of embodiments of the present specification, there is provided an apparatus for training a feature extraction model including a text feature extraction model and a video feature extraction model, the apparatus being configured to perform a model training process by a training unit with a training sample set loop until a training end condition is satisfied, each training sample in the training sample set including a positive text video pair composed of matched text data and video data or a negative text video pair composed of unmatched text data and video data, the training unit comprising: the feature extraction module is configured to provide text data of each current training sample in the current training sample set for the current text feature extraction model to obtain text features of each current training sample; providing the video data of each current training sample to a current video feature extraction model to obtain an image feature sequence of each current training sample; the similar feature determining module is configured to determine similar image features corresponding to the image features of each current training sample according to the similarity between the image features in the image feature sequence of the current training sample; a similarity determination module configured to determine, for each current training sample, a related logographic feature-image feature pair according to a similarity between a text feature of the current training sample and each image feature in a corresponding image feature sequence; aiming at each related character feature-image feature pair, the similarity between the character feature and the image feature indicated by the related character feature-image feature pair and the similarity between the similar image feature corresponding to the image feature and the text feature are aggregated, and similar image constraint similarity is generated; based on the obtained similar image constraint similarity, determining the similarity between the text data of the current training sample and the corresponding video data; a contrast loss calculation module configured to determine a contrast loss value corresponding to the current training sample set based on the determined similarity between the text data and the corresponding video data of each current training sample; the apparatus further comprises: and the parameter adjustment unit is configured to adjust the model parameters of the current feature extraction model according to the contrast loss value in response to the condition that the training end condition is not met, wherein the feature extraction model subjected to model parameter adjustment serves as the current feature extraction model of the next model training process.

According to another aspect of embodiments of the present specification, there is provided an apparatus for determining similarity between text and video, comprising: at least one processor, and a memory coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method for determining similarity between text and video as described above.

According to another aspect of the embodiments of the present specification, there is provided a text video retrieval apparatus including: at least one processor, and a memory coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the text-to-video retrieval method as described above.

According to another aspect of embodiments of the present specification, there is provided an apparatus for training a feature extraction model, comprising: at least one processor, and a memory coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method for training a feature extraction model as described above.

According to another aspect of embodiments of the present specification, there is provided a computer readable storage medium storing a computer program which when executed by a processor implements a method for determining similarity between text and video, a text-to-video retrieval method and/or a method for training a feature extraction model as described above.

According to another aspect of embodiments of the present specification, there is provided a computer program product comprising a computer program that is executed by a processor to implement a method for determining similarity between text and video, a text-to-video retrieval method, and/or a method for training a feature extraction model as described above.

Drawings

A further understanding of the nature and advantages of the present description may be realized by reference to the following drawings. In the drawings, similar components or features may have the same reference numerals.

Fig. 1 illustrates an exemplary architecture of a method for determining similarity between text and video, a text-to-video retrieval method, a method and apparatus for training a feature extraction model according to an embodiment of the present description.

Fig. 2 shows a flowchart of one example of a method for determining similarity between text and video according to an embodiment of the present description.

Fig. 3 shows a flowchart of one example of a determination process of a close image feature according to an embodiment of the present specification.

Fig. 4 is a schematic diagram showing an example of a determination process of the close image constraint similarity according to the embodiment of the present specification.

Fig. 5 shows a flowchart of yet another example of a method for determining similarity between text and video according to an embodiment of the present description.

Fig. 6 is a flowchart showing an example of a determination process of weights corresponding to the respective logographic features according to the embodiment of the present specification.

Fig. 7 shows a flowchart of one example of a text video retrieval method according to an embodiment of the present specification.

FIG. 8 illustrates a flowchart of one example of a method for training a feature extraction model according to an embodiment of the present description.

Fig. 9 shows a flowchart of one example of a determination process of a contrast loss value according to an embodiment of the present specification.

Fig. 10 shows a schematic diagram of one example of an application scenario of a method for training a feature extraction model according to an embodiment of the present specification.

Fig. 11 shows a block diagram of one example of an apparatus for determining similarity between text and video according to an embodiment of the present specification.

Fig. 12 shows a block diagram of one example of a text-video retrieving device according to an embodiment of the present specification.

Fig. 13 shows a block diagram of one example of an apparatus for training a feature extraction model according to an embodiment of the present description.

Fig. 14 shows a block diagram of one example of a training unit in an apparatus for training a feature extraction model according to an embodiment of the present specification.

Fig. 15 shows a block diagram of one example of an apparatus for determining similarity between text and video according to an embodiment of the present specification.

Fig. 16 shows a block diagram of one example of a text-video retrieving device according to an embodiment of the present specification.

Fig. 17 shows a block diagram of one example of an apparatus for training a feature extraction model according to an embodiment of the present description.

Detailed Description

The subject matter described herein will be discussed below with reference to example embodiments. It should be appreciated that these embodiments are discussed only to enable a person skilled in the art to better understand and thereby practice the subject matter described herein, and are not limiting of the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the embodiments herein. Various examples may omit, replace, or add various procedures or components as desired. In addition, features described with respect to some examples may be combined in other examples as well.

As used herein, the term "comprising" and variations thereof mean open-ended terms, meaning "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment. The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. Unless the context clearly indicates otherwise, the definition of a term is consistent throughout this specification.

A method for determining similarity between text and video, a text video retrieval method, a method and an apparatus for training a feature extraction model according to embodiments of the present specification will be described in detail with reference to the accompanying drawings.

Fig. 1 illustrates an exemplary architecture 100 of a method for determining similarity between text and video, a text-to-video retrieval method, a method and apparatus for training a feature extraction model, according to an embodiment of the present description.

In fig. 1, a network 110 is employed to interconnect between a terminal device 120 and an application server 130.

Network 110 may be any type of network capable of interconnecting network entities. The network 110 may be a single network or a combination of networks. In terms of coverage, network 110 may be a Local Area Network (LAN), wide Area Network (WAN), or the like. In terms of a carrier medium, the network 110 may be a wired network, a wireless network, or the like. In terms of data switching technology, the network 110 may be a circuit switched network, a packet switched network, or the like.

Terminal device 120 may be any type of electronic computing device capable of connecting to network 110, accessing servers or websites on network 110, processing data or signals, and the like. For example, the terminal device 120 may be a desktop computer, a notebook computer, a tablet computer, a smart phone, or the like. Although only one terminal device is shown in fig. 1, it should be understood that there may be a different number of terminal devices connected to the network 110.

In one embodiment, the terminal device 120 may be used by a user. Terminal device 120 may include an application client (e.g., application client 121) that may provide various services to a user. In some cases, application client 121 may interact with application server 130. For example, the application client 121 may transmit a message input by a user to the application server 130 and receive a response associated with the message from the application server 130. Herein, a "message" may refer to any input information, such as query text from user input, and the like.

The application server 130 may store a text feature extraction model and a video feature extraction model to determine the similarity between text and video. The application server 130 may be connected to a video database 140. Wherein each candidate video may be included in the video database 140. The application server 130 may also be connected to a model training server 150. The model training server 150 may be used to train the text feature extraction model and the video feature extraction model. However, it should be appreciated that in other cases, instead of interacting with video database 140 and model training server 150, application server 130 may also store candidate videos locally and train to the text feature extraction model and video feature extraction model described above.

It should be appreciated that all network entities shown in fig. 1 are exemplary and that any other network entity may be involved in architecture 100, depending on the particular application requirements.

Fig. 2 shows a flow chart of a method 200 for determining similarity between text and video according to an embodiment of the present description.

As shown in fig. 2, at 210, the text and video included in the acquired text-video pair are provided to a text feature extraction model and a video feature extraction model, respectively, to obtain corresponding text features and video features.

In this embodiment, the text feature extraction model and the video feature extraction model may be models that are trained based on a feature extraction backbone (backbone) model to generate high-dimensional features of text and video. In one example, the feature extraction backbone model described above may include, but is not limited to, at least one of: convolutional neural network (Convolutional Neural Networks, CNN), cyclic neural network (Recurrent Neural Network, RNN), CLIP model, transducer model, generative Pre-Training (GPT) model, and the like.

In this embodiment, the text feature extraction model may obtain, from a text, a logographic feature corresponding to each logographic included in the text. In one example, if the text contains M words, the text feature may be represented, for example, as Where D may refer to various features (e.g., t ₁ ,t ₂ …) length.

The video feature extraction model can obtain an image feature sequence corresponding to the video according to the video. In one example, the video may be sampled at a sampling rate (e.g., 1 frame per second) to obtain video frames, and then the frame features corresponding to each of the sampled video frames may be obtained by using the video feature extraction model. For example, a sequence of frame features may be represented as Where N may refer to the number of video frames. D may refer to various features (e.g., f ₁ ,f ₂ …) length. In one example, for each video frame, the image blocks (patch) may be divided first, then a target number of image blocks may be selected from the divided image blocks, and the image block features corresponding to the selected image blocks may be obtained through the video feature extraction model. For example, the image block feature sequence may be expressed as +.>Where N may refer to the number of video frames. K may refer to the number of image blocks selected from each video frame. D may refer to various features (e.g., f ₁ ,f ₂ …) length. In one example, a pre-trained image block importance assessment model may be utilized to score the importance of each image block corresponding to the same video frame, and a target number of image blocks may be selected based on the score. The image block importance evaluation model can be constructed based on a multi-layer perceptron (Multilayer Perceptron, MLP) and a softmax layer, for example. In one example, an image feature sequence may be derived from at least one of the derived frame feature sequence and image block feature sequence (e.g., v may be used _h To represent).

At 220, a near image feature corresponding to each image feature is determined based on the similarity between each image feature in the sequence of image features.

In one example, for an image feature in a sequence of image features, a similarity between the image feature and other image features in the sequence of image features is determined. Image features corresponding to the similarity greater than a predetermined threshold among the determined similarities may be determined as similar image features corresponding to the image features.

It should be noted that the similarity may be a cosine similarity equidistant measurement.

Optionally, with continued reference to fig. 3, fig. 3 shows a flowchart of one example of a process 300 of determining similar image features according to an embodiment of the present description.

At 310, the similarity between each image feature in the sequence of image features is determined, resulting in an inter-image self-similarity matrix.

In one example, if the image feature sequence includes N image features, the dimension of the inter-image self-similarity matrix may be represented as n×n. Wherein, the element of the ith row and the jth column in the inter-image self-similarity matrix can be used to represent the similarity between the ith image feature and the jth image feature.

At 320, for each image feature in the sequence of image features, a near image feature corresponding to the image feature is determined from a maximum value of similarity between the image feature and other image features indicated by the inter-image self-similarity matrix.

In one example, for the i-th image feature, the image feature indicated by the column (or row) number corresponding to the maximum value in all elements of the i-th row (or column) except the diagonal element may be determined as the close image feature corresponding to the i-th image feature.

Optionally, in one example, it may be further determined whether the determined maximum value of the similarity is greater than a predetermined threshold. If the similarity is larger than the preset similarity, the image feature corresponding to the maximum value of the similarity is determined to be the corresponding similar image feature. If not, it means that there are no other image features in the image feature sequence that are more similar to the image feature.

Based on the method, the image features which are most similar to the method can be determined to be similar image features, so that a basis can be provided for obtaining more accurate similarity of the nearest frame constraint by combining the similar image features and the corresponding logographic features later.

Returning to FIG. 2, at 230, relevant logographic feature-image feature pairs are determined based on the similarity between the text feature and each image feature in the image feature sequence.

In this embodiment, the similarity between the text feature and each image feature in the image feature sequence may be determined, so as to obtain a text image similarity matrix. In one example, if the text feature includes M logographic features and the image feature sequence includes N image features, the dimension of the text image similarity matrix may be expressed as mxn. Wherein the element of the ith row (i=1, 2, …, M) and the jth column (j=1, 2, …, N) in the text image similarity matrix may be used to represent the similarity between the ith logographic feature and the jth image feature. In one example, a logographic feature and an image feature corresponding to a similarity greater than a predetermined value in a text image similarity matrix may be determined as related logographic feature-image feature pairs.

Optionally, in one example, for each logout feature, a corresponding related image feature is selected from the image feature sequence to generate a related logout feature-image feature pair according to a similarity between the logout feature and each image feature in the image feature sequence. For example, an image feature having a similarity to the logographic feature greater than a predetermined value may be selected from the sequence of image features as the relevant image feature. For another example, an image feature corresponding to a maximum value of similarity between the logographic features may be selected from the sequence of image features as the relevant image feature. So that the relevant logographic feature-image feature pairs can be derived from a text-image perspective.

Optionally, in one example, for each image feature in the image feature sequence, a corresponding related logographic feature is selected from each logographic feature according to the similarity between the image feature and each logographic feature to generate a related logographic feature-image feature pair. For example, a logographic feature having a similarity with the image feature greater than a predetermined value may be selected from the logographic features as the relevant logographic feature. For another example, a logographic feature corresponding to the maximum value of the similarity between the image features may be selected from the logographic features as the relevant logographic feature. So that the relevant logographic feature-image feature pairs can be derived from an image-text perspective.

At 240, for each relevant logout feature-image feature pair, the similarity between the logout feature and the image feature indicated by the relevant logout feature-image feature pair and the similarity between the near image feature and the text feature corresponding to the image feature are aggregated to generate a near image constraint similarity.

In this embodiment, the similarity between the similar image feature and the text feature corresponding to the image feature may be combined (e.g., weighted sum) based on the similarity between the related logographic feature-image feature pair and the indicated logographic feature and the image feature in various manners, so that the obtained similarity constraint similarity includes the similarity constraint information between the similar image and the text feature.

In one example, the above-described close image constraint similarity may include at least one of: the text-to-text similarity image constraint similarity between the logograms and the sequence of image features.

In one example, the similarity between the logographic feature and the image feature indicated by the related logographic feature-image feature pair may be determined in various ways (e.g., sim may be used _1st Representation), similarity between the image feature and the corresponding similar image feature (e.g., s may be used _f Representation) and the similarity between the corresponding near image feature and the logographic feature (e.g., sim may be used _2nd Representation) are polymerized to generate the constraint similarity of the text-graph close images. For example, the text-to-graph similarity image constraint similarity corresponding to the ith logographic feature may be expressed as sim _i ＝α·sim _1st +β·s _f ·sim _2nd . Wherein the above alpha and beta may be set to 1.0 and 0.5, respectively, for example.

In one example, the similarity between the logographic feature and the image feature indicated by the related logographic feature-image feature pair may be determined in various ways (e.g., mayBy sim _1st Representation), similarity between the image feature and the corresponding similar image feature (e.g., s may be used _f Representation) and the similarity between the similar image features corresponding to the image features and each character feature are aggregated to generate the constraint similarity of the graph-text similar images. For example, the similarity of the constraint of the image-text similarity image corresponding to the jth image feature can be expressed as sim _j ＝α·sim _1st +β·s _f ·sim _3rd . Wherein the above alpha and beta may be set to 1.0 and 0.5, respectively, for example. The sim is as described above _3rd And the representative value of the similarity between the similar image feature corresponding to the image feature and each character feature can be used for representing the similarity between the similar image feature and each character feature. For example, the representative value may be a maximum value; for another example, the representative value may be a similarity between a similar image feature corresponding to the image feature and a logographic feature indicated by the related logographic feature-image feature pair.

Referring now to fig. 4, fig. 4 is a schematic diagram illustrating one example of a determination process 400 of close image constraint similarity according to an embodiment of the present disclosure.

In one example, as shown in FIG. 4, for a logographic feature corresponding to "lady," the logographic feature corresponds to image feature f of frame 3 in the image feature sequence according to the text image similarity matrix shown at 410 ₃ The similarity between the two is maximized, so that the character features and the image features f corresponding to' lady ₃ Related logographic feature-image feature pairs are composed. Image feature f according to the self-similarity matrix between images shown at 420 ₃ The corresponding similar image features are the image features f corresponding to the 6 th frame ₆ . So that it is possible to make a reference to the logographic feature and the image feature f corresponding to "lady ₃ Similarity between the two is 0.57, and the image characteristic f ₃ And image feature f ₆ Similarity between them 0.96 and image feature f ₆ And similarity between the logographic features corresponding to "lady" of 0.5, and the above formula is adopted to generate the text-graph similar image constraint similarity of the logographic features corresponding to "lady". Similarly, as shown at 430 in FIG. 4, the logograms corresponding to "purple", "shirt", "speak" may be obtained The text-to-graph close image of the feature constrains the similarity.

In one example, as shown in FIG. 4, for image feature f corresponding to frame 4, according to the text image similarity matrix shown at 410 ₄ The image feature f ₄ The similarity with the logographic feature corresponding to "lady" among the text features is maximized, so that the image feature f ₄ And the logographic features corresponding to "lady" constitute related logographic feature-image feature pairs. Image feature f according to the self-similarity matrix between images shown at 420 ₄ The corresponding similar image features are the image features f corresponding to the 5 th frame ₅ . So that it can be based on the image feature f ₄ Similarity between the logographic features corresponding to "lady" and 0.54, image feature f ₄ And image feature f ₅ Similarity between them 0.97 and image feature f ₄ Representative value of similarity with each logographic feature (similarity with logographic feature corresponding to "lady") 0.48, image feature f is generated using the above formula ₄ Is a graph-text similar image constraint similarity. Similarly, individual image features (e.g., f ₁ ～f ₆ ) Is a graph-text similar image constraint similarity.

At 250, a similarity between text and video is determined based on the resulting close image constraint similarity.

In this embodiment, the obtained close image constraint similarity may be fused by various methods, so as to obtain the similarity between the text and the video. In one example, similarity between text and video may be determined using a Token-wise Interaction (TI) method, a Weighted Token-wise Interaction (WTI) method, a single vector dot product Interaction (Single Vector Dot-product Interaction) method, a hierarchical Interaction (Hierarchical Interaction) method, a multi-layer perceptron calculation of global vectors, and the like.

In one example, the maximum value or average value in the obtained text-to-graph close-image constraint similarity between each of the tokens and the image feature sequence may be determined as the text-to-graph sub-similarity. In one example, the maximum value or average value in the obtained graph-text similar image constraint similarity between each image feature and the text feature may be determined as the graph-text similarity. In one example, the similarity between text and video may be determined from at least one of a text-to-graph similarity and a graph-to-text similarity.

Referring now to fig. 5, fig. 5 illustrates a flowchart of yet another example of a method 500 for determining similarity between text and video according to an embodiment of the present disclosure.

As shown in fig. 5, at 510, the text and video included in the acquired text-video pair are provided to a text feature extraction model and a video feature extraction model, respectively, to obtain corresponding text features and video features.

In this embodiment, the text features may include a logographic feature corresponding to each logographic contained in the text and a global text feature corresponding to each logographic feature. In one example, the global text feature may be an EOS corresponding to the text output through various text encoding models]The character of the logogram may be represented, for example, as t _cls 。

At 520, a near image feature corresponding to each image feature is determined based on the similarity between each image feature in the sequence of image features.

At 530, for each logout feature, a corresponding related image feature is selected from the image feature sequence to generate a related logout feature-image feature pair based on the similarity between the logout feature and each image feature in the image feature sequence.

At 540, for each relevant logout feature-image feature pair, text-to-graph similarity image constraint similarities are generated for the similarity between the logout feature and the image feature indicated by the relevant logout feature-image feature pair, the similarity between the image feature and the corresponding similar image feature, and the similarity between the similar image feature corresponding to the image feature and the logout feature.

It should be noted that, the steps 510-540 may refer to the descriptions of the steps 210-240 in the foregoing embodiments, and are not repeated here.

At 550, relevant image features are selected from the sequence of image features based on the similarity between the global text feature and each image feature in the sequence of image features.

At 560, the similarity between the global text feature and the corresponding selected related image feature, the similarity between the corresponding selected related image feature and the corresponding similar image feature, and the similarity between the corresponding similar image feature and the global text feature of the corresponding selected related image feature are aggregated to generate a global similar image constraint similarity.

It should be noted that, the steps 550 to 560 may refer to the related descriptions in the optional examples of the steps 230 to 240 in the foregoing embodiments, and the "character feature" is merely replaced by the "global text feature" accordingly, which is not described herein again.

At 570, weighted similarity-to-image constraint similarities between the token feature and the image feature sequence are generated based on the weights corresponding to each of the token features and the corresponding text-to-image similarity-to-image constraint similarities.

In one example, the weighted close-image constraint similarity may be expressed as Wherein, the->May be used to represent the weight corresponding to the i-th logographic feature. The meaning of the other symbols may be referred to in the foregoing.

In one example, the respective weights may be determined according to whether the logograms corresponding to the logogram features belong to a preset keyword dictionary. In one example, the weights for each of the respective logographic features may be determined based on TF-IDF (Term Frequency-inverse text Frequency index) calculated from the corpus.

Optionally, with continued reference to fig. 6, fig. 6 shows a flowchart of one example of a determination process 600 of weights for respective logographic features according to embodiments of the present description.

At 610, part of speech of the token corresponding to each token feature is determined.

In this embodiment, a part of speech analysis tool (e.g., spaCy) may be used to determine the part of speech (PoS) of the tokens corresponding to each token feature. The parts of speech may include, for example, verbs, nouns, adjectives, pronouns, adverbs, and the like.

At 620, weight assignments corresponding to each of the tokens are determined according to a predetermined part-of-speech-weight assignment relationship.

In one example, part-of-speech-weight assignment relationships may be used to indicate that the weights corresponding to nouns, verbs, adjectives are assigned a value of η (η > 1) and the weights corresponding to other parts of speech are assigned a value of 1.

At 630, the weight assignments are normalized to obtain weights corresponding to the respective token features.

In one example, the individual weight assignments may be normalized using softmax. In one example, each of the logographic features has a respective weightA weight vector may be formed (e.g., w may be used _t Representation).

Based on the scheme, words containing important information can be mined to the maximum extent by explicitly introducing an explicit text pre-weighting process, so that the effective enhancement of the text attention is realized.

Returning to fig. 5, at 580, a similarity between the text and the video is determined based on the global close-image constraint similarity and the weighted close-image constraint similarity.

In one example, the similarity between text and video may be expressed asWherein s is as described above _glb May be used to represent global close image constraint similarity. The meaning of the other symbols may be referred to in the foregoing.

In one example, similarly, the close-up graphs are symmetrically weightedThe image constraint similarity can be expressed as Wherein, the->May be used to represent the weight corresponding to the j-th image feature in the sequence of image features. The meaning of the other symbols may be referred to in the foregoing. In one example, the similarity between text and video may be expressed as

By using the method for determining the similarity between the text and the video disclosed in fig. 1-6, related character feature-image feature pairs can be mined through text feature and image feature corresponding to the text feature-image feature pairs, and intrinsic similarity features between video frames or image blocks of the video are captured through determining similar image features corresponding to the image features in the image feature sequences, and then the similarity between the character feature and the image feature and the similarity between the corresponding similar image feature and the text feature in the related character feature-image feature pairs are aggregated, so that the obtained similar image constraint similarity contains similarity constraint information between the similar image and the text feature, and cross-modal (text and video) similarity between matched and unmatched pairs can be maximally distinguished from a visual angle.

Referring now to fig. 7, fig. 7 shows a flowchart of one example of a text video retrieval method 700 according to an embodiment of the present description.

As shown in fig. 7, at 710, query text provided by a user is received.

In this embodiment, the query text provided by the user may be received in various ways. For example, the query text may be text directly input by the user terminal, or may be text converted by performing optical character recognition (Optical Character Recognition, OCR) or automatic speech recognition (Automatic Speech Recognition, ASR) on an image, video, speech, or the like input by the user terminal, and the present invention is not limited thereto.

At 720, the similarity between the query text and candidate videos included in each query text video pair is determined according to a method for determining the similarity between text and videos.

In this embodiment, the query text may be combined with each candidate video in the candidate video set, so as to obtain each query text video pair. The candidate videos included in the candidate video set may be set according to actual needs. For example, the video may be all candidate videos or a part of candidate videos recalled according to various coarse screening methods. The above method for determining the similarity between text and video may be specifically described with reference to the embodiments of fig. 1-6.

At 730, a matching video is determined from the candidate video set as a video search result based on the determined similarity.

In this embodiment, the matching video may be determined from the candidate video set in various ways. For example, a number of candidate videos with the greatest similarity may be determined as matching videos. For another example, candidate videos with similarity greater than a preset threshold may be used as candidate matching videos, and then a plurality of matching videos may be determined from the candidate matching videos by a manner such as random selection, selection according to user preference, and the like, as video search results.

At 740, video search results are provided to the user.

In the present embodiment, the above-described video search results may be provided to the user in various forms. The video search results described above may be arranged in order of high similarity by list form, for example. Alternatively, the respective similarity may also be displayed in the vicinity of the respective video search results.

It should be noted that, the user provided with the video search result may be the same user as the user described in step 710, or may be the user using the same user terminal as the user described in step 710, which is not limited herein.

With the text video retrieval method disclosed in fig. 7, a method is provided in which a method for determining the similarity between text and video can be applied to the field of text video retrieval to return retrieval results more efficiently and accurately.

Referring now to fig. 8, fig. 8 illustrates a flowchart of one example of a method 800 for training a feature extraction model according to an embodiment of the present disclosure.

As shown in FIG. 8, at 810, the model training processes 820-870 described below are performed in a loop using the training sample set until the training end condition is met.

In this embodiment, each training sample in the training sample set may include a positive text video pair composed of matched text data and video data or a negative text video pair composed of non-matched text data and video data. In one example, a piece of video data v _i And text data t for describing the video data _i Composing a positive text video pair (t) _i ,v _i ) The method comprises the steps of carrying out a first treatment on the surface of the A video data v _j And text data t for describing other video data _i (i.noteq.j) forming a negative text video pair (t) _i ,v _j )。

At 820, text data for each current training sample in the current training sample set is provided to the current text feature extraction model to obtain text features for each current training sample.

Alternatively, for each text data, an importance score corresponding to each of the tokens contained in the text data may be determined. In one example, the importance score may be a TF-IDF value. Thereafter, unimportant words (e.g., words having an importance score below a predetermined value) may be determined from the respective text data based on the determined importance scores. Next, k words may be randomly selected from the determined unimportant words to perform discarding or masking (mask) processing, so as to obtain text features corresponding to the text data after the random discarding or masking processing.

Based on this, by filtering out relatively unimportant words, it helps to eliminate irrelevant or noisy words (e.g., words with grammatical or misspellings), and preserve the actually relevant words in a particular video, thereby achieving an effective enhancement of text attention and enabling an enhancement of learning robustness.

At 830, video data for each current training sample is provided to the current video feature extraction model to obtain an image feature sequence for each current training sample.

In this embodiment, the model parameters of the current text feature extraction model and the current video feature extraction model may be adjusted along with the model training process. In one example, initial model parameters for the text feature extraction model and the video feature extraction model may be derived using random initialization or based on a pre-trained CLIP model.

At 840, for each current training sample, a near image feature corresponding to each image feature is determined based on the similarity between each image feature in the image feature sequence of the current training sample.

It should be noted that the specific operation procedures of steps 820-830 and 840 may refer to the descriptions related to step 210 and step 220 in the embodiment of fig. 2, and are not repeated here.

At 850, a similarity between the text data and the corresponding video data for each current training sample is determined.

In this embodiment, the similarity between the text data and the corresponding video data of each current training sample may be determined as follows in steps 851-855.

At 851, related logographic feature-image feature pairs are determined based on the similarity between the text feature of the current training sample and each image feature in the corresponding image feature sequence.

At 853, for each relevant logout feature-image feature pair, the similarity between the logout feature and the image feature indicated by the relevant logout feature-image feature pair and the similarity between the similar image feature corresponding to the image feature and the text feature are aggregated to generate a similar image constraint similarity.

At 855, a similarity between the text data and the corresponding video data of the current training sample is determined based on the obtained similarity image constraint similarity.

It should be noted that, the specific operation of the steps 851-855 may refer to the descriptions related to the steps 230-250 in the embodiment of fig. 2, and are not repeated here.

At 860, a contrast loss value corresponding to the current training sample set is determined based on the determined similarity between the text data and the corresponding video data for each current training sample.

In one example, at least one of the text-to-graph contrast loss value and the graph-to-text contrast loss value may be determined in accordance with a symmetric contrast learning loss function based on the determined similarity between the text data and the corresponding video data for each current training sample. The graph contrast loss value can be expressed, for example, asCorrespondingly, the graphic contrast loss value can be expressed, for example, as +.>Wherein B may be used to represent the number of batch (batch) samples, such as the number of positive text video pairs. τ may be used to represent the temperature coefficient of the contrast loss function. sim (t) _i ,v _i ) May be used to represent the similarity between the text data of the i-th positive example text video pair and the corresponding video data. Accordingly, sim (t _i ,v _j ) And sim (t) _j ,v _i ) May be used to represent the similarity between text data and corresponding video data of a negative example text video pair.

In one example, the contrast loss value corresponding to the current training sample set, such as a mean of one or both of the contrast loss values, may be determined from at least one of the graph contrast loss values and the graph contrast loss values described above.

Optionally, with continued reference to fig. 9, fig. 9 shows a flowchart of one example of a process 900 of determining a contrast loss value according to an embodiment of the present disclosure.

As shown in fig. 9, at 910, a set of current negative example text video pairs is determined from the set of current negative example text video pairs based on a magnitude relationship between a similarity between text data and corresponding video data of each current negative example text video pair and a similarity between text data and corresponding video data of the corresponding positive example text video pair.

In one example, a current negative example text video pair having a similarity between text data in the set of current negative example text video pairs and corresponding video data greater than a similarity between text data in the corresponding positive example text video pair and corresponding video data may be determined to be a current difficult example negative example text video pair, resulting in a set of current difficult example negative example text video pairs.

In one example, a marginal similarity score for each negative example text video pair in the set of current negative example text video pairs may be determined. Thereafter, individual negative example text video pairs with determined marginal similarity scores greater than a predetermined value (e.g., 0) may be grouped into a current difficult example negative example text video pair set. In one example, the marginal similarity score may be represented asWhere ζ may be used to represent a preset margin (e.g., 0). The meaning of the remaining symbols may be referred to in the foregoing.

At 920, a symmetry probability for each current refractory negative text video pair in the set of refractory negative text video pairs is determined based on the symmetric contrast learning loss function.

In one example, at least one of a graph symmetry probability and a graph symmetry probability for each current refractory negative text video pair may be determined with reference to a symmetric contrast learning loss function. In one example, the graph symmetry probability may be expressed asAccordingly, the graphic symmetry probability canDenoted as-> The meaning of the remaining symbols may be referred to in the foregoing.

In one example, the symmetry probability of each current refractory case negative text video pair, such as a mean of one or both of the two, may be determined from at least one of the graph symmetry probability and the graph symmetry probability described above.

At 930, a refractory loss value for the current refractory negative text video pair set is determined based on the determined symmetry probabilities.

In one example, at least one of a graphical difficulty loss value and a graphical difficulty loss value for the current difficulty negative text video pair set is determined based on the determined symmetry probability. In one example, the graphical difficulty loss value may be expressed asCorrespondingly, the graphic difficulty loss value can be expressed as +. > Wherein (1)>Can be used to represent the current difficult case negative example text video pair set. N may be used to represent the number of refractory negative text video pairs contained by the current set of refractory negative text video pairs. The meaning of the remaining symbols may be referred to in the foregoing.

In one example, the refractory loss value of the current refractory negative text video pair set, for example, a mean of one or both of the refractory loss values, may be determined from at least one of the refractory loss values and the refractory loss values of the graph.

At 940, a contrast loss value corresponding to the current training sample set is determined from the refractory loss value.

In one example, the contrast loss value may be expressed asIn one example, the contrast loss value may be expressed as +.>Wherein, gamma ₁ And gamma ₂ May be used to represent preset weighting coefficients, respectively. The meaning of the remaining symbols may be referred to in the foregoing.

Based on the method, the method and the device can effectively identify the negative sample pairs of the difficult cases, and introduce the influence of the negative sample pairs of the difficult cases into the calculation of the contrast loss value, so that the negative sample pairs of the difficult cases can be fully utilized to promote more robust and more differentiated learning.

Returning to fig. 8, at 870, a determination is made as to whether the training end condition is satisfied.

In one example, whether the training end condition is satisfied may be determined by determining whether the number of iterations reaches a preset number of times, whether the training duration reaches a preset duration, whether the contrast loss value converges, and the like.

At 880, in response to not satisfying the training end condition, model parameters of the current feature extraction model are adjusted according to the contrast loss value.

In this embodiment, the feature extraction model after model parameter adjustment may serve as the current feature extraction model of the next model training process. Thereafter, the current training sample set may be re-determined using the training sample set described above, and the model training process 820-870 may continue until the training end condition is met.

And determining the current feature extraction model as a feature extraction model with the training completed in response to the training ending condition being met. The text feature extraction model and the video feature extraction model which are included in the feature extraction model after training can be utilized to obtain corresponding text features and video features, and then the similarity between the corresponding text and video can be determined.

Referring now to fig. 10, fig. 10 illustrates a schematic diagram of one example of an application scenario 1000 of a method for training a feature extraction model according to an embodiment of the present disclosure.

As shown in fig. 10, the text data "lady wearing a purple shirt is speaking" and the corresponding video data form a positive text video pair. The text data of the current training sample, "the lady wearing a purple shirt is speaking" can be provided to the current text feature extraction model to obtain the corresponding text feature. In one example, the text features may include a logographic feature t corresponding to each word in the text data _tokens . In one example, the text features may also include global features t corresponding to the text data _cls . The corresponding video data can be provided for the current video feature extraction model to obtain a corresponding image feature sequence v _h . In one example, the text data "lady speaking with purple shirt" and the corresponding video data similarity can be obtained by referring to the corresponding descriptions of the embodiment of fig. 4 above to obtain the text-to-image similar image constraint similarity corresponding to each of the logographic features of "lady", "purple", "shirt", "speaking". Alternatively, the corresponding weight vector w may also be derived from the text data "the woman wearing the purple shirt is speaking _t The text data "ladies wearing a purple shirt are speaking" and the corresponding similarity between the video data may then be obtained with reference to the corresponding description of the embodiment of fig. 5 above. Next, a contrast loss value corresponding to the current training sample set may be calculated according to the similarity corresponding to each current training sample. And when the training ending condition is not met, adjusting the model parameters of the current text feature extraction model and the current video feature extraction model according to the calculated contrast loss value until the training ending condition is met.

With the method for training the feature extraction model disclosed in fig. 8-10, cross-modal similarity between matched and unmatched pairs can be maximally distinguished from a visual perspective by aggregating the similarities between the character features and the image features in the determined related character feature-image feature pairs and the similarities between the determined corresponding close image features and text features, thereby fully utilizing semantic cues of different degrees contained in each text segment and image segment, and also by the obtained information of similarity constraint between the close images and text features contained in the close image constraint similarity. In addition, more robust and differentiated learning is facilitated by further introducing a contrast learning loss calculation approach that accounts for the effects of difficult-to-sample negative samples.

Referring now to fig. 11, fig. 11 illustrates a block diagram of one example of an apparatus 1100 for determining similarity between text and video according to an embodiment of the present disclosure. The apparatus embodiment may correspond to the method embodiments shown in fig. 2-6, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 11, the apparatus 1100 for determining the similarity between text and video may include a feature extraction unit 1110, a similar feature determination unit 1120, a related feature pair determination unit 1130, a similarity aggregation unit 1140, and a similarity determination unit 1150.

The feature extraction unit 1110 is configured to provide the text and video included in the acquired text-video pair to the text feature extraction model and the video feature extraction model, respectively, to obtain corresponding text features and video features. Wherein the text features include logographic features corresponding to each logographic contained in the text, and the video features include a sequence of image features extracted based on images contained in the video. The operation of the feature extraction unit 1110 may refer to the operation of 210 described above with respect to fig. 2.

In one example, the text features further include global text features corresponding to respective logographic features.

The similar feature determining unit 1120 is configured to determine similar image features corresponding to the image features according to the similarity between the image features in the image feature sequence. The operation of the similar feature determination unit 1120 may refer to the operation of 220 described above with respect to fig. 2.

In one example, the near feature determination unit 1120 may be further configured to: determining the similarity between each image feature in the image feature sequence to obtain an inter-image self-similarity matrix; and determining the similar image features corresponding to the image features according to the maximum value of the similarity between the image features and other image features indicated by the self-similarity matrix between the images aiming at each image feature in the image feature sequence. The operation of the similar feature determination unit 1120 may refer to the related operations described above in the embodiment of fig. 3.

A related feature pair determining unit 1130 configured to determine related logographic feature-image feature pairs according to a degree of similarity between the text feature and each image feature in the image feature sequence. The operation of the relevant feature pair determining unit 1130 may refer to the operation of 230 described above with respect to fig. 2.

In one example, the relevant feature pair determination unit 1130 may be further configured to: for each logout feature, selecting corresponding related image features from the image feature sequence according to the similarity between the logout feature and each image feature in the image feature sequence to generate related logout feature-image feature pairs. The operation of the relevant feature pair determination unit 1130 may refer to the operation of the alternative implementation in 230 described above in fig. 2.

In one example, the relevant feature pair determination unit 1130 may be further configured to: and selecting corresponding related logout features from each logout feature according to the similarity between the image feature and each logout feature aiming at each image feature in the image feature sequence to generate related logout feature-image feature pairs. The operation of the relevant feature pair determination unit 1130 may refer to the operation of the alternative implementation of 230 described above in fig. 2.

A similarity aggregation unit 1140, configured to aggregate, for each related logout feature-image feature pair, the similarity between the logout feature and the image feature indicated by the related logout feature-image feature pair and the similarity between the similar image feature corresponding to the image feature and the text feature, to generate a similar image constraint similarity. The operation of the similarity aggregation unit 1140 may refer to the operation of 240 described above with respect to fig. 2.

In one example, the close-image constraint similarity may include a text-to-graph close-image constraint similarity. The similarity aggregation unit 1140 may be further configured to: and aggregating the similarity between the indicated character feature and the image feature, the similarity between the image feature and the corresponding similar image feature, and the similarity between the similar image feature corresponding to the image feature and the character feature, so as to generate the text-image similar image constraint similarity. The operation of the similarity aggregation unit 1140 may refer to the operation of one example of 240 described above in fig. 2.

In one example, the close-image constraint similarity may include a graph-text close-image constraint similarity. The similarity aggregation unit 1140 may be further configured to: and aggregating the similarity between the indicated character feature and the image feature, the similarity between the image feature and the corresponding similar image feature, and the similarity between the similar image feature corresponding to the image feature and each character feature to generate the graph-text similar image constraint similarity. The operation of the similarity aggregation unit 1140 may refer to the operation of one example of 240 described above in fig. 2.

In one example, the similarity aggregation unit 1140 may be further configured to: selecting related image features from the image feature sequence according to the similarity between the global text feature and each image feature in the image feature sequence; and aggregating the similarity between the global text feature and the corresponding selected related image feature, the similarity between the corresponding selected related image feature and the corresponding similar image feature, and the similarity between the corresponding similar image feature of the corresponding selected related image feature and the global text feature to generate global similar image constraint similarity. The operation of the similarity aggregation unit 1140 may refer to the operations 550-560 described above with respect to fig. 5.

The similarity determination unit 1150 is configured to determine a similarity between the text and the video based on the obtained close image constraint similarity. The operation of the similarity determination unit 1150 may refer to the operation of 250 described above with respect to fig. 2.

In one example, the similarity determination unit 1150 may be further configured to: generating weighted similar image constraint similarity between the character features and the image feature sequence according to the weight corresponding to each character feature and the corresponding text-graph similar image constraint similarity; and determining the similarity between the text and the video according to the global close image constraint similarity and the weighted close image constraint similarity. The operation of the similarity determination unit 1150 may refer to the operations 570-580 described above with respect to fig. 5.

Optionally, in one example, the apparatus further comprises: a weight generating unit 1160 configured to determine the part of speech of the token corresponding to each token feature; determining weight assignment corresponding to each word symbol according to a preset part-of-speech-weight assignment relation; and normalizing each weight assignment to obtain the weight corresponding to each character feature. The operation of the weight generation unit 1160 may refer to the operation described above for the embodiment of fig. 6.

With continued reference to fig. 12, fig. 12 shows a block diagram of one example of a text-to-video retrieval device 1200 according to an embodiment of the present description. The embodiment of the apparatus may correspond to the embodiment of the method shown in fig. 7, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 12, the text video retrieving apparatus 1200 may include a text receiving unit 1210, a similarity calculating unit 1220, and a video result providing unit 1230.

The text receiving unit 1210 is configured to receive a query text provided by a user. The operation of the text receiving unit 1210 may refer to the operation of 710 described above in fig. 7.

The similarity calculation unit 1220 is configured to determine the similarity between the query text and the candidate video included in each query text video pair according to the means for determining the similarity between text and video as described above. Wherein each query text video is derived for each candidate video in the set of candidate videos from the query text. The operation of the similarity calculation unit 1220 may refer to the operation of 720 described above in fig. 7.

A video result providing unit 1230 configured to determine a matching video from the candidate video set as a video search result based on the determined similarity; and providing the video search results to the user. The operation of the video result providing unit 1230 may refer to the operations of 730-740 described above in fig. 7.

With continued reference to fig. 13, fig. 13 shows a block diagram of one example of an apparatus 1300 for training a feature extraction model according to an embodiment of the disclosure. The apparatus embodiment may correspond to the method embodiments shown in fig. 8-9, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 13, an apparatus 1300 for training a feature extraction model may include a training unit 1310 and a parameter adjustment unit 1320. The feature extraction model comprises a text feature extraction model and a video feature extraction model. The apparatus 1300 is configured to perform a model training process by the training unit 1310 with training sample set loops until a training end condition is met. Each training sample in the training sample set includes a positive example text video pair composed of matched text data and video data or a negative example text video pair composed of non-matched text data and video data. The training unit 1310 includes: a feature extraction module 1311, a similar feature determination module 1312, a similarity determination module 1313, and a contrast loss calculation module 1314.

The feature extraction module 1311 is configured to provide text data of each current training sample in the current training sample set to the current text feature extraction model, so as to obtain text features of each current training sample; and providing the video data of each current training sample for the current video feature extraction model to obtain an image feature sequence of each current training sample. The operation of the feature extraction module 1311 may refer to the operation of 820-830 described above with respect to fig. 8.

The similar feature determining module 1312 is configured to determine, for each current training sample, similar image features corresponding to each image feature according to the similarity between each image feature in the image feature sequence of the current training sample. The operation of the close feature determination module 1312 may refer to the operation of 840 described above in fig. 8.

A similarity determination module 1313 configured to determine, for each current training sample, a related logographic feature-image feature pair from the similarity between the text feature of the current training sample and each image feature in the corresponding image feature sequence; aiming at each related character feature-image feature pair, the similarity between the character feature and the image feature indicated by the related character feature-image feature pair and the similarity between the similar image feature corresponding to the image feature and the text feature are aggregated, and similar image constraint similarity is generated; and determining the similarity between the text data of the current training sample and the corresponding video data based on the obtained similar image constraint similarity. The operation of the similarity determination module 1313 may refer to the operations 851-855 described above in fig. 8.

The contrast loss calculation module 1314 is configured to determine a contrast loss value corresponding to the current training sample set based on the determined similarity between the text data and the corresponding video data of each current training sample. The operation of the contrast loss calculation module 1314 may refer to the operation of 860 described above with respect to fig. 8.

Alternatively, referring now to fig. 14, fig. 14 shows a block diagram of one example of a training unit 1400 in an apparatus for training a feature extraction model according to an embodiment of the present disclosure.

As shown in fig. 14, the training unit 1400 includes: a feature extraction module 1410, a similar feature determination module 1420, a similarity determination module 1430, a difficult-to-case mining module 1440, a difficult-to-case loss determination module 1450, and a contrast loss calculation module 1460.

The refractory case mining module 1440 is configured to determine a current refractory case negative case text video pair set from the current negative case text video pair set according to a magnitude relation between a similarity between text data and corresponding video data of each current negative case text video pair and a similarity between text data and corresponding video data of the corresponding positive case text video pair. The operation of the refractory mining module 1440 may refer to the operation of 910 described above with respect to fig. 9.

A refractory case loss determination module 1450 configured to determine a symmetry probability for each current refractory case negative case text video pair in the set of current refractory case negative case text video pairs based on a symmetric contrast learning loss function; and determining the refractory loss value of the current refractory negative text video pair set according to the determined symmetry probability. The operation of the hard loss determination module 1450 may refer to the operation of 920-930 described above with respect to FIG. 9.

The contrast loss calculation module 1460 is configured to determine a contrast loss value corresponding to the current training sample set according to the refractory loss value. The operation of the contrast loss calculation module 1460 may refer to the operation of 940 described above with respect to fig. 9.

It should be noted that the operations of the above-mentioned feature extraction module 1410, the similar feature determination module 1420, and the similarity determination module 1430 may refer to the relevant descriptions of the above-mentioned feature extraction module 1311, the similar feature determination module 1312, and the similarity determination module 1313.

Returning to fig. 13, the parameter adjustment unit 1320 is configured to adjust the model parameters of the current feature extraction model according to the contrast loss value in response to the training end condition not being satisfied, wherein the feature extraction model after the model parameter adjustment serves as the current feature extraction model of the next model training process. The operation of the parameter adjustment unit 1320 may refer to the operation of 880 described above in fig. 8.

Embodiments of a method and apparatus for determining similarity between text and video, a text-video retrieval method and apparatus, a method and apparatus for training a feature extraction model according to embodiments of the present specification are described above with reference to fig. 1 to 14.

The means for determining the similarity between text and video, the text-video retrieving means, and the means for training the feature extraction model in the embodiments of the present disclosure may be implemented in hardware, or may be implemented in software, or may be implemented in a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a memory into a memory by a processor of a device where the device is located. In the embodiments of the present specification, the means for determining the similarity between text and video, the text-video retrieving means, the means for training the feature extraction model may be implemented, for example, by an electronic device.

Fig. 15 shows a schematic diagram of an apparatus 1500 for determining similarity between text and video according to an embodiment of the present description.

As shown in fig. 15, an apparatus 1500 for determining similarity between text and video may include at least one processor 1510, memory (e.g., non-volatile memory) 1520, memory 1530, and a communication interface 1540, and the at least one processor 1510, memory 1520, memory 1530, and communication interface 1540 are connected together via a bus 1550. The at least one processor 1510 executes at least one computer-readable instruction (i.e., the elements described above as being implemented in software) stored or encoded in memory.

In one embodiment, computer-executable instructions are stored in memory that, when executed, cause the at least one processor 1510 to: providing the text and video included in the obtained text video pair to a text feature extraction model and a video feature extraction model respectively to obtain corresponding text features and video features, wherein the text features comprise word features corresponding to each word (token) included in the text, and the video features comprise image feature sequences extracted based on images included in the video; according to the similarity between the image features in the image feature sequence, determining the similar image features corresponding to the image features; determining related logographic feature-image feature pairs according to the similarity between the text feature and each image feature in the image feature sequence; aiming at each related character feature-image feature pair, the similarity between the character feature and the image feature indicated by the related character feature-image feature pair and the similarity between the similar image feature corresponding to the image feature and the text feature are aggregated, and similar image constraint similarity is generated; and determining the similarity between the text and the video based on the obtained similarity image constraint similarity.

It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 1510 to perform the various operations and functions described above in connection with fig. 1-6 in various embodiments of the present specification.

Fig. 16 shows a schematic diagram of a text-to-video retrieval device 1600 of an embodiment of the present description.

As shown in fig. 16, the text-video retrieval device 1600 can include at least one processor 1610, memory (e.g., non-volatile memory) 1620, memory 1630, and communication interface 1640, and the at least one processor 1610, memory 1620, memory 1630, and communication interface 1640 are connected together via bus 1650. At least one processor 1610 executes at least one computer-readable instruction stored or encoded in memory (i.e., the elements described above that are implemented in software).

In one embodiment, computer-executable instructions are stored in memory that, when executed, cause the at least one processor 1610 to: receiving query text provided by a user; according to the method for determining the similarity between the text and the video, the similarity between the query text and the candidate video included in each query text video pair is determined, wherein each query text video pair is obtained according to each candidate video in the query text and the candidate video set; determining matching videos from the candidate video set as video search results based on the determined similarity; and providing the video search results to the user.

It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 1610 to perform the various operations and functions described above in connection with fig. 7 in various embodiments of the present description.

Fig. 17 shows a schematic diagram of an apparatus 1700 for training a feature extraction model according to an embodiment of the present description.

As shown in fig. 17, an apparatus 1700 for training a feature extraction model may include at least one processor 1710, memory (e.g., non-volatile memory) 1720, memory 1730, and a communication interface 1740, and the at least one processor 1710, memory 1720, memory 1730, and communication interface 1740 are connected together via a bus 1750. At least one processor 1710 executes at least one computer-readable instruction (i.e., the elements described above as being implemented in software) stored or encoded in memory.

In one embodiment, computer-executable instructions are stored in memory that, when executed, cause the at least one processor 1710 to: the following model training process is circularly executed by using a training sample set until a training ending condition is met, wherein each training sample in the training sample set comprises a positive text video pair consisting of matched text data and video data or a negative text video pair consisting of unmatched text data and video data: providing the text data of each current training sample in the current training sample set for a current text feature extraction model to obtain the text features of each current training sample; providing the video data of each current training sample to a current video feature extraction model to obtain an image feature sequence of each current training sample; aiming at each current training sample, determining the similar image features corresponding to each image feature according to the similarity between each image feature in the image feature sequence of the current training sample; for each current training sample, determining related character feature-image feature pairs according to the similarity between the text features of the current training sample and each image feature in the corresponding image feature sequence; aiming at each related character feature-image feature pair, the similarity between the character feature and the image feature indicated by the related character feature-image feature pair and the similarity between the similar image feature corresponding to the image feature and the text feature are aggregated, and similar image constraint similarity is generated; based on the obtained similar image constraint similarity, determining the similarity between the text data of the current training sample and the corresponding video data; determining a contrast loss value corresponding to the current training sample set based on the determined similarity between the text data and the corresponding video data of each current training sample; and responding to the condition that the training ending condition is not met, and adjusting model parameters of the current feature extraction model according to the comparison loss value, wherein the feature extraction model subjected to model parameter adjustment serves as the current feature extraction model of the next model training process, and the feature extraction model comprises a text feature extraction model and a video feature extraction model.

It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 1710 to perform the various operations and functions described above in connection with fig. 8-9 in various embodiments of the present specification.

According to one embodiment, a program product, such as a computer readable medium, is provided. The computer-readable medium may have instructions (i.e., the elements described above implemented in software) that, when executed by a computer, cause the computer to perform the various operations and functions described above in connection with fig. 1-9 in various embodiments of the present specification.

In particular, a system or apparatus provided with a readable storage medium having stored thereon software program code implementing the functions of any of the above embodiments may be provided, and a computer or processor of the system or apparatus may be caused to read out and execute instructions stored in the readable storage medium.

In this case, the program code itself read from the readable medium may implement the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.

Computer program code required for operation of portions of the present description may be written in any one or more programming languages, including an object oriented programming language such as Java, scala, smalltalk, eiffel, JADE, emerald, C ++, c#, VB, NET, python and the like, a conventional programming language such as C language, visual Basic 2003, perl, COBOL 2002, PHP and ABAP, a dynamic programming language such as Python, ruby and Groovy, or other programming languages and the like. The program code may execute on the user's computer or as a stand-alone software package, or it may execute partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any form of network, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or the connection may be made to the cloud computing environment, or for use as a service, such as software as a service (SaaS).

Examples of readable storage media include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or cloud by a communications network.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Not all steps or units in the above-mentioned flowcharts and system configuration diagrams are necessary, and some steps or units may be omitted according to actual needs. The order of execution of the steps is not fixed and may be determined as desired. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by multiple physical entities, or may be implemented jointly by some components in multiple independent devices.

The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.

The alternative implementation manner of the embodiment of the present disclosure has been described in detail above with reference to the accompanying drawings, but the embodiment of the present disclosure is not limited to the specific details of the foregoing implementation manner, and various simple modifications may be made to the technical solution of the embodiment of the present disclosure within the scope of the technical concept of the embodiment of the present disclosure, and all the simple modifications belong to the protection scope of the embodiment of the present disclosure.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for determining similarity between text and video, comprising:

providing the text and the video included in the obtained text video pair to a text feature extraction model and a video feature extraction model respectively to obtain corresponding text features and video features, wherein the text features comprise word features corresponding to each word included in the text, and the video features comprise image feature sequences extracted based on images included in the video;

According to the similarity between the image features in the image feature sequence, determining the similar image features corresponding to the image features;

determining related logographic feature-image feature pairs according to the similarity between the text feature and each image feature in the image feature sequence;

aiming at each related character feature-image feature pair, the similarity between the character feature and the image feature indicated by the related character feature-image feature pair and the similarity between the similar image feature corresponding to the image feature and the text feature are aggregated, and similar image constraint similarity is generated; and

and determining the similarity between the text and the video based on the obtained similar image constraint similarity.

2. The method of claim 1, wherein the determining, based on the similarity between the image features in the sequence of image features, the near image features corresponding to the image features comprises:

determining the similarity between each image feature in the image feature sequence to obtain an inter-image self-similarity matrix; and

and aiming at each image feature in the image feature sequence, determining the similar image feature corresponding to the image feature according to the maximum value of the similarity between the image feature and other image features indicated by the self-similarity matrix between the images.

3. The method of claim 1 or 2, wherein the close-image constraint similarity comprises the text-to-graph close-image constraint similarity,

said determining related logographic feature-image feature pairs from similarities between said text feature and individual image features in said sequence of image features comprising:

for each logout feature, selecting the corresponding related image feature from the image feature sequence to generate the related logout feature-image feature pair according to the similarity between the logout feature and each image feature in the image feature sequence,

the step of aggregating the similarity between the indicated logographic feature and the image feature and the similarity between the similar image feature corresponding to the image feature and the text feature, and the step of generating the similar image constraint similarity comprises the following steps:

and aggregating the similarity between the indicated character feature and the image feature, the similarity between the image feature and the corresponding similar image feature, and the similarity between the similar image feature corresponding to the image feature and the character feature, so as to generate the text-image similar image constraint similarity.

4. The method of claim 1 or 2, wherein the close-image constraint similarity comprises the graph-text close-image constraint similarity,

for each image feature in the image feature sequence, selecting corresponding related character features from each character feature according to the similarity between the image feature and each character feature to generate related character feature-image feature pairs,

and aggregating the similarity between the indicated character feature and the image feature, the similarity between the image feature and the corresponding similar image feature, and the similarity between the similar image feature corresponding to the image feature and each character feature to generate the graph-text similar image constraint similarity.

5. The method of claim 3, wherein the text features further comprise global text features corresponding to respective logographic features,

before the determining of the similarity between the text and the video based on the obtained similarity image constraint similarity, the method further comprises:

selecting related image features from the image feature sequence according to the similarity between the global text feature and each image feature in the image feature sequence; and

aggregating the similarity between the global text feature and the corresponding selected related image feature, the similarity between the corresponding selected related image feature and the corresponding similar image feature, and the similarity between the corresponding similar image feature and the global text feature to generate a global similar image constraint similarity,

the determining the similarity between the text and the video based on the obtained similarity image constraint similarity comprises:

generating weighted similar image constraint similarity between the character features and the image feature sequence according to the weight corresponding to each character feature and the corresponding text-graph similar image constraint similarity; and

And determining the similarity between the text and the video according to the global close image constraint similarity and the weighted close image constraint similarity.

6. The method of claim 5, wherein the weights for each of the respective logographic features are derived by:

determining the part of speech of each character corresponding to each character feature;

determining weight assignment corresponding to each word symbol according to a preset part-of-speech-weight assignment relation; and

and normalizing each weight assignment to obtain the weight corresponding to each character feature.

7. A text video retrieval method comprising:

receiving query text provided by a user;

the method for determining similarity between text and video according to any one of claims 1 to 6, determining similarity between the query text and candidate video comprised by each query text video pair, wherein each query text video pair is derived from each candidate video in the set of query text and candidate videos;

determining matching videos from the candidate video set as video search results based on the determined similarity; and

and providing the video search results to the user.

8. A method for training a feature extraction model, wherein the feature extraction model comprises a text feature extraction model and a video feature extraction model, the method comprising:

the following model training process is circularly executed by using a training sample set until a training ending condition is met, wherein each training sample in the training sample set comprises a positive text video pair consisting of matched text data and video data or a negative text video pair consisting of unmatched text data and video data:

providing the text data of each current training sample in the current training sample set for a current text feature extraction model to obtain the text features of each current training sample;

providing the video data of each current training sample to a current video feature extraction model to obtain an image feature sequence of each current training sample;

aiming at each current training sample, determining the similar image features corresponding to each image feature according to the similarity between each image feature in the image feature sequence of the current training sample;

for each of the current training samples,

determining related character feature-image feature pairs according to the similarity between the text features of the current training sample and each image feature in the corresponding image feature sequence;

Aiming at each related character feature-image feature pair, the similarity between the character feature and the image feature indicated by the related character feature-image feature pair and the similarity between the similar image feature corresponding to the image feature and the text feature are aggregated, and similar image constraint similarity is generated;

based on the obtained similar image constraint similarity, determining the similarity between the text data of the current training sample and the corresponding video data;

determining a contrast loss value corresponding to the current training sample set based on the determined similarity between the text data and the corresponding video data of each current training sample; and

and adjusting model parameters of the current feature extraction model according to the comparison loss value in response to the condition that the training ending condition is not met, wherein the feature extraction model subjected to model parameter adjustment serves as the current feature extraction model of the next model training process.

9. The method of claim 8, wherein prior to said determining a corresponding contrast loss value for the current set of training samples based on the determined similarity between text data and corresponding video data for each current training sample, the method further comprises:

According to the magnitude relation between the similarity between the text data of each current negative example text video pair and the corresponding video data and the similarity between the text data of the corresponding positive example text video pair and the corresponding video data, determining a current difficult-case negative example text video pair set from the current negative example text video pair set;

determining the symmetry probability of each current difficult-case negative-example text video pair in the current difficult-case negative-example text video pair set based on a symmetrical contrast learning loss function; and

determining a refractory loss value of the current refractory negative text video pair set according to the determined symmetry probability,

the determining the contrast loss value corresponding to the current training sample set based on the determined similarity between the text data and the corresponding video data of each current training sample includes:

and determining a contrast loss value corresponding to the current training sample set according to the difficult loss value.

10. An apparatus for determining similarity between text and video, comprising:

a feature extraction unit configured to provide the text and video included in the acquired text-video pair to a text feature extraction model and a video feature extraction model, respectively, to obtain corresponding text features and video features, wherein the text features include word features corresponding to each word included in the text, and the video features include an image feature sequence extracted based on images included in the video;

A similar feature determining unit configured to determine similar image features corresponding to the image features according to the similarity between the image features in the image feature sequence;

a related feature pair determining unit configured to determine related logographic feature-image feature pairs according to a degree of similarity between the text feature and each image feature in the image feature sequence;

a similarity aggregation unit configured to aggregate, for each related logout feature-image feature pair, a similarity between the logout feature and the image feature indicated by the related logout feature-image feature pair and a similarity between a similar image feature corresponding to the image feature and the text feature, to generate a similar image constraint similarity; and

and a similarity determination unit configured to determine a similarity between the text and the video based on the obtained close image constraint similarity.

11. A text video retrieval apparatus comprising:

a text receiving unit configured to receive a query text provided by a user;

a similarity calculation unit configured to determine a similarity between the query text and candidate video included in each query text video pair according to the apparatus for determining a similarity between text and video according to claim 10, wherein each query text video pair is derived from each candidate video in the set of query text and candidate videos;

A video result providing unit configured to determine a matching video from the candidate video set as a video search result based on the determined similarity; and providing the video search results to the user.

12. An apparatus for training a feature extraction model, wherein the feature extraction model comprises a text feature extraction model and a video feature extraction model, the apparatus being configured to perform a model training process by a training unit with a training sample set, each training sample in the training sample set comprising either a positive example text video pair consisting of matched text data and video data or a negative example text video pair consisting of non-matched text data and video data, until a training end condition is met, the training unit comprising:

the feature extraction module is configured to provide text data of each current training sample in the current training sample set for the current text feature extraction model to obtain text features of each current training sample; providing the video data of each current training sample to a current video feature extraction model to obtain an image feature sequence of each current training sample;

The similar feature determining module is configured to determine similar image features corresponding to the image features of each current training sample according to the similarity between the image features in the image feature sequence of the current training sample;

a similarity determination module configured to determine, for each current training sample, a related logographic feature-image feature pair according to a similarity between a text feature of the current training sample and each image feature in a corresponding image feature sequence; aiming at each related character feature-image feature pair, the similarity between the character feature and the image feature indicated by the related character feature-image feature pair and the similarity between the similar image feature corresponding to the image feature and the text feature are aggregated, and similar image constraint similarity is generated; based on the obtained similar image constraint similarity, determining the similarity between the text data of the current training sample and the corresponding video data;

a contrast loss calculation module configured to determine a contrast loss value corresponding to the current training sample set based on the determined similarity between the text data and the corresponding video data of each current training sample; and

The apparatus further comprises:

and the parameter adjustment unit is configured to adjust the model parameters of the current feature extraction model according to the contrast loss value in response to the condition that the training end condition is not met, wherein the feature extraction model subjected to model parameter adjustment serves as the current feature extraction model of the next model training process.

13. An apparatus for determining similarity between text and video, comprising: at least one processor, a memory coupled with the at least one processor, and a computer program stored on the memory, the at least one processor executing the computer program to implement the method of any one of claims 1 to 6.

14. A text video retrieval apparatus comprising: at least one processor, a memory coupled with the at least one processor, and a computer program stored on the memory, the at least one processor executing the computer program to implement the method of claim 7.

15. An apparatus for training a feature extraction model, comprising: at least one processor, a memory coupled with the at least one processor, and a computer program stored on the memory, the at least one processor executing the computer program to implement the method of claim 8 or 9.