CN116166843A - Text video cross-modal retrieval method and device based on fine granularity perception - Google Patents

Text video cross-modal retrieval method and device based on fine granularity perception Download PDF

Info

Publication number
CN116166843A
CN116166843A CN202310200445.3A CN202310200445A CN116166843A CN 116166843 A CN116166843 A CN 116166843A CN 202310200445 A CN202310200445 A CN 202310200445A CN 116166843 A CN116166843 A CN 116166843A
Authority
CN
China
Prior art keywords
text
video
sample
feature vector
matched
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310200445.3A
Other languages
Chinese (zh)
Other versions
CN116166843B (en
Inventor
罗引
郝艳妮
马先钦
郝保
方省
曹家
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Wenge Technology Co ltd
Original Assignee
Beijing Zhongke Wenge Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Wenge Technology Co ltd filed Critical Beijing Zhongke Wenge Technology Co ltd
Priority to CN202310200445.3A priority Critical patent/CN116166843B/en
Publication of CN116166843A publication Critical patent/CN116166843A/en
Application granted granted Critical
Publication of CN116166843B publication Critical patent/CN116166843B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a text video cross-modal retrieval method and device based on fine granularity perception. The method comprises the following steps: extracting characteristics of a text to be matched through a text characteristic coding model to obtain a text characteristic vector set of a plurality of words of the text to be matched; extracting features of the video to be matched through a video feature coding model to obtain a target feature vector set of a plurality of target objects; and determining a relevance score between the target feature vector set and the text feature vector set through a cross-modal matching model. According to the text video cross-modal retrieval method based on fine granularity perception, finer granularity semantic features can be introduced into a retrieval task, and recognition and contrast capabilities of the finer granularity semantic features are trained in model training, so that the model can retrieve the finer granularity semantic features, and accuracy of cross-modal retrieval can be improved.

Description

Text video cross-modal retrieval method and device based on fine granularity perception
Technical Field
The disclosure relates to the technical field of computers, in particular to a text video cross-modal retrieval method and device based on fine granularity perception.
Background
Text-to-video retrieval is a fundamental research task for multimodal video and language understanding, with the aim of returning the most relevant video or video segments through a given query text, and vice versa. With the rapid growth of the number of internet videos, text-to-video retrieval becomes a new requirement, and significant effects are achieved in many video text tasks. Text-to-video retrieval remains a challenging problem due to the large semantic difference between video and text, and the complex matching pattern.
In order to break through the semantic difference between video and text, many methods in the related art break down the problem into two-part processes, namely visual feature representation in the video domain and text feature representation in the text, and then calculate the similarity between them.
Although the acquisition mechanism and alignment strategy of feature representations are continually improved in the related art, these approaches ignore the video-text matching task as not only a cross-modality matching task, but also a complex and subjective cross-modality cognitive process. The current video-text matching technology mainly surrounds the aspects of extracting stronger multi-mode global features, capturing more accurate alignment strategies and training correlation calculation networks, and does not consider fine-grained features among modes, so that the technology cannot be used for more abstract or finer-grained retrieval tasks (such as the number of objects in pictures, different automobile types, specific actions of people or flower types, and the like).
The information disclosed in the background section of this application is only for enhancement of understanding of the general background of this application and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
Disclosure of Invention
The disclosure provides a text video cross-modal retrieval method and device based on fine granularity perception.
According to an aspect of the present disclosure, there is provided a text video cross-modal retrieval method based on fine granularity perception, including:
extracting features of a text to be matched through a text feature coding model to obtain a text feature vector set of a plurality of words of the text to be matched, wherein the words comprise verbs and nouns, and the feature vector set comprises text feature vectors corresponding to the verbs and text feature vectors corresponding to the nouns;
extracting features of a video to be matched through a video feature coding model to obtain a target feature vector set of a plurality of target objects in the video to be matched, wherein the target feature vector set comprises target feature vectors respectively corresponding to the plurality of target objects;
and determining a relevance score between the target feature vector set and the text feature vector set through a cross-modal matching model, wherein the relevance score is used for searching videos corresponding to the texts to be matched in a plurality of videos to be matched or searching texts corresponding to the videos to be matched in a plurality of texts to be matched, and the cross-modal matching model is obtained through training the training texts after verbs or nouns in text samples are randomly removed.
In one possible implementation manner, feature extraction is performed on a text to be matched through a text feature coding model, and a text feature vector set of a plurality of words of the text to be matched is obtained, including:
word segmentation is carried out on the text to be matched to obtain a plurality of words of the text to be matched;
extracting features of the words through the text feature coding model to obtain text feature vectors corresponding to the words;
and obtaining the text feature vector set according to a plurality of text feature vectors.
In one possible implementation manner, feature extraction is performed on a video to be matched through a video feature coding model, and a target feature vector set of a plurality of target objects in the video to be matched is obtained, including:
sampling the video to be matched to obtain a plurality of sampling frames;
detecting target objects in the plurality of sampling frames to obtain areas where the target objects in the sampling frames are located;
reserving the area of the target object in each sampling frame, covering the non-target area, and obtaining a grid image corresponding to each sampling frame;
extracting the characteristics of the region where the target object is located in each grid image through a video characteristic coding model to obtain a target characteristic vector of each target object;
And obtaining the target feature vector set according to a plurality of target feature vectors.
In one possible implementation, determining a relevance score between the set of target feature vectors and the set of text feature vectors by a cross-modality matching model includes:
determining the similarity between each target feature vector in the target feature vector set and each text feature vector in the text feature vector set through a cross-modal matching model;
and inputting the similarity into a fully-connected network to obtain the relevance score.
In one possible implementation, the method further includes:
extracting characteristics of a text sample and a plurality of words of the text sample through a text characteristic coding model to obtain a text global characteristic vector of the text sample and a sample text characteristic vector set of the plurality of words;
extracting features of a video sample and grid images of the video sample through a video feature coding model to obtain a video global feature vector of the video sample and a sample target feature vector set of a plurality of target objects in the video sample;
determining target perception contrast loss according to the sample text feature vector set, the sample target feature vector set, the text global feature vector and the video global feature vector;
Determining feature fusion contrast loss according to the cross-modal matching model, training texts after random removal of verbs or nouns in text samples, video samples matched with the text samples, the text global feature vector and the video global feature vector;
determining video text matching contrast loss according to the sample text feature vector set and the sample target feature vector set;
determining the comprehensive loss of the text feature coding model, the video feature coding model and the cross-modal matching model according to the target perception contrast loss, the feature fusion contrast loss and the video text matching contrast loss;
and training the text feature coding model, the video feature coding model and the cross-modal matching model according to the comprehensive loss to obtain a trained text feature coding model, a trained video feature coding model and a trained cross-modal matching model.
In one possible implementation, determining the target perceptual contrast loss from the set of sample text feature vectors, the set of sample target feature vectors, the text global feature vector, and the video global feature vector includes:
Determining text alignment loss according to the sample text feature vector set, the sample target feature vector set and the video global feature vector;
determining video alignment loss according to the sample text feature vector set, the sample target feature vector set and the text global feature vector;
and determining the target perception contrast loss according to the video alignment loss and the text alignment loss.
In one possible implementation manner, determining a feature fusion contrast loss according to the cross-modal matching model, training text after randomly removing verbs or nouns in text samples, video samples matched with the text samples, and the text global feature vector and the video global feature vector includes:
determining a first noise contrast estimate penalty between the text global feature vector and the video global feature vector;
determining a second noise contrast estimation loss according to the cross-modal matching model, a first training text obtained by randomly removing nouns in a text sample and a video sample matched with the text sample;
determining a third noise contrast estimation loss according to the cross-modal matching model, a second training text after the verbs in the text samples are randomly removed and a video sample matched with the text samples;
And determining the characteristic fusion contrast loss according to the first noise contrast estimated loss, the second noise contrast estimated loss and the third noise contrast estimated loss.
In one possible implementation, determining the second noise contrast estimation loss according to the cross-modal matching model, the first training text after the nouns in the text sample are randomly removed, and the video sample matched with the text sample includes:
extracting features of the first training text through the text feature coding model to obtain noun problem word features;
extracting features of the video samples matched with the text samples through the video feature coding model to obtain query reference features;
acquiring noun answer characteristics for the noun question word characteristics and the query reference characteristics through the cross-modal matching model;
inputting the removed nouns into the text feature coding model to obtain noun features;
and determining the second noise contrast estimation loss according to the noun characteristics and the noun answer characteristics.
In one possible implementation manner, determining a third noise contrast estimation loss according to the cross-modal matching model, the second training text after the verbs in the text samples are randomly removed, and the video samples matched with the text samples, includes:
Extracting features of the second training text through the text feature coding model to obtain verb problem word features;
extracting features of the video samples matched with the text samples through the video feature coding model to obtain query reference features;
obtaining verb answer features for the verb question word features and the query reference features through the cross-modal matching model;
inputting the removed verbs into the text feature coding model to obtain verb features;
and determining the third noise contrast estimation loss according to the verb feature and the verb answer feature.
In one possible implementation, determining a video text matching contrast loss from the set of sample text feature vectors and the set of sample target feature vectors includes:
determining a first contrast loss according to a sample text feature vector set of any text sample in a text sample set, a sample target feature vector set of a video sample matched with the text sample in a video sample set, and a sample target feature vector set of a video sample not matched with the text sample in the video sample set;
determining a second contrast loss according to a sample target feature vector set of any video sample in the video sample set, a sample text feature vector set of a text sample in the text sample set that is matched with the video sample, and a sample text feature vector set of a text sample in the text sample set that is not matched with the video sample;
And determining the video text matching contrast loss according to the first contrast loss and the second contrast loss.
According to another aspect of the present disclosure, there is provided a text video cross-modal retrieval apparatus based on fine granularity perception, the apparatus comprising:
the text coding module is used for extracting features of a text to be matched through a text feature coding model to obtain a text feature vector set of a plurality of words of the text to be matched, wherein the words comprise verbs and nouns, and the feature vector set comprises text feature vectors corresponding to the verbs and text feature vectors corresponding to the nouns;
the video coding module is used for extracting the characteristics of the video to be matched through a video characteristic coding model to obtain a target characteristic vector set of a plurality of target objects in the video to be matched, wherein the target characteristic vector set comprises target characteristic vectors respectively corresponding to the plurality of target objects;
the matching module is used for determining a correlation score between the target feature vector set and the text feature vector set through a cross-modal matching model, wherein the correlation score is used for searching videos corresponding to the texts to be matched in a plurality of videos to be matched or searching texts corresponding to the videos to be matched in a plurality of texts to be matched, and the cross-modal matching model is obtained through training texts after verbs or nouns in text samples are randomly removed.
In one possible implementation, the text encoding module is further configured to:
word segmentation is carried out on the text to be matched to obtain a plurality of words of the text to be matched;
extracting features of the words through the text feature coding model to obtain text feature vectors corresponding to the words;
and obtaining the text feature vector set according to a plurality of text feature vectors.
In one possible implementation, the video encoding module is further configured to:
sampling the video to be matched to obtain a plurality of sampling frames;
detecting target objects in the plurality of sampling frames to obtain areas where the target objects in the sampling frames are located;
reserving the area of the target object in each sampling frame, covering the non-target area, and obtaining a grid image corresponding to each sampling frame;
extracting the characteristics of the region where the target object is located in each grid image through a video characteristic coding model to obtain a target characteristic vector of each target object;
and obtaining the target feature vector set according to a plurality of target feature vectors.
In one possible implementation, the matching module is further configured to:
Determining the similarity between each target feature vector in the target feature vector set and each text feature vector in the text feature vector set through a cross-modal matching model;
and inputting the similarity into a fully-connected network to obtain the relevance score.
In a possible implementation manner, the apparatus further includes a training module, configured to:
extracting characteristics of a text sample and a plurality of words of the text sample through a text characteristic coding model to obtain a text global characteristic vector of the text sample and a sample text characteristic vector set of the plurality of words;
extracting features of a video sample and grid images of the video sample through a video feature coding model to obtain a video global feature vector of the video sample and a sample target feature vector set of a plurality of target objects in the video sample;
determining target perception contrast loss according to the sample text feature vector set, the sample target feature vector set, the text global feature vector and the video global feature vector;
determining feature fusion contrast loss according to the cross-modal matching model, training texts after random removal of verbs or nouns in text samples, video samples matched with the text samples, the text global feature vector and the video global feature vector;
Determining video text matching contrast loss according to the sample text feature vector set and the sample target feature vector set;
determining the comprehensive loss of the text feature coding model, the video feature coding model and the cross-modal matching model according to the target perception contrast loss, the feature fusion contrast loss and the video text matching contrast loss;
and training the text feature coding model, the video feature coding model and the cross-modal matching model according to the comprehensive loss to obtain a trained text feature coding model, a trained video feature coding model and a trained cross-modal matching model.
In one possible implementation, the training module is further configured to:
determining text alignment loss according to the sample text feature vector set, the sample target feature vector set and the video global feature vector;
determining video alignment loss according to the sample text feature vector set, the sample target feature vector set and the text global feature vector;
and determining the target perception contrast loss according to the video alignment loss and the text alignment loss.
In one possible implementation, the training module is further configured to:
determining a first noise contrast estimate penalty between the text global feature vector and the video global feature vector;
determining a second noise contrast estimation loss according to the cross-modal matching model, a first training text obtained by randomly removing nouns in a text sample and a video sample matched with the text sample;
determining a third noise contrast estimation loss according to the cross-modal matching model, a second training text after the verbs in the text samples are randomly removed and a video sample matched with the text samples;
and determining the characteristic fusion contrast loss according to the first noise contrast estimated loss, the second noise contrast estimated loss and the third noise contrast estimated loss.
In one possible implementation, the training module is further configured to:
extracting features of the first training text through the text feature coding model to obtain noun problem word features;
extracting features of the video samples matched with the text samples through the video feature coding model to obtain query reference features;
Acquiring noun answer characteristics for the noun question word characteristics and the query reference characteristics through the cross-modal matching model;
inputting the removed nouns into the text feature coding model to obtain noun features;
and determining the second noise contrast estimation loss according to the noun characteristics and the noun answer characteristics.
In one possible implementation, the training module is further configured to:
extracting features of the second training text through the text feature coding model to obtain verb problem word features;
extracting features of the video samples matched with the text samples through the video feature coding model to obtain query reference features;
obtaining verb answer features for the verb question word features and the query reference features through the cross-modal matching model;
inputting the removed verbs into the text feature coding model to obtain verb features;
and determining the third noise contrast estimation loss according to the verb feature and the verb answer feature.
In one possible implementation, the training module is further configured to:
determining a first contrast loss according to a sample text feature vector set of any text sample in a text sample set, a sample target feature vector set of a video sample matched with the text sample in a video sample set, and a sample target feature vector set of a video sample not matched with the text sample in the video sample set;
Determining a second contrast loss according to a sample target feature vector set of any video sample in the video sample set, a sample text feature vector set of a text sample in the text sample set that is matched with the video sample, and a sample text feature vector set of a text sample in the text sample set that is not matched with the video sample;
and determining the video text matching contrast loss according to the first contrast loss and the second contrast loss.
According to an aspect of the present disclosure, there is provided a text video cross-modal retrieval device based on fine granularity perception, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.
According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.
According to the text video cross-modal retrieval method based on fine granularity perception, the text feature coding model can obtain text feature vectors of verbs and nouns of texts to be matched, the video feature coding model can obtain target feature vectors of a plurality of target objects in a grid image covering a non-target area, therefore, feature information with finer granularity is obtained, the target feature vectors and the text feature vectors can be aligned through the cross-modal matching model, and the similarity of the two can be determined. Therefore, finer granularity features can be identified, and accuracy of cross-modal retrieval is improved. Further, in the training process, the sample text feature vector set, the sample target feature vector set, the text global feature vector and the video global feature vector can be aligned, and the perceptibility and the alignment capability of the cross-mode matching model to fine granularity features are improved, so that the model performance and the retrieval accuracy are improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the invention or the solutions of the prior art, the drawings which are necessary for the description of the embodiments or the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other embodiments may be obtained from these drawings without inventive effort to a person skilled in the art,
FIG. 1 illustrates a flow chart of a fine granularity awareness based text video cross-modality retrieval method in accordance with an embodiment of the present disclosure;
FIG. 2 shows a schematic diagram of a grid image in accordance with an embodiment of the present disclosure;
FIG. 3 illustrates a schematic diagram of model training according to an embodiment of the present disclosure;
FIG. 4 illustrates an application schematic of a fine granularity awareness based text video cross-modality retrieval method in accordance with an embodiment of the present disclosure;
FIG. 5 illustrates a block diagram of a fine granularity awareness based text video cross-modality retrieval arrangement in accordance with an embodiment of the present disclosure;
FIG. 6 illustrates a block diagram of a fine granularity perception based text video cross-modality retrieval device in accordance with an embodiment of the present disclosure;
fig. 7 shows a block diagram of an electronic device, according to an embodiment of the disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein.
It should be understood that, in various embodiments of the present disclosure, the size of the sequence number of each process does not mean that the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.
It should be understood that in this disclosure, "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements that are expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in this disclosure, "plurality" means two or more. "and/or" is merely an association relationship describing an association object, and means that three relationships may exist, for example, and/or B may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. "comprising A, B and C", "comprising A, B, C" means that all three of A, B, C comprise, "comprising A, B or C" means that one of the three comprises A, B, C, and "comprising A, B and/or C" means that any 1 or any 2 or 3 of the three comprises A, B, C.
It should be understood that in this disclosure, "B corresponding to a", "a corresponding to B", or "B corresponding to a" means that B is associated with a from which B may be determined. Determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information. The matching of A and B is that the similarity of A and B is larger than or equal to a preset threshold value.
As used herein, "if" may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection" depending on the context.
The technical scheme of the present disclosure is described in detail below with specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.
In one possible implementation manner, aiming at the problem that a task with finer granularity cannot be searched in the related technology, the method introduces semantic features with finer granularity into a search task, and trains the recognition and contrast capability of the semantic features with finer granularity in model training, so that the model can search the semantic features with finer granularity, and the accuracy of cross-modal search is improved.
Fig. 1 illustrates a flow chart of a fine granularity awareness based text video cross-modality retrieval method, as shown in fig. 1, which may include:
step S11, extracting features of a text to be matched through a text feature coding model to obtain a text feature vector set of a plurality of words of the text to be matched, wherein the words comprise verbs and nouns, and the feature vector set comprises text feature vectors corresponding to the verbs and text feature vectors corresponding to the nouns;
step S12, extracting features of a video to be matched through a video feature coding model to obtain a target feature vector set of a plurality of target objects in the video to be matched, wherein the target feature vector set comprises target feature vectors respectively corresponding to the plurality of target objects;
step S13, determining a relevance score between the target feature vector set and the text feature vector set through a cross-modal matching model, wherein the relevance score is used for searching videos corresponding to the texts to be matched in a plurality of videos to be matched or searching texts corresponding to the videos to be matched in a plurality of texts to be matched, and the cross-modal matching model is obtained through training texts after verbs or nouns in text samples are randomly removed.
In one possible implementation manner, the text feature coding model may extract semantic features of a text, for example, semantic features of a sentence or semantic features of a word may be extracted, and features obtained by the text feature coding model may be information in a vector form or may be information in other forms, and the specific form of the features obtained by the text feature coding model is not limited in this disclosure. The text feature encoding model may be a deep learning neural network model, and the present disclosure is not limited to a particular type of text feature encoding model.
In one possible implementation manner, the video feature encoding model may extract semantic features of a video or a video frame, or semantic features of each target object in the video or the video frame, for example, shape features, color features, and/or action features of the target object, etc., where the features acquired by the video feature encoding model may be information in a vector form, or may be information in other forms, and the specific form of the features acquired by the video feature encoding model is not limited in this disclosure. The video feature encoding model may be a deep learning neural network model, and the specific type of video feature encoding model is not limited by the present disclosure.
In one possible implementation manner, the cross-modal matching model may be used to fuse and compare feature information of the text with feature information of the video frame, for example, the features of the text and the video may be aligned (for example, aligned by mapping, etc.), so that the features of the text and the video are converted into features (for example, features in a vector form) that can be compared in the same feature space, and the feature similarities of the text and the video may be obtained by activating a network, a fully connected network, etc. for comparison, so as to determine whether the text and the video match.
In one possible implementation manner, a specific text to be matched can be selected, and through the above model, a corresponding video can be searched in a plurality of videos to be matched, or a certain video to be matched can be selected, and in the plurality of texts to be matched, a corresponding text is searched, and the searching process is as follows.
In one possible implementation manner, in step S11, the feature of fine granularity of the text to be matched may be obtained first, so that the accuracy of the search is improved through the feature of fine granularity in the search process. For example, the text to be matched may include a plurality of words, and verbs and nouns therein may be determined, and feature extraction is performed through a text feature coding model to obtain feature information of the verbs and feature information of the nouns, for example, feature information in a vector form. So that more, finer granularity features can be obtained when the features are obtained than when global features of the text (e.g., sentence) to be matched are obtained.
In one possible implementation, step S11 may include: word segmentation is carried out on the text to be matched to obtain a plurality of words of the text to be matched; extracting features of the words through the text feature coding model to obtain text feature vectors corresponding to the words; and obtaining the text feature vector set according to a plurality of text feature vectors.
In one possible implementation manner, the text to be matched may be text with a plurality of words, for example, sentences, phrases, and the like, and the text to be matched may be segmented in various manners, for example, segmentation is performed through a token model, or segmentation is performed at a crust, and the like, and the specific manner of segmentation is not limited in the present disclosure.
In a kind ofIn a possible implementation manner, after obtaining each word segment, a plurality of words of the text to be matched may be obtained, where the plurality of words may include verbs and nouns, and features of each word may be obtained through a text feature coding model, that is, text feature vectors, where the text feature vectors include text feature vectors of the verbs and text feature vectors of the nouns. And the set of text feature vectors of the words is the set of text feature vectors. In an example, the text to be matched may include n words, and then n text feature vectors may be obtained through the text feature coding model to form a text feature vector set t= { e 1 ,e 2 …e i …e n I is more than or equal to 1 and is more than or equal to n, i and n are positive integers, e i Is the text feature vector of the i-th word in the text to be matched.
In one possible implementation manner, in step S12, the features of finer granularity of the video to be matched may be obtained, so that the accuracy of the search is improved through the features of fine granularity in the search process. For example, the video to be matched can comprise a plurality of target objects, and feature information such as outlines, colors, actions and the like of the target objects can be acquired. Thus, when the features are acquired, more features with finer granularity can be acquired compared with the global features of the video to be matched.
In one possible implementation, step S12 may include: sampling the video to be matched to obtain a plurality of sampling frames; detecting target objects in the plurality of sampling frames to obtain areas where the target objects in the sampling frames are located; reserving the area of the target object in each sampling frame, covering the non-target area, and obtaining a grid image corresponding to each sampling frame; extracting the characteristics of the region where the target object is located in each grid image through a video characteristic coding model to obtain a target characteristic vector of each target object; and obtaining the target feature vector set according to a plurality of target feature vectors.
In one possible implementation, in the video to be matched, the content of multiple adjacent video frames in a certain scene may be similar, so, in order to save operation resources, feature extraction is not required for all video frames, and the video to be matched may be sampled. For example, video clips of a video to be matched may be acquired, e.g., divided into two or more video clips by scene, and video frames of each video clip sampled. In an example, a predetermined number of video frames may be acquired in each video clip as the sampling frames.
In one possible implementation, the target object in the sampling frame may be detected to obtain the region where the target object is located, and in an example, the sampling frame may be subjected to target detection by using an image detection model, so as to determine the region where each target object in each sampling frame is located. The image detection model may be a convolutional neural network model, and the present disclosure is not limited to a specific type of image detection model.
In one possible implementation manner, the area where the target object is located in each sampling frame may be reserved, and the non-target area is covered, so as to obtain a grid image corresponding to each sampling frame. For example, non-target areas, i.e., areas that do not contain target objects, may be masked by a mask. For example, multiple masks, which are regularly shaped masks and do not overlap each other, may be utilized to cover different non-target areas.
Fig. 2 illustrates a schematic diagram of a grid image according to an embodiment of the present disclosure, as shown in fig. 2, a mask may be added to non-target areas in each sampling frame, thereby masking the non-target areas to obtain the grid image. In the grid image, the region where the target object is located can be reserved, that is, if feature extraction is performed on the grid image, feature information of the target object can be obtained instead of feature information of the whole image.
In one possible implementation manner, feature extraction may be performed on each grid image through the video feature encoding model, and since the non-target area is covered by the mask, feature information of an area where the target object in each grid image is located may be extracted, so as to obtain a target feature vector of each target object. And then the target feature vectors in each grid image can be summarized, so that a target feature vector set is obtained. At the position ofIn an example, a plurality of target objects may be included in each sample frame, where the jth target object v j Is l j The target feature vector set is v= { l 1 ,l 2 …l j …l m And j is more than or equal to 1 and less than or equal to m, wherein j and m are positive integers.
In one possible implementation manner, in step S13, matching is performed on the video to be matched and the text to be matched according to the target feature vector set of the video to be matched and the text feature vector set of the text to be matched by using the cross-mode matching model, so that a video corresponding to the selected text to be matched is retrieved from multiple videos to be matched, or a text corresponding to the selected video to be matched is retrieved from multiple texts to be matched. In an example, a similarity between each text feature vector in the text feature vector set and each target feature vector in the target feature vector set may be determined, and whether the video to be matched and the text to be matched are similar may be determined based on each similarity, for example, each similarity may be weighted and summed, or a maximum value of each similarity may be determined, and whether the processing result meets a similarity criterion, for example, whether the processing result is greater than a similarity threshold, or the like, so as to determine whether the video to be matched and the text to be matched are similar.
In one possible implementation, step S13 may include: determining the similarity between each target feature vector in the target feature vector set and each text feature vector in the text feature vector set through a cross-modal matching model; and inputting the similarity into a fully-connected network to obtain the relevance score.
In one possible implementation manner, the mode matching model may align the target feature vector and the text feature vector by mapping, for example, map the target feature vector and the text feature vector to a common feature space, and solve the similarity of the target feature vector and the text feature vector by using feature vectors corresponding to the target feature vector and the text feature vector respectively in the feature space, for example, euclidean distance, cosine similarity, and the like, which is not limited in the specific manner of solving the similarity.
In one possible implementation, the above-described process may be performed iteratively a plurality of times to determine the similarity between each target feature vector and each text feature vector, respectively. Further, the similarity may be input into a fully connected network to obtain a correlation score of the video to be matched and the text to be matched, or may be input into the fully connected network after an activation process is performed by an activation function (e.g., a sigmoid activation function) to obtain the correlation score. The score may be a score in the form of a probability, e.g., a score in the (0, 1) interval. The present disclosure is not limited to the specific form or scope of the score.
In an example, the relevance score of the video to be matched and the text to be matched may be determined by the following equation (1):
Sim(T,V)=F(f(T,V)) (1)
wherein Sim (T, V) is the relevance score, T is a text feature vector set, V is a target feature vector set, F is a processing function of the modality matching model, and F is a processing function of a fully connected network.
In one possible implementation manner, if a text to be matched is selected, and corresponding videos are retrieved from the plurality of videos to be matched, then the relevance scores of the text to be matched and the plurality of videos to be matched can be respectively solved, the plurality of videos to be matched can be arranged according to the relevance scores, and one or more videos to be matched with the highest relevance scores in the arrangement are selected as the retrieval result. Otherwise, if the video to be matched is selected, searching the corresponding text in the plurality of texts to be matched, respectively solving the relevance scores of the video to be matched and the plurality of texts to be matched, arranging the plurality of texts to be matched according to the relevance scores, and selecting one or more texts to be matched with the highest relevance scores in the arrangement as a search result.
In this way, the text feature coding model can obtain the text feature vectors of the verbs and nouns of the text to be matched, and the video feature coding model can obtain the target feature vectors of a plurality of target objects in the grid image covering the non-target area, so that feature information with finer granularity can be obtained, and the target feature vectors and the text feature vectors can be aligned through the cross-mode matching model, and the similarity of the target feature vectors and the text feature vectors can be determined. Therefore, finer granularity features can be identified, and accuracy of cross-modal retrieval is improved.
In one possible implementation, the above text feature encoding model, video feature encoding model, and cross-modality matching model may be trained prior to use, thereby enhancing the recognition and extraction capabilities of the above models for fine-grained features, as well as the alignment capabilities for cross-modality feature information.
In one possible implementation, the method further includes: extracting characteristics of a text sample and a plurality of words of the text sample through a text characteristic coding model to obtain a text global characteristic vector of the text sample and a sample text characteristic vector set of the plurality of words; extracting features of a video sample and grid images of the video sample through a video feature coding model to obtain a video global feature vector of the video sample and a sample target feature vector set of a plurality of target objects in the video sample; determining target perception contrast loss according to the sample text feature vector set, the sample target feature vector set, the text global feature vector and the video global feature vector; determining feature fusion contrast loss according to the cross-modal matching model, training texts after random removal of verbs or nouns in text samples, video samples matched with the text samples, the text global feature vector and the video global feature vector; determining video text matching contrast loss according to the sample text feature vector set and the sample target feature vector set; determining the comprehensive loss of the text feature coding model, the video feature coding model and the cross-modal matching model according to the target perception contrast loss, the feature fusion contrast loss and the video text matching contrast loss; and training the text feature coding model, the video feature coding model and the cross-modal matching model according to the comprehensive loss to obtain a trained text feature coding model, a trained video feature coding model and a trained cross-modal matching model.
FIG. 3 shows a schematic diagram of model training according to an embodiment of the present disclosure. As shown in fig. 3, the text sample may be subjected to feature extraction by the text feature coding model to obtain a text global feature vector, and a sample text feature vector set of the words (verbs and nouns) in the text sample may also be extracted in a similar manner to the feature extraction of the words (verbs and nouns) in the text to be matched. Feature extraction can be performed on a video sample (for example, a plurality of video frames of the video sample, such as CLS in fig. 3) through a video feature coding model to obtain a video global feature vector, and a non-target area of a sampling frame in the video sample can be covered through masking in a similar manner to the feature extraction performed on each target object in the video to be matched, so as to obtain a grid image, and feature information of the target object in the grid image is extracted, so as to obtain a sample target feature vector set.
In one possible implementation, since finer granularity of feature information is available and cross-modal fine granularity of text feature information and fine granularity of video feature information are aligned during feature extraction, if the text sample and video sample being feature extracted are matched, the video global feature vector may be aligned not only with the text global feature vector but also with each sample text feature vector in the sample text feature vector set during training. Similarly, the text global feature vector may be aligned not only with the video global feature vector, but also with each sample target feature vector of the set of sample target feature vectors.
In one possible implementation, determining the target perceptual contrast loss from the set of sample text feature vectors, the set of sample target feature vectors, the text global feature vector, and the video global feature vector includes: determining text alignment loss according to the sample text feature vector set, the sample target feature vector set and the video global feature vector; determining video alignment loss according to the sample text feature vector set, the sample target feature vector set and the text global feature vector; and determining the target perception contrast loss according to the video alignment loss and the video alignment loss.
In one possible implementation, the text alignment loss may be determined by the following equation (2)
Figure BDA0004108945410000131
Figure BDA0004108945410000132
Wherein v is the global feature vector of the video, w i For the ith sample text feature vector, v i For the ith sample target feature vector, I is the number of sample text feature vector set vectors, τ is the hyper-parameter, sim is the similarity calculation function (e.g., dot product, cosine similarity, etc.). In the training process, if the sample text and the sample video are matched, the similarity between the numerator and the denominator can be improved, the video global feature vector is aligned with each sample text feature vector, each sample text feature vector is aligned with each corresponding sample target feature vector, the true number part of the logarithm is made to approach 1, the text alignment loss is made to approach 0, and the text alignment loss can be reduced in the training process.
In one possible implementation, the video alignment loss may be determined by the following equation (3)
Figure BDA0004108945410000133
Figure BDA0004108945410000134
Wherein v is j For the jth sample target feature vector, X is the text global feature vector, w j For the corresponding sample text feature vector, J is the number of vectors in the sample target feature vector set. In the training process, if the sample text and the sample video are matched, the similarity between the numerator and the denominator can be improved, the text global feature vector is aligned with each sample target feature vector, andthe target feature vectors of all samples are aligned with the corresponding text feature vectors of all samples, the true number part of the logarithm is made to approach 1, and the text alignment loss is made to approach 0, so that the video alignment loss can be reduced in the training process.
In one possible implementation, the above video alignment loss is obtained
Figure BDA0004108945410000135
And text alignment loss->
Figure BDA0004108945410000136
After that, the target perceived contrast loss can be determined by formula (4)>
Figure BDA0004108945410000137
Figure BDA0004108945410000138
I.e. loss of alignment to video
Figure BDA0004108945410000141
And text alignment loss->
Figure BDA0004108945410000142
Summing is performed so that the target perceived contrast is lost during training>
Figure BDA0004108945410000143
The method further includes reducing to align the text global feature vector, the sample text feature vector, the video global feature vector, and the sample target feature vector.
In one possible implementation manner, since the cross-mode matching model can align the cross-mode fine-granularity feature information, in the training process, not only the alignment capability of the cross-mode matching model needs to be improved, but also the sensing and identifying capability of the cross-mode matching model on the fine-granularity feature can be trained, and the sensing capability of the cross-mode matching model on the fine-granularity feature can be improved in a comparison learning manner. As described above, the fine-grained features of the text include text features of verbs and nouns, and thus, the perceptibility of the cross-modality matching model to verbs and nouns, and the alignment capability to verbs and names with corresponding videos may be enhanced.
In one possible implementation manner, determining a feature fusion contrast loss according to the cross-modal matching model, training text after randomly removing verbs or nouns in text samples, video samples matched with the text samples, and the text global feature vector and the video global feature vector includes: determining a first noise contrast estimate penalty between the text global feature vector and the video global feature vector; determining a second noise contrast estimation loss according to the cross-modal matching model, a first training text obtained by randomly removing nouns in a text sample and a video sample matched with the text sample; determining a third noise contrast estimation loss according to the cross-modal matching model, a second training text after the verbs in the text samples are randomly removed and a video sample matched with the text samples; and determining the characteristic fusion contrast loss according to the first noise contrast estimated loss, the second noise contrast estimated loss and the third noise contrast estimated loss.
In one possible implementation, a first noise contrast estimate penalty between the text global feature vector and the video global feature vector may be determined, in an example, the vector representations of both in feature space may be obtained by mapping the text global feature vector and the video global feature vector into the above feature space, respectively, across a modality matching model, i.e., the video vector representation f v And text vector representation f t . And can then be represented by a video vector v And text vector representation f t A first noise contrast estimate penalty is determined.
In one possible implementation, the noise contrast estimate loss may be determined by the following equation (5):
Figure BDA0004108945410000144
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004108945410000145
is x i 、y i The loss is estimated for the noise contrast of two vectors, B being the batch size, i.e., y j Is a number of (3).
In an example, the text vector representation f of the text sample may be utilized t Instead of x i Video vector representation f using matched video samples v Instead of y i And replace y with a video vector representation of other video samples in the set of video samples j The first noise contrast estimate loss can be determined using equation (5)
Figure BDA0004108945410000146
Of course, the video vector representation f of the video samples may also be utilized v Instead of x i Text vector representation f using matched text samples t Instead of y i And replace y with a text vector representation of other text samples in the set of text samples j Using equation (5) the first noise contrast estimate loss can be determined>
Figure BDA0004108945410000147
In one possible implementation, the perceptibility and alignment capabilities of the cross-modality matching model to fine-grained features may be trained. The fine-grained feature may include a text feature of a verb and a text feature of a noun.
In one possible implementation, for text features of a noun therein, a second noise contrast estimate penalty may be determined by the following steps, thereby enhancing perceptibility of the text features of the noun during training. Determining a second noise contrast estimation loss according to the cross-modal matching model, a first training text after random removal of nouns in the text sample, and a video sample matched with the text sample, including: extracting features of the first training text through the text feature coding model to obtain noun problem word features; extracting features of the video samples matched with the text samples through the video feature coding model to obtain query reference features; acquiring noun answer characteristics for the noun question word characteristics and the query reference characteristics through the cross-modal matching model; inputting the removed nouns into the text feature coding model to obtain noun features; and determining the second noise contrast estimation loss according to the noun characteristics and the noun answer characteristics.
In one possible implementation, the cross-modal matching model can identify the missing nouns through a comparison learning mode, so that the cross-modal matching model improves the perceptibility of noun features and the alignment capability with video features. The first training text is a text obtained by randomly removing nouns in a text sample, such as a text sample in fig. 3, wherein "tourists enjoy beauty in a beautiful park", after word segmentation, words such as "tourists", "appreciation", "beauty", and "park" can be obtained, and three nouns such as "park" can be randomly removed, and then "tourists in a beautiful [? And appreciating beauty, the first training text can be used as noun questions to be input into a text feature coding model, and noun question word features can be obtained.
In one possible implementation, the video sample matched with the text sample may be input into a video feature encoding model to obtain query reference features as reference information, i.e., the cross-modality matching model may be made to reference the features of the video to determine the nouns missing in the first training text.
In one possible implementation, the cross-modality matching model may determine the characteristics of the missing nouns, i.e., obtain noun answer characteristics, based on the noun question word characteristics and the query reference characteristics. The noun answer features are output features of the cross-modal matching model, and errors may exist.
In one possible implementation, the noun that has been removed, e.gFor example, "park" may be input to the text feature encoding model to obtain noun features, i.e., noun answers that are error free. Further, the noun feature may be compared with the noun answer feature, for example, the noun feature and the noun answer feature may be mapped to the feature space by a cross-modal matching model to obtain a vector representation f of the noun answer feature noun_a And vector representation f of noun features noun And based on f noun_a And f noun To determine a second noise contrast estimate penalty. During training, the second noise contrast estimate penalty may be reduced while enabling f to be set if the noun answer characteristics are correct noun_a And f noun The similarity between the noun answers is maximized, if the noun answer features are wrong, f can be made noun_a And f noun The similarity between them is minimized.
In an example, the second noise contrast estimate loss may be determined by equation (5), e.g., f may be caused to noun Instead of x in formula (5) i Make correct f noun_a Instead of y in formula (5) i Make f of error noun_a Instead of y in formula (5) j B is the number of noun answer features output. Through the above substitution, the second noise contrast estimation loss can be obtained, so that the similarity between the vector representation of the correct noun answer feature and the vector representation of the noun feature can be maximized while the similarity between the vector representation of the wrong noun answer feature and the vector representation of the noun feature can be minimized in the process of reducing the second noise contrast estimation loss. For example, the similarity between the vector representation of noun answer features of the output "park" and the vector representation of noun features of the correct answer is maximized, while the similarity between the vector representation of noun answer features of the output "lawn" or "beach" and the vector representation of noun features of the correct answer is minimized.
In one possible implementation, on the other hand, for the text feature of the verb in the text sample, a third noise contrast estimate penalty may be determined by the following steps, thereby enhancing perceptibility of the text feature of the verb during training. Determining a third noise contrast estimation loss according to the cross-modal matching model, a second training text after random removal of verbs in the text sample, and a video sample matched with the text sample, including: extracting features of the second training text through the text feature coding model to obtain verb problem word features; extracting features of the video samples matched with the text samples through the video feature coding model to obtain query reference features; obtaining verb answer features for the verb question word features and the query reference features through the cross-modal matching model; inputting the removed verbs into the text feature coding model to obtain verb features; and determining the third noise contrast estimation loss according to the verb feature and the verb answer feature.
In one possible implementation manner, the cross-modal matching model can identify the missing verbs through a comparison learning manner, so that the cross-modal matching model improves the perception capability of verb features and the alignment capability with video features. The second training text is a text after randomly removing verbs in the text sample, such as text sample "tourists enjoy beauty in a beautiful park" in fig. 3, after word segmentation, several words of "tourists", "appreciation", "beauty" and "park" can be obtained, wherein the verbs are "appreciation", and the verbs can be removed, so that the second training text "tourists are in a beautiful park [? And (3) the second training text can be used as a verb question input text feature coding model, and verb question word features can be obtained.
In one possible implementation, the video sample matched with the text sample may be input into a video feature encoding model to obtain query reference features as reference information, i.e., the cross-modality matching model may be made to determine the verb missing in the second training text with the features of the video as reference.
In one possible implementation, the cross-modality matching model may determine the features of the missing verbs, i.e., obtain verb answer features, based on the verb question word features and query reference features. The verb answer features are output features of the cross-modality matching model, and errors may exist.
In one possible implementation, the removed verb, e.g., "appreciation," may be entered into a text feature encoding model to obtain a verb feature, i.e., a verb answer that is error-free. Further, the verb feature may be compared with the verb answer feature, e.g., the verb feature and the verb answer feature may be mapped to the feature space by a cross-modality matching model to obtain a vector representation f of the verb answer feature verb_a And the vector of verb features represents f verb And based on f verb_a And f verb To determine a third noise contrast estimate loss. During training, the third noise contrast estimate penalty may be reduced while enabling f to be enabled if the verb answer feature is correct verb_a And f verb The similarity between the verb answers is maximized, if the verb answer features are wrong, f can be made verb_a And f verb The similarity between them is minimized.
In an example, a third noise contrast estimate loss may be determined by equation (5), e.g., f may be caused verb Instead of x in formula (5) i Make correct f verb_a Instead of y in formula (5) i Make f of error verb_a Instead of y in formula (5) j B is the number of verb answer features output. Through the above substitution, the third noise contrast estimation loss may be obtained, so that the similarity between the vector representation of the correct verb answer feature and the vector representation of the verb feature may be maximized while the similarity between the vector representation of the wrong verb answer feature and the vector representation of the verb feature may be minimized in the process of narrowing down the third noise contrast estimation loss. For example, the degree of similarity between the outputted vector representation of the verb answer feature of "appreciation" and the vector representation of the verb feature of the correct answer is maximized, while the degree of similarity between the outputted vector representation of the verb answer feature of "saturation" or "play" and the vector representation of the verb feature of the correct answer is minimized.
In one possible implementation, after obtaining the above first, second, and third noise contrast estimated losses, a feature fusion contrast loss may be determined by the following equation (6):
Figure BDA0004108945410000171
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004108945410000172
fusion contrast loss for the feature, +. >
Figure BDA0004108945410000173
A loss is estimated for the first noise contrast,
Figure BDA0004108945410000174
estimating a loss for said second noise contrast, < >>
Figure BDA0004108945410000175
A loss is estimated for the third noise contrast.
In one possible implementation manner, in order to improve accuracy of cross-modal retrieval, similarity of feature information of matched text samples and video samples can be improved in a training process, and similarity of feature information of unmatched text samples and video samples can be reduced. Determining a video text matching contrast loss according to the sample text feature vector set and the sample target feature vector set, comprising: determining a first contrast loss according to a sample text feature vector set of any text sample in a text sample set, a sample target feature vector set of a video sample matched with the text sample in a video sample set, and a sample target feature vector set of a video sample not matched with the text sample in the video sample set; determining a second contrast loss according to a sample target feature vector set of any video sample in the video sample set, a sample text feature vector set of a text sample in the text sample set that is matched with the video sample, and a sample text feature vector set of a text sample in the text sample set that is not matched with the video sample; and determining the video text matching contrast loss according to the first contrast loss and the second contrast loss.
In one possible implementation, in retrieving video over text, the above fully-connected network may be utilized to calculate a relevance score between a sample text feature vector set of text samples and a sample target feature vector set of matched video samples, and calculate a relevance score between a sample text feature vector set and a sample target feature vector set of non-matched video samples. Further, the above process may be performed for each text sample in the set of text samples, a plurality of relevance scores may be obtained, and based on these relevance scores, a first loss of contrast may be obtained.
In one possible implementation, in retrieving text over video, the above fully-connected network may be utilized to calculate a relevance score between a sample target feature vector set of video samples and a sample text feature vector set of matched text samples, and to calculate a relevance score between a sample target feature vector set and a sample text feature vector set of unmatched text samples. Further, the above processing may be performed for each video sample in the set of video samples, a plurality of correlation scores may be obtained, and based on these correlation scores, a second contrast loss may be obtained.
In one possible implementation, the video text matching contrast loss may be determined from the first contrast loss and the second contrast loss, and in an example, the video text matching contrast loss may be determined according to the following equation (7)
Figure BDA0004108945410000176
Figure BDA0004108945410000177
Wherein Sim (T, V) is the correlation between the matched set of sample text feature vectors and the set of sample target feature vectorsThe score is a score of the total number of the,
Figure BDA0004108945410000178
for a correlation score between a sample text feature vector set and a non-matching sample object feature vector set,/->
Figure BDA0004108945410000179
For a correlation score between a sample target feature vector set and a non-matching sample text feature vector set [] + Represents the maximum between the value in brackets and 0. During training, the video text matching contrast loss can be caused
Figure BDA0004108945410000181
The correlation score between the set of non-matching sample target feature vectors and the set of sample text feature vectors is minimized and the correlation score between the set of matching sample text feature vectors and the set of sample target feature vectors is maximized.
In one possible implementation, the target perceived contrast loss is obtained
Figure BDA0004108945410000182
Feature fusion contrast loss
Figure BDA0004108945410000183
Contrast loss with video text match>
Figure BDA0004108945410000184
After that, the total loss can be determined according to the following formula (8) >
Figure BDA0004108945410000185
Figure BDA0004108945410000186
In one possible implementation, the text feature encoding model, the video feature encoding model, and the cross-modality matching model may be trained using a comprehensive loss, i.e., counter-propagating with a comprehensive loss, adjusting parameters of the models such that the comprehensive loss is minimized. And can iteratively execute the training process, and stop training after reaching the training condition, wherein the training condition can comprise comprehensive loss convergence, or the test result of the model in the test set meets the requirement, and the like, and the training condition is not limited by the present disclosure.
According to the text video cross-modal retrieval method based on fine granularity perception, the text feature coding model can obtain text feature vectors of verbs and nouns of texts to be matched, the video feature coding model can obtain target feature vectors of a plurality of target objects in a grid image covering a non-target area, therefore, feature information with finer granularity is obtained, the target feature vectors and the text feature vectors can be aligned through the cross-modal matching model, and the similarity of the two can be determined. Therefore, finer granularity features can be identified, and accuracy of cross-modal retrieval is improved. Further, in the training process, the sample text feature vector set, the sample target feature vector set, the text global feature vector and the video global feature vector can be aligned, and the perceptibility and the alignment capability of the cross-mode matching model to fine granularity features are improved, so that the model performance and the retrieval accuracy are improved.
Fig. 4 illustrates an application diagram of a fine granularity awareness based text video cross-modality retrieval method according to an embodiment of the present disclosure.
In an example, the tokenizer model may be used for word segmentation, the CLIP TextFormer as the text feature encoding model, the fast RCNN as the image detection model, and the CLIP VideoFormer as the video feature encoding model. And can be used as a cross-modal matching model through a BridgeNet model.
In an example, training of the above model can be performed on a Web vid2.5M and Google Conceptual Captions (CC 3M) dataset, where Web vid2.5M contains 2.5M video-text pairs and CC3M contains 3.3M image-text pairs. And can be validated and tested on MSR-VTT, MSVD, LSMDC, diDeMo and HowTo100M datasets. Wherein the MSR-VTT contains 1 ten thousand videos and 20 ten thousand text descriptions, wherein 9000 videos are verification sets and 1000 videos are test sets. MSVD contains 1970 videos and 8 ten thousand text descriptions, 1300 videos being the validation set and 670 videos being the test set. LSMDC is derived from 118081 video clips of 202 movies, 7408 videos are the validation set, and 1000 videos are the test set. DiDeMo contains 1 ten thousand videos and 4 ten thousand text data, 6000 videos are verification sets, 4000 videos are test sets, and all description texts of one video can be spliced together to be a text description of a single video. HowTo100M contains 1.22M video and 136M text description, 73 ten thousand videos are the verification set, and 49 ten thousand videos are the test set.
In an example, after the training described above, a text feature vector set { T } of the headline text may be obtained using a text feature encoding model 1 ,T 2 …T n Respectively obtaining target feature vector sets { V } of N videos by utilizing a video feature coding model 1 ,V 2 …V n }。
In an example, a set of text feature vectors of the headline text may be aligned with a set of target feature vectors of video 1 by a cross-modality matching model and a relevance score for both determined; this process may be iteratively performed by aligning the text feature vector set of the headline text with the target feature vector set of video 2 across the modality matching model and determining a relevance score … … for both until a relevance score for the headline text with the respective video is obtained. Further, the videos may be ranked according to the relevance score, so that a search result for performing video search based on the title text may be obtained.
Fig. 5 illustrates a block diagram of a fine granularity perception based text video cross-modality retrieval device, as shown in fig. 3, including:
the text encoding module 11 is configured to perform feature extraction on a text to be matched through a text feature encoding model, and obtain a text feature vector set of a plurality of words of the text to be matched, where the plurality of words include verbs and nouns, and the feature vector set includes text feature vectors corresponding to the verbs and text feature vectors corresponding to the nouns;
The video coding module 12 is configured to perform feature extraction on a video to be matched through a video feature coding model, and obtain a target feature vector set of a plurality of target objects in the video to be matched, where the target feature vector set includes target feature vectors corresponding to the plurality of target objects respectively;
and the matching module 13 is configured to determine a relevance score between the target feature vector set and the text feature vector set through a cross-modal matching model, where the relevance score is used to retrieve a video corresponding to the text to be matched from a plurality of videos to be matched or retrieve a text corresponding to the video to be matched from a plurality of texts to be matched, and the cross-modal matching model is obtained by training a training text after a verb or noun in a text sample is randomly removed.
In one possible implementation, the text encoding module is further configured to:
word segmentation is carried out on the text to be matched to obtain a plurality of words of the text to be matched;
extracting features of the words through the text feature coding model to obtain text feature vectors corresponding to the words;
And obtaining the text feature vector set according to a plurality of text feature vectors.
In one possible implementation, the video encoding module is further configured to:
sampling the video to be matched to obtain a plurality of sampling frames;
detecting target objects in the plurality of sampling frames to obtain areas where the target objects in the sampling frames are located;
reserving the area of the target object in each sampling frame, covering the non-target area, and obtaining a grid image corresponding to each sampling frame;
extracting the characteristics of the region where the target object is located in each grid image through a video characteristic coding model to obtain a target characteristic vector of each target object;
and obtaining the target feature vector set according to a plurality of target feature vectors.
In one possible implementation, the matching module is further configured to:
determining the similarity between each target feature vector in the target feature vector set and each text feature vector in the text feature vector set through a cross-modal matching model;
and inputting the similarity into a fully-connected network to obtain the relevance score.
In a possible implementation manner, the apparatus further includes a training module, configured to:
Extracting characteristics of a text sample and a plurality of words of the text sample through a text characteristic coding model to obtain a text global characteristic vector of the text sample and a sample text characteristic vector set of the plurality of words;
extracting features of a video sample and grid images of the video sample through a video feature coding model to obtain a video global feature vector of the video sample and a sample target feature vector set of a plurality of target objects in the video sample;
determining target perception contrast loss according to the sample text feature vector set, the sample target feature vector set, the text global feature vector and the video global feature vector;
determining feature fusion contrast loss according to the cross-modal matching model, training texts after random removal of verbs or nouns in text samples, video samples matched with the text samples, the text global feature vector and the video global feature vector;
determining video text matching contrast loss according to the sample text feature vector set and the sample target feature vector set;
determining the comprehensive loss of the text feature coding model, the video feature coding model and the cross-modal matching model according to the target perception contrast loss, the feature fusion contrast loss and the video text matching contrast loss;
And training the text feature coding model, the video feature coding model and the cross-modal matching model according to the comprehensive loss to obtain a trained text feature coding model, a trained video feature coding model and a trained cross-modal matching model.
In one possible implementation, the training module is further configured to:
determining text alignment loss according to the sample text feature vector set, the sample target feature vector set and the video global feature vector;
determining video alignment loss according to the sample text feature vector set, the sample target feature vector set and the text global feature vector;
and determining the target perception contrast loss according to the video alignment loss and the text alignment loss.
In one possible implementation, the training module is further configured to:
determining a first noise contrast estimate penalty between the text global feature vector and the video global feature vector;
determining a second noise contrast estimation loss according to the cross-modal matching model, a first training text obtained by randomly removing nouns in a text sample and a video sample matched with the text sample;
Determining a third noise contrast estimation loss according to the cross-modal matching model, a second training text after the verbs in the text samples are randomly removed and a video sample matched with the text samples;
and determining the characteristic fusion contrast loss according to the first noise contrast estimated loss, the second noise contrast estimated loss and the third noise contrast estimated loss.
In one possible implementation, the training module is further configured to:
extracting features of the first training text through the text feature coding model to obtain noun problem word features;
extracting features of the video samples matched with the text samples through the video feature coding model to obtain query reference features;
acquiring noun answer characteristics for the noun question word characteristics and the query reference characteristics through the cross-modal matching model;
inputting the removed nouns into the text feature coding model to obtain noun features;
and determining the second noise contrast estimation loss according to the noun characteristics and the noun answer characteristics.
In one possible implementation, the training module is further configured to:
Extracting features of the second training text through the text feature coding model to obtain verb problem word features;
extracting features of the video samples matched with the text samples through the video feature coding model to obtain query reference features;
obtaining verb answer features for the verb question word features and the query reference features through the cross-modal matching model;
inputting the removed verbs into the text feature coding model to obtain verb features;
and determining the third noise contrast estimation loss according to the verb feature and the verb answer feature.
In one possible implementation, the training module is further configured to:
determining a first contrast loss according to a sample text feature vector set of any text sample in a text sample set, a sample target feature vector set of a video sample matched with the text sample in a video sample set, and a sample target feature vector set of a video sample not matched with the text sample in the video sample set;
determining a second contrast loss according to a sample target feature vector set of any video sample in the video sample set, a sample text feature vector set of a text sample in the text sample set that is matched with the video sample, and a sample text feature vector set of a text sample in the text sample set that is not matched with the video sample;
And determining the video text matching contrast loss according to the first contrast loss and the second contrast loss.
In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a non-volatile computer readable storage medium.
The embodiment of the disclosure also provides a text video cross-mode retrieval device based on fine granularity perception, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.
Embodiments of the present disclosure also provide a computer program product comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for implementing the cloud application management method as provided in any of the embodiments above.
The present disclosure also provides another computer program product for storing computer readable instructions that, when executed, cause a computer to perform the operations of the cloud application management method provided in any of the above embodiments.
The fine grain-aware based text video cross-modality retrieval device may be provided as a terminal, server or other modality device.
Fig. 6 illustrates a block diagram of a fine granularity perception based text video cross-modality retrieval device 800 in accordance with an embodiment of the present disclosure. For example, device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or like terminal device.
Referring to fig. 6, device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 806 provides power to the various components of the device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.
The multimedia component 808 includes a screen between the device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only an edge of a touch or slide action, but also a duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.
Input/output interface 812 provides an interface between processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the device 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, an orientation or acceleration/deceleration of the device 800, and a change in temperature of the device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the device 800 and other devices, either wired or wireless. The device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including computer program instructions executable by processor 820 of device 800 to perform the above-described method.
Fig. 7 illustrates a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, electronic device 1900 may be provided as a terminal or server. Referring to FIG. 7, electronic device 1900 includes a processing unit 1922 that further includes one or more processors and memory resources represented by a storage unit 1932 for storing instructions, such as application programs, that can be executed by processing unit 1922. The application programs stored in storage unit 1932 may include one or more modules each corresponding to a set of instructions. Further, the processing unit 1922 is configured to execute instructions to perform the methods described above.
The electronic device 1900 may also include a power unit 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input-output interface 1958. The electronic device 1900 may operate based on an operating system, such as Windows Server, stored in the storage unit 1932 TM ,Mac OS X TM ,Unix TM ,Linux TM ,FreeBSD TM Or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as a storage unit 1932, including computer program instructions executable by the processing unit 1922 of the electronic device 1900 to perform the methods described above.
The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
It will be appreciated that the above embodiments mentioned in the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited in space, and the disclosure is not repeated. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.
Note that all features disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic set of equivalent or similar features. Where used, further, preferably, still further and preferably, the brief description of the other embodiment is provided on the basis of the foregoing embodiment, and further, preferably, further or more preferably, the combination of the contents of the rear band with the foregoing embodiment is provided as a complete construct of the other embodiment. A further embodiment is composed of several further, preferably, still further or preferably arrangements of the strips after the same embodiment, which may be combined arbitrarily.
It will be appreciated by persons skilled in the art that the embodiments of the invention described above and shown in the drawings are by way of example only and are not limiting. The objects of the present invention have been fully and effectively achieved. The functional and structural principles of the present invention have been shown and described in the examples and embodiments of the invention may be modified or practiced without departing from the principles described.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present disclosure, and not for limiting the same; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present disclosure.

Claims (13)

1. A text video cross-modal retrieval method based on fine granularity perception is characterized by comprising the following steps:
extracting features of a text to be matched through a text feature coding model to obtain a text feature vector set of a plurality of words of the text to be matched, wherein the words comprise verbs and nouns, and the feature vector set comprises text feature vectors corresponding to the verbs and text feature vectors corresponding to the nouns;
Extracting features of a video to be matched through a video feature coding model to obtain a target feature vector set of a plurality of target objects in the video to be matched, wherein the target feature vector set comprises target feature vectors respectively corresponding to the plurality of target objects;
and determining a relevance score between the target feature vector set and the text feature vector set through a cross-modal matching model, wherein the relevance score is used for searching videos corresponding to the texts to be matched in a plurality of videos to be matched or searching texts corresponding to the videos to be matched in a plurality of texts to be matched, and the cross-modal matching model is obtained through training the training texts after verbs or nouns in text samples are randomly removed.
2. The fine granularity perception-based text video cross-modal retrieval method according to claim 1, wherein the feature extraction is performed on the text to be matched through a text feature coding model to obtain a text feature vector set of a plurality of words of the text to be matched, and the method comprises the following steps:
word segmentation is carried out on the text to be matched to obtain a plurality of words of the text to be matched;
Extracting features of the words through the text feature coding model to obtain text feature vectors corresponding to the words;
and obtaining the text feature vector set according to a plurality of text feature vectors.
3. The fine granularity perception-based text video cross-modal retrieval method according to claim 1, wherein the feature extraction is performed on a video to be matched through a video feature coding model to obtain a target feature vector set of a plurality of target objects in the video to be matched, and the method comprises the following steps:
sampling the video to be matched to obtain a plurality of sampling frames;
detecting target objects in the plurality of sampling frames to obtain areas where the target objects in the sampling frames are located;
reserving the area of the target object in each sampling frame, covering the non-target area, and obtaining a grid image corresponding to each sampling frame;
extracting the characteristics of the region where the target object is located in each grid image through a video characteristic coding model to obtain a target characteristic vector of each target object;
and obtaining the target feature vector set according to a plurality of target feature vectors.
4. The fine granularity awareness based text-video cross-modal retrieval method of claim 1, wherein determining a relevance score between the set of target feature vectors and the set of text feature vectors by a cross-modal matching model comprises:
Determining the similarity between each target feature vector in the target feature vector set and each text feature vector in the text feature vector set through a cross-modal matching model;
and inputting the similarity into a fully-connected network to obtain the relevance score.
5. The fine grain perception based text video cross-modality retrieval method as recited in any one of claims 1 to 4, wherein the method further includes:
extracting characteristics of a text sample and a plurality of words of the text sample through a text characteristic coding model to obtain a text global characteristic vector of the text sample and a sample text characteristic vector set of the plurality of words;
extracting features of a video sample and grid images of the video sample through a video feature coding model to obtain a video global feature vector of the video sample and a sample target feature vector set of a plurality of target objects in the video sample;
determining target perception contrast loss according to the sample text feature vector set, the sample target feature vector set, the text global feature vector and the video global feature vector;
determining feature fusion contrast loss according to the cross-modal matching model, training texts after random removal of verbs or nouns in text samples, video samples matched with the text samples, the text global feature vector and the video global feature vector;
Determining video text matching contrast loss according to the sample text feature vector set and the sample target feature vector set;
determining the comprehensive loss of the text feature coding model, the video feature coding model and the cross-modal matching model according to the target perception contrast loss, the feature fusion contrast loss and the video text matching contrast loss;
and training the text feature coding model, the video feature coding model and the cross-modal matching model according to the comprehensive loss to obtain a trained text feature coding model, a trained video feature coding model and a trained cross-modal matching model.
6. The fine granularity perception based text video cross-modal retrieval method of claim 5, wherein determining a target perception contrast penalty from the set of sample text feature vectors, the set of sample target feature vectors, the text global feature vector, and the video global feature vector comprises:
determining text alignment loss according to the sample text feature vector set, the sample target feature vector set and the video global feature vector;
Determining video alignment loss according to the sample text feature vector set, the sample target feature vector set and the text global feature vector;
and determining the target perception contrast loss according to the video alignment loss and the text alignment loss.
7. The fine granularity perception based text video cross-modal retrieval method according to claim 5, wherein determining feature fusion contrast loss according to the cross-modal matching model, training text after random removal of verbs or nouns in text samples, video samples matched with the text samples, and the text global feature vector and the video global feature vector comprises:
determining a first noise contrast estimate penalty between the text global feature vector and the video global feature vector;
determining a second noise contrast estimation loss according to the cross-modal matching model, a first training text obtained by randomly removing nouns in a text sample and a video sample matched with the text sample;
determining a third noise contrast estimation loss according to the cross-modal matching model, a second training text after the verbs in the text samples are randomly removed and a video sample matched with the text samples;
And determining the characteristic fusion contrast loss according to the first noise contrast estimated loss, the second noise contrast estimated loss and the third noise contrast estimated loss.
8. The fine granularity perception based text video cross-modal retrieval method according to claim 7, wherein determining a second noise contrast estimation penalty from the cross-modal matching model, the first training text after random removal of nouns in text samples, and video samples matched to the text samples, comprises:
extracting features of the first training text through the text feature coding model to obtain noun problem word features;
extracting features of the video samples matched with the text samples through the video feature coding model to obtain query reference features;
acquiring noun answer characteristics for the noun question word characteristics and the query reference characteristics through the cross-modal matching model;
inputting the removed nouns into the text feature coding model to obtain noun features;
and determining the second noise contrast estimation loss according to the noun characteristics and the noun answer characteristics.
9. The fine granularity perception based text video cross-modal retrieval method according to claim 7, wherein determining a third noise contrast estimation penalty according to the cross-modal matching model, the second training text after the verbs in the text samples are randomly removed, and the video samples matched with the text samples, comprises:
extracting features of the second training text through the text feature coding model to obtain verb problem word features;
extracting features of the video samples matched with the text samples through the video feature coding model to obtain query reference features;
obtaining verb answer features for the verb question word features and the query reference features through the cross-modal matching model;
inputting the removed verbs into the text feature coding model to obtain verb features;
and determining the third noise contrast estimation loss according to the verb feature and the verb answer feature.
10. The fine granularity perception based text video cross-modal retrieval method of claim 5, wherein determining a video text matching contrast penalty from the set of sample text feature vectors and the set of sample target feature vectors comprises:
Determining a first contrast loss according to a sample text feature vector set of any text sample in a text sample set, a sample target feature vector set of a video sample matched with the text sample in a video sample set, and a sample target feature vector set of a video sample not matched with the text sample in the video sample set;
determining a second contrast loss according to a sample target feature vector set of any video sample in the video sample set, a sample text feature vector set of a text sample in the text sample set that is matched with the video sample, and a sample text feature vector set of a text sample in the text sample set that is not matched with the video sample;
and determining the video text matching contrast loss according to the first contrast loss and the second contrast loss.
11. A text video cross-modal retrieval device based on fine granularity perception, comprising:
the text coding module is used for extracting features of a text to be matched through a text feature coding model to obtain a text feature vector set of a plurality of words of the text to be matched, wherein the words comprise verbs and nouns, and the feature vector set comprises text feature vectors corresponding to the verbs and text feature vectors corresponding to the nouns;
The video coding module is used for extracting the characteristics of the video to be matched through a video characteristic coding model to obtain a target characteristic vector set of a plurality of target objects in the video to be matched, wherein the target characteristic vector set comprises target characteristic vectors respectively corresponding to the plurality of target objects;
the matching module is used for determining a correlation score between the target feature vector set and the text feature vector set through a cross-modal matching model, wherein the correlation score is used for searching videos corresponding to the texts to be matched in a plurality of videos to be matched or searching texts corresponding to the videos to be matched in a plurality of texts to be matched, and the cross-modal matching model is obtained through training texts after verbs or nouns in text samples are randomly removed.
12. A fine granularity perception-based text video cross-modal retrieval device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1 to 10.
13. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon computer program instructions which, when executed by a processor, implement the method according to any of claims 1-10.
CN202310200445.3A 2023-03-02 2023-03-02 Text video cross-modal retrieval method and device based on fine granularity perception Active CN116166843B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310200445.3A CN116166843B (en) 2023-03-02 2023-03-02 Text video cross-modal retrieval method and device based on fine granularity perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310200445.3A CN116166843B (en) 2023-03-02 2023-03-02 Text video cross-modal retrieval method and device based on fine granularity perception

Publications (2)

Publication Number Publication Date
CN116166843A true CN116166843A (en) 2023-05-26
CN116166843B CN116166843B (en) 2023-11-07

Family

ID=86420011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310200445.3A Active CN116166843B (en) 2023-03-02 2023-03-02 Text video cross-modal retrieval method and device based on fine granularity perception

Country Status (1)

Country Link
CN (1) CN116166843B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493608A (en) * 2023-12-26 2024-02-02 西安邮电大学 Text video retrieval method, system and computer storage medium
CN117851640A (en) * 2024-03-04 2024-04-09 广东智媒云图科技股份有限公司 Video data processing method, device, equipment and medium based on composite characteristics
CN117851640B (en) * 2024-03-04 2024-05-31 广东智媒云图科技股份有限公司 Video data processing method, device, equipment and medium based on composite characteristics

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191075A (en) * 2019-12-31 2020-05-22 华南师范大学 Cross-modal retrieval method, system and storage medium based on dual coding and association
CN113806482A (en) * 2021-09-17 2021-12-17 中国电信集团系统集成有限责任公司 Cross-modal retrieval method and device for video text, storage medium and equipment
CN113918767A (en) * 2021-09-29 2022-01-11 北京三快在线科技有限公司 Video clip positioning method, device, equipment and storage medium
CN114048350A (en) * 2021-11-08 2022-02-15 湖南大学 Text-video retrieval method based on fine-grained cross-modal alignment model
CN114048351A (en) * 2021-11-08 2022-02-15 湖南大学 Cross-modal text-video retrieval method based on space-time relationship enhancement
CN114090823A (en) * 2021-09-09 2022-02-25 秒针信息技术有限公司 Video retrieval method, video retrieval device, electronic equipment and computer-readable storage medium
CN114329034A (en) * 2021-12-31 2022-04-12 武汉大学 Image text matching discrimination method and system based on fine-grained semantic feature difference
WO2022147692A1 (en) * 2021-01-06 2022-07-14 京东方科技集团股份有限公司 Voice command recognition method, electronic device and non-transitory computer-readable storage medium
CN114996511A (en) * 2022-04-22 2022-09-02 北京爱奇艺科技有限公司 Training method and device for cross-modal video retrieval model
CN115168638A (en) * 2022-06-22 2022-10-11 网易(杭州)网络有限公司 Training method, device, equipment and storage medium of cross-modal retrieval model
US20230005284A1 (en) * 2021-09-18 2023-01-05 Beijing Baidu Netcom Science Technology Co., Ltd. Method for training image-text matching model, computing device, and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191075A (en) * 2019-12-31 2020-05-22 华南师范大学 Cross-modal retrieval method, system and storage medium based on dual coding and association
WO2022147692A1 (en) * 2021-01-06 2022-07-14 京东方科技集团股份有限公司 Voice command recognition method, electronic device and non-transitory computer-readable storage medium
CN114090823A (en) * 2021-09-09 2022-02-25 秒针信息技术有限公司 Video retrieval method, video retrieval device, electronic equipment and computer-readable storage medium
CN113806482A (en) * 2021-09-17 2021-12-17 中国电信集团系统集成有限责任公司 Cross-modal retrieval method and device for video text, storage medium and equipment
US20230005284A1 (en) * 2021-09-18 2023-01-05 Beijing Baidu Netcom Science Technology Co., Ltd. Method for training image-text matching model, computing device, and storage medium
CN113918767A (en) * 2021-09-29 2022-01-11 北京三快在线科技有限公司 Video clip positioning method, device, equipment and storage medium
CN114048350A (en) * 2021-11-08 2022-02-15 湖南大学 Text-video retrieval method based on fine-grained cross-modal alignment model
CN114048351A (en) * 2021-11-08 2022-02-15 湖南大学 Cross-modal text-video retrieval method based on space-time relationship enhancement
CN114329034A (en) * 2021-12-31 2022-04-12 武汉大学 Image text matching discrimination method and system based on fine-grained semantic feature difference
CN114996511A (en) * 2022-04-22 2022-09-02 北京爱奇艺科技有限公司 Training method and device for cross-modal video retrieval model
CN115168638A (en) * 2022-06-22 2022-10-11 网易(杭州)网络有限公司 Training method, device, equipment and storage medium of cross-modal retrieval model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493608A (en) * 2023-12-26 2024-02-02 西安邮电大学 Text video retrieval method, system and computer storage medium
CN117493608B (en) * 2023-12-26 2024-04-12 西安邮电大学 Text video retrieval method, system and computer storage medium
CN117851640A (en) * 2024-03-04 2024-04-09 广东智媒云图科技股份有限公司 Video data processing method, device, equipment and medium based on composite characteristics
CN117851640B (en) * 2024-03-04 2024-05-31 广东智媒云图科技股份有限公司 Video data processing method, device, equipment and medium based on composite characteristics

Also Published As

Publication number Publication date
CN116166843B (en) 2023-11-07

Similar Documents

Publication Publication Date Title
US11120078B2 (en) Method and device for video processing, electronic device, and storage medium
CN109145213B (en) Historical information based query recommendation method and device
CN113792207B (en) Cross-modal retrieval method based on multi-level feature representation alignment
CN110781305A (en) Text classification method and device based on classification model and model training method
CN111524521A (en) Voiceprint extraction model training method, voiceprint recognition method, voiceprint extraction model training device, voiceprint recognition device and voiceprint recognition medium
CN111931844B (en) Image processing method and device, electronic equipment and storage medium
CN109471919B (en) Zero pronoun resolution method and device
CN109558599B (en) Conversion method and device and electronic equipment
CN107564526B (en) Processing method, apparatus and machine-readable medium
CN110019675B (en) Keyword extraction method and device
JP7116088B2 (en) Speech information processing method, device, program and recording medium
CN111368541A (en) Named entity identification method and device
JP2022533065A (en) Character recognition methods and devices, electronic devices and storage media
CN110781813A (en) Image recognition method and device, electronic equipment and storage medium
CN116166843B (en) Text video cross-modal retrieval method and device based on fine granularity perception
CN114880480A (en) Question-answering method and device based on knowledge graph
CN113987128A (en) Related article searching method and device, electronic equipment and storage medium
CN111739535A (en) Voice recognition method and device and electronic equipment
CN113033163A (en) Data processing method and device and electronic equipment
CN111538998B (en) Text encryption method and device, electronic equipment and computer readable storage medium
CN111984765B (en) Knowledge base question-answering process relation detection method and device
CN111324214B (en) Statement error correction method and device
CN113936697A (en) Voice processing method and device for voice processing
CN111831132A (en) Information recommendation method and device and electronic equipment
CN116127062A (en) Training method of pre-training language model, text emotion classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant