CN116881520A - Content retrieval model training method based on partial order, content retrieval method and device - Google Patents

Content retrieval model training method based on partial order, content retrieval method and device Download PDF

Info

Publication number
CN116881520A
CN116881520A CN202310896764.2A CN202310896764A CN116881520A CN 116881520 A CN116881520 A CN 116881520A CN 202310896764 A CN202310896764 A CN 202310896764A CN 116881520 A CN116881520 A CN 116881520A
Authority
CN
China
Prior art keywords
content
feature
segment
partial order
characterization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310896764.2A
Other languages
Chinese (zh)
Inventor
刘洪�
蒋晨
俞旭铮
徐家
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202310896764.2A priority Critical patent/CN116881520A/en
Publication of CN116881520A publication Critical patent/CN116881520A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a content retrieval model training method, a content retrieval method and a device based on partial order. And extracting global feature characterization and local feature characterization of the first content and the second content when model training is carried out, wherein the extracted local feature characterization comprises content segment feature characterization of a content segment obtained by content segmentation of the content. Generating semantic partial order representations of the first content and the second content from the local feature representations of the first content and the second content by cross-content feature interactions; and performing partial order contrast learning-based model training on the content retrieval model by using the global feature characterization and the semantic partial order characterization of the first content and the second content.

Description

Content retrieval model training method based on partial order, content retrieval method and device
Technical Field
Embodiments of the present disclosure relate generally to the field of artificial intelligence, and more particularly, to a method and apparatus for training a content retrieval model based on partial order.
Background
With the development of artificial intelligence technology, a content retrieval model is increasingly applied to content retrieval over the internet. In order to ensure accuracy of the retrieved content, it is necessary to improve model training accuracy of the content retrieval model.
Disclosure of Invention
The embodiment of the specification provides a content retrieval model training method, a content retrieval method and a content retrieval device based on partial order. By using the content retrieval model training method, the difference of contribution degrees of different segments of the content to semantic matching is considered by introducing the partial order relation of the correlation of the content matching in the training process, so that the model training precision of the content retrieval model can be improved.
According to an aspect of the embodiments of the present specification, there is provided a method for training a content retrieval model based on partial order, including: extracting global feature characterization and local feature characterization of first content and second content, wherein the local feature characterization comprises content segment feature characterization of content segments obtained by segmenting the content, the first content is retrieval reference content, and the second content is retrieval candidate content; generating semantic partial order representations of the first content and the second content from local feature representations of the first content and the second content by cross-content feature interactions; and performing model training based on partial order contrast learning on a content retrieval model by using global feature characterization and semantic partial order characterization of the first content and the second content.
Optionally, in one example of the above aspect, generating semantic partial order representations of the first content and the second content from local feature representations of the first content and the second content by cross-content feature interactions may include: determining segment weights for respective content segment feature characterizations of the first content and the second content by cross-content feature interactions; and generating semantic partial order representations of the first content and the second content based on the local feature representations of the first content and the second content according to the segment weights of the content segment feature representations.
Optionally, in one example of the above aspect, generating the semantic partial order representations of the first content and the second content based on the local feature representations of the first content and the second content according to segment weights of content segment feature representations may include: and carrying out content segment transformation on the local feature characterization of the first content and the second content according to the segment weight of the content segment feature characterization, and generating semantic partial sequence characterization of the first content and the second content.
Optionally, in one example of the above aspect, performing content segment transformation on local feature representations of the first content and the second content according to segment weights of the content segment feature representations, generating semantic partial order representations of the first content and the second content includes: and according to the segment weights of the content segment feature characterization, carrying out content segment masking on the local feature characterization of the first content and the second content, and generating semantic partial sequence characterization of the first content and the second content.
Optionally, in one example of the above aspect, performing content segment masking on local feature representations of the first content and the second content according to segment weights of content segment feature representations, generating semantic partial order representations of the first content and the second content may include: and according to the segment weights of the content segment feature characterization, masking the local feature characterization of the first content and the second content based on the content segment of the weight accumulation duty ratio, and generating semantic partial order characterization of the first content and the second content.
Optionally, in one example of the above aspect, determining the segment weights of the respective content segment features of the first content and the second content by cross-content feature interactions may include: the global feature representation of the first content and the global feature representation of the second content are aligned with the local feature representation of the opposite side content respectively in feature length, and the global feature representation after feature length alignment is obtained; the global feature characterization of the first content and the global feature characterization of the second content, which are subjected to feature length alignment, are respectively subjected to feature stitching with the local feature characterization of the counterpart content, so that the local feature characterization after feature stitching is obtained; and determining segment weights of the segment feature characterizations of the first content and the second content according to the local feature characterizations of the first content and the second content after feature stitching.
Optionally, in one example of the above aspect, the loss function used for the partial order contrast learning based model training includes a content matching loss term and a partial order triplet contrast loss term.
Optionally, in one example of the above aspect, the partial order triplet contrast penalty term includes a partial order triplet contrast penalty term based on a global feature representation of the first content, a global feature representation of the second content, and a semantic partial order representation.
Optionally, in one example of the above aspect, the partial order triplet contrast penalty term includes a partial order triplet contrast penalty term based on a global feature representation of the first content, a global feature representation of the second content, and a semantic partial order representation, a weighted fused global feature representation of the first content, a partial order triplet contrast penalty term based on a global feature representation of the second content, and a semantic partial order representation of the second content, and a partial order triplet contrast penalty term based on a weighted fused global feature representation and a semantic partial order representation of the first content.
Optionally, in one example of the above aspect, the global feature representation of the first content and the global feature representation of the second content include an original global feature representation and a global feature representation obtained through a segment weight weighted fusion, and the semantic partial order representation of the first content and the semantic partial order representation of the second content include a semantic partial order representation obtained through a segment weight weighted fusion.
Optionally, in one example of the above aspect, the global feature representation of the second content includes a time-sequential aggregated global feature representation with the local feature representation of the second content.
Optionally, in one example of the above aspect, the first content and the second content include one of: text content, picture content, audio content, and video content.
According to another aspect of the embodiments of the present specification, there is provided a content retrieval method including: extracting global feature characterizations of a first content and a second content via a feature extraction layer of a content retrieval model, the first content being a retrieval reference content and the second content being a retrieval candidate content, the content retrieval model being trained as described above; and determining content similarity of the first content and the second content according to global feature characterization of the first content and the second content through a content similarity matching layer of the content retrieval model to conduct content retrieval.
Optionally, in one example of the above aspect, the content retrieval model further includes a partial order learning layer. Extracting global feature representations of the first content and the second content via a feature extraction layer of a content retrieval model may include: and extracting global feature characterization and local feature characterization of the first content and the second content through a feature extraction layer of a content retrieval model, wherein the local feature characterization comprises content segment feature characterization of a content segment obtained by segmenting the content. The content retrieval method may further include: determining segment weights of content segment feature characterizations of the first content and the second content through cross-content feature interaction via the partial order learning layer, and performing weighted fusion on local feature characterizations of the first content and the second content by using the determined segment weights to obtain weighted fused global feature characterizations of the first content and the second content. Determining, via a content similarity matching layer of the content retrieval model, content similarity of the first content and the second content from global feature characterizations of the first content and the second content may include: and determining the content similarity of the first content and the second content according to the global feature characterization of the first content and the second content and the weighted and fused local feature characterization through a content similarity matching layer of the content retrieval model to perform content retrieval.
Optionally, in one example of the above aspect, the global feature representation of the second content includes a time-sequential aggregated global feature representation with the local feature representation of the second content.
According to another aspect of embodiments of the present specification, there is provided a partial order based content retrieval model training device, including: a feature extraction unit that extracts global feature features and local feature features of a first content and a second content, the local feature features including content segment feature features of content segments obtained by content segmentation of the content, the first content being a search reference content, and the second content being a search candidate content; a semantic partial order learning unit that generates semantic partial order representations of the first content and the second content from local feature representations of the first content and the second content through cross-content feature interactions; and a model training unit that performs model training based on partial order contrast learning on a content retrieval model using global feature characterization and semantic partial order characterization of the first content and the second content.
Optionally, in one example of the above aspect, the partial order learning unit includes: a segment weight determination module that determines segment weights for respective content segment feature characterizations of the first content and the second content by cross-content feature interactions; and the semantic partial order representation generation module is used for generating semantic partial order representations of the first content and the second content based on the local feature representations of the first content and the second content according to the segment weights of the content segment feature representations.
Optionally, in one example of the above aspect, the semantic partial order representation generation module performs content segment transformation on local feature representations of the first content and the second content according to segment weights of content segment feature representations, and generates semantic partial order representations of the first content and the second content.
Optionally, in one example of the above aspect, the fragment weight determining module includes: the feature alignment sub-module is used for aligning the global feature representation of the first content and the second content with the local feature representation of the opposite side content respectively to obtain the global feature representation after the feature length alignment; the feature splicing sub-module is used for carrying out feature splicing on the global feature characterization of the first content and the second content which are subjected to feature length alignment and the local feature characterization of the opposite side content respectively to obtain the local feature characterization subjected to feature splicing; and a segment weight determining sub-module for determining the segment weight of each content segment feature representation of the first content and the second content according to the local feature representation of the first content and the second content after feature stitching.
According to another aspect of the embodiments of the present specification, there is provided a content retrieval device including: a feature extraction unit that extracts global feature characterizations of a first content and a second content via a feature extraction layer of a content retrieval model, the first content being a retrieval reference content and the second content being a retrieval candidate content, the content retrieval model being trained as described above; and a content retrieval unit that performs content retrieval by determining content similarity of the first content and the second content from global feature characterization of the first content and the second content via a content similarity matching layer of the content retrieval model.
Optionally, in one example of the above aspect, the content retrieval model further includes a partial order learning layer. The feature extraction unit extracts global feature characterization and local feature characterization of the first content and the second content through a feature extraction layer of a content retrieval model, wherein the local feature characterization comprises content segment feature characterization of a content segment obtained by segmenting the content. The content retrieval device may further include: and the weighted fusion unit is used for determining the segment weights of the content segment feature representations of the first content and the second content through cross-content feature interaction through the partial order learning layer, and carrying out weighted fusion on the local feature representations of the first content and the second content by using the determined segment weights to obtain the weighted fused global feature representations of the first content and the second content. The content retrieval unit determines the content similarity of the first content and the second content to retrieve the content according to the global feature representation of the first content and the second content and the weighted and fused global feature representation through a content similarity matching layer of the content retrieval model.
According to another aspect of embodiments of the present specification, there is provided a partial order based content retrieval model training device, including: at least one processor; a memory coupled to the at least one processor; and a computer program stored in the memory, the at least one processor executing the computer program to implement the partial order based content retrieval model training method as described above.
According to another aspect of the embodiments of the present specification, there is provided a content retrieval device including: at least one processor; a memory coupled to the at least one processor; and a computer program stored in the memory, the at least one processor executing the computer program to implement the content retrieval method as described above.
Drawings
A further understanding of the nature and advantages of the present description may be realized by reference to the following drawings. In the drawings, similar components or features may have the same reference numerals.
Fig. 1 shows an exemplary architecture diagram of a content retrieval system according to an embodiment of the present description.
Fig. 2 shows an example schematic diagram of a model structure of a content retrieval model according to an embodiment of the present specification.
FIG. 3 illustrates an example flow chart of a partial order based content retrieval model training method according to an embodiment of this specification.
FIG. 4 illustrates an example flow diagram of a semantic partial order characterization generation process according to embodiments of the present description.
Fig. 5 shows an example flowchart of a fragment weight determination process according to an embodiment of the present specification.
Fig. 6 shows an example schematic diagram of content segment masking based on cumulative weight duty cycle according to an embodiment of the present description.
Fig. 7 shows an example schematic diagram of a model structure of a text-video retrieval model according to an embodiment of the present specification.
FIG. 8 shows an example schematic diagram of a model training process for a text-to-video retrieval model according to an embodiment of the present description.
Fig. 9 shows an example schematic diagram of a semantic partial order feature generation process according to an embodiment of the present description.
Fig. 10 shows an example flowchart of a content retrieval method according to an embodiment of the present specification.
Fig. 11 shows an example schematic diagram of a content retrieval process according to an embodiment of the present specification.
Fig. 12 shows an example block diagram of a content retrieval model training device according to an embodiment of the present specification.
Fig. 13 shows an example block diagram of a partial order learning unit according to an embodiment of the present specification.
Fig. 14 shows an example block diagram of a fragment weight determination module according to an embodiment of the present specification.
Fig. 15 shows an example block diagram of a content retrieval device according to an embodiment of the present specification.
FIG. 16 illustrates an example schematic diagram of a computer system-implemented content retrieval model training device in accordance with an embodiment of the present description.
Fig. 17 shows an example schematic diagram of a computer system implemented content retrieval device according to an embodiment of the present description.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It should be appreciated that these embodiments are discussed only to enable a person skilled in the art to better understand and thereby practice the subject matter described herein, and are not limiting of the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure as set forth in the specification. Various examples may omit, replace, or add various procedures or components as desired. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may be combined in other examples as well.
As used herein, the term "comprising" and variations thereof mean open-ended terms, meaning "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment. The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. Unless the context clearly indicates otherwise, the definition of a term is consistent throughout this specification.
With the popularity of internet applications, the amount and type of content data on the internet has shown an explosive growth. In order to realize that the content interested by the user is effectively searched in massive content data, a content searching scheme based on a content searching model is provided. The development of a large-scale model pre-training technology is benefited, the existing content retrieval model training scheme is mainly based on a pre-trained backbone network, such as a ViT network, a BLIP network, a CLIP network and the like, meanwhile, a contrast learning strategy difficult to sample and mine is introduced to learn content feature characterization of reference content and candidate content, and meanwhile, a multi-granularity/level content feature semantic information alignment mechanism is established. However, these content retrieval model training schemes typically assign the same weight to segments of the reference content segment sequence/candidate content segment sequence during the semantic alignment phase, thereby ignoring the difference in the degree of contribution of different segments to the semantic matching task, and easily introducing important clues of noise loss. From the modeling perspective, existing content retrieval model training schemes do not take into account the partial order relationship between the relevance of content matches, resulting in a lack of fine-grained modeling capabilities.
Based on the analysis, a training scheme of a content retrieval model based on partial order is provided. By using the content retrieval model training scheme, the difference of contribution degrees of different segments of the content to semantic matching is considered by introducing the partial order relation of the correlation of the content matching in the training process, so that the model training precision of the content retrieval model can be improved.
A partial order-based content retrieval model training method and apparatus and a content retrieval model-based content retrieval method and apparatus according to embodiments of the present specification will be described below with reference to the accompanying drawings.
Fig. 1 shows an example architectural schematic diagram of a content retrieval system 100 according to an embodiment of the present description.
As shown in fig. 1, the content retrieval system 100 includes a content retrieval model training means 110, a content retrieval model storage means 120, and a content retrieval means 130. The content retrieval model training means 110, the content retrieval model storage means 120 and the content retrieval means 130 may communicate with each other via a network 140. In some embodiments, network 140 may be any one or more of a wired network or a wireless network. Examples of network 140 may include, but are not limited to, a cable network, a fiber optic network, a telecommunications network, an in-enterprise network, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Public Switched Telephone Network (PSTN), a bluetooth network, a ZigBee network, a Near Field Communication (NFC), an in-device bus, an in-device line, and the like, or any combination thereof. In some embodiments, some or all of the content retrieval model training device 110, the content retrieval model storage device 120, and the content retrieval device 130 may communicate directly without the network 140.
The content retrieval model training means 110 may train out the content retrieval model based on the historical training sample data. The model training process of the content retrieval model training means 110 will be described in detail below with reference to the accompanying drawings. The content retrieval device 130 may perform content retrieval based on the trained content retrieval model. In some embodiments, the content retrieval model trained by the content retrieval model training means 110 may be stored in the content retrieval model storage means 120. In this case, when the content retrieval device 130 performs object recommendation, the content retrieval model may be acquired from the content retrieval model storage device 120 to perform content retrieval, or communication may be performed with the content retrieval model storage device 120 to provide information required for the content retrieval model to the content retrieval model in the content retrieval model storage device 120 to perform content retrieval and receive the content retrieval result. In some embodiments, the content retrieval model training device 110 may deploy the trained content retrieval model into the content retrieval device 130. In case the content retrieval model training means 110 has local storage capability, the trained content retrieval model may also be stored locally, so that the content retrieval model storage means 120 is not needed.
In some embodiments, the content retrieval model may be implemented as a neural network model having various model structures. Examples of neural network models may include, for example, but are not limited to, convolutional neural networks (Convolution Neural Network, CNN), feed forward neural networks (Feed forward Neural Network, FNN), recurrent neural networks (RecurrentNeural Network), transformer networks, and generation countermeasure networks (Generative Adversarial Network, GAN), among others. Examples of recurrent neural networks may include, for example, long Short-Term Memory (LSTM), attention network (Attention Networks), and the like.
Fig. 2 shows an example block diagram of a content retrieval model 200 according to an embodiment of the present specification. As shown in fig. 2, the content retrieval model 200 may include a content input layer 210, a feature extraction layer 220, and a content similarity matching layer 230.
The content input layer 210 may be used to receive content data or content feature data, such as content data or content feature data of reference content and candidate content. The content feature data may be original feature data of the content data, such as attribute data or the like. Here, the reference content is reference content at the time of content retrieval, and the candidate content is content for retrieval. For example, if it is desired to retrieve a matching second content from the first content, the first content is a reference content and the second content is a candidate content. In this specification, the term "content" may include various modalities of content, such as text content, picture content, audio content, video content, and the like. The first content and the second content may have the same modality or may have different modalities. Content retrieval may also be referred to as cross-modal content retrieval when having different modalities.
The feature extraction layer 220 may also be referred to as an intermediate layer or hidden layer, and is configured to process input content data or content feature data to obtain feature vector representations (references) of the content data (or content feature data). The feature extraction layer 220 may include a plurality of intermediate layers. In the case where the feature extraction layer 220 includes multiple intermediate layers, each intermediate layer corresponds to a feature vector representation, which may be referred to as an intermediate feature (intermediate feature) of the intermediate layer. In some embodiments, the feature extraction layer may be a deep neural network, such as CNN, RNN, or the like. The feature extraction layer may process (e.g., convolve, pool, etc.) the content data or the content feature data to obtain a more abstract feature vector representation. The resulting feature vector representation typically has a specified dimension.
The content similarity matching layer 230 may calculate the content similarity of the feature vector characterization of the reference content and the candidate content and determine whether the candidate content matches the reference content based on the calculated content similarity, thereby performing the content retrieval. In some embodiments, the content similarity matching layer 230 may be implemented by using a multi-layer sensor, a full connection layer, or the like, which is not limited in this embodiment.
FIG. 3 illustrates an example flow diagram of a partial order based content retrieval model training method 300 according to an embodiment of this specification.
As shown in fig. 3, at 310, global and local feature characterizations of the first and second content are extracted. Here, the characterization may also be referred to as Embedding (Embedding). The local feature representation may include a content segment feature representation of a content segment resulting from content segmentation of the content. Here, the first content is a search reference content, and the second content is a search candidate content. Each of the first content and the second content may include a plurality of pieces of content, and one piece of the first content and one piece of the second content may constitute one model training sample. Each model training sample may include content characteristics of the first content and the second content and tag data for indicating whether the second content is a target retrieval content of the first content.
In some embodiments, the first content and the second content may be provided to a feature extraction layer of the content detection model to extract global feature representations and local feature representations of the first content and the second content. For example, the feature extraction layer may be implemented using a pre-trained network model structure, such as using CLIP as a backbone network. In the case where the first content and the second content are the same modality content, the feature extraction layer may be implemented using the same network model structure. In the case where the first content and the second content are different modality content, the feature extraction layer may be implemented using different network model structures.
For example, in the case where the first content is Text content and the second content is Video content, i.e., the content retrieval model is a Text-Video retrieval model, the feature extraction layer may contain a Text Encoder (noted g) and a Video Encoder (noted h). Both the text encoder and the video encoder are implemented using a pre-trained CLIP network.
The text encoder may be, for example, a transform encoder, and may include a multi-head self-section (MHSA) and a feed-forward network (FNN). For example, the text encoder may be a transform encoder with 12 layers and 8 attention headers, and features of query, key, and value in both the base content and the candidate content are 512 dimensions. The video encoder may be, for example, a 12-layer video Transformer (ViT) having the same structure as the transducer used in natural language processing.
Given text sentence t i And video v i Composed sentence video pair (t i ,v i ) Text sentence t i Providing the text encoder with the text sentence t i Sentence-granularity text characterization (global feature characterization) t cls ∈R 1×D And word-level text feature (local feature characterization) t tokens =[t 1 ,t 2 ,t 3 ,...,t m ]∈R M×D Where M is the length of the sentence (i.e., the number of words contained), and D is the vector dimension of the feature characterization.
Video v i The frame sequence is obtained by sampling (e.g. a speed of 1 frame per second). The sequence of frames may be provided to a video encoder for frame-level video features (global feature characterization) f cls =[f 0 ,f 1 ,f 2 ,...,f n ]∈R N×D Where N is the number of frames and D is the vector dimension of the feature characterization. At the same time, each video frame i is sliced into k video blocks (slices) and then encoded into fine-grained video block features p tokens =[p cls ,p i,0 ,p i,1 ,...,p i,k ]∈R k×D Where k is the number of video blocks segmented by the video frame and D is the vector dimension of the feature characterization. Characterizing video blocks p of individual video frames tokens Combining to obtain video v i Video block characterization (local characterization) v tokens
Video block feature p at fine granularity due to redundancy characteristics in continuously varying video frames tokens There are many isomorphic video blocks (token). In some embodiments, a video block selection process (video Token selection process) may be performed to characterize the video block p for each video frame tokens Video block selection is performed to select the first K most informative video blocks per frame, thereby aggregating base video blocks and reducing neighbor video blocks, thereby obtaining video v i Video block characterization (local characterization) v tokens
In some embodiments, the video Token selection module may be implemented in a 2-layer MLP architecture followed by a softmax layer architecture for predicting the importance score of a video block, thereby making a video block selection. The operation of the video Token selection module may be expressed as:
v tokens =TokenSelection(p tokens )∈R N×K×D (1)
where N is the number of frames, K is the number of most informative video blocks, and D is the vector dimension of the feature characterization.
In some embodiments, a temporal encoder may be utilized to frame-level video feature f for each frame of video cls And/or video block characterization v tokens Performing time sequence aggregation according to the time sequence information, thereby obtaining global video characteristics v after time sequence aggregation h . For example, a frame-level video feature f may be used cls And/or video block characterization v tokens The clock signal is provided to a clock encoder for clock encoding. The timing encoder may be implemented using a 3-layer transducer with timing position embedding P. At the frame level, video feature f cls And/or video block characterization v tokens After the time sequence is provided for a time sequence encoder to carry out time sequence aggregation, the aggregated global video characteristic characterization v can be obtained h
v h =TransEnc(f cls +P)∈R N×D (2) Or (b)
v h =TransEnc(v tokens +P)∈R N×D (3)
Where N is the number of frames and D is the vector dimension of the feature characterization.
In some embodiments, a video Token selection module may be inserted before the timing encoder to select the first K most informative video blocks per frame before timing encoding.
After the global and local feature characterizations of the first and second content are obtained as above, semantic partial order characterizations of the first and second content are generated from the local feature characterizations of the first and second content by cross-content feature interactions at 320. In this specification, a semantic partial order representation of content is, for example, a partial order feature representation obtained by performing, for example, a masked content segment transformation operation on some of the content feature representations.
In this specification, the term "cross-content feature interactions" refers to the use of feature information of one content as supervisory information to guide the generation of semantic partial order tokens of another content. For example, the feature information of one content may be used as supervision information to determine segment weights for individual content segment feature characterizations in another content, and generate corresponding semantic partial order characterizations based on the segment weights.
FIG. 4 illustrates an example flow diagram of a semantic partial order token generation process 400 according to embodiments of the present description.
As shown in fig. 4, at 410, segment weights for respective content segment characteristics of the first content and the second content are determined by cross-content-characteristic interactions.
Fig. 5 shows an example flowchart of a fragment weight determination process 500 according to an embodiment of the present description.
As shown in fig. 5, at 510, the global feature representations of the first content and the second content are aligned with the local feature representations of the counterpart content, respectively, to obtain the global feature representations after the feature length alignment.
At 520, the global feature representations of the first content and the second content after the feature length alignment are respectively feature-spliced with the local feature representations of the other content, so as to obtain the local feature representations after feature splicing.
At 530, segment weights for each content segment feature representation of the first content and the second content are determined based on the feature-stitched local feature representations of the first content and the second content.
The segment weight determination process for characterizing content segments is described below using text-to-video retrieval as an example.
To calculate word-level text feature t tokens =[t 1 ,t 2 ,t 3 ,...,t m ]The segment weights of the words in the frame-level video can be used forFeature f cls =[f 0 ,f 1 ,f 2 ,...,f n ]After passing through the full-connection network (full connected layers, FC, classification layer) and t tokens Feature length alignment is performed, i.e., in accordance with t tokens Is aligned to obtain f 'of length M' cls =[f′ 0 ,f′ 1 ,f′ 2 ,...,f′ m ]. Then, with t tokens Vector characterization splicing is carried out word by word to obtain t' tokens =[t′ 0 ,t′ 1 ,t′ 2 ,...,t′ m ]Wherein t' 1 =[t i ,f i ′]. Finally, t' tokens Providing the weight predictor consisting of the full-connection layer and the Softmax layer to obtain word-level text characteristics t tokens =[t 1 ,t 2 ,t 3 ,...,t m ]Fragment weights for each word in (a):
also, t is characterized by sentence-granularity text cls The segment weights of each video block in the video block feature characterization of the video content may be determined;
wherein v' tokens =[v′ 1 ,v′ 2 ,...,v′ NK ]The feature representation of the video block at the video side is a sentence granularity text representation t which is spliced in an element-by-element splicing mode and is subjected to feature length alignment cls The local feature characterization after feature stitching is obtained by each corresponding element vector. Method for determining segment weight of each video block in video block characteristic characterization and word-level text characteristic t tokens =[t 1 ,t 2 ,t 3 ,...,t m ]The manner in which the segment weights for each word in a database are determined is exactly the same and will not be described here.
Returning to fig. 4, after determining the segment weights for the content segment characterization, semantic partial order characterizations of the first content and the second content are generated based on the local characterization of the first content and the second content at 420 according to the segment weights for the content segment characterization.
In some embodiments, the local feature representations of the first and second content may be transformed into semantic partial order representations of the first and second content according to the segment weights of the content segment feature representations. For example, in some embodiments, the masking of the content segments may be performed in order of the segment weights from large to small, that is, the local feature representations corresponding to the content segments with large segment weights are masked first, and the local feature representations corresponding to the content segments with small segment weights are masked later until a predetermined condition is met, for example, the predetermined number of masked content segments is reached. In this specification, the term "mask" refers to replacing a corresponding content segment feature representation in a local feature representation with content having no representation meaning or other representation meaning such that the replaced local feature representation deviates from the original local feature representation, thereby yielding a partial order. The larger the segment weight of the replaced content segment, the larger the deviation between the generated semantic partial sequence feature and the original local feature characterization, and the better the partial sequence effect.
In some embodiments, the partial feature representations of the first and second content may be masked based on the content segment based on the weight accumulation duty cycle according to the segment weights of the content segment feature representations, generating semantic partial order representations of the first and second content. In a content segment masking mode based on a weight accumulation duty ratio, an accumulation weight duty ratio threshold is set, and then content masking is performed on the content segments in descending order of segment weights until the sum of the segment weights of the masked content segments reaches or exceeds the accumulation weight duty ratio threshold.
Fig. 6 shows an example schematic diagram of content segment masking based on cumulative weight duty cycle according to an embodiment of the present description. As shown in fig. 6, assuming that 7 pieces of content exist and the piece weights determined by the pieces 1 to 7 are 0.1,0.15,0.09,0.27,0.05,0.3 and 0.04, respectively, and the cumulative weight ratio threshold value is 0.7, it is necessary to mask the pieces 6, 4, and 2.
In some embodiments, the adaptive token selector may be utilized to adaptively mask the local feature representations of the first content and the second content according to the determined segment weights, thereby generating semantic partial order representations of the first content and the second content. For example, in the case of a text-to-video retrieval model, an adaptive token selector may be utilized to adaptively mask word/video blocks of local feature representations of text content and video content based on determined segment weights, thereby producing semantic partial order representations of the text content and video content.
For example, in the context of text-to-video retrieval, word weights for a given text are basedThe mask matrix can be calculated>Wherein a masking threshold τ (defaulting to 0.6) is given if and only if the word t is calculated in descending order of word weight i When the cumulative fragment weight of (a) is smaller than a given threshold τ,/-j>The value of (1) is 1, otherwise, < >>The value of (2) is 0.
Using a masking matrix b t Semantic partial order characterization of text content can be obtained
Similarly, semantic partial order features of video content may be derived
In some embodiments, the local feature representations of the first content and the second content and the semantic partial order representations may be further subjected to segment weight weighted fusion by using respective segment weights, so as to obtain a weighted fused global feature representation and semantic partial order representation.
For example, can utilizeFor text sequence t tokens Weighting and fusing to obtain text global feature representation t through cross-content feature interaction g And semantic partial order representation of text content obtained by masking operation +.>
Can be utilizedFor video sequence v tokens Weighting and fusing to obtain video global feature characterization v through cross-content feature interaction g And semantic partial order characterization of video content obtained by masking operation +. >
Returning to fig. 3, after the global feature characterization and the semantic partial order characterization of the first content and the second content are obtained as described above, at 330, the global feature characterization and the semantic partial order characterization of the first content and the second content are used to model train the content retrieval model based on partial order contrast learning.
In some embodiments, the penalty function used for model training based on partial order contrast learning may include a content match penalty term and a partial order triplet contrast penalty term.
The content matching loss item is a loss item corresponding to the content similarity matching task. The content similarity matching task is the main task of the content retrieval model, i.e., the task for content retrieval.
In some embodiments, the content similarity score may be determined from a global characterization of the first content and the second content. For example, for a text-to-video retrieval task, a sentence-to-video pair (t i ,v i ) Sentence-to-video semantic similarity score sim (t i ,v i ) Can be defined as:
sim(t i ,v i )=S(t cls ,v h ) (12)
where S (·, ·) is a similarity calculation function, for example, token Interaction (TI) or Weighted Token Interaction (WTI) methods may be employed, or other similarity calculation functions may be employed.
In some embodiments, the content similarity score may be determined from a global feature representation of the first content and the second content and the weighted fused global feature representation. For example, for a text-to-video retrieval task, a sentence-to-video pair (t i ,v i ) Sentence-to-videoIs (t) i ,v i ) Can be defined as:
sim(t i ,v i )=S(t cls ,v h )+S(t g ,v g ) (13)。
in some embodiments, the content match penalty term may employ a symmetric InfoNCE penalty L NegNCE
In some embodiments, the partial order triplet contrast penalty term comprises a partial order triplet contrast penalty term based on a global feature representation of the first content, a global feature representation of the second content, and a semantic partial order representationFor example, in the text-video search scenario described above, < > in the text-video search scenario>
In some embodiments, the partial order triplet contrast penalty term may include a partial order triplet contrast penalty term based on a global feature characterization of the first content, a global feature characterization of the second content, and a semantic partial order characterizationPartial order triplet contrast loss item based on weighted fusion global feature representation of first content, global feature representation of second content and semantic partial order representation>And partial order triplet contrast loss item based on the weighted and fused global feature representation and semantic partial order representation of the first content and the global feature representation of the second content>
Partial order triplet contrast loss term based on weighted fusion global feature representation of first content, global feature representation of second content and semantic partial order representation Can be expressed as:
partial order triplet contrast loss term based on weighted fusion global feature representation and semantic partial order representation of first content and global feature representation of second contentCan be expressed as:
the content retrieval model training process according to embodiments of the present specification is briefly described below in connection with a text-to-video retrieval model.
Fig. 7 shows an example schematic diagram of a model structure of a text-video retrieval model according to an embodiment of the present specification.
In the example shown in fig. 7, the text-video retrieval model includes an input layer (content input layer), a feature extraction layer, and a content similarity matching layer. The feature extraction layer is implemented as a first content feature extraction layer for text content and a second content feature extraction layer for video content. The first content feature extraction layer includes a text encoder, and the second content feature extraction layer includes a video encoder, a Token selector, and a timing encoder connected in sequence. The content similarity matching layer is implemented as a similarity calculation layer.
FIG. 8 shows an example schematic diagram of a model training process for a text-to-video retrieval model according to an embodiment of the present description.
As shown in fig. 8, text t i Provided to a provider text encoder, outputs a sentence-granularity text representation (global feature representation) t cls And word-level text feature (local feature characterization) t tokens . Video v i Providing to a providing video encoder, outputting frame-level video features f cls Fine granularity video block feature p tokens . Fine granularity video block feature p tokens Obtaining the characteristic characterization v of the local video block through a Token selector tokens . Frame-level video feature f cls And/or local video block characterization v tokens After time sequence encoding is carried out by a time sequence encoder, the video global feature characterization v is obtained h
Sentence-granularity text characterization t cls And local video block characterization v tokens Providing the video weight predictor to predict and obtain the characteristic characterization v of the local video block tokens Video block weights for individual video blocks of (a) and word level text features t tokens And frame-level video feature f cls Providing the text weight predictor to predict and obtain word-level text characteristics t tokens Word weights for each word in (a). Word-level text feature t is then added tokens Providing the corresponding word weight to the adaptive Token selector to obtain the semantic partial sequence representation of the sentenceCharacterizing local video block features v tokens Providing the corresponding video block weight to an adaptive Token selector to obtain semantic partial order representation of the video >Semantic partial order characterization of video/>After the time sequence encoder, the semantic partial sequence characteristic of the video is obtained>As shown in fig. 9.
Furthermore, local video block characterization v tokens Obtaining the video global feature characterization v after weighted fusion g . Word-level text feature t tokens Obtaining the text global feature representation t after weighted fusion g . The obtained sentence-granularity text representation t cls Characterization of local video block v tokens Frame-level video feature f cls Text global feature representation t after weighted fusion g Video global feature characterization v after weighted fusion g Semantic partial order characterization of sentencesAnd semantic partial order feature of video->Is provided to a similarity calculation layer for similarity, thereby obtaining a content matching loss term L NegNCE Partial order triplet contrast loss term +.>And->Then, loss item L is matched according to the content NegNCE Partial order triplet contrast loss term +.>And->Determining a total loss functionSubsequently, based on the total loss function L total To make model parameter adjustments.
The content retrieval model training process according to the embodiment of the present specification is described above with reference to fig. 1 to 9. The trained content retrieval model may be applied for content retrieval.
Fig. 10 shows an example flowchart of a content retrieval method 1000 according to an embodiment of the present description.
As shown in fig. 10, at 1010, global and local feature representations of the first and second content are extracted via a feature extraction layer of the content retrieval model. The local feature characterization includes a content segment feature characterization of a content segment obtained by segmenting the content.
At 1020, determining segment weights of content segment feature characterizations of the first content and the second content by cross-content feature interactions via a partial order learning layer of the content retrieval model, and weighting and fusing local feature characterizations of the first content and the second content using the determined segment weights to obtain weighted and fused global feature characterizations of the first content and the second content.
At 1030, content similarity of the first content and the second content is determined for content retrieval based on the global feature characterization of the first content and the second content and the weighted and fused global feature characterization via a content similarity matching layer of the content retrieval model.
Alternatively, in other embodiments, the operations of 1020 may not be included and the local feature representations of the first content and the second content are not extracted. In this case, content retrieval is performed by determining content similarity of the first content and the second content from the global feature characterization of the first content and the second content via a content similarity matching layer of the content retrieval model.
Fig. 11 shows an example schematic diagram of a content retrieval process according to an embodiment of the present specification.
As shown in fig. 11, text t i Provided to a provider text encoder, outputs a sentence-granularity text representation (global feature representation) t cls And word-level text feature (local feature characterization) t tokens . Video v i Providing to a providing video encoder, outputting frame-level video features f cls Fine granularity video block feature p tokens . Fine granularity video block feature p tokens Obtaining the characteristic characterization v of the local video block through a Token selector tokens . Frame-level video feature f cls And/or local video block characterization v tokens After time sequence encoding is carried out by a time sequence encoder, the video global feature characterization v is obtained h
Sentence-granularity text characterization t cls And local video block characterization v tokens Providing the video weight predictor to predict and obtain the characteristic characterization v of the local video block tokens Video block weights for individual video blocks of (a) and word level text features t tokens And frame-level video feature f cls Providing the text weight predictor to predict and obtain word-level text characteristics t tokens Word weights for each word in (a). Local video block characterization v tokens Obtaining the video global feature characterization v after weighted fusion g . Word-level text feature t tokens Obtaining the text global feature representation t after weighted fusion g
The obtained sentence-granularity text representation t cls Global feature characterization v of video h Text global feature representation t after weighted fusion g Video global feature characterization v after weighted fusion g Providing the similarity calculation layer to perform similarity, and performing content retrieval decision according to the calculated similarity.
Fig. 12 shows an example block diagram of a content retrieval model training device 1200 according to an embodiment of the present description. As shown in fig. 12, the content retrieval model training device 1200 includes a feature extraction unit 1210, a partial order learning unit 1220, and a model training unit 1230.
The feature extraction unit 1210 is configured to extract a global feature representation and a local feature representation of the first content and the second content, the extracted local feature representation including a content segment feature representation of a content segment obtained by content segmentation of the content. The operation of the feature extraction unit 1210 may refer to the operation described above with reference to 310 of fig. 3.
The order learning unit 1220 is configured to generate semantic partial order representations of the first content and the second content from the local feature representations of the first content and the second content by cross-content feature interactions. The operation of the order learning unit 1220 may refer to the operation described above with reference to 320 of fig. 3.
The model training unit 1230 is configured to perform a partial order contrast learning based model training on the content retrieval model using the global feature characterization and the semantic partial order characterization of the first content and the second content. The operation of model training unit 1330 may refer to the operation described above with reference to 330 of fig. 3.
Fig. 13 shows an example block diagram of a partial order learning unit 1300 according to an embodiment of the present specification. As shown in fig. 13, the partial order learning unit 1300 includes a segment weight determination module 1310 and a semantic partial order characterization generation module 1320.
The segment weight determination module 1310 is configured to determine segment weights for respective content segment features characterization of the first content and the second content by cross-content feature interactions.
The semantic partial order characterization generating module 1320 is configured to generate semantic partial order characterizations of the first content and the second content based on local feature characterizations of the first content and the second content according to segment weights of the content segment characterizations.
In some embodiments, the semantic partial order characterization generation module 1320 may perform content segment transformations on local feature characterizations of the first content and the second content according to segment weights of the content segment feature characterizations to generate semantic partial order characterizations of the first content and the second content. For example, the semantic partial order token generation module 1320 may mask the content segments from the local feature tokens of the first content and the second content according to the segment weights of the content segment feature tokens to generate semantic partial order tokens of the first content and the second content.
In some embodiments, the semantic partial order characterization generating module 1320 may mask the local feature characterization of the first content and the second content with a content segment mask based on a weight accumulation duty cycle according to the segment weights of the content segment characterization, generating the semantic partial order characterization of the first content and the second content.
Fig. 14 shows an example block diagram of a segment weight determination module 1400 according to an embodiment of the present disclosure. As shown in fig. 14, the segment weight determination module 1400 may include a feature alignment sub-module 1410, a feature stitching sub-module 1420, and a segment weight determination sub-module 1430.
The feature alignment sub-module 1410 is configured to perform feature length alignment on the global feature representations of the first content and the second content and the local feature representations of the counterpart content, respectively, to obtain a global feature representation after feature length alignment. The operation of feature alignment sub-module 1410 may refer to the operation described above with reference to 510 of fig. 5.
The feature stitching submodule 1420 is configured to perform feature stitching on the global feature representations of the first content and the second content after feature length alignment and the local feature representations of the other content respectively, so as to obtain the local feature representations after feature stitching. The operation of the feature stitching submodule 1420 may refer to the operation described above with reference to 520 of fig. 5.
The segment weight determination submodule 1430 is configured to determine segment weights for respective content segment feature characterizations of the first content and the second content based on the feature-stitched local feature characterizations of the first content and the second content. The operation of the segment weight determination submodule 1430 may refer to the operation described above with reference to 530 of fig. 5.
Fig. 15 shows an example block diagram of a content retrieval device 1500 according to an embodiment of the present specification. As shown in fig. 15, the content retrieval device 1500 includes a feature extraction unit 1510, a weighted fusion unit 1520, and a content retrieval unit 1530.
The feature extraction unit 1510 is configured to extract global feature characterizations and local feature characterizations of the first content and the second content via a feature extraction layer of the content retrieval model. The local feature characterization includes a content segment feature characterization of a content segment obtained by segmenting the content.
The weighted fusion unit 1520 is configured to determine segment weights of the content segment feature characterizations of the first content and the second content by cross-content feature interactions via a partial order learning layer of the content retrieval model, and to use the determined segment weights to perform weighted fusion on the local feature characterizations of the first content and the second content, resulting in a weighted fused global feature characterization of the first content and the second content.
The content retrieval unit 1530 is configured to determine content similarity of the first content and the second content based on the global feature characterization of the first content and the second content and the weighted and fused global feature characterization via the content similarity matching layer of the content retrieval model, for content retrieval.
In some embodiments, the content retrieval device may not include a weighted fusion unit. In this case, the feature extraction unit extracts global feature characterizations of the first content and the second content via a feature extraction layer of the content retrieval model. Then, the content retrieval unit performs content retrieval by determining content similarity of the first content and the second content from the global feature characterization of the first content and the second content via a content similarity matching layer of the content retrieval model.
As described above with reference to fig. 1 to 15, the content retrieval model training method and the content retrieval model training apparatus and the content retrieval method and the content retrieval apparatus according to the embodiments of the present specification are described. The above content retrieval model training device and the content retrieval device may be implemented in hardware, or may be implemented in software, or a combination of hardware and software.
FIG. 16 shows a schematic diagram of a computer system-implemented content retrieval model training device 1600 in accordance with an embodiment of the present description. As shown in fig. 16, the content retrieval model training device 1600 may include at least one processor 1610, memory (e.g., non-volatile memory) 1620, memory 1630, and communication interface 1640, and the at least one processor 1610, memory 1620, memory 1630, and communication interface 1640 are connected together via bus 1660. At least one processor 1610 executes at least one computer-readable instruction stored or encoded in memory (i.e., the elements described above that are implemented in software).
In one embodiment, computer-executable instructions are stored in memory that, when executed, cause the at least one processor 1610 to: extracting global feature characterization and local feature characterization of first content and second content, wherein the local feature characterization comprises content segment feature characterization of content segments obtained by content segmentation of the content, the first content is retrieval reference content, and the second content is retrieval candidate content; generating semantic partial order representations of the first content and the second content from the local feature representations of the first content and the second content by cross-content feature interactions; and performing model training based on partial order contrast learning on the content retrieval model by using the global feature characterization and the semantic partial order characterization of the first content and the second content.
It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 1610 to perform the various operations and functions described above in connection with fig. 1-9 and 12-14 in various embodiments of the present description.
Fig. 17 shows a schematic diagram of a computer system implemented content retrieval device 1700 according to an embodiment of the present description. As shown in fig. 17, content retrieval device 1700 may include at least one processor 1710, memory (e.g., non-volatile memory) 1720, memory 1730, and communication interface 1740, with at least one processor 1710, memory 1720, memory 1730, and communication interface 1740 connected together via bus 1760. At least one processor 1710 executes at least one computer-readable instruction (i.e., the elements described above as being implemented in software) stored or encoded in memory.
In one embodiment, computer-executable instructions are stored in memory that, when executed, cause the at least one processor 1710 to: extracting global feature characterizations of first content and second content via a feature extraction layer of a content retrieval model, the first content being a retrieval reference content and the second content being a retrieval candidate content, the content retrieval model being trained as described above; and determining the content similarity of the first content and the second content according to the global feature characterization of the first content and the second content via a content similarity matching layer of the content retrieval model to perform content retrieval.
It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 1710 to perform the various operations and functions described above in connection with fig. 10-11 and 15 in various embodiments of the present specification.
According to one embodiment, a program product such as a machine-readable medium (e.g., a non-transitory machine-readable medium) is provided. The machine-readable medium may have instructions (i.e., elements described above implemented in software) that, when executed by a machine, cause the machine to perform the various operations and functions described above in connection with fig. 1-15 in various embodiments of the specification. In particular, a system or apparatus provided with a readable storage medium having stored thereon software program code implementing the functions of any of the above embodiments may be provided, and a computer or processor of the system or apparatus may be caused to read out and execute instructions stored in the readable storage medium.
In this case, the program code itself read from the readable medium may implement the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.
Examples of readable storage media include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or cloud by a communications network.
According to one embodiment, a computer program product is provided that includes a computer program that, when executed by a processor, causes the processor to perform the various operations and functions described above in connection with fig. 1-15 in various embodiments of the present description.
It will be appreciated by those skilled in the art that various changes and modifications can be made to the embodiments disclosed above without departing from the spirit of the invention. Accordingly, the scope of the invention should be limited only by the attached claims.
It should be noted that not all the steps and units in the above flowcharts and the system configuration diagrams are necessary, and some steps or units may be omitted according to actual needs. The order of execution of the steps is not fixed and may be determined as desired. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by multiple physical entities, or may be implemented jointly by some components in multiple independent devices.
In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may include permanently dedicated circuitry or logic (e.g., a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware unit or processor may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The particular implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.
The detailed description set forth above in connection with the appended drawings describes exemplary embodiments, but does not represent all embodiments that may be implemented or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (23)

1. A content retrieval model training method based on partial order comprises the following steps:
extracting global feature characterization and local feature characterization of first content and second content, wherein the local feature characterization comprises content segment feature characterization of content segments obtained by segmenting the content, the first content is retrieval reference content, and the second content is retrieval candidate content;
generating semantic partial order representations of the first content and the second content from local feature representations of the first content and the second content by cross-content feature interactions; and
A partial order contrast learning-based model training is performed on a content retrieval model using global feature characterizations and semantic partial order characterizations of the first content and the second content.
2. The content retrieval model training method of claim 1, wherein generating semantic partial order representations of the first content and the second content from local feature representations of the first content and the second content by cross-content feature interactions comprises:
determining segment weights for respective content segment feature characterizations of the first content and the second content by cross-content feature interactions; and
and generating semantic partial order characterization of the first content and the second content based on the local feature characterization of the first content and the second content according to the segment weight of the content segment characterization.
3. The content retrieval model training method of claim 2, wherein generating semantic partial order representations of the first content and the second content based on local feature representations of the first content and the second content according to segment weights of content segment feature representations comprises:
and carrying out content segment transformation on the local feature characterization of the first content and the second content according to the segment weight of the content segment feature characterization, and generating semantic partial sequence characterization of the first content and the second content.
4. The content retrieval model training method as recited in claim 3, wherein performing content segment transformation on local feature representations of the first content and the second content according to segment weights of content segment feature representations, generating semantic partial order representations of the first content and the second content comprises:
and according to the segment weights of the content segment feature characterization, carrying out content segment masking on the local feature characterization of the first content and the second content, and generating semantic partial sequence characterization of the first content and the second content.
5. The content retrieval model training method as recited in claim 4, wherein performing content segment masking on local feature representations of the first and second content according to segment weights of content segment feature representations, generating semantic partial order representations of the first and second content comprises:
and according to the segment weights of the content segment feature characterization, masking the local feature characterization of the first content and the second content based on the content segment of the weight accumulation duty ratio, and generating semantic partial order characterization of the first content and the second content.
6. The content retrieval model training method as recited in claim 2, wherein determining segment weights for respective content segment feature characterizations of the first content and the second content by cross-content feature interactions comprises:
The global feature representation of the first content and the global feature representation of the second content are aligned with the local feature representation of the opposite side content respectively in feature length, and the global feature representation after feature length alignment is obtained;
the global feature characterization of the first content and the global feature characterization of the second content, which are subjected to feature length alignment, are respectively subjected to feature stitching with the local feature characterization of the counterpart content, so that the local feature characterization after feature stitching is obtained; and
and determining the segment weights of the characteristic representations of the content segments of the first content and the second content according to the characteristic-spliced local characteristic representations of the first content and the second content.
7. The content retrieval model training method of claim 1, wherein the loss function used for the partial order contrast learning based model training comprises a content matching loss term and a partial order triplet contrast loss term.
8. The content retrieval model training method of claim 1, wherein the partial order triplet contrast penalty term comprises a partial order triplet contrast penalty term based on a global feature representation of the first content, a global feature representation of the second content, and a semantic partial order representation.
9. The content retrieval model training method of claim 8, wherein the partial order triplet contrast penalty term comprises a partial order triplet contrast penalty term based on a global feature representation of the first content, a global feature representation of the second content, and a semantic partial order representation, a weighted fused global feature representation of the first content, a partial order triplet contrast penalty term based on a global feature representation of the second content, and a semantic partial order representation of the second content, and a partial order triplet contrast penalty term based on a weighted fused global feature representation and semantic partial order representation of the first content.
10. The content retrieval model training method of claim 2, wherein the global feature representation of the first content and the second content comprises an original global feature representation and a global feature representation obtained by a segment weight weighted fusion, and the semantic partial order representation of the first content and the second content comprises a semantic partial order representation obtained by a segment weight weighted fusion.
11. The content retrieval model training method of claim 1, wherein the global feature representation of the second content comprises a time-sequential aggregated global feature representation with the local feature representation of the second content.
12. The content retrieval model training method as recited in claim 1, wherein the first content and the second content comprise one of:
text content, picture content, audio content, and video content.
13. A content retrieval method, comprising:
extracting global feature characterizations of a first content and a second content via a feature extraction layer of a content retrieval model, the first content being a retrieval reference content and the second content being a retrieval candidate content, the content retrieval model being trained according to the method of any one of claims 1 to 11; and
and determining the content similarity of the first content and the second content according to the global characteristic characterization of the first content and the second content through a content similarity matching layer of the content retrieval model to perform content retrieval.
14. The content retrieval method of claim 13, wherein the content retrieval model further comprises a partial order learning layer, extracting global feature representations of the first content and the second content via a feature extraction layer of the content retrieval model comprises:
extracting global and local feature characterizations of the first and second content via a feature extraction layer of a content retrieval model, the local feature characterizations including content segment feature characterizations of content segments resulting from content segmentation,
The content retrieval method further includes:
determining segment weights of content segment feature representations of the first content and the second content through cross-content feature interactions via the partial order learning layer, performing weighted fusion on local feature representations of the first content and the second content by using the determined segment weights to obtain weighted fused global feature representations of the first content and the second content,
determining, via a content similarity matching layer of the content retrieval model, content similarity of the first content and the second content from global feature characterizations of the first content and the second content comprises:
and determining the content similarity of the first content and the second content according to the global feature characterization of the first content and the second content and the weighted and fused global feature characterization through a content similarity matching layer of the content retrieval model to perform content retrieval.
15. The content retrieval method as recited in claim 14, wherein the global feature representation of the second content comprises a time-sequential aggregated global feature representation with the local feature representation of the second content.
16. A partial order based content retrieval model training device, comprising:
a feature extraction unit that extracts global feature features and local feature features of a first content and a second content, the local feature features including content segment feature features of content segments obtained by content segmentation of the content, the first content being a search reference content, and the second content being a search candidate content;
a semantic partial order learning unit that generates semantic partial order representations of the first content and the second content from local feature representations of the first content and the second content through cross-content feature interactions; and
and the model training unit is used for carrying out model training based on partial order contrast learning on the content retrieval model by using the global characteristic characterization and the semantic partial order characterization of the first content and the second content.
17. The content retrieval model training device of claim 16, wherein the partial order learning unit comprises:
a segment weight determination module that determines segment weights for respective content segment feature characterizations of the first content and the second content by cross-content feature interactions; and
the semantic partial order representation generation module is used for generating semantic partial order representations of the first content and the second content based on the local feature representations of the first content and the second content according to the segment weights of the content segment feature representations.
18. The content retrieval model training device of claim 16, wherein the semantic partial order characterization generation module performs content segment transformation on local feature characterizations of the first content and the second content according to segment weights of content segment feature characterizations to generate semantic partial order characterizations of the first content and the second content.
19. The content retrieval model training device of claim 16, wherein the segment weight determination module comprises:
the feature alignment sub-module is used for aligning the global feature representation of the first content and the second content with the local feature representation of the opposite side content respectively to obtain the global feature representation after the feature length alignment;
the feature splicing sub-module is used for carrying out feature splicing on the global feature characterization of the first content and the second content which are subjected to feature length alignment and the local feature characterization of the opposite side content respectively to obtain the local feature characterization subjected to feature splicing; and
and the segment weight determining sub-module is used for determining the segment weights of the characteristic representations of the content segments of the first content and the second content according to the characteristic-spliced local characteristic representations of the first content and the second content.
20. A content retrieval device, comprising:
a feature extraction unit that extracts global feature characterizations of a first content and a second content via a feature extraction layer of a content retrieval model, the first content being a retrieval reference content and the second content being a retrieval candidate content, the content retrieval model being trained according to the method of any one of claims 1 to 11; and
and the content retrieval unit is used for carrying out content retrieval by determining the content similarity of the first content and the second content according to the global characteristic characterization of the first content and the second content through the content similarity matching layer of the content retrieval model.
21. The content retrieval device according to claim 20, wherein the content retrieval model further includes a partial order learning layer, the feature extraction unit extracts global feature characterizations and local feature characterizations of the first content and the second content via a feature extraction layer of the content retrieval model, the local feature characterizations including content segment feature characterizations of content segments resulting from content segmentation,
the content retrieval device further includes:
a weighted fusion unit, via the partial order learning layer, determining segment weights of content segment feature characterizations of the first content and the second content by cross-content feature interactions, and performing weighted fusion on local feature characterizations of the first content and the second content using the determined segment weights to obtain weighted fused global feature characterizations of the first content and the second content,
The content retrieval unit determines the content similarity of the first content and the second content to retrieve the content according to the global feature representation of the first content and the second content and the weighted and fused global feature representation through a content similarity matching layer of the content retrieval model.
22. A partial order based content retrieval model training device, comprising:
at least one processor;
a memory coupled to the at least one processor; and
computer program stored in the memory, the at least one processor executing the computer program to implement the partial order based content retrieval model training method of any of claims 1 to 12.
23. A content retrieval device, comprising:
at least one processor;
a memory coupled to the at least one processor; and
computer program stored in the memory, the at least one processor executing the computer program to implement the content retrieval method according to any one of claims 13-15.
CN202310896764.2A 2023-07-20 2023-07-20 Content retrieval model training method based on partial order, content retrieval method and device Pending CN116881520A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310896764.2A CN116881520A (en) 2023-07-20 2023-07-20 Content retrieval model training method based on partial order, content retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310896764.2A CN116881520A (en) 2023-07-20 2023-07-20 Content retrieval model training method based on partial order, content retrieval method and device

Publications (1)

Publication Number Publication Date
CN116881520A true CN116881520A (en) 2023-10-13

Family

ID=88260131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310896764.2A Pending CN116881520A (en) 2023-07-20 2023-07-20 Content retrieval model training method based on partial order, content retrieval method and device

Country Status (1)

Country Link
CN (1) CN116881520A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556276A (en) * 2024-01-11 2024-02-13 支付宝(杭州)信息技术有限公司 Method and device for determining similarity between text and video
CN117556276B (en) * 2024-01-11 2024-05-10 支付宝(杭州)信息技术有限公司 Method and device for determining similarity between text and video

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556276A (en) * 2024-01-11 2024-02-13 支付宝(杭州)信息技术有限公司 Method and device for determining similarity between text and video
CN117556276B (en) * 2024-01-11 2024-05-10 支付宝(杭州)信息技术有限公司 Method and device for determining similarity between text and video

Similar Documents

Publication Publication Date Title
CN108959396B (en) Machine reading model training method and device and question and answer method and device
CN111309971B (en) Multi-level coding-based text-to-video cross-modal retrieval method
CN115329127A (en) Multi-mode short video tag recommendation method integrating emotional information
CN112464656B (en) Keyword extraction method, keyword extraction device, electronic equipment and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN112948708A (en) Short video recommendation method
CN116955699B (en) Video cross-mode search model training method, searching method and device
CN113159023A (en) Scene text recognition method based on explicit supervision mechanism
CN114495129B (en) Character detection model pre-training method and device
CN112183655A (en) Document multi-label classification method and device
CN115130591A (en) Cross supervision-based multi-mode data classification method and device
CN114491258A (en) Keyword recommendation system and method based on multi-modal content
Abdar et al. A review of deep learning for video captioning
Gu et al. Generalized zero-shot learning via VAE-conditioned generative flow
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
CN117313740A (en) Language model training method
CN116881520A (en) Content retrieval model training method based on partial order, content retrieval method and device
CN115408494A (en) Text matching method integrating multi-head attention alignment
CN117011737A (en) Video classification method and device, electronic equipment and storage medium
CN110969187B (en) Semantic analysis method for map migration
CN117556276B (en) Method and device for determining similarity between text and video
CN115329755B (en) Entity link model processing method and device and entity link processing method and device
CN114996424B (en) Weak supervision cross-domain question-answer pair generation method based on deep learning
CN113837910B (en) Test question recommending method and device, electronic equipment and storage medium
CN115187893A (en) Video participation degree prediction method, system, device and medium based on graph learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination