CN116610831A - Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system - Google Patents

Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system Download PDF

Info

Publication number
CN116610831A
CN116610831A CN202310684445.5A CN202310684445A CN116610831A CN 116610831 A CN116610831 A CN 116610831A CN 202310684445 A CN202310684445 A CN 202310684445A CN 116610831 A CN116610831 A CN 116610831A
Authority
CN
China
Prior art keywords
text
image
modal
formula
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310684445.5A
Other languages
Chinese (zh)
Inventor
李宝莲
李培瑶
孙苹苹
朱良彬
韩博
谢海瑶
强保华
李忠涛
赵建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
CETC 54 Research Institute
Original Assignee
Guilin University of Electronic Technology
CETC 54 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology, CETC 54 Research Institute filed Critical Guilin University of Electronic Technology
Priority to CN202310684445.5A priority Critical patent/CN116610831A/en
Publication of CN116610831A publication Critical patent/CN116610831A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a cross-modal searching method and a searching system for semantic subdivision and modal alignment reasoning learning, wherein the cross-modal searching method for semantic subdivision and modal alignment reasoning learning comprises the following steps: performing modal alignment on the original modal characteristics obtained after pre-training based on the zoom dot product attention so as to realize the mode alignment characteristics of projection mode re-aggregation for the original characteristics; after passing the modal alignment data formed in the steps through a weight sharing multi-layer perceptron, adopting a semantic approximate matching and correct matching method to realize semantic correct matching and approximate matching mining on the same-class tag clusters; model constraint is carried out by adopting an Arc4cmr loss function, a mutual supervision contrast loss function, a graph-text feature similarity matrix and a contrast loss function between similar label matrixes. The cross-modal retrieval method further improves accuracy of cross-modal retrieval.

Description

Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system
Technical Field
The invention relates to the field of cross-modal retrieval with maximum correlation of semantics and modal alignment, in particular to a cross-modal retrieval method and a retrieval system for semantic subdivision and modal alignment reasoning learning.
Background
With rapid development and gradual maturity of multimedia technology, an information carrier gradually evolves from a simple image-text form to a situation that a plurality of media data are combined to be presented, the data have different existing forms, data types, data distribution, data expression forms and the like, different angles, different dimensions and different layers of things are displayed, and the information carrier is uniformly called as multi-mode data. In the rapid growth process of the multimedia social platform, the data expression form is gradually plump, and a new pattern of content symbiosis and form multi-element fusion is gradually formed. The information propagation mode is arranged in a hundred flowers way, and the continuous expansion of the search dimension is brought. For example, when we retrieve an event or concept, we want to see various forms of information related to pictures, videos, charts, etc. for better understanding and memorization, and thus, cross-modal retrieval tasks have evolved.
Cross-modal retrieval aims at solving the problem that the underlying features of different modal data are heterogeneous and the semantics of the higher layers are related. According to whether tag information is used, the cross-modal search can be divided into two types of supervised cross-modal search and unsupervised cross-modal search. The cross-modal search can be divided into a traditional method based on statistical analysis and a modern method based on deep learning according to the time history.
Traditional cross-modal retrieval method based on statistical analysis:
1, an unsupervised method: the Cross-modal factor analysis method (Cross-modal Factor Analysis, CFA) proposed by Li and the like is the earliest traditional Cross-modal unsupervised method, and takes F norm as a measurement, and the projection subspaces of different modalities are learned by minimizing the distances of different modal sample pairs in a transformation domain, so that the potential matching relation behind two-modality data is deeply analyzed in a public subspace. The typical correlation analysis (Canonical Correlation Analysis, CCA) proposed by Hotelling et al is an unsupervised public space learning method, and is a milestone of image-text content correlation retrieval.
2, supervised method: rasiwasia et al propose a cross-modal retrieval model of semantic correlation matching (Semantic Correlation Matching, SCM), abstract the semantics of images and texts, and jointly model the cross-correlated two-modal data in a shared space so as to improve the retrieval precision of the model. In order to fully utilize the fact that information among modes in real life is not an absolute one-to-one relationship, ranjan et al construct a multi-label typical correlation analysis (multi-label Canonical Correlation Analysis, ml-CCA) model by utilizing the one-to-many, many-to-one, many-to-many and other relationships generated by the multi-label on the basis of CCA, and the model is more fit to a real scene and has better performance.
The traditional cross-modal retrieval method is simpler to realize based on a statistical analysis principle, but most of nonlinear relations learned by a model or a shallow mapping relation of multi-modal data has a great improvement space for advanced semantic modeling. In addition, as the data size increases, the computational complexity of conventional methods increases, and the capability for high-dimensional data processing decreases dramatically.
The cross-modal retrieval method based on deep learning in the prior art mainly comprises the following steps:
1, an unsupervised method: the depth typical correlation analysis (Deep Canonical Correlation Analysis, DCCA) proposed by Andrew et al utilizes a neural network to learn nonlinear transformation public space, accurately captures data correlation, and solves the problem that CCA is only suitable for linear public space learning. The depth-typical correlated automatic encoder (Deep Canonically Correlated Autoencoders, DCCAE) proposed by Wang et al improves DCCA by adding an automatic encoder-based reconstruction error.
2, supervised method: zhai et al propose a joint token learning (Joint Representation Learning, JRL) method that combines multiple modality data in a unified framework to sparse and semi-supervised regularization to explore pairwise correlation and semantic correlation information between them. Zhen et al propose an end-to-end deep supervision cross-mode Retrieval (DSCMR) method, which keeps semantic distinguishability by linearly classifying samples in a public space, learns correlations between intersecting modalities by a weight sharing strategy, so as to keep invariance of the modalities.
Compared with the traditional statistical analysis method, the method has the advantages of expanding new ideas and technologies for cross-modal retrieval research due to the large-scale data computing capacity, nonlinear structural design and deep semantic information mining capacity of the deep learning network model, and the method is provided based on the deep learning method.
Disclosure of Invention
In view of the above, the invention discloses a novel cross-modal retrieval method, which is characterized in that the relevance of semantic related modal characteristics is enhanced through a modal alignment module based on the attention of a zoom dot product, the modal alignment between two modal data is learned, a semantic approximate matching and correct matching module is designed, the aggregation of graphic and text characteristics in a class is enhanced, and meanwhile, the detailed distinction is made on the graph-text pairs with semantic information differences in the class. The contrast loss function is used for mutual supervision, fine granularity alignment of features is enhanced, and the contrast loss function between the image-text feature similarity matrix and the similarity label matrix is used for enabling the loss of semantic approximate matching in the class to be smaller than the loss of error matching between the classes.
Specifically, the invention is realized by the following technical scheme:
in a first aspect, the invention discloses a novel cross-modal retrieval method, which comprises the following steps:
performing modal alignment on the original modal characteristics obtained after pre-training based on the zoom dot product attention so as to realize the mode alignment characteristics of projection mode re-aggregation for the original characteristics;
after passing the modal alignment data formed in the steps through a weight sharing multi-layer perceptron, adopting a semantic approximate matching and correct matching method to realize semantic correct matching and approximate matching mining on the same-class tag clusters;
model constraint is carried out by adopting an Arc4cmr loss function, a mutual supervision contrast loss function, a graph-text feature similarity matrix and a contrast loss function between similar label matrixes.
In a second aspect, the invention discloses a cross-modal retrieval system comprising:
and a mode alignment module: the method comprises the steps of performing modal alignment on original modal characteristics obtained after pre-training based on zoom dot product attention so as to realize the mode alignment characteristics of projection mode re-aggregation for the original characteristics;
and a matching module: after passing the modal alignment data formed in the steps through the weight sharing multi-layer perceptron, adopting a semantic approximate matching and correct matching method to realize semantic correct matching and approximate matching mining on the same type label clusters;
constraint module: the method is used for carrying out model constraint by adopting an Arc4cmr loss function, a mutual supervision contrast loss function, a graph-text feature similarity matrix and a contrast loss function between similar label matrixes.
In a third aspect, the present invention discloses a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the cross-modal retrieval method of the first aspect.
In a fourth aspect, the invention discloses a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the cross-modality retrieval method as in the first aspect when the program is executed.
Currently, most of the constraint conditions of the loss function of the supervised model use class label information as a measure, and in a multi-class model, the number of output dimensions of the model is generally set to the number of classes (assumed to be N) so as to output the probability score of each class. Meanwhile, in order to facilitate calculation and evaluation of the prediction effect of the model, the real class labels of the samples are converted into an N-dimensional One-hot vector. Specifically, N preset categories are represented as a vector with a length of N, where only the position of the corresponding category is 1 and the other positions are 0, however, this conceals the problem that specific semantic information is forcedly classified.
In addition, the initial aim of the pre-training model CLIP is to judge the image type by taking text sentences containing more fine granularity information as labels of images, so that the problem that similar images are forcedly classified into one type can be effectively relieved, the existing model lacks consideration of fine granularity alignment of image-text features, and based on the consideration of the fine granularity alignment of image-text features, the invention takes the correctly matched image-text pair of one mode feature data as the supervision information of the other mode feature to perform the fine granularity alignment processing of the feature.
In order to promote the reasoning capability of the modal alignment module for feature reconstruction, make the intra-class distance of different modal data as small as possible and the inter-class distance as large as possible, and effectively conduct intra-class subdivision according to semantic information, enhance fine granularity alignment of modal features, provide a semantic subdivision and modal alignment reasoning learning (Semantic Refinement Discrimination and Modal Alignment inference learning, SRD-MA) + ) The model is retrieved across modalities.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is an overall frame diagram of a cross-modal retrieval method provided by an embodiment of the invention;
FIG. 2 is a flow chart of a modality alignment workflow based on scaled dot product attention provided by an embodiment of the present invention;
FIG. 3 is a data processing flow chart of a semantic approximate matching and correct matching module provided by an embodiment of the present invention;
FIG. 4 is a diagram of a label matrix label using similarity according to an embodiment of the present invention sim Visual effect diagram of the processed image-text similarity matrix;
FIG. 5 shows an SRD-MA according to Experimental example 1 of the present invention + A visual effect comparison graph of the model and the SMR-MA model;
fig. 6 is a schematic flow chart of a computer device according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
The invention discloses a cross-modal retrieval method, which comprises the following steps:
performing modal alignment on the original modal characteristics obtained after pre-training based on the zoom dot product attention so as to realize the mode alignment characteristics of projection mode re-aggregation for the original characteristics;
after passing the modal alignment data formed in the steps through a weight sharing multi-layer perceptron, adopting a semantic approximate matching and correct matching method to realize semantic correct matching and approximate matching mining on the same-class tag clusters;
model constraint is carried out by adopting an Arc4cmr loss function, a mutual supervision contrast loss function, a graph-text feature similarity matrix and a contrast loss function between similar label matrixes.
The overall framework of the model for cross-modal retrieval is shown in fig. 1. Firstly, feature coding is carried out on the image and text samples by using a pre-training model CLIP, and original image features and original text features are obtained. Secondly, in order to optimize the modal alignment module, the correlation of two semantically related modal features is enhanced, and projection modal alignment features are recombined for original features through the modal alignment module based on the dot product attention. Next, the common representation space is learned using weight-shared MLPs to preserve the invariance of the modalities. And then, utilizing a semantic approximate matching and correct matching module to realize semantic correct matching and approximate matching mining of the label clusters of the same category, and improving the accuracy of cross-modal retrieval. Finally, combining three loss functions to restrict and improve the separation degree and intra-class polymerization degree between classes, enhancing the fine granularity alignment of the features, enabling the loss of the intra-class semantic approximate matching to be smaller than the loss of the inter-class error matching, and conducting semantic subdivision on the intra-class graph-text pairs.
SRD-MA + The method comprises the steps of selecting an improved ResNet-50 network structure in a CLIP pre-training model as an image encoder phi and a transform network structure as a text encoder phi to obtain corresponding original image characteristicsAnd original text feature representation ++>And-> wherein ,/>D is the feature dimension size, D is 1024, M is the Batch size, and M has a value of 100.
The modal alignment module workflow based on zoom dot product attention is shown in fig. 2 and comprises two data processing procedures, namely image feature alignment processing and text feature alignment processing. The input data of the module is the original image characteristics and the original text characteristics after feature coding, the fused image characteristics and the fused text characteristics are output, and in the process, the model can learn a new joint potential space to highlight the image (text) which is more matched with the query text (image), so that the image (text) which is irrelevant to the query text (image) is effectively restrained. In order to enhance the modal information interaction and eliminate the modal difference, the two data processing processes share the network structure and parameters.
In the modality alignment module based on the zoom dot product attention, for retrieving images in text, an original text feature is usedConversion to a single query->All origins within BatchInitial image feature->Conversion to bond->Sum->A specific transformation is shown as equation 1, and so on, for image retrieval text, a specific transformation is shown as equation 2.
In equations 1 and 2, D p For projection dimension, D p The value of (c) is 1024, the ln represents layer normalization,is a projection matrix of the same dimension.
Single original text feature Q t And each original image feature K in Batch v Is a correlation weight of (a)The projection image characteristic V is recombined according to the proportion of the correlation weight coefficient v The text feature obtained after the dot product Attention is scaled is output as Attention (Q t ,K v ,V v ) As shown in particular in equation 3.
Similarly, the image feature obtained after the dot product Attention is scaled is output as Attention (Q t ,K v ,V v ) As shown in particular in equation 4.
In equations 3 and 4The introduction of the scaling factor can effectively avoid the attention fraction QK as the scaling factor T The gradient vanishing or gradient explosion problem caused by too large or too small normalizes the attention score, and improves the performance and stability of the model.
For embedding images into a shared space with text, aggregate image representations of attention modules are weightedProjection back->As shown in equation 5:
r v∣t =LN(Attention(Q t ,K v ,V v )W O ) (5)
wherein ,rv∣t Representing the aggregate image feature conditioned on text t.
For embedding text into a shared space with images, aggregate text representations of attention modules are weightedProjection back->As shown in equation 6:
r t∣v =LN(Attention (Q v ,K t ,V t )W o ) (6)
wherein ,rt∣v Is an aggregate text feature conditioned on image v.
The image modality feature, which ultimately results in text as a search condition via a modality interaction based on scaled dot product attention, is denoted as C v I t, the final fusion output of the text feature alignment process, as shown in equation 7, toThe text modal feature with image as search condition is represented as C t V, the final fusion output of the image feature alignment process, as shown in equation 8.
wherein ,single element->The specific meaning is the single original text +.within Batch>Features are re-represented as +.>Single element->Specifically meaning a single original image feature +.within Batch>Is re-represented as +.>
Image characteristics obtained by the processingAnd text feature->And the data are transmitted into the multi-layer perceptron MLP with weight sharing. MLP structure is y (x) =W 2 (ε(W 1 x+b 1 ) Where ε represents the GeLU activation function, W 1 ,W 2 Representation canTraining weight parameters, b 1 ,b 2 Representing the bias term. The image processed by the weight sharing multi-layer perceptron is characterized by v i =y(c i I ) Text is characterized by->At this time, all image features and text features within Batch are denoted as C' t |v=[v 1 ,v 2 ,…,v i ,…,v M ] T 、C' v |t=[u 1 ,u 2 ,…,u j ,…,u M ] T, wherein ,/>
The data processing schematic diagram of the semantic approximate matching and correct matching module is shown in fig. 3, and simultaneously, the correct matching, semantic approximate matching and complete unmatched relation among classes of the image-text pair are considered, so that the positive effect and the negative effect of the image-text pair on cross-modal retrieval are accurately distinguished from different matching dimensions. Graph-text pairs that are matched semantically approximately within a class are processed with correctly matched graph-text pairs using two different attention mask mechanisms. First, all image features u within Batch are calculated i And text feature v j Semantic relevance score s between ij As shown in particular in equation 9.
According to equation 9, a text-to-image (T2I) similarity matrix attn can be obtained t The formula is 10.
Similarly, an image-to-text (I2T) similarity matrix may be expressed as attn v As shown in equation 11.
The class labels corresponding to all the images and texts in one Batch may be repeated, and the label values of the corresponding positions are subjected to an exclusive nor operation, namely, the same label value is 1, and the label values are different from 0, so that the similarity label matrix of the approximately matched image-text pairs can be represented by a formula 12.
By similarity label matrix label sim Based on the text-to-image similarity matrix attn t The similarity specific gravity of (2) can be re-expressed asSimilarly, the image-to-text similarity matrix attn v The similarity specific gravity of (2) can be re-expressed as +.>The inherent attribute of the Softmax function is utilized to increase the similarity of the correct matching graph-text pairs, and meanwhile, the similarity of the intra-class semantic approximate matching graph-text pairs is far greater than the similarity of the inter-class uncorrelated graph-text pairs, so that the negative influence caused by incorrect matching among classes is effectively inhibited, the separation degree among classes is enhanced, and the intra-class cohesion is effectively enhanced while the specific gravity of the correct matching similarity is enhanced.
Using a similarity tag matrix label sim The visual visualization effect diagram of the processed image-text similarity matrix can be represented by fig. 4. Diagonal elements in the graph represent graph-text pairs that are correctly matched according to the original semantic information, and color representations are darker and represent a greater degree of similarity. The semantic approximate matching graph-text pairs of the labels of the same category are lower in similarity than the correct matching graph-text pairs, and the corresponding color representations are also lighter. And the similarity of the image-text pairs which are not related to each other and belong to different categories is represented by lighter colors and is represented as negative samples.
In measuring the similarity of image-text pairs, attention is paid firstIs the shared semantics of two modality data, related image features in the Batch-sized image library for the ith text can be aggregated intoEquation 13 shows:
wherein ,expansion is expressed as +.> Is a near semantic association between text and images, as shown in equation 14.
Related text features in the Batch-sized text library for the ith image may be aggregated asThe formula is defined as 15:
wherein ,expansion is expressed as +.> Is a near semantic association between an image and text, a specific formulation, as shown in equation 16.
Lambda in equations 14, 16 is the penalty factor, mask sim (. Cndot.) represents a masking function, when the input is positive, the output is equal to the input, otherwise, the output- +_is processed by the Softmax function, so that the attention weight of the irrelevant samples is reduced to 0, thereby realizing effective attention to the relevant samples.
According to the information content of the graph-Wen Duiyu, the label discrimination matrix corresponding to the correct matching relation is an identity matrix, and the formula is 17.
In unit tag matrix label eql Based on this, the similarity ratio attn of the correct matching graph-text pair is calculated eql I.e. text-to-image similarity matrix attn t And unit tag matrix label eql Is the similarity specific gravity of (2) Image-to-text similarity matrix attn v And unit tag matrix label eql Similarity specific gravity> Wherein, mask eql (·) is a mask function, and when the input is positive, the output is 1, the others are 0. According to soft attention calculation rules, correctly matched image features can be re-expressed asCorrectly matching text features may be re-represented as
In combination with the above analysis, image features represented by a combination of features that approximate semantic association and correctly matching features are shown in equation 18, text features are shown in equation 19, where,
SRD-MA + the objective function of the model mainly comprises three parts, namely an Arc4cmr (ArcFace loss for Cross-Modal real) loss function, a mutual supervision contrast loss function, a graph-text feature similarity matrix and a contrast loss function between similar label matrices.
The ArcFace loss function is specified as equation 20, where x i Representing characteristic inputs, y i Which is a corresponding category label that is associated with the category, θk representing characteristic x i And corresponding weight W k In the angle space, s is the hyperspherical radius, n is the category number, and M is the batch size. The constraint can be expressed by equation 21.
ArcFace loss function L for text retrieval image task simT2I The corresponding input isCorresponding regularization->As shown in particular in equation 22.
ArcFace loss function L for image retrieval text task simI2T The corresponding input isCorresponding regularization->As shown in particular in equation 23.
As can be seen from a combination of equations 22 and 23, SRD-MA + Arc4cmr loss function L of model Arc4cmr May be expressed as formula 24.
L Arc4cmr =L simT2I +L simI2T (24)
The purpose of the network model training is to minimize the distance between the pairs, i.e. the content of the image presentation is unified with the content of the text description, called the pairs, and to maximize the distance between the negative pairs, i.e. the image presentation is contradictory with the text description, called the negative pairs. Thus, the contrast loss function is included in the objective function. In feature x i With other features x j Similarity L between Softmax similarity measures between the two using normalized temperature scaling contr As shown in equation 25:
wherein τ represents the constant temperature parameter, sim (·) represents the dot product after L2 normalization of the input, unlike direct measurement using cosine similarity, normalizationThe chemical Softmax can amplify the similarity of positive pairs and impair the independence of negative pairs. The contrast loss function is the arithmetic average of all the positive normalized similarity cross entropy in Batch, assumingIs content and x i Matching samples, then contrast loss function L contr Available public
Equation 26 indicates that B is the size of the sample set.
Features of matching graph-text pairs are taken as mutual supervision (Mutual Supervision) signals to realize feature fine granularity alignment, thereby mutual supervision and comparison loss function L contr-MutlSup Can be represented by equation 27.
The traditional classification loss function does not refine and distinguish image-text pairs which belong to the same large class of labels but have specific semantic information differences, but roughly performs class-combination processing on the image-text pairs. The image-text characteristic similarity matrix sim is used for realizing the separation between classes, intra-class aggregation and effective subdivision of semantic information differences in the classes t 、sim v And similar label matrix label sim Contrast loss function betweenThe loss of semantic approximate matching in the class is smaller than that of mismatching between classes, and the specific formula is 28:
wherein M is the batch size, sim t Representing text-image feature similarity matrix represented jointly using text as search condition, e.g. equation 2Shown as 9; sim (sim) v Representing an image-text feature similarity matrix jointly represented by taking the image as a retrieval condition, wherein ρ is a learnable parameter as shown in a formula 30.
sim t =ρΣF t F v (29)
sim v =ρ∑F v F t (30)
SRD-MA in combination with equations 24, 27 and 28 + The overall objective loss function of the model is shown in equation 31,μ is a super parameter, and the contribution degree of different loss functions to the objective function is reflected.
In addition, the invention provides a cross-modal searching method and a cross-modal searching system, which concretely comprises the following steps:
and a mode alignment module: the method comprises the steps of performing modal alignment on original modal characteristics obtained after pre-training based on zoom dot product attention so as to realize the mode alignment characteristics of projection mode re-aggregation for the original characteristics;
and a matching module: after passing the modal alignment data formed in the steps through the weight sharing multi-layer perceptron, adopting a semantic approximate matching and correct matching method to realize semantic correct matching and approximate matching mining on the same type label clusters;
constraint module: the method is used for carrying out model constraint by adopting an Arc4cmr loss function, a mutual supervision contrast loss function, a graph-text feature similarity matrix and a contrast loss function between similar label matrixes.
In the implementation, each module may be implemented as an independent entity, or may be combined arbitrarily, and implemented as the same entity or several entities, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.
In summary, the invention is in aA cross-modal search method and optimization work based on search system patent (CN 115563316A). In contrast to the SMR-MA model of the CN 115563316A patent. The two models differ in the network structure of the mode alignment module, the mode alignment module of the SMR-MA model is based on decomposable attention, SRD-MA + The modal alignment module of the model is based on scaling dot product attention, which is one of the differences. SRD-MA + The specific workflow of the modality alignment module in the model differs in that: the original characteristic data obtained by the encoding processing is subjected to linear projection before being sent into a core operation module; scaling and normalizing the correlation matrix obtained by the two mode calculation; the aligned modal features are added to the original feature representation after a layer of full join processing. In addition, SRD-MA + The model additionally designs a semantic approximate matching and correct matching module, which is two differences. The objective function of the two models has an overlapped part, and the SMR-MA model only depends on the Arc4cmr loss function L Arc4cmr Model constraint is carried out, and based on the model constraint, SRD-MA is carried out + The model also superimposes a mutual supervision contrast loss function L contr-MutlSup Contrast loss function L between image-text characteristic similarity matrix and similarity label matrix contr-sim . By further optimizing the former cross-modal retrieval method, the working efficiency is improved, and the model training time is shortened.
Experimental example 1
To verify model performance, experimental analysis was performed on three baseline data sets Wikipedia, pascal-Sentence and NUS-WIDE, as can be seen from Table 1, SRD-MA + The model has obvious performance improvement on the graph Text (I2T), the Text (T2I) and the Average value (Average, avg) of two search tasks, so that the model convergence speed is also greatly improved, the calculation cost is saved, and the model training time is shortened. Overall SRD-MA + The overall performance of the model is superior to that of the SMRA-MA model.
TABLE 1SMR-MA and SRD-MA + Comparison of running Properties
To more clearly observe SRD-MA + For the inter-class separation and intra-class aggregation effects of the model and the SMR-MA model on two modal characteristics, a t-SNE nonlinear dimension reduction algorithm is adopted for visualization. Fig. 5 shows the result of the visualization on a Wikipedia dataset, where the circles "Μ" represent image features and the triangles "∈ΔΜ" represent text features.
By comparing the image features and text feature distribution graphs 5 (a) and (b) of the two models in the common representation space, the two models have good separation effect on feature distribution among different categories, and two kinds of mode data with the same semantics have good overlapping effect, so that the two models are proved to effectively eliminate mode heterogeneity. The image features of the same class are substantially identical to the text feature shape, indicating that the SRD-MA + The model can effectively eliminate the modal difference, so that two modal features of the same category can be effectively overlapped, and meanwhile, effective separation can be realized for different categories. In contrast, SRD-MA + The model has better effect in the public representation space, fewer discrete points of the samples in the same category and better intra-category aggregation effect.
SRD-MA from the perspective of analyzing image features or text feature distributions alone + The model can effectively distinguish between different semantic category samples and divide them into corresponding semantic category clusters. For image features, as in FIGS. 5 (c) and (d), SRD-MA + The model has good separation effect on different types, and the SMR-MA model has the condition that different types are mutually adhered, which is contrary to the purpose of cross-modal retrieval. As shown in FIGS. 5 (e) and (f), SRD-MA + The effect of the model on text feature processing is superior to that of the SMR-MA model, SRD-MA + The model has higher polymerization degree, fewer discrete points and better subdivision effect in the model, which is consistent with the original purpose of designing the semantic approximate matching and correct matching module. In addition, the SRD-MA is used for comparing the characteristic distribution effect of two models on different modes + Image features of the same class of models are closer to the text feature shape,whereas the SMR-MA model is more focused on text features and image features are more diffuse. By comprehensively analyzing the visual effect, SRD-MA can be explained + The model is superior to the direct visual cause of the SMR-MA model.
Fig. 6 is a schematic structural diagram of a computer device according to the present disclosure. Referring to FIG. 6, the computer device 400 includes at least a memory 402 and a processor 401; the memory 402 is connected to the processor through a communication bus 403, and is configured to store computer instructions executable by the processor 401, and the processor 401 is configured to read the computer instructions from the memory 402 to implement the steps of the cross-modal searching method according to any of the foregoing embodiments.
For the above-described device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the objectives of the disclosed solution. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal magnetic disks or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
Finally, it should be noted that: while this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features of specific embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The foregoing description of the preferred embodiments of the present disclosure is not intended to limit the disclosure, but rather to cover all modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present disclosure.

Claims (9)

1. A cross-modal retrieval method for semantic subdivision and modal alignment reasoning learning is characterized by comprising the following steps:
performing modal alignment on the original modal characteristics obtained after pre-training based on the zoom dot product attention so as to realize the mode alignment characteristics of projection mode re-aggregation for the original characteristics;
after passing the modal alignment data formed in the steps through a weight sharing multi-layer perceptron, adopting a semantic approximate matching and correct matching method to realize semantic correct matching and approximate matching mining on the same-class tag clusters;
model constraint is carried out by adopting an Arc4cmr loss function, a mutual supervision contrast loss function, a graph-text feature similarity matrix and a contrast loss function between similar label matrixes.
2. The cross-modality retrieval method of claim 1, wherein the method of modality alignment based on scaled dot product attention includes:
characterizing original textConversion to a single query->All original image features in BatchConversion to bond->Sum->The conversion method is specifically shown in formula 1. Similarly, the conversion method for the image retrieval text task is specifically shown in a formula 2;
in the above formula, D p For projection dimension, D p The value of (c) is 1024, the ln represents layer normalization,is a projection matrix with the same dimension;
single original text feature Q t And each original image feature K in Batch v Is a correlation weight of (a)The projection image characteristic V is recombined according to the proportion of the correlation weight coefficient v The text feature obtained after the dot product Attention is scaled is output as Attention (Q t ,K v ,V v ) Specifically, as shown in formula 3;
similarly, the image feature obtained after the dot product Attention is scaled is output as Attention (Q t ,K v ,V v ) Specifically, as shown in formula 4;
in equations 3 and 4Is a scaling factor;
the aggregate image representation of the attention module is then weightedProjection back->As in equation 5:
r v∣t =LN(Attention(Q t ,K v ,V v )W O ) (5)
wherein ,rv∣t Representing an aggregate image feature conditioned on text t;
weighting aggregate text representations of attention modulesProjection back->As in equation 6:
r t∣v =LN(Attention(Q v ,K t ,V t )W o ) (6)
wherein ,rt∣v Is an aggregate text feature conditioned on image v;
the image modality feature, which ultimately results in text as a search condition via a modality interaction based on scaled dot product attention, is denoted as C v T, the final fusion output of the text feature alignment process, as in equation 7, the text modal feature with image as the search condition is denoted as C t V, the final fusion output of the image feature alignment process, as shown in equation 8;
in the above-mentioned formula(s),single element->The specific meaning is the single original text +.within Batch>Features are re-represented as +.>Single element->Specifically meaning a single original image feature +.within Batch>Is re-represented as +.>
3. The cross-modal retrieval method according to claim 2, wherein the method of semantic approximate matching and correct matching includes the steps of:
computing all image features u within Batch i And text feature v j Semantic relevance score s between ij Specifically, as shown in formula 9;
according to equation 9, a text-to-image T2I similarity matrix attn can be obtained t The formula is 10;
the similarity matrix of the image to the text I2T may be expressed as attn v As in equation 11;
the class labels corresponding to all the images and texts in the Batch are the same as 1, and are different from 0, so that the similarity label matrix of the approximately matched image-text pairs can be expressed by a formula 12;
by similarity label matrix label sim Based on the text-to-image similarity matrix attn t The similarity specific gravity of (2) can be re-expressed asSimilarly, the image-to-text similarity matrix attn v The similarity specific gravity of (2) can be re-expressed as +.>
Related image features in the Batch-sized image library for text may be aggregated asEquation 13:
wherein ,expansion is expressed as +.> Is a near semantic association between text and image, specifically as equation 14;
related text features in a Batch-sized text library for an image may be aggregated asThe formula is 15:
wherein ,expansion is expressed as +.> Is a near semantic association between an image and text, a specific formula expression is shown as formula 16;
lambda in equations 14, 16 is the penalty factor, mask sim (. Cndot.) represents a mask function, when the input is positive, the output is equal to the input, otherwise the output- ≡is output-;
the label discrimination matrix corresponding to the correct matching relation is an identity matrix, and the formula is expressed as 17;
in unit tag matrix label eql Based on this, the similarity ratio attn of the correct matching graph-text pair is calculated eql I.e. text-to-image similarity matrix attn t And unit tag matrix label eql Is the similarity specific gravity of (2) Image-to-text similarity matrix attn v And unit tag matrix label eql Similarity specific gravity>Wherein, mask eql (. Cndot.) is a mask function, and when the input is positive, the output is 1, and the others are 0;
according to the soft attention calculation rule, the correctly matched image features are re-expressed asThe correctly matching text feature is re-represented as +.>
Image features represented by a combination of features that approximate semantic association, feature aggregation, and correctly matching features, such as formula 18, text features such as formula 19, where,
4. a cross-modal retrieval method as claimed in claim 3 wherein the method of model constraint of the Arc4cmr loss function is represented by equation 21:
in the above formula, x i Representing characteristic inputs, y i Which is the corresponding class label, θ k Representing characteristic x i And corresponding weight W k The included angle in the angle space is s is the hypersphere radius, n is the category number, and M is the batch size;
ArcFace loss function L for text retrieval image task simT2I The corresponding input isCorresponding regularization ofAs in equation 22;
ArcFace loss function L for image retrieval text task simI2T The corresponding input isCorresponding regularization ofAs in equation 23;
as can be seen from the combination of the formula 22 and the formula 23, the Arc4cmr loss function L Arc4cmr Expressed as equation 24;
L Arc4cmr =L simT2I +L simI2T (24)。
5. a cross-modal retrieval method as claimed in claim 3 wherein the method of model constraining the mutually supervised contrast loss function comprises:
in feature x i With other features x j Similarity L between Softmax similarity measures between the two using normalized temperature scaling contr As in equation 25:
in the formula, tau represents a constant temperature parameter, sim (·) represents a dot product after L2 normalization of the input;
the contrast loss function is the arithmetic average of all the positive normalized similarity cross entropy in Batch, assumingIs content and x i Matching samples, then contrast loss function L contr Expressed by formula 26, wherein B is the size of the sample set;
features of the matched graph-text pairs are used as supervision signals of each other, feature fine granularity alignment is achieved, and the mutual supervision and comparison loss function L is achieved contr-MutlSup Expressed by equation 27;
6. a cross-modal retrieval method as claimed in claim 3 wherein the method of model constraint of the contrast loss function between the teletext feature similarity matrix and the similarity tag matrix comprises:
characterizing graphics contextSimilarity matrix sim t 、sim v And similar label matrix label sim Contrast loss function betweenThe loss of semantic approximate matching in the class is smaller than that of mismatching between classes, and the specific formula is 28:
wherein M is the batch size, sim t Representing a text-image feature similarity matrix jointly represented by using the text as a search condition, as shown in a formula 29;
sim v representing an image-text feature similarity matrix which is jointly represented by taking the image as a retrieval condition, wherein ρ is a learnable parameter as shown in a formula 30;
sim t =ρ∑F t F v (29)
sim v =ρ∑F v F t (30)
in combination with formulas 24, 27 and 28, the total target loss function is shown in formula 31, and θ and μ are super-parameters;
7. a retrieval system employing the cross-modality retrieval method as claimed in any one of claims 1 to 6, comprising:
and a mode alignment module: the method comprises the steps of performing modal alignment on original modal characteristics obtained after pre-training based on zoom dot product attention so as to realize the mode alignment characteristics of projection mode re-aggregation for the original characteristics;
and a matching module: after passing the modal alignment data formed in the steps through the weight sharing multi-layer perceptron, adopting a semantic approximate matching and correct matching method to realize semantic correct matching and approximate matching mining on the same type label clusters;
constraint module: the method is used for carrying out model constraint by adopting an Arc4cmr loss function, a mutual supervision contrast loss function, a graph-text feature similarity matrix and a contrast loss function between similar label matrixes.
8. A computer readable storage medium having stored thereon a computer program, characterized in that the program when executed implements the steps of the cross-modality retrieval method of any of claims 1 to 6.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the cross-modal retrieval method according to any one of claims 1 to 6 when the program is executed by the processor.
CN202310684445.5A 2023-06-09 2023-06-09 Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system Pending CN116610831A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310684445.5A CN116610831A (en) 2023-06-09 2023-06-09 Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310684445.5A CN116610831A (en) 2023-06-09 2023-06-09 Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system

Publications (1)

Publication Number Publication Date
CN116610831A true CN116610831A (en) 2023-08-18

Family

ID=87678104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310684445.5A Pending CN116610831A (en) 2023-06-09 2023-06-09 Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system

Country Status (1)

Country Link
CN (1) CN116610831A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116775918A (en) * 2023-08-22 2023-09-19 四川鹏旭斯特科技有限公司 Cross-modal retrieval method, system, equipment and medium based on complementary entropy contrast learning
CN117112829A (en) * 2023-10-24 2023-11-24 吉林大学 Medical data cross-modal retrieval method and device and related equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116775918A (en) * 2023-08-22 2023-09-19 四川鹏旭斯特科技有限公司 Cross-modal retrieval method, system, equipment and medium based on complementary entropy contrast learning
CN116775918B (en) * 2023-08-22 2023-11-24 四川鹏旭斯特科技有限公司 Cross-modal retrieval method, system, equipment and medium based on complementary entropy contrast learning
CN117112829A (en) * 2023-10-24 2023-11-24 吉林大学 Medical data cross-modal retrieval method and device and related equipment
CN117112829B (en) * 2023-10-24 2024-02-02 吉林大学 Medical data cross-modal retrieval method and device and related equipment

Similar Documents

Publication Publication Date Title
Cao et al. Generalized multi-view embedding for visual recognition and cross-modal retrieval
Xu et al. Deep adversarial metric learning for cross-modal retrieval
US11093560B2 (en) Stacked cross-modal matching
Katsurai et al. Image sentiment analysis using latent correlations among visual, textual, and sentiment views
CN116610831A (en) Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system
Wang et al. Facilitating image search with a scalable and compact semantic mapping
Vo et al. Transductive kernel map learning and its application to image annotation
Wu et al. Online asymmetric metric learning with multi-layer similarity aggregation for cross-modal retrieval
Ma et al. A weighted KNN-based automatic image annotation method
CN113537304A (en) Cross-modal semantic clustering method based on bidirectional CNN
Ou et al. Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity
Ni et al. Scene classification from remote sensing images using mid-level deep feature learning
Shen et al. Clustering-driven deep adversarial hashing for scalable unsupervised cross-modal retrieval
Li et al. Discriminative-region attention and orthogonal-view generation model for vehicle re-identification
Guadarrama et al. Understanding object descriptions in robotics by open-vocabulary object retrieval and detection
Arulmozhi et al. DSHPoolF: deep supervised hashing based on selective pool feature map for image retrieval
Hu et al. Deep supervised multi-view learning with graph priors
Li et al. Automatic image annotation with continuous PLSA
Malik et al. Multimodal semantic analysis with regularized semantic autoencoder
Zhang et al. Research on hierarchical pedestrian detection based on SVM classifier with improved kernel function
CN115563316A (en) Cross-modal retrieval method and retrieval system
Dai et al. Cross-modal deep discriminant analysis
Xu et al. Learning multi-task local metrics for image annotation
Su et al. Deep supervised hashing with hard example pairs optimization for image retrieval
XingJia et al. Calligraphy and Painting Identification 3D‐CNN Model Based on Hyperspectral Image MNF Dimensionality Reduction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination