CN116610831A

CN116610831A - Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system

Info

Publication number: CN116610831A
Application number: CN202310684445.5A
Authority: CN
Inventors: 李宝莲; 李培瑶; 孙苹苹; 朱良彬; 韩博; 谢海瑶; 强保华; 李忠涛; 赵建
Original assignee: Guilin University of Electronic Technology; CETC 54 Research Institute
Current assignee: Guilin University of Electronic Technology; CETC 54 Research Institute
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-08-18

Abstract

The invention provides a cross-modal searching method and a searching system for semantic subdivision and modal alignment reasoning learning, wherein the cross-modal searching method for semantic subdivision and modal alignment reasoning learning comprises the following steps: performing modal alignment on the original modal characteristics obtained after pre-training based on the zoom dot product attention so as to realize the mode alignment characteristics of projection mode re-aggregation for the original characteristics; after passing the modal alignment data formed in the steps through a weight sharing multi-layer perceptron, adopting a semantic approximate matching and correct matching method to realize semantic correct matching and approximate matching mining on the same-class tag clusters; model constraint is carried out by adopting an Arc4cmr loss function, a mutual supervision contrast loss function, a graph-text feature similarity matrix and a contrast loss function between similar label matrixes. The cross-modal retrieval method further improves accuracy of cross-modal retrieval.

Description

Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system

Technical Field

The invention relates to the field of cross-modal retrieval with maximum correlation of semantics and modal alignment, in particular to a cross-modal retrieval method and a retrieval system for semantic subdivision and modal alignment reasoning learning.

Background

With rapid development and gradual maturity of multimedia technology, an information carrier gradually evolves from a simple image-text form to a situation that a plurality of media data are combined to be presented, the data have different existing forms, data types, data distribution, data expression forms and the like, different angles, different dimensions and different layers of things are displayed, and the information carrier is uniformly called as multi-mode data. In the rapid growth process of the multimedia social platform, the data expression form is gradually plump, and a new pattern of content symbiosis and form multi-element fusion is gradually formed. The information propagation mode is arranged in a hundred flowers way, and the continuous expansion of the search dimension is brought. For example, when we retrieve an event or concept, we want to see various forms of information related to pictures, videos, charts, etc. for better understanding and memorization, and thus, cross-modal retrieval tasks have evolved.

Cross-modal retrieval aims at solving the problem that the underlying features of different modal data are heterogeneous and the semantics of the higher layers are related. According to whether tag information is used, the cross-modal search can be divided into two types of supervised cross-modal search and unsupervised cross-modal search. The cross-modal search can be divided into a traditional method based on statistical analysis and a modern method based on deep learning according to the time history.

Traditional cross-modal retrieval method based on statistical analysis:

1, an unsupervised method: the Cross-modal factor analysis method (Cross-modal Factor Analysis, CFA) proposed by Li and the like is the earliest traditional Cross-modal unsupervised method, and takes F norm as a measurement, and the projection subspaces of different modalities are learned by minimizing the distances of different modal sample pairs in a transformation domain, so that the potential matching relation behind two-modality data is deeply analyzed in a public subspace. The typical correlation analysis (Canonical Correlation Analysis, CCA) proposed by Hotelling et al is an unsupervised public space learning method, and is a milestone of image-text content correlation retrieval.

2, supervised method: rasiwasia et al propose a cross-modal retrieval model of semantic correlation matching (Semantic Correlation Matching, SCM), abstract the semantics of images and texts, and jointly model the cross-correlated two-modal data in a shared space so as to improve the retrieval precision of the model. In order to fully utilize the fact that information among modes in real life is not an absolute one-to-one relationship, ranjan et al construct a multi-label typical correlation analysis (multi-label Canonical Correlation Analysis, ml-CCA) model by utilizing the one-to-many, many-to-one, many-to-many and other relationships generated by the multi-label on the basis of CCA, and the model is more fit to a real scene and has better performance.

The traditional cross-modal retrieval method is simpler to realize based on a statistical analysis principle, but most of nonlinear relations learned by a model or a shallow mapping relation of multi-modal data has a great improvement space for advanced semantic modeling. In addition, as the data size increases, the computational complexity of conventional methods increases, and the capability for high-dimensional data processing decreases dramatically.

The cross-modal retrieval method based on deep learning in the prior art mainly comprises the following steps:

1, an unsupervised method: the depth typical correlation analysis (Deep Canonical Correlation Analysis, DCCA) proposed by Andrew et al utilizes a neural network to learn nonlinear transformation public space, accurately captures data correlation, and solves the problem that CCA is only suitable for linear public space learning. The depth-typical correlated automatic encoder (Deep Canonically Correlated Autoencoders, DCCAE) proposed by Wang et al improves DCCA by adding an automatic encoder-based reconstruction error.

2, supervised method: zhai et al propose a joint token learning (Joint Representation Learning, JRL) method that combines multiple modality data in a unified framework to sparse and semi-supervised regularization to explore pairwise correlation and semantic correlation information between them. Zhen et al propose an end-to-end deep supervision cross-mode Retrieval (DSCMR) method, which keeps semantic distinguishability by linearly classifying samples in a public space, learns correlations between intersecting modalities by a weight sharing strategy, so as to keep invariance of the modalities.

Compared with the traditional statistical analysis method, the method has the advantages of expanding new ideas and technologies for cross-modal retrieval research due to the large-scale data computing capacity, nonlinear structural design and deep semantic information mining capacity of the deep learning network model, and the method is provided based on the deep learning method.

Disclosure of Invention

In view of the above, the invention discloses a novel cross-modal retrieval method, which is characterized in that the relevance of semantic related modal characteristics is enhanced through a modal alignment module based on the attention of a zoom dot product, the modal alignment between two modal data is learned, a semantic approximate matching and correct matching module is designed, the aggregation of graphic and text characteristics in a class is enhanced, and meanwhile, the detailed distinction is made on the graph-text pairs with semantic information differences in the class. The contrast loss function is used for mutual supervision, fine granularity alignment of features is enhanced, and the contrast loss function between the image-text feature similarity matrix and the similarity label matrix is used for enabling the loss of semantic approximate matching in the class to be smaller than the loss of error matching between the classes.

Specifically, the invention is realized by the following technical scheme:

in a first aspect, the invention discloses a novel cross-modal retrieval method, which comprises the following steps:

performing modal alignment on the original modal characteristics obtained after pre-training based on the zoom dot product attention so as to realize the mode alignment characteristics of projection mode re-aggregation for the original characteristics;

after passing the modal alignment data formed in the steps through a weight sharing multi-layer perceptron, adopting a semantic approximate matching and correct matching method to realize semantic correct matching and approximate matching mining on the same-class tag clusters;

model constraint is carried out by adopting an Arc4cmr loss function, a mutual supervision contrast loss function, a graph-text feature similarity matrix and a contrast loss function between similar label matrixes.

In a second aspect, the invention discloses a cross-modal retrieval system comprising:

and a mode alignment module: the method comprises the steps of performing modal alignment on original modal characteristics obtained after pre-training based on zoom dot product attention so as to realize the mode alignment characteristics of projection mode re-aggregation for the original characteristics;

and a matching module: after passing the modal alignment data formed in the steps through the weight sharing multi-layer perceptron, adopting a semantic approximate matching and correct matching method to realize semantic correct matching and approximate matching mining on the same type label clusters;

constraint module: the method is used for carrying out model constraint by adopting an Arc4cmr loss function, a mutual supervision contrast loss function, a graph-text feature similarity matrix and a contrast loss function between similar label matrixes.

In a third aspect, the present invention discloses a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the cross-modal retrieval method of the first aspect.

In a fourth aspect, the invention discloses a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the cross-modality retrieval method as in the first aspect when the program is executed.

Currently, most of the constraint conditions of the loss function of the supervised model use class label information as a measure, and in a multi-class model, the number of output dimensions of the model is generally set to the number of classes (assumed to be N) so as to output the probability score of each class. Meanwhile, in order to facilitate calculation and evaluation of the prediction effect of the model, the real class labels of the samples are converted into an N-dimensional One-hot vector. Specifically, N preset categories are represented as a vector with a length of N, where only the position of the corresponding category is 1 and the other positions are 0, however, this conceals the problem that specific semantic information is forcedly classified.

In addition, the initial aim of the pre-training model CLIP is to judge the image type by taking text sentences containing more fine granularity information as labels of images, so that the problem that similar images are forcedly classified into one type can be effectively relieved, the existing model lacks consideration of fine granularity alignment of image-text features, and based on the consideration of the fine granularity alignment of image-text features, the invention takes the correctly matched image-text pair of one mode feature data as the supervision information of the other mode feature to perform the fine granularity alignment processing of the feature.

In order to promote the reasoning capability of the modal alignment module for feature reconstruction, make the intra-class distance of different modal data as small as possible and the inter-class distance as large as possible, and effectively conduct intra-class subdivision according to semantic information, enhance fine granularity alignment of modal features, provide a semantic subdivision and modal alignment reasoning learning (Semantic Refinement Discrimination and Modal Alignment inference learning, SRD-MA) ⁺ ) The model is retrieved across modalities.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is an overall frame diagram of a cross-modal retrieval method provided by an embodiment of the invention;

FIG. 2 is a flow chart of a modality alignment workflow based on scaled dot product attention provided by an embodiment of the present invention;

FIG. 3 is a data processing flow chart of a semantic approximate matching and correct matching module provided by an embodiment of the present invention;

FIG. 4 is a diagram of a label matrix label using similarity according to an embodiment of the present invention _sim Visual effect diagram of the processed image-text similarity matrix;

FIG. 5 shows an SRD-MA according to Experimental example 1 of the present invention ⁺ A visual effect comparison graph of the model and the SMR-MA model;

fig. 6 is a schematic flow chart of a computer device according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

The invention discloses a cross-modal retrieval method, which comprises the following steps:

The overall framework of the model for cross-modal retrieval is shown in fig. 1. Firstly, feature coding is carried out on the image and text samples by using a pre-training model CLIP, and original image features and original text features are obtained. Secondly, in order to optimize the modal alignment module, the correlation of two semantically related modal features is enhanced, and projection modal alignment features are recombined for original features through the modal alignment module based on the dot product attention. Next, the common representation space is learned using weight-shared MLPs to preserve the invariance of the modalities. And then, utilizing a semantic approximate matching and correct matching module to realize semantic correct matching and approximate matching mining of the label clusters of the same category, and improving the accuracy of cross-modal retrieval. Finally, combining three loss functions to restrict and improve the separation degree and intra-class polymerization degree between classes, enhancing the fine granularity alignment of the features, enabling the loss of the intra-class semantic approximate matching to be smaller than the loss of the inter-class error matching, and conducting semantic subdivision on the intra-class graph-text pairs.

SRD-MA ⁺ The method comprises the steps of selecting an improved ResNet-50 network structure in a CLIP pre-training model as an image encoder phi and a transform network structure as a text encoder phi to obtain corresponding original image characteristicsAnd original text feature representation ++>And-> wherein ,/>D is the feature dimension size, D is 1024, M is the Batch size, and M has a value of 100.

The modal alignment module workflow based on zoom dot product attention is shown in fig. 2 and comprises two data processing procedures, namely image feature alignment processing and text feature alignment processing. The input data of the module is the original image characteristics and the original text characteristics after feature coding, the fused image characteristics and the fused text characteristics are output, and in the process, the model can learn a new joint potential space to highlight the image (text) which is more matched with the query text (image), so that the image (text) which is irrelevant to the query text (image) is effectively restrained. In order to enhance the modal information interaction and eliminate the modal difference, the two data processing processes share the network structure and parameters.

In the modality alignment module based on the zoom dot product attention, for retrieving images in text, an original text feature is usedConversion to a single query->All origins within BatchInitial image feature->Conversion to bond->Sum->A specific transformation is shown as equation 1, and so on, for image retrieval text, a specific transformation is shown as equation 2.

In equations 1 and 2, D _p For projection dimension, D _p The value of (c) is 1024, the ln represents layer normalization,is a projection matrix of the same dimension.

Single original text feature Q _t And each original image feature K in Batch _v Is a correlation weight of (a)The projection image characteristic V is recombined according to the proportion of the correlation weight coefficient _v The text feature obtained after the dot product Attention is scaled is output as Attention (Q _t ,K _v ,V _v ) As shown in particular in equation 3.

Similarly, the image feature obtained after the dot product Attention is scaled is output as Attention (Q _t ,K _v ,V _v ) As shown in particular in equation 4.

In equations 3 and 4The introduction of the scaling factor can effectively avoid the attention fraction QK as the scaling factor ^T The gradient vanishing or gradient explosion problem caused by too large or too small normalizes the attention score, and improves the performance and stability of the model.

For embedding images into a shared space with text, aggregate image representations of attention modules are weightedProjection back->As shown in equation 5:

r _v∣t ＝LN(Attention(Q _t ,K _v ,V _v )W _O ) (5)

wherein ,r_v∣t Representing the aggregate image feature conditioned on text t.

For embedding text into a shared space with images, aggregate text representations of attention modules are weightedProjection back->As shown in equation 6:

r _t∣v ＝LN(Attention (Q _v ,K _t ,V _t )W _o ) (6)

wherein ,r_t∣v Is an aggregate text feature conditioned on image v.

The image modality feature, which ultimately results in text as a search condition via a modality interaction based on scaled dot product attention, is denoted as C _v I t, the final fusion output of the text feature alignment process, as shown in equation 7, toThe text modal feature with image as search condition is represented as C _t V, the final fusion output of the image feature alignment process, as shown in equation 8.

wherein ,single element->The specific meaning is the single original text +.within Batch>Features are re-represented as +.>Single element->Specifically meaning a single original image feature +.within Batch>Is re-represented as +.>

Image characteristics obtained by the processingAnd text feature->And the data are transmitted into the multi-layer perceptron MLP with weight sharing. MLP structure is y (x) =W ₂ (ε(W ₁ x+b ₁ ) Where ε represents the GeLU activation function, W ₁ ，W ₂ Representation canTraining weight parameters, b ₁ ，b ₂ Representing the bias term. The image processed by the weight sharing multi-layer perceptron is characterized by v _i ＝y(c _i ^I ) Text is characterized by->At this time, all image features and text features within Batch are denoted as C' _t |v＝[v ₁ ,v ₂ ,…,v _i ,…,v _M ] ^T 、C' _v |t＝[u ₁ ,u ₂ ,…,u _j ,…,u _M ] ^T, wherein ,/>

The data processing schematic diagram of the semantic approximate matching and correct matching module is shown in fig. 3, and simultaneously, the correct matching, semantic approximate matching and complete unmatched relation among classes of the image-text pair are considered, so that the positive effect and the negative effect of the image-text pair on cross-modal retrieval are accurately distinguished from different matching dimensions. Graph-text pairs that are matched semantically approximately within a class are processed with correctly matched graph-text pairs using two different attention mask mechanisms. First, all image features u within Batch are calculated _i And text feature v _j Semantic relevance score s between _ij As shown in particular in equation 9.

According to equation 9, a text-to-image (T2I) similarity matrix attn can be obtained _t The formula is 10.

Similarly, an image-to-text (I2T) similarity matrix may be expressed as attn _v As shown in equation 11.

The class labels corresponding to all the images and texts in one Batch may be repeated, and the label values of the corresponding positions are subjected to an exclusive nor operation, namely, the same label value is 1, and the label values are different from 0, so that the similarity label matrix of the approximately matched image-text pairs can be represented by a formula 12.

By similarity label matrix label _sim Based on the text-to-image similarity matrix attn _t The similarity specific gravity of (2) can be re-expressed asSimilarly, the image-to-text similarity matrix attn _v The similarity specific gravity of (2) can be re-expressed as +.>The inherent attribute of the Softmax function is utilized to increase the similarity of the correct matching graph-text pairs, and meanwhile, the similarity of the intra-class semantic approximate matching graph-text pairs is far greater than the similarity of the inter-class uncorrelated graph-text pairs, so that the negative influence caused by incorrect matching among classes is effectively inhibited, the separation degree among classes is enhanced, and the intra-class cohesion is effectively enhanced while the specific gravity of the correct matching similarity is enhanced.

Using a similarity tag matrix label _sim The visual visualization effect diagram of the processed image-text similarity matrix can be represented by fig. 4. Diagonal elements in the graph represent graph-text pairs that are correctly matched according to the original semantic information, and color representations are darker and represent a greater degree of similarity. The semantic approximate matching graph-text pairs of the labels of the same category are lower in similarity than the correct matching graph-text pairs, and the corresponding color representations are also lighter. And the similarity of the image-text pairs which are not related to each other and belong to different categories is represented by lighter colors and is represented as negative samples.

In measuring the similarity of image-text pairs, attention is paid firstIs the shared semantics of two modality data, related image features in the Batch-sized image library for the ith text can be aggregated intoEquation 13 shows:

wherein ,expansion is expressed as +.> Is a near semantic association between text and images, as shown in equation 14.

Related text features in the Batch-sized text library for the ith image may be aggregated asThe formula is defined as 15:

wherein ,expansion is expressed as +.> Is a near semantic association between an image and text, a specific formulation, as shown in equation 16.

Lambda in equations 14, 16 is the penalty factor, mask _sim (. Cndot.) represents a masking function, when the input is positive, the output is equal to the input, otherwise, the output- +_is processed by the Softmax function, so that the attention weight of the irrelevant samples is reduced to 0, thereby realizing effective attention to the relevant samples.

According to the information content of the graph-Wen Duiyu, the label discrimination matrix corresponding to the correct matching relation is an identity matrix, and the formula is 17.

In unit tag matrix label _eql Based on this, the similarity ratio attn of the correct matching graph-text pair is calculated _eql I.e. text-to-image similarity matrix attn _t And unit tag matrix label _eql Is the similarity specific gravity of (2) Image-to-text similarity matrix attn _v And unit tag matrix label _eql Similarity specific gravity> Wherein, mask _eql (·) is a mask function, and when the input is positive, the output is 1, the others are 0. According to soft attention calculation rules, correctly matched image features can be re-expressed asCorrectly matching text features may be re-represented as

In combination with the above analysis, image features represented by a combination of features that approximate semantic association and correctly matching features are shown in equation 18, text features are shown in equation 19, where,

SRD-MA ⁺ the objective function of the model mainly comprises three parts, namely an Arc4cmr (ArcFace loss for Cross-Modal real) loss function, a mutual supervision contrast loss function, a graph-text feature similarity matrix and a contrast loss function between similar label matrices.

The ArcFace loss function is specified as equation 20, where x _i Representing characteristic inputs, y _i Which is a corresponding category label that is associated with the category, _θk representing characteristic x _i And corresponding weight W _k In the angle space, s is the hyperspherical radius, n is the category number, and M is the batch size. The constraint can be expressed by equation 21.

ArcFace loss function L for text retrieval image task _simT2I The corresponding input isCorresponding regularization->As shown in particular in equation 22.

ArcFace loss function L for image retrieval text task _simI2T The corresponding input isCorresponding regularization->As shown in particular in equation 23.

As can be seen from a combination of equations 22 and 23, SRD-MA ⁺ Arc4cmr loss function L of model _Arc4cmr May be expressed as formula 24.

L _Arc4cmr ＝L _simT2I +L _simI2T (24)

The purpose of the network model training is to minimize the distance between the pairs, i.e. the content of the image presentation is unified with the content of the text description, called the pairs, and to maximize the distance between the negative pairs, i.e. the image presentation is contradictory with the text description, called the negative pairs. Thus, the contrast loss function is included in the objective function. In feature x _i With other features x _j Similarity L between Softmax similarity measures between the two using normalized temperature scaling _contr As shown in equation 25:

wherein τ represents the constant temperature parameter, sim (·) represents the dot product after L2 normalization of the input, unlike direct measurement using cosine similarity, normalizationThe chemical Softmax can amplify the similarity of positive pairs and impair the independence of negative pairs. The contrast loss function is the arithmetic average of all the positive normalized similarity cross entropy in Batch, assumingIs content and x _i Matching samples, then contrast loss function L _contr Available public

Equation 26 indicates that B is the size of the sample set.

Features of matching graph-text pairs are taken as mutual supervision (Mutual Supervision) signals to realize feature fine granularity alignment, thereby mutual supervision and comparison loss function L _{contr-MutlSup} Can be represented by equation 27.

The traditional classification loss function does not refine and distinguish image-text pairs which belong to the same large class of labels but have specific semantic information differences, but roughly performs class-combination processing on the image-text pairs. The image-text characteristic similarity matrix sim is used for realizing the separation between classes, intra-class aggregation and effective subdivision of semantic information differences in the classes _t 、sim _v And similar label matrix label _sim Contrast loss function betweenThe loss of semantic approximate matching in the class is smaller than that of mismatching between classes, and the specific formula is 28:

wherein M is the batch size, sim _t Representing text-image feature similarity matrix represented jointly using text as search condition, e.g. equation 2Shown as 9; sim (sim) _v Representing an image-text feature similarity matrix jointly represented by taking the image as a retrieval condition, wherein ρ is a learnable parameter as shown in a formula 30.

sim _t ＝ρΣF _t F _v (29)

sim _v ＝ρ∑F _v F _t (30)

SRD-MA in combination with equations 24, 27 and 28 ⁺ The overall objective loss function of the model is shown in equation 31,μ is a super parameter, and the contribution degree of different loss functions to the objective function is reflected.

In addition, the invention provides a cross-modal searching method and a cross-modal searching system, which concretely comprises the following steps:

In the implementation, each module may be implemented as an independent entity, or may be combined arbitrarily, and implemented as the same entity or several entities, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

In summary, the invention is in aA cross-modal search method and optimization work based on search system patent (CN 115563316A). In contrast to the SMR-MA model of the CN 115563316A patent. The two models differ in the network structure of the mode alignment module, the mode alignment module of the SMR-MA model is based on decomposable attention, SRD-MA ⁺ The modal alignment module of the model is based on scaling dot product attention, which is one of the differences. SRD-MA ⁺ The specific workflow of the modality alignment module in the model differs in that: the original characteristic data obtained by the encoding processing is subjected to linear projection before being sent into a core operation module; scaling and normalizing the correlation matrix obtained by the two mode calculation; the aligned modal features are added to the original feature representation after a layer of full join processing. In addition, SRD-MA ⁺ The model additionally designs a semantic approximate matching and correct matching module, which is two differences. The objective function of the two models has an overlapped part, and the SMR-MA model only depends on the Arc4cmr loss function L _Arc4cmr Model constraint is carried out, and based on the model constraint, SRD-MA is carried out ⁺ The model also superimposes a mutual supervision contrast loss function L _{contr-MutlSup} Contrast loss function L between image-text characteristic similarity matrix and similarity label matrix _contr-sim . By further optimizing the former cross-modal retrieval method, the working efficiency is improved, and the model training time is shortened.

Experimental example 1

To verify model performance, experimental analysis was performed on three baseline data sets Wikipedia, pascal-Sentence and NUS-WIDE, as can be seen from Table 1, SRD-MA ⁺ The model has obvious performance improvement on the graph Text (I2T), the Text (T2I) and the Average value (Average, avg) of two search tasks, so that the model convergence speed is also greatly improved, the calculation cost is saved, and the model training time is shortened. Overall SRD-MA ⁺ The overall performance of the model is superior to that of the SMRA-MA model.

TABLE 1SMR-MA and SRD-MA ⁺ Comparison of running Properties

To more clearly observe SRD-MA ⁺ For the inter-class separation and intra-class aggregation effects of the model and the SMR-MA model on two modal characteristics, a t-SNE nonlinear dimension reduction algorithm is adopted for visualization. Fig. 5 shows the result of the visualization on a Wikipedia dataset, where the circles "Μ" represent image features and the triangles "∈ΔΜ" represent text features.

By comparing the image features and text feature distribution graphs 5 (a) and (b) of the two models in the common representation space, the two models have good separation effect on feature distribution among different categories, and two kinds of mode data with the same semantics have good overlapping effect, so that the two models are proved to effectively eliminate mode heterogeneity. The image features of the same class are substantially identical to the text feature shape, indicating that the SRD-MA ⁺ The model can effectively eliminate the modal difference, so that two modal features of the same category can be effectively overlapped, and meanwhile, effective separation can be realized for different categories. In contrast, SRD-MA ⁺ The model has better effect in the public representation space, fewer discrete points of the samples in the same category and better intra-category aggregation effect.

SRD-MA from the perspective of analyzing image features or text feature distributions alone ⁺ The model can effectively distinguish between different semantic category samples and divide them into corresponding semantic category clusters. For image features, as in FIGS. 5 (c) and (d), SRD-MA ⁺ The model has good separation effect on different types, and the SMR-MA model has the condition that different types are mutually adhered, which is contrary to the purpose of cross-modal retrieval. As shown in FIGS. 5 (e) and (f), SRD-MA ⁺ The effect of the model on text feature processing is superior to that of the SMR-MA model, SRD-MA ⁺ The model has higher polymerization degree, fewer discrete points and better subdivision effect in the model, which is consistent with the original purpose of designing the semantic approximate matching and correct matching module. In addition, the SRD-MA is used for comparing the characteristic distribution effect of two models on different modes ⁺ Image features of the same class of models are closer to the text feature shape,whereas the SMR-MA model is more focused on text features and image features are more diffuse. By comprehensively analyzing the visual effect, SRD-MA can be explained ⁺ The model is superior to the direct visual cause of the SMR-MA model.

Fig. 6 is a schematic structural diagram of a computer device according to the present disclosure. Referring to FIG. 6, the computer device 400 includes at least a memory 402 and a processor 401; the memory 402 is connected to the processor through a communication bus 403, and is configured to store computer instructions executable by the processor 401, and the processor 401 is configured to read the computer instructions from the memory 402 to implement the steps of the cross-modal searching method according to any of the foregoing embodiments.

For the above-described device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the objectives of the disclosed solution. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal magnetic disks or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

Finally, it should be noted that: while this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features of specific embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The foregoing description of the preferred embodiments of the present disclosure is not intended to limit the disclosure, but rather to cover all modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present disclosure.

Claims

1. A cross-modal retrieval method for semantic subdivision and modal alignment reasoning learning is characterized by comprising the following steps:

2. The cross-modality retrieval method of claim 1, wherein the method of modality alignment based on scaled dot product attention includes:

characterizing original textConversion to a single query->All original image features in BatchConversion to bond->Sum->The conversion method is specifically shown in formula 1. Similarly, the conversion method for the image retrieval text task is specifically shown in a formula 2;

in the above formula, D _p For projection dimension, D _p The value of (c) is 1024, the ln represents layer normalization,is a projection matrix with the same dimension;

single original text feature Q _t And each original image feature K in Batch _v Is a correlation weight of (a)The projection image characteristic V is recombined according to the proportion of the correlation weight coefficient _v The text feature obtained after the dot product Attention is scaled is output as Attention (Q _t ,K _v ,V _v ) Specifically, as shown in formula 3;

similarly, the image feature obtained after the dot product Attention is scaled is output as Attention (Q _t ,K _v ,V _v ) Specifically, as shown in formula 4;

in equations 3 and 4Is a scaling factor;

the aggregate image representation of the attention module is then weightedProjection back->As in equation 5:

r _v∣t ＝LN(Attention(Q _t ,K _v ,V _v )W _O ) (5)

wherein ,r_v∣t Representing an aggregate image feature conditioned on text t;

weighting aggregate text representations of attention modulesProjection back->As in equation 6:

r _t∣v ＝LN(Attention(Q _v ,K _t ,V _t )W _o ) (6)

wherein ,r_t∣v Is an aggregate text feature conditioned on image v;

the image modality feature, which ultimately results in text as a search condition via a modality interaction based on scaled dot product attention, is denoted as C _v T, the final fusion output of the text feature alignment process, as in equation 7, the text modal feature with image as the search condition is denoted as C _t V, the final fusion output of the image feature alignment process, as shown in equation 8;

in the above-mentioned formula(s),single element->The specific meaning is the single original text +.within Batch>Features are re-represented as +.>Single element->Specifically meaning a single original image feature +.within Batch>Is re-represented as +.>

3. The cross-modal retrieval method according to claim 2, wherein the method of semantic approximate matching and correct matching includes the steps of:

computing all image features u within Batch _i And text feature v _j Semantic relevance score s between _ij Specifically, as shown in formula 9;

according to equation 9, a text-to-image T2I similarity matrix attn can be obtained _t The formula is 10;

the similarity matrix of the image to the text I2T may be expressed as attn _v As in equation 11;

the class labels corresponding to all the images and texts in the Batch are the same as 1, and are different from 0, so that the similarity label matrix of the approximately matched image-text pairs can be expressed by a formula 12;

by similarity label matrix label _sim Based on the text-to-image similarity matrix attn _t The similarity specific gravity of (2) can be re-expressed asSimilarly, the image-to-text similarity matrix attn _v The similarity specific gravity of (2) can be re-expressed as +.>

Related image features in the Batch-sized image library for text may be aggregated asEquation 13:

wherein ,expansion is expressed as +.> Is a near semantic association between text and image, specifically as equation 14;

related text features in a Batch-sized text library for an image may be aggregated asThe formula is 15:

wherein ,expansion is expressed as +.> Is a near semantic association between an image and text, a specific formula expression is shown as formula 16;

lambda in equations 14, 16 is the penalty factor, mask _sim (. Cndot.) represents a mask function, when the input is positive, the output is equal to the input, otherwise the output- ≡is output-;

the label discrimination matrix corresponding to the correct matching relation is an identity matrix, and the formula is expressed as 17;

in unit tag matrix label _eql Based on this, the similarity ratio attn of the correct matching graph-text pair is calculated _eql I.e. text-to-image similarity matrix attn _t And unit tag matrix label _eql Is the similarity specific gravity of (2) Image-to-text similarity matrix attn _v And unit tag matrix label _eql Similarity specific gravity>Wherein, mask _eql (. Cndot.) is a mask function, and when the input is positive, the output is 1, and the others are 0;

according to the soft attention calculation rule, the correctly matched image features are re-expressed asThe correctly matching text feature is re-represented as +.>

Image features represented by a combination of features that approximate semantic association, feature aggregation, and correctly matching features, such as formula 18, text features such as formula 19, where,

4. a cross-modal retrieval method as claimed in claim 3 wherein the method of model constraint of the Arc4cmr loss function is represented by equation 21:

in the above formula, x _i Representing characteristic inputs, y _i Which is the corresponding class label, θ _k Representing characteristic x _i And corresponding weight W _k The included angle in the angle space is s is the hypersphere radius, n is the category number, and M is the batch size;

ArcFace loss function L for text retrieval image task _simT2I The corresponding input isCorresponding regularization ofAs in equation 22;

ArcFace loss function L for image retrieval text task _simI2T The corresponding input isCorresponding regularization ofAs in equation 23;

as can be seen from the combination of the formula 22 and the formula 23, the Arc4cmr loss function L _Arc4cmr Expressed as equation 24;

L _Arc4cmr ＝L _simT2I +L _simI2T (24)。

5. a cross-modal retrieval method as claimed in claim 3 wherein the method of model constraining the mutually supervised contrast loss function comprises:

in feature x _i With other features x _j Similarity L between Softmax similarity measures between the two using normalized temperature scaling _contr As in equation 25:

in the formula, tau represents a constant temperature parameter, sim (·) represents a dot product after L2 normalization of the input;

the contrast loss function is the arithmetic average of all the positive normalized similarity cross entropy in Batch, assumingIs content and x _i Matching samples, then contrast loss function L _contr Expressed by formula 26, wherein B is the size of the sample set;

features of the matched graph-text pairs are used as supervision signals of each other, feature fine granularity alignment is achieved, and the mutual supervision and comparison loss function L is achieved _{contr-MutlSup} Expressed by equation 27;

6. a cross-modal retrieval method as claimed in claim 3 wherein the method of model constraint of the contrast loss function between the teletext feature similarity matrix and the similarity tag matrix comprises:

characterizing graphics contextSimilarity matrix sim _t 、sim _v And similar label matrix label _sim Contrast loss function betweenThe loss of semantic approximate matching in the class is smaller than that of mismatching between classes, and the specific formula is 28:

wherein M is the batch size, sim _t Representing a text-image feature similarity matrix jointly represented by using the text as a search condition, as shown in a formula 29;

sim _v representing an image-text feature similarity matrix which is jointly represented by taking the image as a retrieval condition, wherein ρ is a learnable parameter as shown in a formula 30;

sim _t ＝ρ∑F _t F _v (29)

sim _v ＝ρ∑F _v F _t (30)

in combination with formulas 24, 27 and 28, the total target loss function is shown in formula 31, and θ and μ are super-parameters;

7. a retrieval system employing the cross-modality retrieval method as claimed in any one of claims 1 to 6, comprising:

8. A computer readable storage medium having stored thereon a computer program, characterized in that the program when executed implements the steps of the cross-modality retrieval method of any of claims 1 to 6.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the cross-modal retrieval method according to any one of claims 1 to 6 when the program is executed by the processor.