CN114298159B

CN114298159B - Image similarity detection method based on text fusion under label-free sample

Info

Publication number: CN114298159B
Application number: CN202111482531.5A
Authority: CN
Inventors: 袁鑫攀; 毛鑫鑫; 谢少军; 夏威; 李长云
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2024-04-09
Anticipated expiration: 2041-12-06
Also published as: CN114298159A

Abstract

An image similarity detection method based on text fusion under a label-free sample belongs to the technical field of image similarity measurement, and comprises the following steps: s1: constructing a TFCS (transport stream control) Siam model comprising upper, middle, lower branches; s2: training the model constructed in the step S1; s3: and (5) carrying out similarity measurement between images by using the model trained by the S2. According to the method, the contrast learning in the unsupervised learning is introduced, the text fusion structure only takes the text information related to the image as the supplement of the image features, the text information is embedded into a subspace shared by the image-text semantics through the image-text cross-mode module, and then the embedded features of the subspace are fused into the image, so that the effects of reducing the complexity of feature fusion and subsequent multi-mode projection, improving the model prediction precision, improving the model training calculation efficiency and the like are achieved.

Description

Image similarity detection method based on text fusion under label-free sample

Technical Field

The invention belongs to the technical field of image similarity measurement, and particularly relates to an image similarity detection method based on text fusion under a label-free sample.

Background

At present, image similarity measurement algorithms are all supervised learning models, training and effect of the models are seriously dependent on manual labeling of training data and difficult sample mining, and the training is limited and has relatively low training efficiency under a huge amount of unlabeled data scenes. In addition, the current similarity measurement model usually focuses on only single image mode information, but in most scenes, the image is accompanied by auxiliary information of other modes such as titles, keywords and the like. These are key scientific problems that need to be solved.

Image similarity is a term describing the similarity and correlation between image data and can be used to compare the similarity of abstract features of images to each other to represent the similarity between the entire images. For some local regions in the image (e.g. local texture features Local Binary Pattern, LBP) are image features describing the relationship between a pixel and its neighboring pixels, gray scale features or gradient features of these local regions may be utilized for similarity comparison. The image feature extraction and the similarity calculation between images are the basis of all the works in the fields of computer vision and image correlation, the interesting features are extracted from the images, and the method has an important guiding function for the subsequent works of image retrieval, image generation, image classification, target detection, image registration and the like. Therefore, extracting features of interest from images and measuring similarity between images has become a fundamental problem worthy of study, and is very important in the field of computer vision.

The traditional machine learning method relies on manually designed characteristics and low-level representation of visual content of images, cannot abstract objects or events contained in the images at a high level, and has the problems of low accuracy, poor expansibility, difficult surmounting of semantic gaps and the like of image similarity measurement algorithms. In recent years, with the rapid development of artificial intelligence and deep learning, deep artificial neural networks have been widely applied to image similarity measurement and large-scale image similarity retrieval, and have achieved very good effects, replacing the early traditional machine learning method. The current approaches for image similarity measurement are mainly Siamese Network and Triplet Network and variants based on both types of models.

Self-supervision learning is an unsupervised learning, so that the self-supervision learning is mainly characterized in that knowledge is directly learned from data without labels without depending on labeling data in the model training process. Self-supervision learning mainly utilizes auxiliary tasks (context) to mine own supervision information from large-scale non-supervision data, and the network is trained through the supervision information with the structure, so that valuable characterization on downstream tasks can be learned. The contrast learning is one implementation of self-supervision, and the auxiliary task of the contrast learning is based on contrast constraint, and the representation is constructed by learning to encode similarity or dissimilarity of two transactions. The specific implementation is similar to a Siamese Network, self-supervised learning is realized by constructing positive samples (positive) and negative samples (negative), and then measuring the distances between the positive samples and the negative samples, and the core idea is that the distance between the samples and the positive samples is far smaller than the distance between the samples and the negative samples:

dist(f(x)，f(x ⁺ ))＜＜dist(f(x)，f(x ^- ))

the idea is very similar to the Siamese Network, the main difference being that the Siamese Network is supervised learning, and the sample pairs used for training are labeled, and the sample pairs can be artificially constructed. Whereas contrast learning is unsupervised learning, the input samples are unlabeled, so their sample pairs need special ways to be constructed.

A good contrast learning system should possess two attributes: alignment and Uniformity. By "Alignment" is meant that similar samples, i.e., positive examples, should have close features after mapping to the target space, i.e., relatively close distances in the target space; by "Uniformity" it is meant that the system should tend to retain as much information in the features as possible, which is equivalent to having features mapped to the target space as evenly as possible distributed in the target space, the more evenly distributed meaning that there are two-by-two differences, and also that the respective samples retain unique information, which represents that the information retention is sufficient.

An extreme negative of the Uniformity property is that all data map to the same point, which represents that all data information is lost, i.e., all data after feature representation mapping, has converged to the same constant solution, which is commonly referred to as model Collapse (Collapse). If the definition of the structure or the loss function and the like of the comparison learning is not good, the model collapse phenomenon is very easy to occur.

For example, the literature named facenet: A unified embedding for face recognition and clustering applies the triple Network structure to the similarity detection and recognition of face images, the literature named Multi-Similarity Loss with General Pair Weighting for Deep Metric Learning further improves the loss function and the construction mode of sample pairs, and although the modes achieve good effects in the image similarity measurement in the respective fields, several problems still exist: 1. all belong to supervised learning, that is to say, data which depend on manual labels when the model is trained, and sample pairs need to be constructed; 2. only the image itself is considered when performing the image similarity calculation, and no information about other modalities such as text related to the image is considered.

Disclosure of Invention

Aiming at the main technical problems, the invention provides an image similarity detection method based on text fusion under a label-free sample, TFCSiam (Textual Fusion Contrastive Siamese), which has the advantages of reducing the complexity of feature fusion and subsequent multi-mode projection, improving model prediction precision, improving model training calculation efficiency and the like by introducing contrast learning in non-supervision learning, taking text information related to images as the supplement of image features, embedding the text information into a subspace shared by image-text semanteme through an image-text cross-mode module, then fusing the embedded features of the subspace into the images and the like.

The invention adopts the following specific technical scheme:

an image similarity detection method based on text fusion under a label-free sample comprises the following steps:

s1: constructing a TFCS (transport stream control) Siam model comprising upper, middle, lower branches, wherein the upper branch and the lower branch are of asymmetric structures, and a cross-mode module is selected in the middle branch to process text mode information;

s2: training the model constructed in the step S1;

s2.1: randomly performing two data enhancement on an input image;

s2.2: respectively inputting the images processed by the S2.1 into an upper branch and a lower branch to extract deep features of the images;

s2.3: inputting text information corresponding to the image into a middle branch, projecting the text into a subspace shared by image-text semantics to obtain multi-mode text embedding characteristics, and further semantically zooming in the text and the image;

s2.4: fusing the deep features of the image of S2.2 and the multi-mode text embedding features of S2.3;

s2.5: mapping the multimodal information after S2.4 fusion to a measurement space to obtain a text fusion image embedding;

s2.6: the text-fused image is embedded in the upper branch and is continuously input to the next module to obtain an output result, and the cosine distance between the output result and the text-fused image embedded in the lower branch is calculated;

s2.7: taking the cosine distance in S2.6 as a loss value of the TFCSIAM model, and updating parameters of the model with the aim of reducing the loss value so as to optimize the model;

s3: performing similarity measurement between images by using the model trained by the S2;

s3.1: transforming the trained TFCS (visual basic) Siam model into a TFCS (visual basic) Siam characterization model;

s3.2: inputting the image-text pairs A 'and B' which need to be subjected to similarity measurement into a TFCSham characterization model to obtain an image embedded representation A and B of the fusion text in a measurement space;

s3.3: the cosine distance d of A, B is calculated to determine if the a ', B' images are similar.

Preferably, the upper branch and the lower branch in the tfcsim model sequentially comprise an Encoder module and a Multimodal Projector module, which are respectively used for extracting deep features of the image and mapping the fused multi-mode information to the measurement space, the upper branch is one more Predictor module than the tail of the lower branch, and the lower branch adopts a gradient stop strategy to not update parameters during training, and the whole structure is an asymmetric structure of the upper branch and the lower branch.

Preferably, the deep feature of the image extracted by the Encoder module adopts a ResNet-50 model in a convolutional neural network.

Preferably, the Cross-mode module selected by the middle branch in the TFCS Siam model is a Cross-mode end module.

Preferably, the data enhancement mode of S2.1 includes random clipping, flip conversion, color dithering, gray level conversion, gaussian blur and overexposure.

Preferably, the fusion mode of S2.4 adopts simple splicing operation.

Preferably, the step S2.6 employs a gradient descent method to update the parameters.

Preferably, the metric space is an ordered pair (M, d), where M is any set, and d is a metric on M, i.e., a function:so that for any x, y, z within M, the following conditions hold: d (x, y) is more than or equal to 0,d (x, y) =d (y, x), d (x, z) +.d (x, y) +d (y, z), where x, y, z are all an element of set M,/->Is a real set.

Preferably, the TFCS (global positioning system) SIAM characterization model of S3.1 is obtained by removing upper branches and Similarity modules from the TFCS SIAM model.

Preferably, the judging method of S3.3 is to set a threshold α, and when the cosine distance d > α, it indicates that the images a ', B' are dissimilar; when cosine distance d < alpha indicates that the A ', B' images are similar.

The beneficial effects of the invention are as follows:

(1) Aiming at the similarity measurement problem of the images, the method and the device introduce contrast learning into similarity detection and are directly used for image similarity measurement, and downstream tasks are not needed, so that the model can directly use massive unlabeled data in training, manual labeling of the data is not needed, training data can be fully utilized, and training efficiency is improved.

(2) In addition, the method and the device are different from the mode of cross-mode retrieval and the like, when the similarity of the images is calculated, only text information related to the images is used as the supplement of image features, so that the information of text modes such as titles, keywords and the like related to the images is further utilized, and the model prediction accuracy is improved.

(3) When the text mode is processed, the text content is mapped into a subspace shared by the image-text semantics instead of directly using the text model, and the distance between the image and the text is shortened or lengthened from the semantic level, so that the efficiency of subsequent feature fusion and space mapping is improved, and the calculation amount is reduced.

Drawings

FIG. 1 is a schematic diagram of a model structure of a Siamese Network;

FIG. 2 is a schematic diagram of a model structure of a Triplet Network;

FIG. 3 is a schematic diagram of a model structure of SimSiam;

fig. 4 is a model structure of tfcsim constructed according to the present invention;

FIG. 5 is a schematic diagram of a ResNet-50 model in a convolutional neural network used by an Encoder module of the present invention to extract deep features of an image;

FIG. 6 is a schematic diagram of a semantic sharing subspace learning process of a Cross-Modal Encoder module according to the present invention;

FIG. 7 is a schematic diagram of an image similarity measurement model based on TFCS (thermal center-to-center) Siam according to the present invention;

FIG. 8 is a schematic diagram of a TFCSIAM characterization model according to the present invention;

FIG. 9 is a schematic view of A' in a pair of images in a preferred embodiment;

FIG. 10 is a schematic view of B' in a similar pair as in FIG. 9;

FIG. 11 is a schematic view of B' in a dissimilar image pair to that of FIG. 9.

Detailed Description

The invention will be further illustrated with reference to specific examples. Unless otherwise indicated, the starting materials and methods employed in the examples of the present invention are those conventionally commercially available in the art and those conventionally used. Several model structures in the background art are briefly described:

(1)Siamese Network

the most basic twin Network model is composed of two identical convolutional neural networks, which require the simultaneous input of two pictures, as shown in fig. 1. The goal of twin network training and learning is to have similar images closer together and dissimilar images farther apart.

The twin network model designs a loss function of the pair constraint (Pairwise Constraint) through distance metric learning (Distance Metric Learning), and the formula is as follows:

wherein Y represents the image pair X ₁ And X ₂ Whether similar or the same class, Y may represent a value as shown in the following equation:

D(X ₁ ，X ₂ ) Represented as feature vector F (X ₁ ) And F (X) ₂ ) Square of the euclidean distance between:

and F (-) is a nonlinear function mapping with respect to network model parameters W and b. F (X) ₁ ) And F (X) ₂ ) The image X is extracted by a convolution neural network with the SoftMax layer removed respectively ₁ And X ₂ Is a feature vector obtained by deep features of the image, m is the distance D (X) ₁ ，X ₂ ) Boundary value (Margin) of (c).

(2)Triplet Network

A Triplet Network model, also known as a Triplet convolutional neural Network (Triplet Convolutional Network). The model is developed based on a twin neural network model, and is composed of three identical convolutional neural networks sharing weights, and the structure of the model is shown in figure 2.

The Triplet Network model needs to input three images simultaneously, namely an Anchor sample (Anchor), a Positive sample (Positive) and a Negative sample (Negative). The anchor sample and the positive example sample are image pairs with the same category or similar content, and the anchor sample and the negative example sample are image pairs with different categories or dissimilar content. By the distance relationship between the Triplet constraint (Triplet Constraint) and the sample, a Triplet Loss function (Triplet Loss) can be constructed as shown in the following equation:

wherein x is ^a For the anchor sample, x ^p As a positive example sample, x ⁿ For negative examples, α refers to D (x ^a ，x ^p ) And D (x) ^a ，x ⁿ ) Distance boundary value (Margin) between, N is the number of triplet samples. The loss function is minimized by training and optimizing algorithms of the network model. From the relationship between distances in mathematics, it is necessary to let x ^a And x ^p The closer the distance between them is, the better, and x ^a And x ⁿ The farther the distance between them is, the better. From the loss function L, it can be found that:

when D (x) ^a ，x ^p )-D(x ^a ，x ⁿ ) When +alpha > 0, L > 0, loss occurs;

when D (x) ^a ，x ^p )-D(x ^a ，x ⁿ ) When +α is less than or equal to 0, l=0, and no loss occurs.

(3) SimSimSiam model

SimSiam is an implementation of contrast learning, which, unlike general contrast learning, does not require negative examples (negative). However, the negative example is a very important factor for preventing the collapse of the model, and in order that the model does not collapse after discarding the negative example, simsim adopts an asymmetric structure, as shown in fig. 3.

The model has one more predictor module on the left side than on the right side, and when training, the branch on the right side adopts a gradient stop strategy, namely, the branch on the right side does not update the parameters when training, and only updates the parameters of the branch on the left side. The purpose of this asymmetric design of the model is to avoid "collapse" of the model during training. The SimSiam has the auxiliary task of obtaining x by enhancing the input image x through data such as turning, rotating, scaling and the like ₁ ，x ₂ Two outputs, then according to x ₁ ，x ₂ Is the own supervision information derived from the same picture x, i.e. x ₁ 、x ₂ The relationship between positive examples (so there is no negative example in SimSiam), and the similarity only needs to be increased when the objective of the optimization model is trained. The encoder module is a convolutional neural network, and the convolutional neural network used in the original text is ResNet-50; the predictor module is a fully connected network.

Example 1

The embodiment discloses an image similarity detection method based on text fusion under a label-free sample, which comprises the following steps:

s1: as shown in fig. 4, a tfcsim model including three branches upper, middle, lower is constructed, the upper branch and the lower branch sequentially include an Encoder module and a Multimodal Projector module, the upper branch has one more Predictor module than the tail of the lower branch, the lower branch does not update parameters, the upper branch and the lower branch are integrally in an asymmetric structure, and the Cross-ModalEncoder module is selected as a Cross-modal module in the middle branch to process text modal information.

The input of the model is a picture/text pair, let the picture be x, the text information corresponding to the picture be c, f is as follows _θ (. Cndot.) represents the Encoder module, a backhaul network for image feature extraction. The backbone network specifically used in this embodiment is a ResNet-50 model, where θ represents a parameter of the model, g _θ (. Cndot.) and q _θ (. Cndot.) represents Multi respectivelyA model Projector module and a Predictor module.The Cross-Modal Encoder module in the middle branch is represented.

S2: training the model constructed in the step S1;

s2.1: randomly performing two data enhancement on an input image; t (-) and t' (-) are two different data enhancement modes, such as turning, clipping, scaling, color change and the like of the picture, in the actual training process, the data enhancement modes are selected randomly, and the random data enhancement modes are as follows:

randomly cutting: randomly cutting an area between 8% and 100% of the original image size, and then adjusting the size of an image block obtained by cutting to 224×224 by a bicubic interpolation method;

and (3) overturn transformation: flipping the image in a horizontal or vertical direction;

color dithering: randomly adjusting the saturation, brightness, contrast and tone of the original picture;

gray level conversion: converting an input picture into a gray level picture, wherein the calculation mode is as follows: 0.2989r+0.5870g+0.1140b.

Gaussian blur: randomly carrying out Gaussian blur on an input image of 224 multiplied by 224 by adopting a standard deviation in the range of [0.1,2.0] and a 23 multiplied by 23 Gaussian kernel;

overexposure: let the image pixel point be x, then

S2.2: the image processed by S2.1 is respectively input to an Encoder module in an upper branch and a lower branch to extract deep features f of the image _θ (t (x)) and f _θ (t' (x)) let the output of the Encoder be y, then:

the output of the Encoder in the upper branch is y ₁ ：＝f _θ (t(x))；

The output of the Encoder in the lower branch is y ₂ ：＝f _θ (t′(x))。

The structure of ResNet-50 model in convolutional neural network is shown in FIG. 5.

S2.3: inputting text information corresponding to the image into a Cross-Modal encoding module of a middle branch, projecting the text into a subspace shared by image-text semantics, and obtaining multi-mode text embeddingThereby further semantically bringing together text and images, the process of projecting graphics into the shared subspace is shown in fig. 6.

In fig. 6, "" is an image of an image space, different colors represent different semantics; "∈" indicates a certain text in the text space, and different colors indicate texts of different semantics. In the original text subspace, only the text semantic relation exists between the texts, after the text passes through the CommonSpace Learning of the Cross-Modal Encoder, the text learns the image information of the corresponding semantic, and the text can be closer to the image of the similar semantic in the shared subspace.

For processing the text mode, the scheme adopts a cross-mode module instead of a module for directly using the text to process the text, or the scheme aims to enable the text information to learn the semantic information of the image after being embedded, so that the subsequent feature fusion and Multimodal Proiector part can be easier, the calculation is reduced, and the model training efficiency is improved.

S2.4: fusing the deep features of the image of S2.2 and the multi-mode text embedding features of S2.3, and outputting by the upper branch network and the lower branch network respectively:

upper branch

lower branch

Because the cross-mode module is used for processing the text information in the last step, the obtained embedded features already learn certain image semantic information, so that T isFusion mode in FCSiamA simple splice (splicing) operation is used:

assume thatThen

Wherein,representing an x-dimensional real space; v represents a vector in the corresponding real space.

It should be noted that this simple stitching operation does not build a particular link on different features and therefore relies on subsequent network layers to adapt it.

S2.5: mapping the multimodal information after S2.4 fusion to a measurement space through a Multimodal Projector module to obtain a text fused image embedding (or called multimodal image embedding)Let z be, the upper and lower branches are respectively obtained

Wherein Multimodal Proiector is a multi-layer perceptron network (MLP). Among the mapped metric spaces, image embedding can be used directly to calculate distance. The similarity between images is represented in terms of the distance between different image embeddings: the similar image distance value is small and the dissimilar image distance value is large.

Definition of metric space:

the metric space is an ordered pair (M, d), where M is the set and d is the metric (metric) on M, which is a function:

so that for any x, y, z within M, the following conditions 1-4 hold:

1.d(x，y)≥0

2.

3.d(x，y)＝d(y，x)

4.d(x，z)≤d(x，y)+d(y，z)。

S2.6-S2.7: after mapping to the metric space, the on-model branches re-embed the multi-modal image into z ₁ Input to the Predictor module, and output p to obtain result p ₁ ：＝q _θ (z ₁ ). Output result and fused image embedding z of lower branch ₂ The cosine distance is calculated as a loss value of the model. The network is then trained through a large number of image text pairs, reducing the penalty value. P is p ₁ And z ₂ Is defined as:

wherein I II ₂ Representing the euclidean distance.

In actual training, the same image/text pair is entered twice. Results x after first enhancing the image ₁ 、x ₂ Respectively inputting an upper branch and a lower branch; the second input swaps a branch, x ₁ 、x ₂ The lower branch and upper branch are input, respectively. Thus, the final product is:

the loss function is defined as follows:

the model is trained through a large amount of training data, and parameters are updated:

wherein θ represents a model parameter; n represents N samples in one iteration update; l (L) _i Representing the loss function of the ith sample;representation pair l _i Gradient with respect to parameter θ; delta theta represents the total average gradient value obtained by N samples in one iteration; η is the learning rate of parameter updating; opt (·) represents a parameter updating method, and in this embodiment, a gradient descent method is used to update parameters.

S3: and (3) carrying out similarity measurement between images by using the S2 trained model, wherein the overall similarity measurement structure diagram is shown in fig. 7:

s3.1: the trained TFCS (visual basic) Siam model is transformed into a TFCS Siam characterization model, as shown in FIG. 8, and the model for Similarity measurement is obtained by removing upper branches and removing Similarity modules from the TFCS Siam model and can be called as the TFCS Siam characterization model;

s3.2: inputting a Text-to-Text Pair (Image-Text Pair) A 'and B' which need to be subjected to similarity measurement into a TFCSham characterization model to obtain an Image embedded representation A and B of a fusion Text under a measurement space;

s3.3: calculating A, B cosine distance d, and giving a threshold value alpha, wherein when d > alpha represents that images A 'and B' are dissimilar; when d < alpha, the A ', B' images are similar.

Example 2

The embodiment discloses an image similarity detection method based on text fusion under a label-free sample, which comprises the following steps of:

s1: as in S1 of example 1, a tfcsiem model was constructed containing upper, middle, lower branches.

S2: the model constructed in S1 is trained on a large number of images and texts as in S2 of example 1.

s3.2: the Pair of graphics (Image-Text Pair) a 'and B' shown in fig. 9 and 10 are input into the tfcsim characterization model. Let the image and text of the graphic pair A' be respectively: x is x _A′ And c _A′ The method comprises the steps of carrying out a first treatment on the surface of the The image and text of the graphic pair B' are respectively: x is x _B′ And c _B′ . Deep features of the image are obtained:

y _A′ ＝f _θ (x _A′ )＝[0.87，0.032，…，0.913]

y _B′ ＝f _θ (x _B′ )＝[0.907，0.392，…，-0.510]

obtaining multi-mode text embedding:

obtaining an image-embedded representation of a fused text under a metric space

Wherein f _θ (·)、The method comprises the following steps of respectively integrating an Encoder module, a Cross-Modal Encoder module and fusion operation; y is _A ，、y _B′ A vector of 2048 dimensions; />Is a 1024-dimensional vector; A. b is a 128-dimensional vector.

S3.3: given a threshold α=0.8, the cosine distance d=0.84 > α of A, B is calculated and images a ', B' are similar.

Example 3

s3.2: the Pair of graphics (Image-Text Pair) a 'and B' shown in fig. 9 and 11 are input into the tfcsim characterization model. Let the image and text of the graphic pair A' be respectively: x is x _A′ And c _A′ The method comprises the steps of carrying out a first treatment on the surface of the The image and text of the graphic pair B' are respectively:x _B’ and c _B’ . Deep features of the image are obtained:

y _A′ ＝f _θ (x _A′ )＝[0.87，0.032，…，0.913]

y _B′ ＝f _θ (x _B′ )＝[0.077，-0.502，…，-0.010]

obtaining multi-mode text embedding:

obtaining an image embedded representation of the fused text in the metric space:

wherein f _θ (·)、The method comprises the following steps of respectively integrating an Encoder module, a Cross-Modal Encoder module and fusion operation; y is _A′ 、y _B′ A vector of 2048 dimensions; />Is a 1024-dimensional vector; A. b is a 128-dimensional vector.

S3.3: given a threshold α=0.8, the cosine distance d=0.17 < α of A, B is calculated, and images a ', B' are dissimilar.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. The image similarity detection method based on text fusion under the label-free sample is characterized by comprising the following steps of:

s2: training the model constructed in the step S1;

s2.1: randomly performing two data enhancement on an input image;

2. The method for detecting the image similarity based on text fusion under the unlabeled exemplar as set forth in claim 1, wherein upper branches and lower branches in the tfcsim model sequentially comprise an Encoder module and a multimodalprotector module, which are respectively used for extracting deep features of an image and mapping fused multi-mode information into a metric space, the upper branches have one more Predictor module than lower branches, and the lower branches adopt a gradient stop strategy without updating parameters during training, so that the overall structure is an asymmetric structure of upper branches and lower branches.

3. The method for detecting image similarity based on text fusion under unlabeled exemplars according to claim 2, wherein the deep features of the image extracted by the Encoder module adopts a ResNet-50 model in a convolutional neural network.

4. The method for detecting image similarity based on text fusion under an unlabeled exemplar of claim 2, wherein a Cross-Modal module selected for the middle branch in the tfcsim model is a Cross-Modal Encoder module.

5. The method for detecting image similarity based on text fusion under a label-free sample according to claim 1, wherein the data enhancement mode of S2.1 comprises random clipping, inversion transformation, color dithering, gray level conversion, gaussian blur and overexposure.

6. The method for detecting image similarity based on text fusion under a label-free sample according to claim 1, wherein the fusion mode of S2.4 adopts a simple splicing operation.

7. The method for detecting image similarity based on text fusion under unlabeled exemplars according to claim 1, wherein the step S2.7 is to update parameters by gradient descent method.

8. The method for detecting image similarity based on text fusion under unlabeled exemplars according to claim 1, wherein the metric space is an ordered pair (M, d), where M is an arbitrary set, and d is a metric on M, which is a function:so that for any x, y, z within M, the following conditions hold: d (x, y) is greater than or equal to 0->d (x, y) =d (y, x) d (x, z). Ltoreq.d (x, y) +d (y, z), where x, y, z are all an element of the set M, and ∈0>Is a real set.

9. The method for detecting the image Similarity based on text fusion under the unlabeled exemplar according to claim 1, wherein the TFCS Siam characterization model of S3.1 is obtained by removing upper branches and Similarity modules from the TFCS Siam model.

10. The method for detecting image similarity based on text fusion under a label-free sample according to claim 1, wherein the judging method of S3.3 is to set a threshold α, and when the cosine distance d > α, it indicates that images a 'and B' are dissimilar; when cosine distance d < alpha indicates that the A ', B' images are similar.