CN114241273A - Multi-modal image processing method and system based on Transformer network and hypersphere space learning - Google Patents

Multi-modal image processing method and system based on Transformer network and hypersphere space learning Download PDF

Info

Publication number
CN114241273A
CN114241273A CN202111451939.6A CN202111451939A CN114241273A CN 114241273 A CN114241273 A CN 114241273A CN 202111451939 A CN202111451939 A CN 202111451939A CN 114241273 A CN114241273 A CN 114241273A
Authority
CN
China
Prior art keywords
model
distillation
modal
loss
modality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111451939.6A
Other languages
Chinese (zh)
Other versions
CN114241273B (en
Inventor
徐行
田加林
沈复民
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202111451939.6A priority Critical patent/CN114241273B/en
Publication of CN114241273A publication Critical patent/CN114241273A/en
Application granted granted Critical
Publication of CN114241273B publication Critical patent/CN114241273B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Abstract

The invention discloses a multi-modal image processing method and system based on a Transformer network and hypersphere space learning, which comprises the steps of obtaining a pre-trained Transformer network model to obtain a teacher model; constructing a multi-branch model consisting of a teacher model and a multi-mode fusion model; extracting teacher's distillation vector, student's distillation vector, and the features and classification probability of each modal image in unit hypersphere space; calculating distillation loss, center alignment loss among the modes, uniformity loss in the modes and classification loss of each mode, and updating the multi-mode fusion model; and generating a zero-sample cross-modal retrieval result by adopting the updated multi-modal fusion model based on the image of the to-be-detected modality and the image of the to-be-queried modality. The method can effectively improve the capability of modeling and aligning multi-modal distribution of the multi-modal fusion model, and eliminate the problem of modal difference among different modes, thereby realizing zero-sample cross-modal retrieval.

Description

Multi-modal image processing method and system based on Transformer network and hypersphere space learning
Technical Field
The invention relates to the field of deep learning, in particular to a multi-modal image processing method and system based on a Transformer network and hypersphere space learning.
Background
With the rapid development of scientific technology, image data becomes more and more easily acquired. These image data have various sources, perspectives, styles, etc., forming a multi-modal image dataset. For example, a sketch and a photo are two modality images with different styles, the sketch has high abstraction and structural details depicting objects, and the photo has rich visual features and complex background information depicting objects. Data processing and retrieval of multi-modal images are a research focus in the technical field of deep learning.
However, most of the existing multi-modal image processing methods assume that the categories included in the image of the modality to be queried and the image of the modality to be queried during actual application are completely consistent with the data categories used during model training, and do not consider the situation that the categories included in the training data during actual application are not included, which results in poor retrieval results.
In addition, the existing multi-modal image processing methods all adopt a deep convolutional neural network as a basic network architecture to extract features for downstream tasks. However, the performance of deep convolutional networks is limited by the locality of the convolution operation and cannot model the global structural information of the object. The recently proposed Transformer network has a multi-head self-attention mechanism, can effectively model the global structure information of an object, and has good performance in an image recognition task.
In summary, the existing multi-modal image processing method has the problems of unreasonable application settings and limited performance of the infrastructure.
Disclosure of Invention
In view of the above, the invention provides a multimodal image processing method and a multimodal image processing system based on a Transformer network and hypersphere spatial learning, and solves the problems of unreasonable application setting and performance limitation of an infrastructure network structure in the existing multimodal image processing method.
In order to solve the above problems, the technical method of the present invention is a multi-modal image processing method based on a Transformer network and hypersphere spatial learning, comprising: acquiring a pre-trained Transformer network model, and fine-tuning the pre-trained Transformer network model in a self-supervision mode based on image data of each modality to obtain a teacher model; constructing a multi-branch model capable of performing hypersphere space learning based on multi-modal images, wherein the multi-branch model is composed of a teacher model corresponding to each modality and a multi-modal fusion model; extracting teacher distillation vectors of the modal images based on the teacher model; extracting a student distillation vector of each modal image based on the multi-modal fusion model, and extracting the characteristics and classification probability of each modal image in a unit hypersphere space based on the multi-modal fusion model; calculating distillation loss, center alignment loss among the modes, uniformity loss in the modes and classification loss of each mode according to the teacher distillation vector, the student distillation vector, the characteristics and the classification probability; updating the multi-modal fusion model based on distillation loss, inter-modal center alignment loss, intra-modal uniformity loss, and classification loss; and the updated multi-mode fusion model generates a zero-sample cross-mode retrieval result based on the image of the to-be-detected mode and the image of the to-be-queried mode.
Further, a multi-branch model capable of hypersphere space learning based on multi-modal images is constructed, which is composed of the teacher model corresponding to each modality and a multi-modal fusion model, and comprises: the network structure of the teacher model is a Transf ormer network, the pre-trained Transformer network model is finely adjusted in a self-supervision training mode based on image data of each mode, and distillation marks are adaptively added based on knowledge distillation; the multi-modal fusion model is a model which is proposed based on the purpose of eliminating modal differences, the basic network structure of the multi-modal fusion model is a Transformer network, a distillation mark is added adaptively based on knowledge distillation, and a fusion mark is added adaptively based on hypersphere learning.
Further, the distillation marker and the fusion marker are input embedded vectors of a Transformer network model and are obtained by training a multi-head self-attention layer and a full-junction layer based on the Transformer network model.
Further, the output of the distillation mark of the teacher model is used for calculating the teacher distillation vector, the output of the distillation mark of the multi-mode fusion model is used for calculating the student distillation vector, and the output of the fusion mark of the multi-mode model is used for calculating the characteristics and classification probability of the unit hypersphere space.
Further, calculating a distillation loss, an inter-modality center alignment loss, an intra-modality uniformity loss, and a classification loss based on the teacher distillation vector, the student distillation vector, the features, and the classification probability, comprising: calculating distillation losses for each modality based on the teacher distillation vector and the student distillation vectors; calculating inter-modal center alignment loss and intra-modal uniformity loss based on the characteristics of each modality; the classification loss is calculated based on the classification probability of each modality.
Further, calculating inter-modality center alignment loss and intra-modality uniformity loss based on features of each modality, comprising: the features are calculated from the output of the fusion tag of the multi-modal fusion model, are located in a unit hypersphere space, and have the property of a vector mode of one; calculating class centers for each category in each modality based on the characteristics, normalizing the class centers to enable the vector mode of the class centers to be one, aligning the class centers of the same category in different modalities, and calculating the center alignment loss between the modalities; the intra-modal uniformity loss comprising: intra-modal uniformity loss is calculated for the features of each modality based on the features and the radial basis functions, wherein intra-modal uniformity loss is defined as an average of the logarithms of the gaussian potentials of the paired features.
Further, calculating a classification loss based on the classification probability of each modality, comprising: the classification probability is obtained by outputting through a linear classifier based on the features of the corresponding modes, wherein weights of the linear classifier are shared by all the modes.
Accordingly, the multi-modal image processing method obtains image data of each modality in a manner that: acquiring image samples of different modalities, including but not limited to manually drawn sketch samples and photo samples collected by an imaging device, and forming a data set for training parameters of the pre-trained transform network model.
Correspondingly, the multi-modal fusion model generates a zero-sample cross-modal retrieval result based on the image of the to-be-detected modality and the image of the to-be-queried modality, and the zero-sample cross-modal retrieval result comprises the following steps: the multi-mode fusion model extracts a fusion mark of an image to be detected based on the image of the mode to be detected; the multi-mode fusion model extracts a fusion mark of an image to be queried based on the image of the modality to be queried; and calculating the cosine similarity of the image to be detected and the image to be inquired, and generating the zero sample cross-modal retrieval result after sequencing from big to small.
Correspondingly, the invention provides a multi-modal image processing system based on a Transformer network and hypersphere space learning, which comprises: the imaging unit is used for acquiring multi-modal image samples; the data storage unit is used for storing multi-modal image samples; the neural network unit comprises a pre-trained Transformer network model, a teacher model and a multi-mode fusion model, wherein the teacher model is finely adjusted in a self-supervision mode; the data processing unit is used for extracting teacher distillation vectors of all the modal images based on the teacher model, extracting student distillation vectors of all the modal images based on the multi-modal fusion model, and extracting the characteristics and classification probability of all the modal images in a unit hypersphere space based on the multi-modal fusion model; meanwhile, calculating distillation loss, center alignment loss between modes, uniformity loss in modes and classification loss of each mode based on the teacher distillation vector, the student distillation vector, the features and the classification probability, and updating a multi-mode fusion model based on the distillation loss, the center alignment loss between modes, the uniformity loss in modes and the classification loss.
Further, the data processing unit calculates distillation loss based on the teacher distillation vector and the student distillation vector, calculates center alignment loss between modalities and uniformity loss within modalities based on the features, calculates classification loss based on the classification probability of each modality, weights the distillation loss, the center alignment loss between modalities, the uniformity loss within modalities and the classification loss based on a linear weighting mode, calculates a final loss value, and updates a multi-modality fusion model.
The invention discloses a multi-modal image processing method based on a Transformer network and hypersphere space learning, which adaptively adds a distillation mark and a fusion mark to the Transformer network, constructs a multi-branch model capable of hypersphere space learning based on multi-modal images, extracts teacher distillation vectors, student distillation vectors, characteristics on unit hypersphere space and classification probability of each modal image sample by utilizing global structure modeling capacity, effectively improves the modeling and multi-modal alignment distribution capacity of a multi-modal fusion model by calculating distillation loss, center alignment loss among modals, intra-modal uniformity loss and classification loss, eliminates the modal difference problem among different modals, and further realizes zero-sample cross-modal retrieval.
Drawings
FIG. 1 is a simplified flow chart of the multi-modal image processing method based on the Transformer network and hypersphere spatial learning according to the present invention;
FIG. 2 is a simplified unit connection diagram of the multi-modal image processing system based on the Transformer network and hypersphere spatial learning according to the present invention;
FIG. 3 is a simplified flow diagram of a multimodal fusion model in accordance with a preferred embodiment of the present invention;
FIG. 4 is a simplified flow diagram of a multi-branch model of a preferred embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings and specific examples.
Example 1
As shown in fig. 1, the present invention is a multimodal image processing method based on a Transformer network and hypersphere space learning, including steps S1 to S5.
S1: and acquiring a pre-trained Transformer network model, and fine-tuning the pre-trained Transformer network model in a self-supervision mode based on image data of each modality to obtain a teacher model.
The Transformer network is first proposed in the field of natural language processing, taking serialized text data as input. Recently, the Transformer network structure has been improved to process image data, and is excellent in the field of computer vision. As shown in fig. 3 (ignoring fusion and distillation markers), the Transformer network structure consists of L layers of multi-headed self-attention modules alternating with feed-forward neural network modules, each of which contains pre-layer normalization and residual concatenation. Each image is cut into a series of image blocks with fixed resolution according to grids, the image blocks are linearly projected, and trainable position coding is added to form a serialized block marker vector.
The serialized block labels are the feature representations of the image blocks, and can be used for aggregating the information of the block labels by taking the average value, or can be used for aggregating the information of the block labels by additionally training one label to serve as the feature representation of the whole image.
The present embodiment contains image data for two modalities, a photograph (typically acquired by an imaging device) and a sketch (typically rendered by a human). As shown in fig. 3 (ignoring fusion tags), distillation tags were adaptively added in a pre-trained transform network to aggregate information of other block tags. And fine-tuning the pre-training Transformer network model in a self-supervision mode by respectively adopting photo and sketch data to obtain teacher models in a photo mode and a sketch mode. The self-supervision approach, i.e., excludes tag information, avoids degradation of model generalization during the fine-tuning process. More specifically, a "multi-cropping" strategy is used to generate a set of different views (V) for each image (photo image when a photo modality teacher model is acquired; sketch image when a sketch modality teacher model is acquired), including two global views x with a resolution of 224 x 224g,1And xg,2And 10 partial views with a resolution of 96 × 96. Then, two models to be fine-tuned are initialized from the pre-trained Transformer network model and are marked as a model P and a model T. The fine-tuning process follows a strategy from local to global, entering all views in V into the model P, and only the wholeAnd inputting the model T by the local view, and defining an optimization formula as follows:
Figure BDA0003386458420000051
wherein Z ist,τt,θtAnd Zp,τp,θpRespectively representing the output and temperature over-parameter of the models T and P, psi represents Softmax normalization operation, and KL represents Kullback-Leibler divergence. x may be any global view, but not a local view. xS can be any view, but may not be the same as x.
Figure BDA0003386458420000052
This shows that the Kullback-Leibler divergence is minimized, i.e. the outputs of the models P and T are made similar and the parameters of the model P are updated. Thus, the above formula is used to align the output of the local view and the global view, as well as to align the different global views. Finally, the parameters of the model T are updated in an exponential moving average manner: thetat=ζθt+(1-ζ)θp. Where ζ is a preset parameter between (0, 1). And respectively obtaining the trained model T of the photo mode and the sketch mode as a teacher model of the corresponding mode.
S2: and constructing a multi-branch model capable of performing hypersphere space learning based on the multi-modal images, wherein the multi-branch model is composed of a teacher model corresponding to each modality and a multi-modal fusion model.
In step S1, a transform web teacher model with distillation labels in photo and sketch modalities is obtained, and is respectively marked as gIAnd gS. Wherein I and S represent the photo modality and the sketch modality, respectively. As shown in FIG. 4, the teacher model of the photo modality is represented by a skeleton network fIAnd a projection network hIThe teacher model in the sketch mode is composed of a skeleton network fSAnd a projection network hSAnd (4) forming.
Further, a multi-branch model capable of hypersphere learning based on multi-modal images is constructed, which is taught by the two modalitiesThe teacher model and a multi-modal fusion model. Unlike the teacher model of a specific modality, the multi-modality fusion model aims to eliminate the distribution difference between modalities, process image data of all modalities simultaneously, and output features with similar distribution for all modalities. As shown in fig. 3, its basic network structure is also a Transformer network, and distillation markers are adaptively added based on knowledge distillation, and fusion markers are adaptively added based on hypersphere space learning. Thus, as shown in FIG. 4, it consists of a skeleton network fFAnd two projection networks hDAnd hFConstitution (h in FIG. 4 for convenience of illustration)DTwo are drawn symmetrically). To simplify the formula, use gDDenotes fFAnd hDConstructed model, gFDenotes fFAnd hFAnd (4) forming a model. Similar to the distillation marker, the fusion marker is also a trainable input embedding vector. But the functions of the teacher model and the multi-modal fusion model are different, the distillation mark of the teacher model is used for calculating a teacher distillation vector, the distillation mark of the multi-modal fusion model is used for calculating student distillation vectors of all the modalities, and the output of the fusion mark of the multi-modal fusion model is used for calculating the characteristics and classification probability of the hypersphere space of a unit.
S3: and respectively extracting teacher distillation vectors of all the modal images based on all the teacher models, extracting student distillation vectors of all the modal images based on the multi-modal fusion model, and extracting the characteristics and classification probability of all the modal images in a unit hypersphere space based on the multi-modal fusion model.
As shown in FIG. 4, in either the teacher model or the multi-modal fusion model, the distillation mark is represented by a skeleton network (f)I,fSOr fF) Learned and obtained by the projection network (h)I,hSOr hD) Projected into a space of K dimensions (K fixed to 65536) to obtain a distillation vector. In any ith photograph
Figure BDA0003386458420000061
Or any ith sketch
Figure BDA0003386458420000062
For the purpose of example only,
Figure BDA0003386458420000063
and
Figure BDA0003386458420000064
the teacher's distillation vectors of the photo and sketch are shown separately, and
Figure BDA0003386458420000065
and
Figure BDA0003386458420000066
student distillation vectors representing photographs and sketch, respectively. In addition, the fusion tag is represented by fFIs learned and is obtained fromFAnd projecting the image to a unit hypersphere space to obtain the characteristics of the image in the unit hypersphere space. In the same way as in the example,
Figure BDA0003386458420000067
and
Figure BDA0003386458420000068
respectively, representing the characteristics of a photograph and a sketch. These features are classified by a linear classifier shared by all modes to obtain classification probabilities.
S4: calculating distillation loss, center alignment loss between modes, uniformity loss in modes and classification loss of each mode based on the teacher distillation vector, the student distillation vector, the features and the classification probability, and updating a multi-mode fusion model based on the distillation loss, the center alignment loss between modes, the uniformity loss in modes and the classification loss.
The training process of the multi-modal fusion model is carried out under the supervision of teacher models of all the modalities. Because the teacher models of the respective modalities are pre-trained in an unsupervised manner, they are optimized to find global structural information specific to each image. However, the purpose of the multi-modal fusion model is to eliminate modal differences between the distributions of the same class but different modalities, which will inevitably require the multi-modal fusion model to pay more attention to the more discriminative local structures shared by the whole class, gradually forgetting the structural information specific to each image. We therefore avoided this phenomenon, known as "catastrophic forgetting", by knowledge distillation.
This example contains photographs and sketch maps, and distillation loss of photographs was calculated based on teacher's distillation vector and student's distillation vector of photographs
Figure BDA0003386458420000069
Calculating distillation loss of sketch map based on distillation vector of teacher and distillation vector of students
Figure BDA00033864584200000610
Taking a photograph as an example, given a batch of data consisting of N images, knowledge distillation matches the probability distribution of teacher distillation vectors and student distillation vectors, and distillation loss of the photograph is calculated as follows:
Figure BDA00033864584200000611
wherein, tauIAnd τDRespectively representing the temperature super parameters of the photo mode teacher model and the multi-mode fusion model, and the rest symbols are consistent with the definition of the content. In a similar manner to that described above,
Figure BDA0003386458420000071
or from a batch of sketch calculations. Thus, the overall distillation loss of the multi-branch model can be defined as follows:
Figure BDA0003386458420000072
as shown in fig. 4, both the photographs and the sketch map are projected into the unit hypersphere space, and it is desirable that the photographs and the sketch map can be grouped by category. When the images of all classes are aggregated individually, their distribution is linearly separable in hypersphere space. Thus, a linear classifier can be used to classify features, calculating the classification penalty as follows:
Figure BDA0003386458420000073
wherein the content of the first and second substances,
Figure BDA0003386458420000074
representing a mathematical expectation, xiRepresenting an arbitrary ith photograph or sketch, yiDenotes xiClass label of thetacA parameter, P (y), representing the linear classifieri|gF(xi);θc) With the expression parameter thetacLinear classifier of (2) will xiClassification as yiThe probability of (c). Thus, the classification penalty can be aligned intra-and inter-modal distribution by a linear classifier shared by all the modalities.
In addition, inter-modality center alignment loss is calculated based on characteristics of each modality
Figure BDA0003386458420000075
It is explicitly required that the feature distributions of the individual modes overlap on the hypersphere.
Figure BDA0003386458420000076
Figure BDA0003386458420000077
Figure BDA0003386458420000078
Wherein, for simplifying representation, the image mode I or the sketch mode S is represented, lambda is the weight of the exponential moving average, NyiIndicating that the category label in a batch of image data is yjOf (2) a sample
Figure BDA0003386458420000079
The number of the (c) component(s),
Figure BDA00033864584200000710
representing class centers, Y being represented by respective classes YiA set of classes. The second line of equations above represents the L2 norm normalized class center, making the norm of the class center 1, i.e., mapping the class center back to the unit hypersphere space.
Figure BDA00033864584200000711
And
Figure BDA00033864584200000712
in combination, the multimodal fusion model is able to align feature distributions both inter-modality and intra-modality.
However, both alignment and uniformity are key properties of features in hypersphere space, since uniformity means that the representation capability of hypersphere space is fully exploited. In particular, intra-modal uniformity loss is calculated for each batch of features for each modality based on the features and radial basis functions of the respective modality, wherein intra-modal uniformity loss is defined as a desired logarithm of the gaussian potential of the paired features. Finally, the overall intra-modal uniformity loss
Figure BDA0003386458420000081
Is intra-modal uniformity loss for each mode
Figure BDA0003386458420000082
And
Figure BDA0003386458420000083
the sum of (1):
Figure BDA0003386458420000084
Figure BDA0003386458420000085
Figure BDA0003386458420000086
wherein, for the sake of simplified representation, a photo modality I or a sketch modality S,
Figure BDA0003386458420000087
and
Figure BDA0003386458420000088
any two images representing the same modality, t is a parameter fixed to 2,
Figure BDA0003386458420000089
for arbitrary pairs of images
Figure BDA00033864584200000810
And
Figure BDA00033864584200000811
the gaussian potential is calculated. It is worth noting that the total intra-modal uniformity loss is the sum of intra-modal uniformity losses for the individual modalities, rather than constraining all features non-modally. Such a loss function design is reasonable because ideally, the distribution of all modes in hypersphere space tends to be uniform, but the distribution of homogeneous features of different modes has overlapping positions (determined by the classification loss and the inter-mode center alignment loss).
Finally, the overall objective function of the multi-modal fusion model is a linear weighting of the four losses described above, defined as follows:
Figure BDA00033864584200000812
wherein λ is1And λ2Respectively, are hyper-parameters for inter-modal center alignment loss and intra-modal uniformity loss. After the value of the overall objective function of the multi-modal fusion model is calculated, the parameters of the multi-modal fusion model are updated according to a stochastic gradient descent algorithm,and obtaining the trained multi-modal fusion model.
S5: the multi-mode fusion model generates a zero-sample cross-mode retrieval result based on the image of the to-be-detected mode and the image of the to-be-queried mode.
When the training of the multi-modal fusion model is completed, the skeleton network f of the trained multi-modal fusion model can be usedFExtracting the fusion mark vector of the image of each mode, extracting the fusion mark vector of the image to be detected based on the image of the mode to be detected, and extracting the fusion mark vector of the image to be inquired based on the image of the mode to be inquired. And finally, calculating the cosine similarity of the image to be detected and the image to be inquired according to the two fusion mark vectors, and generating the zero sample cross-modal retrieval result after sequencing from big to small.
Example 2
This example was experimentally verified on the basis of example 1. In the embodiment, three mainstream data sets in the zero-sample cross-modal search field are used as training and testing data sets, namely Sketchy, TU-Berlin, and QuickDraw. They both contain data and tags for the photo modality and the sketch modality for the zero sample photo-sketch retrieval task. Specifically, Sketchy initially consists of 125 classes of 75471 sketch maps and 12500 photographs, with a pairing relationship between the sketch maps and the images. The collection of photos of Sketchy was then expanded to 73002. TU-Berlin is composed of 20000 sketch maps and 204489 photos of 250 categories, so the quantity of sketch maps and photos is seriously unbalanced, and the abstraction degree of sketch maps is high; QuickDraw is the largest of the three data sets, consisting of 330000 sketch maps of 110 classes and 204000 photographs, with the sketch maps being the most abstract.
For Sketchy, there are two types of training class and test class division: one randomly selects 25 classes as test classes, and the other selects 21 classes that do not overlap with the ImageNet class as test classes. For simplicity, we refer to the former as Sketchy and the latter as Sketchy-NO. TU-Berlin is similar to Sketchy, and randomly selects 20 classes as test classes. QuickDraw is similar to Sketchy-NO, and 30 classes that do not overlap with the ImageNet class are selected as test classes. Furthermore, we perform binarization processing on real-valued features by iterative quantization (ITQ) for comparison with a hash method. The cosine and hamming distances are used to calculate the similarity of the real and binary representations, respectively. The evaluation criteria used recall (Prec) and mean recall (mAP), Prec @ K and mAP @ K representing recall and mean recall calculated from the top K results of the search.
Further, the claimed system is defined as TVT in this embodiment, and the rest of the searching methods are the high-usage sketch searching methods. Experimental results as shown in tables 1 and 2, the TVT process showed continuous and significant improvement over all existing processes. Most zero sample sketch retrieval methods were only experimented with Sketchy and TU-Berlin, both datasets randomly selecting test classes. Specifically, on both datasets, TVT consistently defeated the best existing method (DSN) with an improvement of the mAP @ all score of 11.1% and 0.5%. However, on more realistic and challenging data sets (Sketchy-NO and QuickDraw), the TVT achieved more significant improvements. On Sketchy-NO, TVT increases the mAP @200 score from 0.501 to 0.531 compared to DSN. Furthermore, on a large scale of QuickDraw, TVT achieves a dramatic improvement of the mAP @ all score of nearly 100%. Given the large-scale nature of these datasets and the limitations of fixed class segmentation, these results effectively demonstrate that the vast improvement in TVT is not incidental, nor caused by segmentation bias. TVT also gives the best results compared to the hash method. The improvement achieved by TVT is even more pronounced when comparing results using an index that considers only the top 100 or 200 candidate samples. On Sketchy and TU-Berlin, the fraction of TVT far above DSN, Prec @100 increased by 13.1% and 13.0%, respectively. On QuickDraw, the mAP @200 and Prec @200 scores of TVT increased by 112.2% and 330.9%, respectively, compared to Dey et al. These results mean that the correct result has a higher probability of appearing in the top 100 or 200 search results, which is well suited for the search task. All of these comparisons can demonstrate that TVTs can effectively align intra-and inter-modal distributions without loss of uniformity, and then achieve satisfactory generalization over unseen classes.
Table 1: comparison of TVT and other 10 existing zero sample sketch retrieval methods on Sketchy and TU-Berlin. The subscript "b" indicates the result obtained from the binary representation and "-" indicates that the method reported no relevant result. The best and second best results are shown in bold and underlined, respectively.
Figure BDA0003386458420000101
Table 2: comparison of TVT and the other two methods on QuickDraw. The best results are shown in bold.
Figure BDA0003386458420000102
Accordingly, as shown in fig. 2, the present invention provides a multimodal image processing system based on a Transformer network and hypersphere spatial learning, comprising: the imaging unit is used for acquiring multi-modal image samples; the data storage unit is used for storing multi-modal image samples; the neural network unit comprises a pre-trained Transformer network model, a teacher model and a multi-mode fusion model, wherein the teacher model is finely adjusted in a self-supervision mode; the data processing unit is used for extracting teacher distillation vectors of all the modal images based on the teacher model, extracting student distillation vectors of all the modal images based on the multi-modal fusion model, and extracting the characteristics and classification probability of all the modal images in a unit hypersphere space based on the multi-modal fusion model; meanwhile, calculating distillation loss, center alignment loss between modes, uniformity loss in modes and classification loss of each mode based on the teacher distillation vector, the student distillation vector, the features and the classification probability, and updating a multi-mode fusion model based on the distillation loss, the center alignment loss between modes, the uniformity loss in modes and the classification loss.
Further, the data processing unit calculates distillation loss based on the teacher distillation vector and the student distillation vector, calculates center alignment loss between modalities and uniformity loss within modalities based on the features, calculates classification loss based on the classification probability of each modality, weights the distillation loss, the center alignment loss between modalities, the uniformity loss within modalities and the classification loss based on a linear weighting mode, calculates a final loss value, and updates a multi-modality fusion model.
The multi-modal image processing method and system based on the transform network and hypersphere spatial learning provided by the embodiment of the invention are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Claims (9)

1. A multi-modal image processing method based on a Transformer network and hypersphere space learning is characterized by specifically comprising the following steps of:
step S1: acquiring a pre-trained Transformer network model, and fine-tuning the pre-trained Transformer network model in a self-supervision manner based on image data of each modality to obtain a teacher model corresponding to each modality;
step S2: constructing a multi-branch model capable of performing hypersphere space learning based on multi-modal images, wherein the multi-branch model is composed of a teacher model corresponding to each modality and a multi-modal fusion model;
step S3: respectively extracting teacher distillation vectors of the images of each mode based on a teacher model corresponding to each mode, extracting student distillation vectors of the images of each mode based on the multi-mode fusion model, and extracting the characteristics and classification probability of the images of each mode in a unit hypersphere space based on the multi-mode fusion model;
step S4: calculating distillation loss, inter-modality center alignment loss, intra-modality uniformity loss and classification loss of each modality based on teacher distillation vectors of each modality image, student distillation vectors of each modality image, features of each modality image in unit hypersphere space and classification probability of each modality image in unit hypersphere space, and updating the multi-modality fusion model based on the distillation loss, the inter-modality center alignment loss, the intra-modality uniformity loss and the classification loss;
step S5: and generating a zero-sample cross-modal retrieval result by adopting the updated multi-modal fusion model based on the image of the to-be-detected modality and the image of the to-be-queried modality.
2. The method of claim 1, wherein the image data of each modality in step S1 includes a photograph and a sketch, and the fine-tuning the pre-trained Transformer network model in an autonomous manner based on the image data of each modality specifically includes:
fine-tuning the pre-trained transform network model using the photo and sketch data respectively in an auto-supervised manner, i.e. excluding label information, to avoid degradation of model generalization during fine-tuning, i.e. generating a set of different views V for each photo image or sketch image using a "multi-cropping" strategy, including two global views x with a resolution of 224 x 224, to obtain teacher models in the photo mode and the sketch modeg,1And xg,2And 10 partial views with a resolution of 96 × 96; then, initializing two models to be fine-tuned from a pre-trained Transformer network model, and marking as a model P and a model T; the fine tuning process follows a strategy from local to global, inputting all views in V into model P, and only global view into model T, then the optimization formula is defined as follows:
Figure FDA0003386458410000011
wherein Z ist,τt,θtAnd Zp,τp,θpRespectively representing the output and temperature over-parameter parameters of the models T and P, psi represents Softmax normalization operation, and KL represents Kullback-Leibler divergence; x may be any global view, but not a local view; x' may be any view, but may not be the same as x;
Figure FDA0003386458410000021
the expression minimizes the Kullback-Leibler divergence, so the above formula is used to align the output of the local view and the global view, and to align the different global views; finally, the parameters of the model T are updated in an exponential moving average manner: thetat=ζθt+(1-ζ)θpWhere ζ is a preset parameter between (0, 1);and respectively obtaining the trained model T of the image mode and the sketch mode as a teacher model of the corresponding mode.
3. The multimodal image processing method based on Transformer network and hypersphere spatial learning of claim 2, wherein the step S2 is implemented in a specific manner:
the teacher model in the photo modality and the teacher model in the sketch modality acquired in step S1 are each denoted as gIAnd gSWherein, I and S respectively represent a photo mode and a sketch mode, and the structure of the teacher model of the photo mode is represented by a skeleton network fIAnd a projection network hIThe structure of the teacher model in the sketch mode is composed of a skeleton network fSAnd a projection network hSForming;
further constructing a multi-branch model capable of performing hypersphere learning based on multi-modal images, wherein the multi-branch model is composed of a teacher model with two modalities and a multi-modal fusion model, the basic network structure of the multi-modal fusion model is also a Transformer network, distillation marks are added adaptively based on knowledge distillation, meanwhile, fusion marks are added adaptively based on hypersphere space learning, and the multi-modal fusion model is structurally composed of a skeleton network fFAnd two projection networks hDAnd hFComposition of gDDenotes fFAnd hDConstructed model, gFDenotes fFAnd hFAnd the distillation marks of the teacher model are used for calculating teacher distillation vectors, the distillation marks of the multi-mode fusion model are used for calculating student distillation vectors of all the modes, and the output of the fusion marks of the multi-mode fusion model is used for calculating the characteristics and classification probability of the hypersphere space of a unit.
4. The multimodal image processing method based on the Transformer network and hypersphere spatial learning of claim 3, wherein the step S3 is implemented in a specific manner:
the distillation mark is formed by a skeleton network f no matter the teacher model or the multi-mode fusion modelI,fSOr fFLearned and obtained by the projection network hI,hSOr hDProjecting the distillation vector into a K-dimensional space to obtain a distillation vector; in any ith photograph
Figure FDA0003386458410000022
Or any ith sketch
Figure FDA0003386458410000023
For the purpose of example only,
Figure FDA0003386458410000024
and
Figure FDA0003386458410000025
the teacher's distillation vectors of the photo and sketch are shown separately, and
Figure FDA0003386458410000026
and
Figure FDA0003386458410000027
student distillation vectors representing photographs and sketch, respectively; in addition, the fusion tag is represented by fFIs learned and is obtained fromFProjecting the image to a unit hypersphere space to obtain the characteristics of the image in the unit hypersphere space,
Figure FDA0003386458410000028
and
Figure FDA0003386458410000029
respectively representing the features of the photo and the sketch, and classifying the features through a linear classifier shared by all the modalities to obtain the classification probability.
5. The multimodal image processing method based on Transformer network and hypersphere spatial learning of claim 4, wherein the step S4 is implemented in a specific manner:
the training process of the multi-modal fusion model is carried out under the supervision of the teacher model corresponding to each modality, because the teacher models corresponding to each modality are pre-trained in a self-supervision mode, and are optimized to find the specific global structure information of each image; however, the purpose of the multimodal fusion model is to eliminate modal differences between the distributions of the same category but different modalities, which will inevitably require that the multimodal fusion model pay more attention to the more discriminative local structures shared by the whole category, gradually forgetting the structural information specific to each image, thus avoiding the phenomenon known as "catastrophic forgetting" by knowledge distillation;
calculating distillation loss of photo based on distillation vector of teacher and distillation vector of student
Figure FDA0003386458410000031
Calculating distillation loss of sketch map based on distillation vector of teacher and distillation vector of students
Figure FDA0003386458410000032
Taking a photograph as an example, given a batch of data consisting of N images, knowledge distillation matches the probability distribution of teacher distillation vectors and student distillation vectors, and distillation loss of the photograph is calculated as follows:
Figure FDA0003386458410000033
wherein, tauIAnd τDRespectively representing the temperature super parameters of the photo mode teacher model and the multi-mode fusion model, and the other symbols are consistent with the definition of the content; in the same way, the method for preparing the composite material,
Figure FDA0003386458410000034
can also be calculated from a batch of sketch images; thus, the overall distillation loss of the multimodal fusion model is defined as:
Figure FDA0003386458410000035
both the photographs and the sketch map are projected into the unit hypersphere space and it is expected that the photographs and the sketch map can be aggregated according to categories, and when the images of all the categories are aggregated individually, their distribution is linearly separable in hypersphere space, so a linear classifier is used to classify the features and the classification loss is calculated as follows:
Figure FDA0003386458410000036
wherein the content of the first and second substances,
Figure FDA0003386458410000037
representing a mathematical expectation, xiRepresenting an arbitrary ith photograph or sketch, yiDenotes xiClass label of thetacA parameter, P (y), representing the linear classifieri|gF(xi);θc) With the expression parameter thetacLinear classifier of (2) will xiClassification as yiThe probability of (d);
in addition, inter-modality center alignment loss is calculated based on characteristics of each modality image in a unit hypersphere space
Figure FDA0003386458410000041
It is explicitly required that the feature distributions of the respective modality images in the unit hypersphere space are superimposed on the hypersphere:
Figure FDA0003386458410000042
Figure FDA0003386458410000043
Figure FDA0003386458410000044
wherein, for the sake of simplified representation, represents a photo mode I or a sketch mode S, λ is a weight of an exponential moving average,
Figure FDA0003386458410000045
indicating that the category label in a batch of image data is yjOf (2) a sample
Figure FDA0003386458410000046
The number of the (c) component(s),
Figure FDA0003386458410000047
representing class centers, Y being represented by respective classes YiThe second line formula represents the normalized class center of the L2 norm, i.e. the class center is mapped back to the unit hypersphere space;
in addition, intra-modality uniformity loss is calculated for the features of each modality based on the features of the respective modality images in the unit hypersphere space and the radial basis functions, wherein the intra-modality uniformity loss is defined as an expected logarithm of the gaussian potential of the paired features; finally, the overall intra-modal uniformity loss
Figure FDA0003386458410000048
Is intra-modal uniformity loss for each mode
Figure FDA0003386458410000049
And
Figure FDA00033864584100000410
the sum of (1):
Figure FDA00033864584100000411
Figure FDA00033864584100000412
Figure FDA00033864584100000413
wherein the content of the first and second substances,
Figure FDA00033864584100000414
and
Figure FDA00033864584100000415
any two images representing the same modality, t is a parameter fixed to 2,
Figure FDA00033864584100000416
for arbitrary pairs of images
Figure FDA00033864584100000417
And
Figure FDA00033864584100000418
calculating the Gaussian potential;
finally, the overall objective function of the multi-modal fusion model
Figure FDA00033864584100000420
Is a linear weighting of the four losses described above, defined as follows:
Figure FDA00033864584100000419
wherein λ is1And λ2And after calculating the value of the overall objective function of the multi-modal fusion model, updating the parameters of the multi-modal fusion model according to a stochastic gradient descent algorithm to obtain the updated multi-modal fusion model.
6. The multimodal image processing method based on Transformer network and hypersphere spatial learning of claim 5, wherein the step S5 is implemented in a specific manner:
after the multi-modal fusion model is updated, using the updated skeleton network f of the multi-modal fusion modelFExtracting fusion mark vectors of all modal images, extracting the fusion mark vectors of the images to be detected based on the images of the to-be-detected modalities, and extracting the fusion mark vectors of the images to be inquired based on the images of the to-be-inquired modalities; and finally, calculating the cosine similarity of the image to be detected and the image to be inquired according to the two fusion mark vectors, and generating the zero sample cross-modal retrieval result after sequencing from big to small.
7. The multimodal image processing method based on Transformer network and hypersphere space learning of claim 6, wherein K is fixed to 65536.
8. A multi-modal image processing system based on a Transformer network and hypersphere spatial learning, for implementing the multi-modal image processing method based on the Transformer network and hypersphere spatial learning according to any one of claims 1-7, wherein the system comprises:
the imaging unit is used for acquiring multi-modal image samples;
the data storage unit is used for storing the multi-modal image samples;
the neural network unit comprises a pre-trained Transformer network model, a teacher model and a multi-mode fusion model, wherein the teacher model corresponds to each mode which is finely adjusted in a self-supervision mode;
the data processing unit is used for extracting teacher distillation vectors of all the modal images based on the teacher model corresponding to all the modalities, extracting student distillation vectors of all the modal images based on the multi-modality fusion model, and extracting the characteristics and classification probability of all the modal images in a unit hypersphere space based on the multi-modality fusion model; meanwhile, calculating distillation loss, center alignment loss among modes, uniformity loss in modes and classification loss based on teacher distillation vectors, student distillation vectors of all mode images, characteristics of all mode images in unit hypersphere space and classification probability thereof, and updating a multi-mode fusion model based on the distillation loss, the center alignment loss among the modes, the uniformity loss in the modes and the classification loss.
9. The multimodal image processing system based on Transformer network and hypersphere space learning of claim 8, wherein the data processing unit calculates distillation loss based on teacher distillation vector and student distillation vector of each modality image, calculates center alignment loss among modalities and uniformity loss in modalities based on features of each modality image in unit hypersphere space, calculates classification loss based on classification probability of each modality image in unit hypersphere space, and weights distillation loss, center alignment loss among modalities, uniformity loss in modalities and classification loss based on linear weighting, calculates final loss value for updating multimodal fusion model.
CN202111451939.6A 2021-12-01 2021-12-01 Multi-modal image processing method and system based on Transformer network and hypersphere space learning Active CN114241273B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111451939.6A CN114241273B (en) 2021-12-01 2021-12-01 Multi-modal image processing method and system based on Transformer network and hypersphere space learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111451939.6A CN114241273B (en) 2021-12-01 2021-12-01 Multi-modal image processing method and system based on Transformer network and hypersphere space learning

Publications (2)

Publication Number Publication Date
CN114241273A true CN114241273A (en) 2022-03-25
CN114241273B CN114241273B (en) 2022-11-04

Family

ID=80752607

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111451939.6A Active CN114241273B (en) 2021-12-01 2021-12-01 Multi-modal image processing method and system based on Transformer network and hypersphere space learning

Country Status (1)

Country Link
CN (1) CN114241273B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114937178A (en) * 2022-06-30 2022-08-23 抖音视界(北京)有限公司 Multi-modality-based image classification method and device, readable medium and electronic equipment
CN114999637A (en) * 2022-07-18 2022-09-02 华东交通大学 Pathological image diagnosis method and system based on multi-angle coding and embedded mutual learning
CN115272881A (en) * 2022-08-02 2022-11-01 大连理工大学 Long-tail remote sensing image target identification method based on dynamic relation distillation
CN115294407A (en) * 2022-09-30 2022-11-04 山东大学 Model compression method and system based on preview mechanism knowledge distillation
CN115496928A (en) * 2022-09-30 2022-12-20 云南大学 Multi-modal image feature matching method based on multi-feature matching
CN115953586A (en) * 2022-10-11 2023-04-11 香港中文大学(深圳)未来智联网络研究院 Method, system, electronic device and storage medium for cross-modal knowledge distillation
CN116628507A (en) * 2023-07-20 2023-08-22 腾讯科技(深圳)有限公司 Data processing method, device, equipment and readable storage medium
CN117636074A (en) * 2024-01-25 2024-03-01 山东建筑大学 Multi-mode image classification method and system based on feature interaction fusion

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428071A (en) * 2020-03-26 2020-07-17 电子科技大学 Zero-sample cross-modal retrieval method based on multi-modal feature synthesis
CN111581405A (en) * 2020-04-26 2020-08-25 电子科技大学 Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning
CN113032601A (en) * 2021-04-15 2021-06-25 金陵科技学院 Zero sample sketch retrieval method based on discriminant improvement
CN113360701A (en) * 2021-08-09 2021-09-07 成都考拉悠然科技有限公司 Sketch processing method and system based on knowledge distillation
WO2021191908A1 (en) * 2020-03-25 2021-09-30 Yissum Research Development Company Of The Hebrew University Of Jerusalem Ltd. Deep learning-based anomaly detection in images
WO2021214935A1 (en) * 2020-04-23 2021-10-28 日本電信電話株式会社 Learning device, search device, learning method, search method, and program

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021191908A1 (en) * 2020-03-25 2021-09-30 Yissum Research Development Company Of The Hebrew University Of Jerusalem Ltd. Deep learning-based anomaly detection in images
CN111428071A (en) * 2020-03-26 2020-07-17 电子科技大学 Zero-sample cross-modal retrieval method based on multi-modal feature synthesis
WO2021214935A1 (en) * 2020-04-23 2021-10-28 日本電信電話株式会社 Learning device, search device, learning method, search method, and program
CN111581405A (en) * 2020-04-26 2020-08-25 电子科技大学 Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning
CN113032601A (en) * 2021-04-15 2021-06-25 金陵科技学院 Zero sample sketch retrieval method based on discriminant improvement
CN113360701A (en) * 2021-08-09 2021-09-07 成都考拉悠然科技有限公司 Sketch processing method and system based on knowledge distillation

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
H WANG等: "Stacked Sematic-Guided Network for Zero-Shot Sketch-Based Image Retrieval", 《HTTP://ARXIV.ORG/PDF/1904.01971V1.PDF》 *
HAIXUAN GUO等: "LogBERT: LOG Anomaly detection via BERT", 《2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS(IJCNN)》 *
JIALIN TIAN等: "TVT:Three-Way Vision Transformer through Multi-Modal Hypersphere Learning for Zero-Shot Sketch-Based Image Retrieval", 《AAAI-22 TECHNICAL TRACK ON COMPUTER VISION II》 *
XU XING等: "Deep adversarial metric learning for cross-modal retrieval", 《WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS》 *
刘伟旻等: "基于DHSC的多模态间歇过程测量数据异常检测方法", 《化工学报》 *
张燕咏等: "基于多模态融合的自动驾驶感知及计算", 《计算机研究与发展》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114937178A (en) * 2022-06-30 2022-08-23 抖音视界(北京)有限公司 Multi-modality-based image classification method and device, readable medium and electronic equipment
CN114937178B (en) * 2022-06-30 2023-04-18 抖音视界有限公司 Multi-modality-based image classification method and device, readable medium and electronic equipment
CN114999637A (en) * 2022-07-18 2022-09-02 华东交通大学 Pathological image diagnosis method and system based on multi-angle coding and embedded mutual learning
CN114999637B (en) * 2022-07-18 2022-10-25 华东交通大学 Pathological image diagnosis method and system based on multi-angle coding and embedded mutual learning
CN115272881A (en) * 2022-08-02 2022-11-01 大连理工大学 Long-tail remote sensing image target identification method based on dynamic relation distillation
CN115294407A (en) * 2022-09-30 2022-11-04 山东大学 Model compression method and system based on preview mechanism knowledge distillation
CN115496928A (en) * 2022-09-30 2022-12-20 云南大学 Multi-modal image feature matching method based on multi-feature matching
CN115294407B (en) * 2022-09-30 2023-01-03 山东大学 Model compression method and system based on preview mechanism knowledge distillation
CN115953586A (en) * 2022-10-11 2023-04-11 香港中文大学(深圳)未来智联网络研究院 Method, system, electronic device and storage medium for cross-modal knowledge distillation
CN116628507A (en) * 2023-07-20 2023-08-22 腾讯科技(深圳)有限公司 Data processing method, device, equipment and readable storage medium
CN116628507B (en) * 2023-07-20 2023-10-27 腾讯科技(深圳)有限公司 Data processing method, device, equipment and readable storage medium
CN117636074A (en) * 2024-01-25 2024-03-01 山东建筑大学 Multi-mode image classification method and system based on feature interaction fusion

Also Published As

Publication number Publication date
CN114241273B (en) 2022-11-04

Similar Documents

Publication Publication Date Title
CN114241273B (en) Multi-modal image processing method and system based on Transformer network and hypersphere space learning
CN108960330B (en) Remote sensing image semantic generation method based on fast regional convolutional neural network
CN111127385B (en) Medical information cross-modal Hash coding learning method based on generative countermeasure network
CN108132968A (en) Network text is associated with the Weakly supervised learning method of Semantic unit with image
WO2019015246A1 (en) Image feature acquisition
JP2018513491A (en) Fine-grained image classification by investigation of bipartite graph labels
CN110942091B (en) Semi-supervised few-sample image classification method for searching reliable abnormal data center
CN109241377A (en) A kind of text document representation method and device based on the enhancing of deep learning topic information
CN105975573A (en) KNN-based text classification method
CN113360701B (en) Sketch processing method and system based on knowledge distillation
CN112949740B (en) Small sample image classification method based on multilevel measurement
CN114444600A (en) Small sample image classification method based on memory enhanced prototype network
CN108595546A (en) Based on semi-supervised across media characteristic study search method
CN109960732A (en) A kind of discrete Hash cross-module state search method of depth and system based on robust supervision
CN116129141A (en) Medical data processing method, apparatus, device, medium and computer program product
CN110598022B (en) Image retrieval system and method based on robust deep hash network
CN106021402A (en) Multi-modal multi-class Boosting frame construction method and device for cross-modal retrieval
CN114510594A (en) Traditional pattern subgraph retrieval method based on self-attention mechanism
CN105117735A (en) Image detection method in big data environment
Kokilambal Intelligent content based image retrieval model using adadelta optimized residual network
Liu et al. Deep convolutional neural networks for regular texture recognition
Rice Convolutional neural networks for detection and classification of maritime vessels in electro-optical satellite imagery
CN110941994B (en) Pedestrian re-identification integration method based on meta-class-based learner
CN113627522A (en) Image classification method, device and equipment based on relational network and storage medium
CN113095229B (en) Self-adaptive pedestrian re-identification system and method for unsupervised domain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant