CN114241273A

CN114241273A - Multi-modal image processing method and system based on Transformer network and hypersphere space learning

Info

Publication number: CN114241273A
Application number: CN202111451939.6A
Authority: CN
Inventors: 徐行; 田加林; 沈复民; 申恒涛
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-03-25
Anticipated expiration: 2041-12-01
Also published as: CN114241273B

Abstract

The invention discloses a multi-modal image processing method and system based on a Transformer network and hypersphere space learning, which comprises the steps of obtaining a pre-trained Transformer network model to obtain a teacher model; constructing a multi-branch model consisting of a teacher model and a multi-mode fusion model; extracting teacher's distillation vector, student's distillation vector, and the features and classification probability of each modal image in unit hypersphere space; calculating distillation loss, center alignment loss among the modes, uniformity loss in the modes and classification loss of each mode, and updating the multi-mode fusion model; and generating a zero-sample cross-modal retrieval result by adopting the updated multi-modal fusion model based on the image of the to-be-detected modality and the image of the to-be-queried modality. The method can effectively improve the capability of modeling and aligning multi-modal distribution of the multi-modal fusion model, and eliminate the problem of modal difference among different modes, thereby realizing zero-sample cross-modal retrieval.

Description

Multi-modal image processing method and system based on Transformer network and hypersphere space learning

Technical Field

The invention relates to the field of deep learning, in particular to a multi-modal image processing method and system based on a Transformer network and hypersphere space learning.

Background

With the rapid development of scientific technology, image data becomes more and more easily acquired. These image data have various sources, perspectives, styles, etc., forming a multi-modal image dataset. For example, a sketch and a photo are two modality images with different styles, the sketch has high abstraction and structural details depicting objects, and the photo has rich visual features and complex background information depicting objects. Data processing and retrieval of multi-modal images are a research focus in the technical field of deep learning.

However, most of the existing multi-modal image processing methods assume that the categories included in the image of the modality to be queried and the image of the modality to be queried during actual application are completely consistent with the data categories used during model training, and do not consider the situation that the categories included in the training data during actual application are not included, which results in poor retrieval results.

In addition, the existing multi-modal image processing methods all adopt a deep convolutional neural network as a basic network architecture to extract features for downstream tasks. However, the performance of deep convolutional networks is limited by the locality of the convolution operation and cannot model the global structural information of the object. The recently proposed Transformer network has a multi-head self-attention mechanism, can effectively model the global structure information of an object, and has good performance in an image recognition task.

In summary, the existing multi-modal image processing method has the problems of unreasonable application settings and limited performance of the infrastructure.

Disclosure of Invention

In view of the above, the invention provides a multimodal image processing method and a multimodal image processing system based on a Transformer network and hypersphere spatial learning, and solves the problems of unreasonable application setting and performance limitation of an infrastructure network structure in the existing multimodal image processing method.

In order to solve the above problems, the technical method of the present invention is a multi-modal image processing method based on a Transformer network and hypersphere spatial learning, comprising: acquiring a pre-trained Transformer network model, and fine-tuning the pre-trained Transformer network model in a self-supervision mode based on image data of each modality to obtain a teacher model; constructing a multi-branch model capable of performing hypersphere space learning based on multi-modal images, wherein the multi-branch model is composed of a teacher model corresponding to each modality and a multi-modal fusion model; extracting teacher distillation vectors of the modal images based on the teacher model; extracting a student distillation vector of each modal image based on the multi-modal fusion model, and extracting the characteristics and classification probability of each modal image in a unit hypersphere space based on the multi-modal fusion model; calculating distillation loss, center alignment loss among the modes, uniformity loss in the modes and classification loss of each mode according to the teacher distillation vector, the student distillation vector, the characteristics and the classification probability; updating the multi-modal fusion model based on distillation loss, inter-modal center alignment loss, intra-modal uniformity loss, and classification loss; and the updated multi-mode fusion model generates a zero-sample cross-mode retrieval result based on the image of the to-be-detected mode and the image of the to-be-queried mode.

Further, a multi-branch model capable of hypersphere space learning based on multi-modal images is constructed, which is composed of the teacher model corresponding to each modality and a multi-modal fusion model, and comprises: the network structure of the teacher model is a Transf ormer network, the pre-trained Transformer network model is finely adjusted in a self-supervision training mode based on image data of each mode, and distillation marks are adaptively added based on knowledge distillation; the multi-modal fusion model is a model which is proposed based on the purpose of eliminating modal differences, the basic network structure of the multi-modal fusion model is a Transformer network, a distillation mark is added adaptively based on knowledge distillation, and a fusion mark is added adaptively based on hypersphere learning.

Further, the distillation marker and the fusion marker are input embedded vectors of a Transformer network model and are obtained by training a multi-head self-attention layer and a full-junction layer based on the Transformer network model.

Further, the output of the distillation mark of the teacher model is used for calculating the teacher distillation vector, the output of the distillation mark of the multi-mode fusion model is used for calculating the student distillation vector, and the output of the fusion mark of the multi-mode model is used for calculating the characteristics and classification probability of the unit hypersphere space.

Further, calculating a distillation loss, an inter-modality center alignment loss, an intra-modality uniformity loss, and a classification loss based on the teacher distillation vector, the student distillation vector, the features, and the classification probability, comprising: calculating distillation losses for each modality based on the teacher distillation vector and the student distillation vectors; calculating inter-modal center alignment loss and intra-modal uniformity loss based on the characteristics of each modality; the classification loss is calculated based on the classification probability of each modality.

Further, calculating inter-modality center alignment loss and intra-modality uniformity loss based on features of each modality, comprising: the features are calculated from the output of the fusion tag of the multi-modal fusion model, are located in a unit hypersphere space, and have the property of a vector mode of one; calculating class centers for each category in each modality based on the characteristics, normalizing the class centers to enable the vector mode of the class centers to be one, aligning the class centers of the same category in different modalities, and calculating the center alignment loss between the modalities; the intra-modal uniformity loss comprising: intra-modal uniformity loss is calculated for the features of each modality based on the features and the radial basis functions, wherein intra-modal uniformity loss is defined as an average of the logarithms of the gaussian potentials of the paired features.

Further, calculating a classification loss based on the classification probability of each modality, comprising: the classification probability is obtained by outputting through a linear classifier based on the features of the corresponding modes, wherein weights of the linear classifier are shared by all the modes.

Accordingly, the multi-modal image processing method obtains image data of each modality in a manner that: acquiring image samples of different modalities, including but not limited to manually drawn sketch samples and photo samples collected by an imaging device, and forming a data set for training parameters of the pre-trained transform network model.

Correspondingly, the multi-modal fusion model generates a zero-sample cross-modal retrieval result based on the image of the to-be-detected modality and the image of the to-be-queried modality, and the zero-sample cross-modal retrieval result comprises the following steps: the multi-mode fusion model extracts a fusion mark of an image to be detected based on the image of the mode to be detected; the multi-mode fusion model extracts a fusion mark of an image to be queried based on the image of the modality to be queried; and calculating the cosine similarity of the image to be detected and the image to be inquired, and generating the zero sample cross-modal retrieval result after sequencing from big to small.

Correspondingly, the invention provides a multi-modal image processing system based on a Transformer network and hypersphere space learning, which comprises: the imaging unit is used for acquiring multi-modal image samples; the data storage unit is used for storing multi-modal image samples; the neural network unit comprises a pre-trained Transformer network model, a teacher model and a multi-mode fusion model, wherein the teacher model is finely adjusted in a self-supervision mode; the data processing unit is used for extracting teacher distillation vectors of all the modal images based on the teacher model, extracting student distillation vectors of all the modal images based on the multi-modal fusion model, and extracting the characteristics and classification probability of all the modal images in a unit hypersphere space based on the multi-modal fusion model; meanwhile, calculating distillation loss, center alignment loss between modes, uniformity loss in modes and classification loss of each mode based on the teacher distillation vector, the student distillation vector, the features and the classification probability, and updating a multi-mode fusion model based on the distillation loss, the center alignment loss between modes, the uniformity loss in modes and the classification loss.

Further, the data processing unit calculates distillation loss based on the teacher distillation vector and the student distillation vector, calculates center alignment loss between modalities and uniformity loss within modalities based on the features, calculates classification loss based on the classification probability of each modality, weights the distillation loss, the center alignment loss between modalities, the uniformity loss within modalities and the classification loss based on a linear weighting mode, calculates a final loss value, and updates a multi-modality fusion model.

The invention discloses a multi-modal image processing method based on a Transformer network and hypersphere space learning, which adaptively adds a distillation mark and a fusion mark to the Transformer network, constructs a multi-branch model capable of hypersphere space learning based on multi-modal images, extracts teacher distillation vectors, student distillation vectors, characteristics on unit hypersphere space and classification probability of each modal image sample by utilizing global structure modeling capacity, effectively improves the modeling and multi-modal alignment distribution capacity of a multi-modal fusion model by calculating distillation loss, center alignment loss among modals, intra-modal uniformity loss and classification loss, eliminates the modal difference problem among different modals, and further realizes zero-sample cross-modal retrieval.

Drawings

FIG. 1 is a simplified flow chart of the multi-modal image processing method based on the Transformer network and hypersphere spatial learning according to the present invention;

FIG. 2 is a simplified unit connection diagram of the multi-modal image processing system based on the Transformer network and hypersphere spatial learning according to the present invention;

FIG. 3 is a simplified flow diagram of a multimodal fusion model in accordance with a preferred embodiment of the present invention;

FIG. 4 is a simplified flow diagram of a multi-branch model of a preferred embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings and specific examples.

Example 1

As shown in fig. 1, the present invention is a multimodal image processing method based on a Transformer network and hypersphere space learning, including steps S1 to S5.

S1: and acquiring a pre-trained Transformer network model, and fine-tuning the pre-trained Transformer network model in a self-supervision mode based on image data of each modality to obtain a teacher model.

The Transformer network is first proposed in the field of natural language processing, taking serialized text data as input. Recently, the Transformer network structure has been improved to process image data, and is excellent in the field of computer vision. As shown in fig. 3 (ignoring fusion and distillation markers), the Transformer network structure consists of L layers of multi-headed self-attention modules alternating with feed-forward neural network modules, each of which contains pre-layer normalization and residual concatenation. Each image is cut into a series of image blocks with fixed resolution according to grids, the image blocks are linearly projected, and trainable position coding is added to form a serialized block marker vector.

The serialized block labels are the feature representations of the image blocks, and can be used for aggregating the information of the block labels by taking the average value, or can be used for aggregating the information of the block labels by additionally training one label to serve as the feature representation of the whole image.

The present embodiment contains image data for two modalities, a photograph (typically acquired by an imaging device) and a sketch (typically rendered by a human). As shown in fig. 3 (ignoring fusion tags), distillation tags were adaptively added in a pre-trained transform network to aggregate information of other block tags. And fine-tuning the pre-training Transformer network model in a self-supervision mode by respectively adopting photo and sketch data to obtain teacher models in a photo mode and a sketch mode. The self-supervision approach, i.e., excludes tag information, avoids degradation of model generalization during the fine-tuning process. More specifically, a "multi-cropping" strategy is used to generate a set of different views (V) for each image (photo image when a photo modality teacher model is acquired; sketch image when a sketch modality teacher model is acquired), including two global views x with a resolution of 224 x 224_g,1And x_g,2And 10 partial views with a resolution of 96 × 96. Then, two models to be fine-tuned are initialized from the pre-trained Transformer network model and are marked as a model P and a model T. The fine-tuning process follows a strategy from local to global, entering all views in V into the model P, and only the wholeAnd inputting the model T by the local view, and defining an optimization formula as follows:

wherein Z is_t，τ_t，θ_tAnd Z_p，τ_p，θ_pRespectively representing the output and temperature over-parameter of the models T and P, psi represents Softmax normalization operation, and KL represents Kullback-Leibler divergence. x may be any global view, but not a local view. xS can be any view, but may not be the same as x.

This shows that the Kullback-Leibler divergence is minimized, i.e. the outputs of the models P and T are made similar and the parameters of the model P are updated. Thus, the above formula is used to align the output of the local view and the global view, as well as to align the different global views. Finally, the parameters of the model T are updated in an exponential moving average manner: theta_t＝ζθ_t+(1-ζ)θ_p. Where ζ is a preset parameter between (0, 1). And respectively obtaining the trained model T of the photo mode and the sketch mode as a teacher model of the corresponding mode.

S2: and constructing a multi-branch model capable of performing hypersphere space learning based on the multi-modal images, wherein the multi-branch model is composed of a teacher model corresponding to each modality and a multi-modal fusion model.

In step S1, a transform web teacher model with distillation labels in photo and sketch modalities is obtained, and is respectively marked as g_IAnd g_S. Wherein I and S represent the photo modality and the sketch modality, respectively. As shown in FIG. 4, the teacher model of the photo modality is represented by a skeleton network f_IAnd a projection network h_IThe teacher model in the sketch mode is composed of a skeleton network f_SAnd a projection network h_SAnd (4) forming.

Further, a multi-branch model capable of hypersphere learning based on multi-modal images is constructed, which is taught by the two modalitiesThe teacher model and a multi-modal fusion model. Unlike the teacher model of a specific modality, the multi-modality fusion model aims to eliminate the distribution difference between modalities, process image data of all modalities simultaneously, and output features with similar distribution for all modalities. As shown in fig. 3, its basic network structure is also a Transformer network, and distillation markers are adaptively added based on knowledge distillation, and fusion markers are adaptively added based on hypersphere space learning. Thus, as shown in FIG. 4, it consists of a skeleton network f_FAnd two projection networks h_DAnd h_FConstitution (h in FIG. 4 for convenience of illustration)_DTwo are drawn symmetrically). To simplify the formula, use g_DDenotes f_FAnd h_DConstructed model, g_FDenotes f_FAnd h_FAnd (4) forming a model. Similar to the distillation marker, the fusion marker is also a trainable input embedding vector. But the functions of the teacher model and the multi-modal fusion model are different, the distillation mark of the teacher model is used for calculating a teacher distillation vector, the distillation mark of the multi-modal fusion model is used for calculating student distillation vectors of all the modalities, and the output of the fusion mark of the multi-modal fusion model is used for calculating the characteristics and classification probability of the hypersphere space of a unit.

S3: and respectively extracting teacher distillation vectors of all the modal images based on all the teacher models, extracting student distillation vectors of all the modal images based on the multi-modal fusion model, and extracting the characteristics and classification probability of all the modal images in a unit hypersphere space based on the multi-modal fusion model.

As shown in FIG. 4, in either the teacher model or the multi-modal fusion model, the distillation mark is represented by a skeleton network (f)_I，f_SOr f_F) Learned and obtained by the projection network (h)_I，h_SOr h_D) Projected into a space of K dimensions (K fixed to 65536) to obtain a distillation vector. In any ith photograph

Or any ith sketch

For the purpose of example only,

and

the teacher's distillation vectors of the photo and sketch are shown separately, and

and

student distillation vectors representing photographs and sketch, respectively. In addition, the fusion tag is represented by f_FIs learned and is obtained from_FAnd projecting the image to a unit hypersphere space to obtain the characteristics of the image in the unit hypersphere space. In the same way as in the example,

and

respectively, representing the characteristics of a photograph and a sketch. These features are classified by a linear classifier shared by all modes to obtain classification probabilities.

S4: calculating distillation loss, center alignment loss between modes, uniformity loss in modes and classification loss of each mode based on the teacher distillation vector, the student distillation vector, the features and the classification probability, and updating a multi-mode fusion model based on the distillation loss, the center alignment loss between modes, the uniformity loss in modes and the classification loss.

The training process of the multi-modal fusion model is carried out under the supervision of teacher models of all the modalities. Because the teacher models of the respective modalities are pre-trained in an unsupervised manner, they are optimized to find global structural information specific to each image. However, the purpose of the multi-modal fusion model is to eliminate modal differences between the distributions of the same class but different modalities, which will inevitably require the multi-modal fusion model to pay more attention to the more discriminative local structures shared by the whole class, gradually forgetting the structural information specific to each image. We therefore avoided this phenomenon, known as "catastrophic forgetting", by knowledge distillation.

This example contains photographs and sketch maps, and distillation loss of photographs was calculated based on teacher's distillation vector and student's distillation vector of photographs

Calculating distillation loss of sketch map based on distillation vector of teacher and distillation vector of students

Taking a photograph as an example, given a batch of data consisting of N images, knowledge distillation matches the probability distribution of teacher distillation vectors and student distillation vectors, and distillation loss of the photograph is calculated as follows:

wherein, tau_IAnd τ_DRespectively representing the temperature super parameters of the photo mode teacher model and the multi-mode fusion model, and the rest symbols are consistent with the definition of the content. In a similar manner to that described above,

or from a batch of sketch calculations. Thus, the overall distillation loss of the multi-branch model can be defined as follows:

as shown in fig. 4, both the photographs and the sketch map are projected into the unit hypersphere space, and it is desirable that the photographs and the sketch map can be grouped by category. When the images of all classes are aggregated individually, their distribution is linearly separable in hypersphere space. Thus, a linear classifier can be used to classify features, calculating the classification penalty as follows:

wherein the content of the first and second substances,

representing a mathematical expectation, x_iRepresenting an arbitrary ith photograph or sketch, y_iDenotes x_iClass label of theta_cA parameter, P (y), representing the linear classifier_i|g_F(x_i)；θ_c) With the expression parameter theta_cLinear classifier of (2) will x_iClassification as y_iThe probability of (c). Thus, the classification penalty can be aligned intra-and inter-modal distribution by a linear classifier shared by all the modalities.

In addition, inter-modality center alignment loss is calculated based on characteristics of each modality

It is explicitly required that the feature distributions of the individual modes overlap on the hypersphere.

Wherein, for simplifying representation, the image mode I or the sketch mode S is represented, lambda is the weight of the exponential moving average, N_yiIndicating that the category label in a batch of image data is y_jOf (2) a sample

The number of the (c) component(s),

representing class centers, Y being represented by respective classes Y_iA set of classes. The second line of equations above represents the L2 norm normalized class center, making the norm of the class center 1, i.e., mapping the class center back to the unit hypersphere space.

And

in combination, the multimodal fusion model is able to align feature distributions both inter-modality and intra-modality.

However, both alignment and uniformity are key properties of features in hypersphere space, since uniformity means that the representation capability of hypersphere space is fully exploited. In particular, intra-modal uniformity loss is calculated for each batch of features for each modality based on the features and radial basis functions of the respective modality, wherein intra-modal uniformity loss is defined as a desired logarithm of the gaussian potential of the paired features. Finally, the overall intra-modal uniformity loss

Is intra-modal uniformity loss for each mode

And

the sum of (1):

wherein, for the sake of simplified representation, a photo modality I or a sketch modality S,

and

any two images representing the same modality, t is a parameter fixed to 2,

for arbitrary pairs of images

And

the gaussian potential is calculated. It is worth noting that the total intra-modal uniformity loss is the sum of intra-modal uniformity losses for the individual modalities, rather than constraining all features non-modally. Such a loss function design is reasonable because ideally, the distribution of all modes in hypersphere space tends to be uniform, but the distribution of homogeneous features of different modes has overlapping positions (determined by the classification loss and the inter-mode center alignment loss).

Finally, the overall objective function of the multi-modal fusion model is a linear weighting of the four losses described above, defined as follows:

wherein λ is₁And λ₂Respectively, are hyper-parameters for inter-modal center alignment loss and intra-modal uniformity loss. After the value of the overall objective function of the multi-modal fusion model is calculated, the parameters of the multi-modal fusion model are updated according to a stochastic gradient descent algorithm,and obtaining the trained multi-modal fusion model.

S5: the multi-mode fusion model generates a zero-sample cross-mode retrieval result based on the image of the to-be-detected mode and the image of the to-be-queried mode.

When the training of the multi-modal fusion model is completed, the skeleton network f of the trained multi-modal fusion model can be used_FExtracting the fusion mark vector of the image of each mode, extracting the fusion mark vector of the image to be detected based on the image of the mode to be detected, and extracting the fusion mark vector of the image to be inquired based on the image of the mode to be inquired. And finally, calculating the cosine similarity of the image to be detected and the image to be inquired according to the two fusion mark vectors, and generating the zero sample cross-modal retrieval result after sequencing from big to small.

Example 2

This example was experimentally verified on the basis of example 1. In the embodiment, three mainstream data sets in the zero-sample cross-modal search field are used as training and testing data sets, namely Sketchy, TU-Berlin, and QuickDraw. They both contain data and tags for the photo modality and the sketch modality for the zero sample photo-sketch retrieval task. Specifically, Sketchy initially consists of 125 classes of 75471 sketch maps and 12500 photographs, with a pairing relationship between the sketch maps and the images. The collection of photos of Sketchy was then expanded to 73002. TU-Berlin is composed of 20000 sketch maps and 204489 photos of 250 categories, so the quantity of sketch maps and photos is seriously unbalanced, and the abstraction degree of sketch maps is high; QuickDraw is the largest of the three data sets, consisting of 330000 sketch maps of 110 classes and 204000 photographs, with the sketch maps being the most abstract.

For Sketchy, there are two types of training class and test class division: one randomly selects 25 classes as test classes, and the other selects 21 classes that do not overlap with the ImageNet class as test classes. For simplicity, we refer to the former as Sketchy and the latter as Sketchy-NO. TU-Berlin is similar to Sketchy, and randomly selects 20 classes as test classes. QuickDraw is similar to Sketchy-NO, and 30 classes that do not overlap with the ImageNet class are selected as test classes. Furthermore, we perform binarization processing on real-valued features by iterative quantization (ITQ) for comparison with a hash method. The cosine and hamming distances are used to calculate the similarity of the real and binary representations, respectively. The evaluation criteria used recall (Prec) and mean recall (mAP), Prec @ K and mAP @ K representing recall and mean recall calculated from the top K results of the search.

Further, the claimed system is defined as TVT in this embodiment, and the rest of the searching methods are the high-usage sketch searching methods. Experimental results as shown in tables 1 and 2, the TVT process showed continuous and significant improvement over all existing processes. Most zero sample sketch retrieval methods were only experimented with Sketchy and TU-Berlin, both datasets randomly selecting test classes. Specifically, on both datasets, TVT consistently defeated the best existing method (DSN) with an improvement of the mAP @ all score of 11.1% and 0.5%. However, on more realistic and challenging data sets (Sketchy-NO and QuickDraw), the TVT achieved more significant improvements. On Sketchy-NO, TVT increases the mAP @200 score from 0.501 to 0.531 compared to DSN. Furthermore, on a large scale of QuickDraw, TVT achieves a dramatic improvement of the mAP @ all score of nearly 100%. Given the large-scale nature of these datasets and the limitations of fixed class segmentation, these results effectively demonstrate that the vast improvement in TVT is not incidental, nor caused by segmentation bias. TVT also gives the best results compared to the hash method. The improvement achieved by TVT is even more pronounced when comparing results using an index that considers only the top 100 or 200 candidate samples. On Sketchy and TU-Berlin, the fraction of TVT far above DSN, Prec @100 increased by 13.1% and 13.0%, respectively. On QuickDraw, the mAP @200 and Prec @200 scores of TVT increased by 112.2% and 330.9%, respectively, compared to Dey et al. These results mean that the correct result has a higher probability of appearing in the top 100 or 200 search results, which is well suited for the search task. All of these comparisons can demonstrate that TVTs can effectively align intra-and inter-modal distributions without loss of uniformity, and then achieve satisfactory generalization over unseen classes.

Table 1: comparison of TVT and other 10 existing zero sample sketch retrieval methods on Sketchy and TU-Berlin. The subscript "b" indicates the result obtained from the binary representation and "-" indicates that the method reported no relevant result. The best and second best results are shown in bold and underlined, respectively.

Table 2: comparison of TVT and the other two methods on QuickDraw. The best results are shown in bold.

Accordingly, as shown in fig. 2, the present invention provides a multimodal image processing system based on a Transformer network and hypersphere spatial learning, comprising: the imaging unit is used for acquiring multi-modal image samples; the data storage unit is used for storing multi-modal image samples; the neural network unit comprises a pre-trained Transformer network model, a teacher model and a multi-mode fusion model, wherein the teacher model is finely adjusted in a self-supervision mode; the data processing unit is used for extracting teacher distillation vectors of all the modal images based on the teacher model, extracting student distillation vectors of all the modal images based on the multi-modal fusion model, and extracting the characteristics and classification probability of all the modal images in a unit hypersphere space based on the multi-modal fusion model; meanwhile, calculating distillation loss, center alignment loss between modes, uniformity loss in modes and classification loss of each mode based on the teacher distillation vector, the student distillation vector, the features and the classification probability, and updating a multi-mode fusion model based on the distillation loss, the center alignment loss between modes, the uniformity loss in modes and the classification loss.

The multi-modal image processing method and system based on the transform network and hypersphere spatial learning provided by the embodiment of the invention are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Claims

1. A multi-modal image processing method based on a Transformer network and hypersphere space learning is characterized by specifically comprising the following steps of:

step S1: acquiring a pre-trained Transformer network model, and fine-tuning the pre-trained Transformer network model in a self-supervision manner based on image data of each modality to obtain a teacher model corresponding to each modality;

step S2: constructing a multi-branch model capable of performing hypersphere space learning based on multi-modal images, wherein the multi-branch model is composed of a teacher model corresponding to each modality and a multi-modal fusion model;

step S3: respectively extracting teacher distillation vectors of the images of each mode based on a teacher model corresponding to each mode, extracting student distillation vectors of the images of each mode based on the multi-mode fusion model, and extracting the characteristics and classification probability of the images of each mode in a unit hypersphere space based on the multi-mode fusion model;

step S4: calculating distillation loss, inter-modality center alignment loss, intra-modality uniformity loss and classification loss of each modality based on teacher distillation vectors of each modality image, student distillation vectors of each modality image, features of each modality image in unit hypersphere space and classification probability of each modality image in unit hypersphere space, and updating the multi-modality fusion model based on the distillation loss, the inter-modality center alignment loss, the intra-modality uniformity loss and the classification loss;

step S5: and generating a zero-sample cross-modal retrieval result by adopting the updated multi-modal fusion model based on the image of the to-be-detected modality and the image of the to-be-queried modality.

2. The method of claim 1, wherein the image data of each modality in step S1 includes a photograph and a sketch, and the fine-tuning the pre-trained Transformer network model in an autonomous manner based on the image data of each modality specifically includes:

fine-tuning the pre-trained transform network model using the photo and sketch data respectively in an auto-supervised manner, i.e. excluding label information, to avoid degradation of model generalization during fine-tuning, i.e. generating a set of different views V for each photo image or sketch image using a "multi-cropping" strategy, including two global views x with a resolution of 224 x 224, to obtain teacher models in the photo mode and the sketch mode_g,1And x_g,2And 10 partial views with a resolution of 96 × 96; then, initializing two models to be fine-tuned from a pre-trained Transformer network model, and marking as a model P and a model T; the fine tuning process follows a strategy from local to global, inputting all views in V into model P, and only global view into model T, then the optimization formula is defined as follows:

wherein Z is_t，τ_t，θ_tAnd Z_p，τ_p，θ_pRespectively representing the output and temperature over-parameter parameters of the models T and P, psi represents Softmax normalization operation, and KL represents Kullback-Leibler divergence; x may be any global view, but not a local view; x' may be any view, but may not be the same as x;

the expression minimizes the Kullback-Leibler divergence, so the above formula is used to align the output of the local view and the global view, and to align the different global views; finally, the parameters of the model T are updated in an exponential moving average manner: theta_t＝ζθ_t+(1-ζ)θ_pWhere ζ is a preset parameter between (0, 1);and respectively obtaining the trained model T of the image mode and the sketch mode as a teacher model of the corresponding mode.

3. The multimodal image processing method based on Transformer network and hypersphere spatial learning of claim 2, wherein the step S2 is implemented in a specific manner:

the teacher model in the photo modality and the teacher model in the sketch modality acquired in step S1 are each denoted as g_IAnd g_SWherein, I and S respectively represent a photo mode and a sketch mode, and the structure of the teacher model of the photo mode is represented by a skeleton network f_IAnd a projection network h_IThe structure of the teacher model in the sketch mode is composed of a skeleton network f_SAnd a projection network h_SForming;

further constructing a multi-branch model capable of performing hypersphere learning based on multi-modal images, wherein the multi-branch model is composed of a teacher model with two modalities and a multi-modal fusion model, the basic network structure of the multi-modal fusion model is also a Transformer network, distillation marks are added adaptively based on knowledge distillation, meanwhile, fusion marks are added adaptively based on hypersphere space learning, and the multi-modal fusion model is structurally composed of a skeleton network f_FAnd two projection networks h_DAnd h_FComposition of g_DDenotes f_FAnd h_DConstructed model, g_FDenotes f_FAnd h_FAnd the distillation marks of the teacher model are used for calculating teacher distillation vectors, the distillation marks of the multi-mode fusion model are used for calculating student distillation vectors of all the modes, and the output of the fusion marks of the multi-mode fusion model is used for calculating the characteristics and classification probability of the hypersphere space of a unit.

4. The multimodal image processing method based on the Transformer network and hypersphere spatial learning of claim 3, wherein the step S3 is implemented in a specific manner:

the distillation mark is formed by a skeleton network f no matter the teacher model or the multi-mode fusion model_I，f_SOr f_FLearned and obtained by the projection network h_I，h_SOr h_DProjecting the distillation vector into a K-dimensional space to obtain a distillation vector; in any ith photograph

Or any ith sketch

For the purpose of example only,

and

and

student distillation vectors representing photographs and sketch, respectively; in addition, the fusion tag is represented by f_FIs learned and is obtained from_FProjecting the image to a unit hypersphere space to obtain the characteristics of the image in the unit hypersphere space,

and

respectively representing the features of the photo and the sketch, and classifying the features through a linear classifier shared by all the modalities to obtain the classification probability.

5. The multimodal image processing method based on Transformer network and hypersphere spatial learning of claim 4, wherein the step S4 is implemented in a specific manner:

the training process of the multi-modal fusion model is carried out under the supervision of the teacher model corresponding to each modality, because the teacher models corresponding to each modality are pre-trained in a self-supervision mode, and are optimized to find the specific global structure information of each image; however, the purpose of the multimodal fusion model is to eliminate modal differences between the distributions of the same category but different modalities, which will inevitably require that the multimodal fusion model pay more attention to the more discriminative local structures shared by the whole category, gradually forgetting the structural information specific to each image, thus avoiding the phenomenon known as "catastrophic forgetting" by knowledge distillation;

calculating distillation loss of photo based on distillation vector of teacher and distillation vector of student

wherein, tau_IAnd τ_DRespectively representing the temperature super parameters of the photo mode teacher model and the multi-mode fusion model, and the other symbols are consistent with the definition of the content; in the same way, the method for preparing the composite material,

can also be calculated from a batch of sketch images; thus, the overall distillation loss of the multimodal fusion model is defined as:

both the photographs and the sketch map are projected into the unit hypersphere space and it is expected that the photographs and the sketch map can be aggregated according to categories, and when the images of all the categories are aggregated individually, their distribution is linearly separable in hypersphere space, so a linear classifier is used to classify the features and the classification loss is calculated as follows:

wherein the content of the first and second substances,

representing a mathematical expectation, x_iRepresenting an arbitrary ith photograph or sketch, y_iDenotes x_iClass label of theta_cA parameter, P (y), representing the linear classifier_i|g_F(x_i)；θ_c) With the expression parameter theta_cLinear classifier of (2) will x_iClassification as y_iThe probability of (d);

in addition, inter-modality center alignment loss is calculated based on characteristics of each modality image in a unit hypersphere space

It is explicitly required that the feature distributions of the respective modality images in the unit hypersphere space are superimposed on the hypersphere:

wherein, for the sake of simplified representation, represents a photo mode I or a sketch mode S, λ is a weight of an exponential moving average,

indicating that the category label in a batch of image data is y_jOf (2) a sample

The number of the (c) component(s),

representing class centers, Y being represented by respective classes Y_iThe second line formula represents the normalized class center of the L2 norm, i.e. the class center is mapped back to the unit hypersphere space;

in addition, intra-modality uniformity loss is calculated for the features of each modality based on the features of the respective modality images in the unit hypersphere space and the radial basis functions, wherein the intra-modality uniformity loss is defined as an expected logarithm of the gaussian potential of the paired features; finally, the overall intra-modal uniformity loss

Is intra-modal uniformity loss for each mode

And

the sum of (1):

wherein the content of the first and second substances,

and

any two images representing the same modality, t is a parameter fixed to 2,

for arbitrary pairs of images

And

calculating the Gaussian potential;

finally, the overall objective function of the multi-modal fusion model

Is a linear weighting of the four losses described above, defined as follows:

wherein λ is₁And λ₂And after calculating the value of the overall objective function of the multi-modal fusion model, updating the parameters of the multi-modal fusion model according to a stochastic gradient descent algorithm to obtain the updated multi-modal fusion model.

6. The multimodal image processing method based on Transformer network and hypersphere spatial learning of claim 5, wherein the step S5 is implemented in a specific manner:

after the multi-modal fusion model is updated, using the updated skeleton network f of the multi-modal fusion model_FExtracting fusion mark vectors of all modal images, extracting the fusion mark vectors of the images to be detected based on the images of the to-be-detected modalities, and extracting the fusion mark vectors of the images to be inquired based on the images of the to-be-inquired modalities; and finally, calculating the cosine similarity of the image to be detected and the image to be inquired according to the two fusion mark vectors, and generating the zero sample cross-modal retrieval result after sequencing from big to small.

7. The multimodal image processing method based on Transformer network and hypersphere space learning of claim 6, wherein K is fixed to 65536.

8. A multi-modal image processing system based on a Transformer network and hypersphere spatial learning, for implementing the multi-modal image processing method based on the Transformer network and hypersphere spatial learning according to any one of claims 1-7, wherein the system comprises:

the imaging unit is used for acquiring multi-modal image samples;

the data storage unit is used for storing the multi-modal image samples;

the neural network unit comprises a pre-trained Transformer network model, a teacher model and a multi-mode fusion model, wherein the teacher model corresponds to each mode which is finely adjusted in a self-supervision mode;

the data processing unit is used for extracting teacher distillation vectors of all the modal images based on the teacher model corresponding to all the modalities, extracting student distillation vectors of all the modal images based on the multi-modality fusion model, and extracting the characteristics and classification probability of all the modal images in a unit hypersphere space based on the multi-modality fusion model; meanwhile, calculating distillation loss, center alignment loss among modes, uniformity loss in modes and classification loss based on teacher distillation vectors, student distillation vectors of all mode images, characteristics of all mode images in unit hypersphere space and classification probability thereof, and updating a multi-mode fusion model based on the distillation loss, the center alignment loss among the modes, the uniformity loss in the modes and the classification loss.

9. The multimodal image processing system based on Transformer network and hypersphere space learning of claim 8, wherein the data processing unit calculates distillation loss based on teacher distillation vector and student distillation vector of each modality image, calculates center alignment loss among modalities and uniformity loss in modalities based on features of each modality image in unit hypersphere space, calculates classification loss based on classification probability of each modality image in unit hypersphere space, and weights distillation loss, center alignment loss among modalities, uniformity loss in modalities and classification loss based on linear weighting, calculates final loss value for updating multimodal fusion model.