CN116778140A

CN116778140A - Visual positioning method, device, equipment and memory based on double knowledge distillation

Info

Publication number: CN116778140A
Application number: CN202310790208.7A
Authority: CN
Inventors: 胡越; 武万森; 秦龙; 许凯; 张淼; 祝建成; 尹全军; 刘婷; 王有凯
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2023-09-19

Abstract

The application relates to a visual positioning method, a visual positioning device, visual positioning equipment and a visual positioning storage device based on double knowledge distillation. The method comprises the following steps: taking the acquired original image and the corresponding language query as training samples; constructing a visual positioning model based on double knowledge distillation; the model comprises a student network, a semantic knowledge distillation module and a positioning knowledge distillation module; the semantic knowledge distillation module is used for coding training samples into visual features and semantic features by adopting a teacher network and distilling the visual features and the semantic features to a student network; the positioning knowledge distillation module is used for learning positioning knowledge by adopting a contrast learning mode; and training the multi-vision positioning model according to the training sample and the total loss function, and inputting the image to be tested and the corresponding language query into a student network of the trained vision positioning model to obtain a positioning boundary frame. The method improves the cross-modal representation of the basic framework, and enables the correlation between the two modes to be tighter.

Description

Visual positioning method, device, equipment and memory based on double knowledge distillation

Technical Field

The application relates to the technical field of visual positioning, in particular to a visual positioning method, device, equipment and memory based on double knowledge distillation.

Background

The visual localization task aims at locating the object pointed by the sentence in a picture according to the pointing language. Unlike the general target detection task that locates a class of objects in a picture, the visual localization task requires that the model, on the basis of understanding a complex pointing language (Referring Expression), distinguish objects in the picture that uniquely correspond to that language from among multiple classes of objects. This ability to visually locate is a prerequisite to cross-media intelligence, which is essential in many subsequent visual and linguistic tasks, such as answering "girls wearing green clothing" in visual question-answering tasks; in the visual language navigation task, the model needs to accurately position objects in the environment to perform subsequent actions, namely, the red table is positioned; and in the automatic driving task, the vehicle is required to be positioned by stopping the vehicle at the No. 1 parking space so as to control the vehicle.

Knowledge distillation is a Teacher-Student training structure, typically a trained Teacher model that provides knowledge, and a Student model that obtains Teacher's knowledge through distillation training. Model enhancements emphasize the use of other resources (e.g., unlabeled or cross-modal data) or optimization strategies for knowledge distillation (e.g., mutual learning and self-learning) to improve the performance of a complex student model for example, an unlabeled sample is used as input to both teacher and student networks, and a powerful teacher network can usually predict the label of the sample and then use the label to guide complex student network training.

Self-Distillation (Self-Distillation) is a single network that is used as both a teacher and student model, allowing the single network model to improve performance through knowledge Distillation during Self-learning. Self-distillation is largely divided into two categories, the first category being mutual distillation using different sample information. Soft labels for other samples can avoid over-confidence predictions for the network and even reduce intra-class distances by minimizing the prediction distribution between different samples. Other efforts have used information that enhances the samples, such as exploiting the consistency of the characteristics of the data under different distortion states to facilitate intra-class robustness learning. The other is self-distillation between network layers of a single network. The most common approach is to use the features of the deep network, including the soft targets of the network output, to guide the learning of the shallow network. In the task of the sequence feature, knowledge in the previous frame is transferred to the subsequent frame for learning. The learning of each network block of a single network can also be bidirectional, and each block can perform collaborative learning and mutually guide learning in the whole training process.

Existing visual positioning methods can be divided into two main categories: a single stage process and a dual stage process. There are many studies currently using a transducer as an encoder of the model to extract visual and linguistic features. But these methods still have two problems.

Problem one: the visual and linguistic encoders used by most methods are independent pre-trained models. As shown in fig. 1 (a), the trans vg model uses the parameters of DETR to initialize the visual encoder and the parameters of BERT to initialize the speech encoder. Since the BERT and DETR models are pre-trained based on different tasks, the correlation between features obtained after encoding by the encoder for semantically close visual and linguistic information is poor. Recently, the visual language pre-training model CLIP has demonstrated its ability to connect visual and language modalities to a unified embedded space and achieves significant effects in multiple multi-modal tasks. Intuitively, we can directly use CLIP visual encoder and language encoder to replace original encoder in the infrastructure, thereby ensuring that the features of the two modalities are closely related. As shown in fig. 1 (b), it is shown from the experimental results that such simple replacement does not bring about improvement in performance, but rather, loses the accuracy of the model. Therefore, we consider that semantic knowledge contained in the CLIP model needs to be extracted in an indirect way.

And a second problem: in the VG task, an image typically contains multiple objects or regions, and a powerful model is required to distinguish the reference object (foreground) from other objects (background). However, existing approaches tend to ignore this problem and easily confuse the two types of objects. This problem may make the model insensitive to significant foreground objects, resulting in ambiguity in the location boundaries. As shown in fig. 1 (d) and (e), given the query "sky right of middle tree", the TransVG cannot identify the foreground sky region pointed to, but instead erroneously focuses on the background building. These results indicate that it is very difficult to process challenging images containing multiple objects and having complex location information.

Disclosure of Invention

Based on the above, it is necessary to provide a visual positioning method, device, equipment and storage based on dual knowledge distillation.

A visual localization method based on dual knowledge distillation, the method comprising:

and taking the acquired original image and the corresponding language query as training samples.

Constructing a visual positioning model based on double knowledge distillation; the visual positioning model comprises a student network, a semantic knowledge distillation module and a positioning knowledge distillation module; the semantic knowledge distillation module is used for coding the training samples into visual features and semantic features by adopting a teacher network, and distilling the visual features and the semantic features from the teacher network to the student network; the student network is used for coding the training samples, fusing the visual characteristics and the semantic characteristics according to coding results and distillation, and predicting according to the obtained fusion characteristics to obtain a prediction positioning boundary frame; the positioning knowledge distillation module is used for generating high-quality positive and negative samples by adopting a semantic positioning sensing sampling mechanism according to the original image or the feature vector of the prediction positioning boundary box, and learning positioning knowledge by adopting a contrast learning mode.

And constructing a total loss function of the visual positioning model.

And training the visual positioning model according to the training sample and the total loss function to obtain a trained visual positioning model.

And inputting the image to be detected and the corresponding language query into a student network of the trained visual positioning model to obtain a positioning boundary frame.

A visual positioning device based on dual knowledge distillation, the device comprising:

and the training sample acquisition module is used for taking the acquired original image and the corresponding language query as training samples.

The visual positioning model construction module is used for constructing a visual positioning model based on double knowledge distillation; the visual positioning model comprises a student network, a semantic knowledge distillation module and a positioning knowledge distillation module; the semantic knowledge distillation module is used for coding the training samples into visual features and semantic features by adopting a teacher network, and distilling the visual features and the semantic features from the teacher network to the student network; the student network is used for coding the training samples, fusing the visual characteristics and the semantic characteristics according to coding results and distillation, and predicting according to the obtained fusion characteristics to obtain a prediction positioning boundary frame; the positioning knowledge distillation module is used for generating high-quality positive and negative samples by adopting a semantic positioning sensing sampling mechanism according to the original image or the feature vector of the prediction positioning boundary box, and learning positioning knowledge by adopting a contrast learning mode.

The visual positioning model training module is used for constructing a total loss function of the visual positioning model based on double knowledge distillation; and training the visual positioning model according to the training sample and the total loss function to obtain a trained visual positioning model.

And the visual positioning module based on double knowledge distillation is used for inputting the image to be detected and the corresponding language query into a student network of the trained visual positioning model to obtain a positioning boundary frame.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method described above when the processor executes the computer program.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.

The visual positioning method, the device, the equipment and the memory based on double knowledge distillation, wherein the method comprises the following steps: taking the acquired original image and the corresponding language query as training samples; constructing a visual positioning model based on double knowledge distillation; the visual positioning model comprises a student network, a semantic knowledge distillation module and a positioning knowledge distillation module; the semantic knowledge distillation module is used for coding training samples into visual features and semantic features by adopting a teacher network, and distilling the visual features and the semantic features from the teacher network to a student network; the student network is used for coding the training samples, fusing the coding results with the distillation visual characteristics and the semantic characteristics, and predicting according to the obtained fusion characteristics to obtain a prediction positioning boundary frame; the positioning knowledge distillation module is used for generating high-quality positive and negative samples by adopting a semantic positioning sensing sampling mechanism according to a prediction positioning boundary box and learning positioning knowledge by adopting a contrast learning mode; constructing a total loss function of the visual positioning model; training the visual positioning model according to the training sample and the total loss function to obtain a trained visual positioning model; and inputting the image to be detected and the corresponding language query into a student network of the trained visual positioning model to obtain a positioning boundary frame. The model of the method obviously improves the cross-modal representation of the basic framework, and makes the correlation between the two modes tighter.

Drawings

FIG. 1 is a diagram showing the introduction of the problems of the present application, wherein (a) is a TransVG structure, (b) is a TransVG w/CLIP structure, (c) is a visual positioning model structure diagram based on dual knowledge distillation, (d) is a query of "sky right of middle tree", (e) is a visual result of a TransVG attention score, and (f) is a visual result of a visual positioning model attention score based on dual knowledge distillation;

FIG. 2 is a flow diagram of a visual localization method based on dual knowledge distillation in one embodiment;

FIG. 3 is a block diagram of a visual positioning model based on dual knowledge distillation in one embodiment;

FIG. 4 shows an alignment block in another embodiment, wherein (a) is a feature level alignment block and (b) is a pixel level alignment block;

FIG. 5 is a block diagram of a visual positioning device based on dual knowledge distillation in one embodiment;

fig. 6 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 2, a visual localization method based on dual knowledge distillation is provided, the method comprising the steps of:

step 200: and taking the acquired original image and the corresponding language query as training samples.

Step 202: constructing a visual positioning model based on double knowledge distillation; the visual positioning model comprises a student network, a semantic knowledge distillation module and a positioning knowledge distillation module; the semantic knowledge distillation module is used for coding training samples into visual features and semantic features by adopting a teacher network, and distilling the visual features and the semantic features from the teacher network to a student network; the student network is used for coding the training samples, fusing the coding results with the distillation visual characteristics and the semantic characteristics, and predicting according to the obtained fusion characteristics to obtain a prediction positioning boundary frame; the positioning knowledge distillation module is used for generating high-quality positive and negative samples by adopting a semantic positioning sensing sampling mechanism according to the original image or the feature vector of the prediction positioning boundary box, and learning positioning knowledge by adopting a contrast learning mode.

In particular, a visual localization model (DUET) based on dual knowledge distillation introduces a Knowledge Distillation (KD) mechanism into the visual localization task. DUET involves two distillation methods for semantic knowledge and positional knowledge.

First, in order to have a stronger semantic correlation between feature embedding of visual and language modalities, semantic knowledge is extracted from CLIP by trying to match global visual/language features of the teacher network CLIP model and the target student network model, as shown in fig. 1 (c). This is a simple and efficient way to significantly improve the cross-modal representation of the infrastructure and to make the correlation between the two modalities tighter.

Second, semantic knowledge extracted from CLIP requires further refinement because CLIP is adept at solving the classification task, but it is difficult to distinguish between foreground and background when solving the visual localization task. For this purpose, contrast learning is used as a self-distilling method to refine the positioning knowledge. In particular, an alignment penalty is established that forces features extracted from the bounding box generated by the model to resemble positive samples, as opposed to negative samples. However, when a large number of simple samples dominate the gradient, the performance of the model quickly goes into plateau. To solve this problem, the present method also designs a Semantic Location Aware (SLA) sampling mechanism to distinguish positive and negative samples from randomly cropped image regions. From the visualization results of fig. 1 (f), it can be observed that the DUET can help the baseline model generate a unique location-aware representation for the visual localization task. The structure of the visual localization model based on dual knowledge distillation is shown in fig. 3, and the framework of the DUET can be divided into two modules: the semantic knowledge distillation module and the positioning knowledge distillation module.

The student network may be, but is not limited to, a TransVG network or a FAOA network.

Step 204: and constructing a total loss function of the visual positioning model.

Step 206: and training the visual positioning model according to the training sample and the total loss function to obtain a trained visual positioning model.

Step 208: and inputting the image to be detected and the corresponding language query into a student network of the trained visual positioning model to obtain a positioning boundary frame.

In the above visual localization method based on dual knowledge distillation, the method comprises: taking the acquired original image and the corresponding language query as training samples; constructing a visual positioning model based on double knowledge distillation; the visual positioning model comprises a student network, a semantic knowledge distillation module and a positioning knowledge distillation module; the semantic knowledge distillation module is used for coding training samples into visual features and semantic features by adopting a teacher network, and distilling the visual features and the semantic features from the teacher network to a student network; the student network is used for coding the training samples, fusing the coding results with the distillation visual characteristics and the semantic characteristics, and predicting according to the obtained fusion characteristics to obtain a prediction positioning boundary frame; the positioning knowledge distillation module is used for generating high-quality positive and negative samples by adopting a semantic positioning sensing sampling mechanism according to a prediction positioning boundary box and learning positioning knowledge by adopting a contrast learning mode; constructing a total loss function of the visual positioning model; training the visual positioning model according to the training sample and the total loss function to obtain a trained visual positioning model; and inputting the image to be detected and the corresponding language query into a student network of the trained visual positioning model to obtain a positioning boundary frame. The model of the method obviously improves the cross-modal representation of the basic framework, and makes the correlation between the two modes tighter.

In one embodiment, step 204 includes: constructing a semantic knowledge distillation module, a positioning knowledge distillation module and a loss function of a student network; setting a loss influence corresponding super parameter; and carrying out weighted summation according to the hyper-parameters, the semantic knowledge distillation module, the positioning knowledge distillation module and the loss function of the student network to obtain the total loss function of the visual positioning model, wherein the total loss function is as follows:

wherein ,λ_giou 、λ _sem 、λ _loc To loss influence the corresponding super-parameters, L _L1 and L_giou L1 loss and GIoU loss of student network respectively, L _loc Locating the loss function of the knowledge distillation module, L _sem Is a loss function of the semantic knowledge distillation module.

Specifically, the GIoU penalty plus the L1 penalty is used to address the visual localization task. To perform knowledge distillation, we need to minimize two distillation losses, so the training goal of the visual localization model based on double knowledge distillation is shown in equation (1).

In one embodiment, the loss function of the semantic knowledge distillation module is:

L _sem ＝L _sem,v +L _sem,l (2)

wherein ,L_sem For words of Chinese characterLoss function of sense knowledge distillation module, L _sem,v and L_sem,l The loss of distillation in visual and linguistic modes are represented respectively,for the overall visual features of the student model, +.>The method is characterized in that the method is an integral visual characteristic of a semantic knowledge distillation module, eta (·) is a self-adaptive layer, w (·) is a characteristic whitening function, and beta is a preset parameter; / >For the whole semantic feature of the student model, +.>Is the whole semantic feature of the semantic knowledge distillation module.

The loss function of the positioning knowledge distillation module is:

wherein ,self-distillation loss using pixel level alignment module and self-distillation loss using feature level alignment module, respectively, +.>Respectively an image region and a feature mapping region, N _pos and N_neg The number of positive and negative samples, respectively.

In one embodiment, step 206 specifically includes the steps of:

step 300: and inputting the training sample into a semantic knowledge distillation module for visual feature coding and semantic feature coding, performing whitening treatment on the obtained code, and distilling the obtained whitened visual features and semantic features into a student network after self-adaptive treatment.

Step 302: and inputting the training samples into a student network to obtain a prediction positioning boundary box.

Step 304: and inputting the predicted positioning boundary frame and the original image or the feature vector in the corresponding region into a positioning knowledge distillation module, and obtaining positioning knowledge by adopting a self-distillation method.

Step 306: training the visual positioning model according to the whitened visual features and semantic features, the distilled visual features and semantic features, the prediction positioning boundary frame, positioning knowledge obtained by self-distillation and the total loss function to obtain a trained visual positioning model.

In one embodiment, the semantic knowledge distillation module comprises a teacher network and a feature whitening module; the teacher network comprises two large-scale pre-training models CLIP; step 300 includes: inputting a training sample into a teacher network of a semantic knowledge distillation module, adopting a first large-scale pre-training model CLIP to carry out visual feature coding, and adopting a second large-scale pre-training model CLIP to carry out semantic feature coding to obtain visual features and semantic features; inputting the visual features and the semantic features into a feature whitening module of a semantic knowledge distillation module, performing whitening treatment by adopting a non-parameter layer normalization function without scaling or deviation, and processing the whitened visual features and semantic features by adopting a self-adaptive layer; the adaptively processed visual and semantic features are distilled into the student's network.

In particular, since the vision and language encoders of the student network use DETR and BERT pre-training parameters, respectively, there is a significant semantic gap between their embedded spaces. To make up for this gap, a large-scale pre-training model CLIP is used that is capable of encoding visual features and linguistic features in one and the same semantic space. However, due to the differences between classification and regression tasks, the direct use of CLIP may not be optimal for reasoning tasks. Thus, the first distillation method focuses on distilling semantic knowledge from a teacher's network (CLIP) to a student's network. Notably, this efficient distillation method does not require additional pre-training and maintains the complexity of reasoning.

Specifically, using image I and language query L as inputs, the CLIP model encodes image I and language query L into visual features and language features. Visual [ CLS ] using CLIP]token and CLIP text [ EOS ]]token represents the overall visual feature V of the teacher network model ^t ∈R ^D And language features L ^t ∈R ^D . At the same time, output h of the student multi-mode transducer encoder _s From [ REG ]]token, language feature f _l ' and visual characteristics f _v ' output state composition. Selecting a visual feature from f _v ' all visual token and from language feature f _l 'first language token' is respectively used as integral visual characteristic V of student model ^s ∈R ^n×D And language features L ^s ∈R ^D Where n is the number of visual feature token. The introduction of an adaptive layer and characteristic whitening prior to distillation makes the process more versatile.

(1) An adaptive layer. Because the characteristics of the whole characteristics of the teacher and student models are different in dimension, the characteristics are required to be adjusted to the same dimension, so that the characteristics are comparable. We have tested common methods such as average pooling and 1 x 1 convolution. The adaptive layer allows for different dimensions of output features between the teacher and student models, thus enabling our distillation method to be further generalized.

(2) And (5) whitening the characteristics. Adjusting the hyper-parameters of the various teacher network models can be challenging because different pre-trained models can have different feature orders. To address this problem, a non-parametric layer normalization function without scaling or bias is used to whiten the output feature map of the teacher model.

Representing distillation of semantic knowledge as V ^s ∈R ^n×D and V^t ∈R ^D ，L ^s ∈R ^D and L^t ∈R ^D A loss function therebetween. Specifically, L _sem Is the sum of distillation losses between the two modes, and the expression is shown in the formula (2). : for example, the loss of visual mode can be calculated as students and educationThe smooth L1 loss between the overall features of the engineer is shown as equation (3). Also, we can calculate the loss of distillation L in language mode _sem,l Obtaining the total distillation loss L _sem 。

In one embodiment, the student network is a TransVG network; the TransVG network comprises a visual encoder, a language encoder, a multi-mode fusion module and a prediction layer; step 302 includes: inputting an original image in a training sample into a visual encoder of a student network to obtain visual characteristics; inputting language inquiry corresponding to the original image in the training sample into a language encoder of a student network to obtain semantic features; inputting semantic features, visual features, learnable token [ REG ], semantic features and visual features output by a distilled teacher network into a multi-mode fusion module of a student network to obtain fusion features as follows:

h _s ＝Transformer([f _REG ,φ(f _l ),φ(f _v )]) (6)

wherein ,h_s To fuse features, f _l As semantic features, f _v For visual features, phi (·) is the linear layer, f _REG Is a learnable token [ REG ]]。

And inputting the fusion characteristics into a prediction layer of the student network to obtain a prediction positioning boundary box.

Specifically, the double knowledge distillation method provided by the method can be applied to different models. Without loss of generality, the classical model TransVG is chosen herein as the student network. The student network comprises four main parts: visual encoder, speech encoder, multimodal fusion module, and prediction layer. In general, given an image I of size W H and a linguistic query L, a student network may output 4-point bounding box coordinates bbox end-to-end _pred ＝(x _c ,y _c ω, h), wherein x _c ,y _c Is the coordinates of the centre of the bounding box, ω, h representing half the width and height, respectively.

(1) Visual encoder and language encoder. The visual encoder encodes the input image I through the convolution network and then the input image is encoded through the transform encoder layerCoding to obtain visual feature map f _v . Similarly, for each token in the language query L, we embed it as the sum of its word embedment and location embedment. Then, according to the structure of the standard BERT model, we use a 6-layer transducer to obtain the language feature f _l 。

(2) And a multi-mode fusion module. The multi-modality fusion module includes a linear layer and a 6-layer transducer encoder for each modality. To learn token [ REG ]]Language features f _l And visual characteristics f _v The visual and linguistic features are concatenated and fused using a transducer encoder as shown in equation (6).

(3) And (5) a prediction layer. For the purpose of bounding box coordinate prediction, a regression network with MLP and activation functions and a linear output layer are used. Given [ REG ]]The prediction layer generates 4-dimensional bounding box coordinates bbox _pred ＝(x _c ,y _c ,ω,h)。

In one embodiment, the positioning knowledge distillation module comprises an alignment module and a semantic positioning sensing sampling mechanism; the alignment module comprises a feature level alignment module or a pixel level alignment module; step 304 includes:

when the alignment module is a pixel level alignment module: inputting the prediction positioning boundary box and the original image in the corresponding region into a semantic positioning sensing sampling mechanism of a positioning knowledge distillation module to obtain a plurality of high-quality positive samples and negative samples of pixel levels; and carrying out pixel level alignment on the positive samples and the negative samples of all pixel levels by adopting a pixel level alignment module to obtain relevant image areas of each sample:

wherein ,s is the grid sample size for the relevant image region, (x) ₁ ,y ₁ ,x ₂ ,y ₂ ) W x H is the size adjustment for predicting the upper left corner and lower right corner coordinates of the positioning bounding box; STN is space transformation network, I is pixel of original image, STN is space transformation network;

when the alignment module is a feature level alignment module: inputting the prediction positioning boundary box and the feature vectors in the corresponding region into a semantic positioning sensing sampling mechanism of a positioning knowledge distillation module to obtain a plurality of positive samples and negative samples with high-quality feature levels; the positive samples and the negative samples of all the feature levels are aligned in pixel level by adopting a feature level alignment module, and the obtained relevant feature mapping area of each sample is as follows:

wherein ,for feature mapping region, f _v ' visual characteristics f _v Visual characteristics processed by the multi-mode fusion module.

In particular, for visual localization tasks, simply assigning semantic knowledge to a student model is not sufficient to support its accurate output localization boxes based on the reference language. We believe that the student model also needs to learn how to distinguish between foreground and background. Thus, positioning knowledge is critical to the success of the model in visual positioning tasks, but is typically ignored by existing methods. To solve this problem, we devised a regional level self-distillation method to learn positioning knowledge. The region features contained in the bounding box output from the student model should match the bounding box of the real label. In addition, more high-quality positive and negative samples are generated by further utilizing a semantic location sensing sampling mechanism, and a contrast learning method is used for helping a model to learn location knowledge.

(1) Alignment at feature level and alignment at pixel level

Given image I and query L, the student network may generate a bounding box bbox _pred = (xc, yc, w, h). There are two different levels of methods of extracting positioning knowledge, namely alignment at the feature level and alignment at the pixel level. Fig. 4 shows, among others, (a) is a pixel-level alignment module, and (b) is a feature-level alignment module. The pixel level alignment is to align the image pixels contained in the predicted bounding box with the image pixels contained in the group-trunk bounding box. And the alignment of the feature level aligns the feature vectors corresponding to the two regions. To transfer the focus from the whole image or feature map to the region corresponding to the prediction bounding box bboxpred, a spatial transform network (Spatial Transformer Network, STN) is used to obtain the pixel I or feature f while preserving gradient information _v Related area on'. In addition to the ability to map the transformation, it is a scalable image sampling method that allows the loss gradient to flow back to the input feature map, and also to the sampling grid coordinates, and thus to the transformation parameters and student network. First, due to pixel I or feature f _v The' size is different, the latter size is adjusted to w×h to keep the same size as the former. Then, based on a bounding box (x ₁ ,y ₁ ,x ₂ ,y ₂ ) Defining a transformation T theta as shown in the formula (8),

based on this, the relevant image area can be extractedOr feature map area->As shown in the formulas (7) and (8).

(2) Semantic Location Aware (SLA) sampling mechanism

We consider to introduce the pair described aboveHelping to absorb more positioning knowledge and will benefit from more samples through contrast learning. However, the random sampling method reduces the performance of the model because the quality of the samples collected by this method varies. Thus, SLA sampling mechanisms are proposed to generate high quality positive and negative samples to help models learn positional knowledge. Specifically, first for pixel I or feature f _v ' a fixed number of random cuts, denoted N, are performed to obtain a set of bounding boxes. By calculating IoU values between all bounding boxes in this set and the group-trunk box, bounding boxes with IoU less than a certain threshold are considered negative samples. For samples with IoU greater than the threshold, calculate language feature f _l ' cosine similarity between region features corresponding to boxes. For samples with IoU and cosine similarity greater than the threshold, we take this as a positive sample, numbered N _pos The rest samples are negative samples, numbered N _neg . The values of IoU and cosine similarity were set to 0.6 and the samples were normalized.

(3) Calculation of loss function

After SLA sampling, a prediction region R is calculated _pred The self-distillation losses between other examples are:

wherein R_pred May be image pixels in the image layer, i.eObtain->Or feature mapping in feature layer, +.>Obtain->N _pos and N_neg The number of positive and negative samples, respectively.

Furthermore, without the use of SLA sampling mechanisms and contrast learning, we can perform na iotave localization knowledge distillation, i.e., align the prediction and truth regions in two levels of L1 penalty without any additional samples, and />

In the formula, I ₁ In order to be an L1 norm, and />The generated group-trunk pixel area and the characteristic area are respectively.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

In one embodiment, as shown in fig. 5, there is provided a visual positioning device based on dual knowledge distillation, comprising: training sample acquisition module, based on dual knowledge distillation's vision positioning model construction module, based on dual knowledge distillation's vision positioning model training module and based on dual knowledge distillation's vision positioning module, wherein:

The visual positioning model construction module is used for constructing a visual positioning model based on double knowledge distillation; the visual positioning model comprises a student network, a semantic knowledge distillation module and a positioning knowledge distillation module; the semantic knowledge distillation module is used for coding training samples into visual features and semantic features by adopting a teacher network, and distilling the visual features and the semantic features from the teacher network to a student network; the student network is used for coding the training samples, fusing the coding results with the distillation visual characteristics and the semantic characteristics, and predicting according to the obtained fusion characteristics to obtain a prediction positioning boundary frame; the positioning knowledge distillation module is used for generating high-quality positive and negative samples by adopting a semantic positioning sensing sampling mechanism according to the original image or the feature vector of the prediction positioning boundary box, and learning positioning knowledge by adopting a contrast learning mode.

In one embodiment, the visual positioning model training module based on dual knowledge distillation is further configured to construct a semantic knowledge distillation module, a positioning knowledge distillation module, and a loss function of the student network; setting a loss influence corresponding super parameter; and obtaining the total loss function of the visual positioning model according to the hyper-parameters, the semantic knowledge distillation module, the positioning knowledge distillation module and the loss function of the student network, wherein the total loss function is shown in a formula (1).

In one embodiment, the loss function of the semantic knowledge distillation module is shown in equations (2) through (4). The loss function of the positioning knowledge distillation module is shown in the formula (2).

In one embodiment, the visual positioning model training module based on dual knowledge distillation is further configured to input a training sample into the semantic knowledge distillation module to perform visual feature coding and semantic feature coding, perform whitening processing on the obtained code, and perform adaptive processing on the obtained whitened visual features and semantic features and then distill the whitened visual features and semantic features into a student network; inputting training samples into a student network to obtain a prediction positioning boundary box; inputting the predicted positioning boundary frame and the original image or the feature vector in the corresponding region into a positioning knowledge distillation module, and obtaining positioning knowledge by adopting a self-distillation method; training the visual positioning model according to the whitened visual features and semantic features, the distilled visual features and semantic features, the prediction positioning boundary frame, positioning knowledge obtained by self-distillation and the total loss function to obtain a trained visual positioning model.

In one embodiment, the semantic knowledge distillation module comprises a teacher network and a feature whitening module; the teacher network comprises two large-scale pre-training models CLIP; the visual positioning model training module based on double knowledge distillation is also used for inputting training samples into a teacher network of the semantic knowledge distillation module, adopting a first large-scale pre-training model CLIP to carry out visual feature coding, and adopting a second large-scale pre-training model CLIP to carry out semantic feature coding so as to obtain visual features and semantic features; inputting the visual features and the semantic features into a feature whitening module of a semantic knowledge distillation module, performing whitening treatment by adopting a non-parameter layer normalization function without scaling or deviation, and processing the whitened visual features and semantic features by adopting a self-adaptive layer; the adaptively processed visual and semantic features are distilled into the student's network.

In one embodiment, the student network is a TransVG network; the TransVG network comprises a visual encoder, a language encoder, a multi-mode fusion module and a prediction layer; the visual positioning model training module based on double knowledge distillation is also used for inputting an original image in a training sample into a visual encoder of a student network to obtain visual characteristics; inputting language inquiry corresponding to the original image in the training sample into a language encoder of a student network to obtain semantic features; inputting semantic features, visual features, learnable token REG and semantic features and visual features output by a distilled teacher network into a multi-mode fusion module of a student network to obtain fusion features as shown in a formula (6); and inputting the fusion characteristics into a prediction layer of the student network to obtain a prediction positioning boundary box.

In one embodiment, the positioning knowledge distillation module comprises an alignment module and a semantic positioning sensing sampling mechanism; the alignment module comprises a feature level alignment module or a pixel level alignment module; the visual positioning model training module based on double knowledge distillation is further used for when the alignment module is a pixel level alignment module: inputting the prediction positioning boundary box and the original image in the corresponding region into a semantic positioning sensing sampling mechanism of a positioning knowledge distillation module to obtain a plurality of high-quality positive samples and negative samples of pixel levels; the pixel level alignment module is adopted to align the positive samples and the negative samples of all pixel levels, and the relevant image areas of each sample are obtained as shown in the formulas (7) and (8); when the alignment module is a feature level alignment module: inputting the prediction positioning boundary box and the feature vectors in the corresponding region into a semantic positioning sensing sampling mechanism of a positioning knowledge distillation module to obtain a plurality of positive samples and negative samples with high-quality feature levels; and carrying out pixel level alignment on the positive samples and the negative samples of all the feature levels by adopting a feature level alignment module to obtain relevant feature mapping areas of each sample as shown in the formulas (9) and (10).

For specific limitations on visual positioning devices based on dual knowledge distillation, reference may be made to the above limitations on visual positioning methods based on dual knowledge distillation, and no further description is given here. The various modules in the dual knowledge distillation based visual positioning device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a visual localization method based on dual knowledge distillation. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A visual localization method based on dual knowledge distillation, the method comprising:

taking the acquired original image and the corresponding language query as training samples;

constructing a visual positioning model based on double knowledge distillation; the visual positioning model comprises a student network, a semantic knowledge distillation module and a positioning knowledge distillation module; the semantic knowledge distillation module is used for coding the training samples into visual features and semantic features by adopting a teacher network, and distilling the visual features and the semantic features from the teacher network to the student network; the student network is used for coding the training samples, fusing the visual characteristics and the semantic characteristics according to coding results and distillation, and predicting according to the obtained fusion characteristics to obtain a prediction positioning boundary frame; the positioning knowledge distillation module is used for generating high-quality positive and negative samples by adopting a semantic positioning sensing sampling mechanism according to the original image or the feature vector of the prediction positioning boundary box, and learning positioning knowledge by adopting a contrast learning mode;

Constructing a total loss function of the visual positioning model;

training the visual positioning model according to the training sample and the total loss function to obtain a trained visual positioning model;

2. The method of claim 1, wherein constructing the overall loss function of the visual localization model comprises:

constructing a loss function of the semantic knowledge distillation module, the positioning knowledge distillation module and the student network;

setting a loss influence corresponding super parameter;

according to the hyper-parameters, the semantic knowledge distillation module, the positioning knowledge distillation module and the loss function of the student network, the total loss function of the visual positioning model is obtained as follows:

wherein ,λ_giou 、λ _sem 、λ _loc In order for the loss to affect the corresponding hyper-parameters, and L_giou L1 loss and GIoU loss of student network respectively, L _loc To locate the loss function of the knowledge distillation module, L _sem Is a loss function of the semantic knowledge distillation module.

3. The method of claim 2, wherein the loss function of the semantic knowledge distillation module is:

L _sem ＝L _sem,v +L _sem,l

wherein ,L_sem Loss function for semantic knowledge distillation module, L _sem,v and L_sem,l The loss of distillation in visual and linguistic modes are represented respectively,for the overall visual features of the student model, +.>The method is characterized in that the method is an integral visual characteristic of a semantic knowledge distillation module, eta (·) is a self-adaptive layer, w (·) is a characteristic whitening function, and beta is a preset parameter; />For the whole semantic feature of the student model, +.>Is the whole semantic feature of the semantic knowledge distillation module.

The loss function of the positioning knowledge distillation module is as follows:

4. The method of claim 1, wherein training the visual positioning model based on the training samples and the total loss function results in a trained visual positioning model, comprising:

inputting the training sample into the semantic knowledge distillation module for visual feature coding and semantic feature coding, performing whitening treatment on the obtained code, and distilling the obtained whitened visual features and semantic features into the student network after self-adaptive treatment;

Inputting the training sample into the student network to obtain a prediction positioning boundary box;

inputting the predicted positioning boundary frame and the original image or the feature vector in the corresponding region into the positioning knowledge distillation module, and obtaining positioning knowledge by adopting a self-distillation method;

and training the visual positioning model according to the whitened visual features and semantic features, the distilled visual features and semantic features, the prediction positioning boundary frame, the positioning knowledge obtained by self-distillation and the total loss function to obtain a trained visual positioning model.

5. The method of claim 4, wherein the semantic knowledge distillation module comprises a teacher network and a feature whitening module; the teacher network comprises two large-scale pre-training models CLIP;

inputting the training sample into the semantic knowledge distillation module for visual feature coding and semantic feature coding, whitening the obtained code, and distilling the obtained whitened visual features and semantic features into the student network after self-adaptive processing, wherein the method comprises the following steps:

inputting the training sample into the teacher network of the semantic knowledge distillation module, adopting a first large-scale pre-training model CLIP to carry out visual feature coding, and adopting a second large-scale pre-training model CLIP to carry out semantic feature coding to obtain visual features and semantic features;

Inputting the visual features and the semantic features into a feature whitening module of the semantic knowledge distillation module, performing whitening treatment by adopting a non-parameter layer normalization function line without scaling or deviation, and processing the whitened visual features and semantic features by adopting an adaptive layer;

distilling the adaptively processed visual features and the semantic features into the student network.

6. The method of claim 4, wherein the student network is a trans vg network; the TransVG network comprises a visual encoder, a language encoder, a multi-mode fusion module and a prediction layer;

inputting the training sample into the student network to obtain a prediction positioning boundary box, wherein the prediction positioning boundary box comprises the following steps:

inputting an original image in the training sample into the visual encoder of the student network to obtain visual characteristics;

inputting language inquiry corresponding to the original image in the training sample into the language encoder of the student network to obtain semantic features;

inputting the semantic features, the visual features, the learnable token [ REG ], the semantic features and the visual features output by the distilled teacher network into the multi-mode fusion module of the student network to obtain fusion features as follows:

h _s ＝Transformer([f _REG ,φ(f _l ),φ(f _v )])

wherein ,h_s To fuse features, f _l As semantic features, f _v For visual features, phi (·) is the linear layer, f _REG Is a learnable token [ REG ]]；

And inputting the fusion characteristics into the prediction layer of the student network to obtain a prediction positioning boundary box.

7. The method of claim 4, wherein the localization knowledge distillation module comprises an alignment module and a semantic localization aware sampling mechanism; the alignment module comprises a feature level alignment module or a pixel level alignment module;

inputting the predicted positioning boundary frame and the original image or the feature vector in the corresponding region into the positioning knowledge distillation module to obtain positioning knowledge by adopting a self-distillation method, wherein the method comprises the following steps:

when the alignment module is the pixel level alignment module: inputting the predicted positioning boundary box and the original image in the corresponding region into a semantic positioning sensing sampling mechanism of the positioning knowledge distillation module to obtain a plurality of high-quality positive samples and negative samples of pixel levels; and carrying out pixel level alignment on positive samples and negative samples of all pixel levels by adopting the pixel level alignment module to obtain relevant image areas of each sample:

when the alignment module is the feature level alignment module: inputting the prediction positioning boundary box and the feature vectors in the corresponding region into a semantic positioning sensing sampling mechanism of the positioning knowledge distillation module to obtain a plurality of positive samples and negative samples of high-quality feature levels; and carrying out pixel level alignment on positive samples and negative samples of all feature levels by adopting the feature level alignment module, wherein the obtained relevant feature mapping area of each sample is as follows:

wherein ,mapping regions for features, f' _v For visual features f _v Visual characteristics processed by the multi-mode fusion module.

8. A visual positioning device based on dual knowledge distillation, the device comprising:

the training sample acquisition module is used for taking the acquired original image and the corresponding language query as training samples;

the visual positioning model construction module is used for constructing a visual positioning model based on double knowledge distillation; the visual positioning model comprises a student network, a semantic knowledge distillation module and a positioning knowledge distillation module; the semantic knowledge distillation module is used for coding the training samples into visual features and semantic features by adopting a teacher network, and distilling the visual features and the semantic features from the teacher network to the student network; the student network is used for coding the training samples, fusing the visual characteristics and the semantic characteristics according to coding results and distillation, and predicting according to the obtained fusion characteristics to obtain a prediction positioning boundary frame; the positioning knowledge distillation module is used for generating high-quality positive and negative samples by adopting a semantic positioning sensing sampling mechanism according to the original image or the feature vector of the prediction positioning boundary box, and learning positioning knowledge by adopting a contrast learning mode;

The visual positioning model training module is used for constructing a total loss function of the visual positioning model based on double knowledge distillation; training the visual positioning model according to the training sample and the total loss function to obtain a trained visual positioning model;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable memory, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.