CN115147644A

CN115147644A - Method, system, device and storage medium for training and describing image description model

Info

Publication number: CN115147644A
Application number: CN202210658065.XA
Authority: CN
Inventors: 陆阳; 赵明; 杨帆; 白婷; 闻斌; 张立; 卫星
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-10-04

Abstract

The invention provides a training and describing method, a system, equipment and a storage medium of an image description model, and belongs to the technical field of image recognition. The training method of the image description model comprises the following steps: acquiring a training set, wherein the training set comprises a source domain image with a label and a target domain image without the label; acquiring a cross-domain image description model, wherein the cross-domain image description model comprises a style transfer module, a comparison learning module and a target detection module; and performing combined training on the style migration module, the comparison learning module and the target detection module based on a training set to obtain a trained cross-domain image description model, and performing target identification on the image with the target domain image style based on the label classification of the source domain image by the cross-domain image description model. The method can reduce the contrast loss and the difference between the source domain and the target domain to the maximum extent, effectively improves the detection capability under different domains, and can realize the description of the target domain image without marking the target domain image acquired manually.

Description

Training and description method, system, equipment and storage medium of image description model

Technical Field

The invention relates to the technical field of image recognition, in particular to a training and describing method, a system, equipment and a storage medium of an image description model.

Background

In many real-world scenes, general visual tasks such as image recognition, object detection, image translation, etc., are always confronted with serious challenges from the aspects of viewing angle, lighting, background, occlusion, scene change, etc. These inevitable factors have made the task of these field shifting environments a challenging and emerging research direction in recent years. In real tasks such as video monitoring and automatic driving, domain change is a widely recognized problem and needs to be broken through urgently, so that large-scale cross-domain benchmark testing is urgently needed to promote the development of the domain.

Currently, cross-Domain adaptive Object Detection (Cross-Domain Object Detection) commonly used in the prior art aims to learn feature-related representations in the case of Domain movement, where training data (source Domain) is rich labels with bounding box labels, while test data (target Domain) has fewer or no labels. The feature distribution between the source domain and the target domain is different, so that the generalization of the trained model is poor, and the label supervision of the source domain has better sharability for the target domain by aligning the distribution of the two domains in the training process, thereby obtaining the detector with the capability of enhancing the generalization. However, there are two major difficulties in domain adaptation: firstly, the domain difference cannot be eliminated, the performance is sharply reduced, and convergence cannot be caused, so that the classifier trained on the source domain cannot be directly applied to the target domain; second, we cannot know what part of the source domain class space shares features with the target domain class space, because the target domain class space is not accessible in training.

Therefore, there is a need to provide a method, system, device and storage medium for training and describing an image description model to solve the above problems.

Disclosure of Invention

In view of the foregoing shortcomings in the prior art, an object of the present invention is to provide a method, a system, a device, and a storage medium for training and describing an image description model, so as to solve the technical problem in the prior art that a classifier trained by a cross-domain adaptive target detection method cannot eliminate domain differences, and the classifier cannot align common features of a source domain and a target domain during training, so that the trained classifier cannot effectively identify images of the target domain.

To achieve the above and other related objects, the present invention provides a training method for cross-domain image description model, comprising the following steps:

acquiring a training set, wherein the training set comprises a source domain image with a label and a target domain image without the label;

acquiring a cross-domain image description model, wherein the cross-domain image description model comprises a style transfer module, a comparison learning module and a target detection module;

and performing combined training on the style migration module, the comparison learning module and the target detection module based on the training set to obtain the trained cross-domain image description model, wherein the cross-domain image description model performs target identification on images with target domain image styles based on label classification of source domain images.

In an embodiment of the present invention, the present invention further provides a method for identifying cross-domain image description, where the method for identifying cross-domain image description adopts a cross-domain image description model obtained by training the method for training the cross-domain image description model according to any one of the above embodiments, and the method for identifying cross-domain image description includes:

acquiring image data;

and inputting the image data into the cross-domain image description model, and acquiring a target identification result of the image data.

In an embodiment of the present invention, the present invention further provides a training system for a cross-domain image description model, where the system includes:

the data acquisition unit is used for acquiring a training set, wherein the training set comprises a source domain image with a label and a target domain image without the label;

the model calling unit is used for acquiring a cross-domain image description model, and the cross-domain image description model comprises a style migration module, a comparison learning module and a target detection module;

and the joint training unit is used for performing joint training on the style migration module, the comparison learning module and the target detection module based on the training set to obtain the trained cross-domain image description model, and the cross-domain image description model performs target identification on the image with the target domain image style based on the label classification of the source domain image.

In an embodiment of the present invention, there is also provided a computer device comprising a processor coupled with a memory, the memory storing program instructions, the program instructions stored by the memory when executed by the processor implementing any of the methods described above.

In an embodiment of the invention, there is also provided a computer-readable storage medium comprising a program which, when run on a computer, causes the computer to perform the method of any of the above.

The invention discloses a training and describing method, a system, equipment and a storage medium of an image description model, wherein stylization is embedded into contrast learning and target recognition, a source domain image and a target domain image are stylized, so that the source domain image has a target domain image style, domain differences are eliminated while a content structure of the source domain image is kept, and therefore, the similarity of features in the source domain image and the target domain image is maximized without the domain differences in the contrast learning, the recognition targets in the target domain image can be accurately positioned and classified by using label information of the source domain image during the target recognition, and a better describing result of a target scene image is obtained.

In summary, the training and description method, system, device and storage medium of the image description model can maximally reduce the contrast loss and the difference between the source domain and the target domain, effectively improve the detection capability under different domains, and can realize the description of the target domain image without labeling the manually acquired target domain image. By adopting the technical scheme of the invention, the problems that the current data volume is large, the label information cannot be effectively labeled in time and the manual labeling cost is reduced by using a technical means can be effectively solved, the cross-domain target detection task is completed, and the target identification accuracy is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without any creative effort.

FIG. 1 is an overall framework diagram of a cross-domain image description model according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating the process of style migration of source domain images by the style migration module in accordance with an embodiment of the present invention;

FIG. 3 is a diagram illustrating a process of extracting similar features for image contrast learning in a training set by a contrast learning module according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating the process of the target detection module in recognizing images in the training set according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an SSD framework for a target detection module in accordance with an embodiment of the present invention;

FIG. 6 is a flowchart illustrating a training method of a cross-domain image description model according to an embodiment of the present invention;

FIG. 7 is a flowchart illustrating step S3 according to an embodiment of the present invention;

FIG. 8 is a flowchart illustrating step S32 according to an embodiment of the present invention;

FIG. 9 is a flowchart illustrating step S33 according to an embodiment of the present invention;

FIG. 10 is a flowchart illustrating step S34 according to an embodiment of the present invention;

FIG. 11 is a block diagram illustrating an exemplary training system for cross-domain image description models, according to the present invention.

Element number description:

10. a training system for cross-domain image description models; 11. a data acquisition unit; 12. a model calling unit; 13. a joint training unit.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. It is also to be understood that the terminology used in the examples is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the present invention. Test methods in which specific conditions are not noted in the following examples are generally performed under conventional conditions or conditions recommended by the respective manufacturers.

Please refer to fig. 1 to 10. It should be understood that the structures, ratios, sizes, etc. shown in the drawings are only used for matching the disclosure of the present disclosure to be understood and read by those skilled in the art, and are not used for limiting the conditions of the present disclosure, so that the present disclosure is not limited to the essential meanings in the technology, and any modifications of the structures, changes of the ratio relationships, or adjustments of the sizes, should still fall within the scope of the present disclosure without affecting the functions and the achievable effects of the present disclosure. In addition, the terms "upper", "lower", "left", "right", "middle" and "one" used in the present specification are used for clarity of description, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms may be changed or adjusted without substantial change in the technical content.

When numerical ranges are given in the examples, it is understood that both endpoints of each of the numerical ranges and any value therebetween can be selected unless the invention otherwise indicated. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs and the description of the present invention, and any methods, apparatuses, and materials similar or equivalent to those described in the examples of the present invention may be used to practice the present invention.

Referring to fig. 1 to 6, an object of the present invention is to provide a method, a system, a device, and a storage medium for training and describing an image description model, so as to solve the technical problem in the prior art that a classifier trained by a cross-domain adaptive target detection method cannot eliminate domain differences, and the classifier cannot align with common features of a source domain and a target domain during training, so that the trained classifier cannot effectively identify a target domain image.

Referring to fig. 1 to 4, based on the image description model trained by the training method of the present invention, a stylized network is embedded in the model, and the stylization is embedded in the comparison learning, so that the comparison loss and the difference between the source domain and the target domain are reduced to the maximum extent, the detection capability under different domains is effectively improved, and the cost and time of manual labeling data are reduced.

Referring to fig. 1 to 6, fig. 6 is a flowchart illustrating a training method of a cross-domain image description model according to an embodiment of the invention. In an embodiment of the present invention, a training method for a cross-domain image description model is provided, which includes the following steps:

s1, acquiring a training set, and preprocessing images in the training set; wherein the training set comprises labeled source domain images and unlabeled target domain images, and the pre-processing comprises cropping all images in the training set to a size of 224 x 224; specifically, the target domain image is obtained by shooting and collecting a target scene through a camera, the source domain image is selected from an existing data set, and the selected source domain image is labeled through a corresponding tool pack to obtain a source domain image with a label. For example, in an embodiment, the target domain image is a daily state image captured by a camera in a real traffic scene, specifically, the daily state video stream data in the traffic scene is collected by installing a static industrial camera on a road surface of a traffic environment for capturing, and then key frames in the video stream data are selected according to a preset time interval and stored to obtain the target domain image; the source domain image is an image obtained from a current popular data set citrescaps, and the obtained image is labeled by adopting a pycocools packet, so that a source domain image with a label is obtained; among them, the data set cityscaps employed in the present embodiment has 5000 images of driving scenes in an urban environment (2975train, 500val,1525 test), with 19 classes of dense pixel labeling (97% coverage), 8 of which have example level segmentation. The large dataset contains a variety of stereoscopic video sequences recorded in street scenes from 50 different cities, focusing on a semantically understood picture dataset of the city street scenes.

S2, acquiring a cross-domain image description model, wherein the cross-domain image description model comprises a style migration module, a comparison learning module and a target detection module;

the style migration module may be configured to perform style migration on an input source domain image and an input target domain image, and is configured to stylize the source domain image into an image with a style of the target domain image, specifically, to retain content features of the source domain image, and migrate style features, such as texture, color, and the like, of the source domain image into style features of the target domain image, so as to obtain the source domain image with the style of the target domain, thereby effectively reducing a domain difference between the source domain image and the target domain image.

The comparison learning module performs self-supervision comparison learning comparison on the target domain image and the source domain image with the target domain image style, aligns the characteristics of the source domain image and the target domain image, classifies the source domain image and the target domain image, and marks a pseudo label on the same type of target domain image based on the label carried by the source domain image. The comparison learning module trains a VGG19 (Visual Geometry Group Network) Network model by using an ImageNet data set.

The target detection module is used for carrying out target identification on the target domain image and the source domain image with the target domain image style, acquiring positioning information and classification information of targets in the target domain image and the source domain image with the target domain image style based on a label of the source domain image and a pseudo label of the target domain image, and completing the target identification of the image in the target domain style scene. Compared with the conventional FastRCNN (Fast Regions with CNN features, fast convolution network method based on Regions), the target detection module adopts a target identification framework based on an SSD (Single Shot Multi Box Detector), and the algorithm does not have a process of generating region probes, so that the detection speed is greatly improved.

As shown in fig. 4 and 5, the default base Network of the SSD is VGG16 (Visual Geometry Group Network), the VGG16 Network is composed of 2 Conv1_ x convolutional layers, 2 Conv2_ x, 3 Conv3_ x, 3 Conv4_ x, 3 Conv5_ x, and 5 average pooling layers, and the last 3 layers are fully connected layers, including convolutional cores 1*1 and are 4096 in number; the Conv1_ x, conv2_ x, conv3_ x, conv4_ x and Conv5_ x are residual blocks with four different sizes, and the detailed structure is as follows:

conv1_ x has two convolution layers, including convolution kernel 3*3, and number 64, input image: 224 × 3, magnitude after convolution: 224 × 64;

pool1 contains convolution kernels 3*3 and number 64, input image: 224 × 64, post-convolution size: 112 x 64;

conv2_ x has two convolution layers, including convolution kernel 3*3, and number 128, input image: 112 × 64, magnitude after convolution: 112 x 128;

pool2 contains convolution kernels 2*2, and number 128, input image: 112 × 128, size after convolution: 56 x 128;

conv3_ x has three convolution layers, including convolution kernel 3*3, and 256 in number, input image: 56 × 128, magnitude after convolution: 56 × 256;

pool3 contains convolution kernels 2*2, and number 256, input image: 56 × 256, size after convolution: 28 x 256;

conv4_ x has three convolution layers, including convolution kernel 3*3, and 256 in number, input image: 28 × 256, magnitude after convolution: 28 x 512;

pool4 contains convolution kernels 2*2, and number 512, input image: 28 × 512, post-convolution size: 14 x 512;

conv5_ x has three convolution layers, including convolution kernel 3*3, and a number of 512, input image: 14 × 512, magnitude after convolution: 14 x 512;

pool5 contains convolution kernels 2*2, and number 512, input image: 14 × 512, size after convolution: 7*7*512.

According to the cross-domain image description model, the source domain image style is transferred into the target domain image style through the style transfer module, so that the domain difference between the source domain image and the target domain image is eliminated, the comparison learning module can conveniently acquire the similar characteristics of the source domain image and the target domain image without the domain difference in the self-supervision comparison learning, accurate pseudo labels are marked on the target domain image by using the labels of the source domain image based on characteristic alignment, and finally, the target detection module completes accurate identification of the target in the target domain image based on the pseudo label classification, so that the target domain style image is effectively described by using the label classification of the existing source domain image. The cross-domain image description model solves the problems that a large amount of manual inspection exists in the traditional traffic scene vehicle pedestrian detection, the misjudgment of naked eye observation caused by the complex environment (aspects of visual angle, illumination, background, shelters, scene change and the like), the traditional monitoring equipment cannot provide effective state information and the like, and improves the accuracy of the system for detecting objects in a common scene.

And S3, performing combined training on the style migration module, the comparison learning module and the target detection module based on the training set to obtain the trained cross-domain image description model, and performing target identification on the image with the target domain image style based on the label classification of the source domain image by the cross-domain image description model.

Specifically, the training set is input into the cross-domain image description model, loss functions of the style migration module, the comparison learning module and the target detection module are obtained, and a total loss function of the cross-domain image description model is obtained through calculation;

and inputting the training set into the style migration module, the contrast learning module and the target detection module according to batches in a small-batch random gradient descending mode for iterative training, minimizing the total loss function through the iterative training, so that the cross-domain image description model can reduce the domain difference between the stylized source domain image with the target domain style and the target domain image, and ensure that the model effectively utilizes the label of the source domain image to classify the target domain image, thereby enabling the cross-domain image description model to complete accurate positioning and classification of the recognition target in the images under different scenes to obtain the trained cross-domain image description model.

In an embodiment of the invention, a small batch of random gradient descent modes with weight attenuation of 0.0005 and momentum of 0.9 is used for training the cross-domain image description model, a plurality of batches of small batch samples are used for iteratively training the cross-domain image description model, and parameters in the cross-domain image description model are finely adjusted until convergence, so that a total loss function of the cross-domain image description model is minimized. Wherein the iterative process follows the same learning rate, each batch of samples are iteratively trained in the model for 50 times, so that the model learning rate eta is _p According to the formula

From the initial learning rate eta ₀ The linear adjustment increases to 1.

Further, referring to fig. 1 and fig. 7, in step S3, inputting the training set into the cross-domain image description model, and obtaining the total loss function of the cross-domain image description model includes the following processes:

s31, inputting the training set into the cross-domain image description model;

s32, performing style migration on the source domain images through the style migration module, migrating the style of the source domain images in the training set into the style of the target domain images, and obtaining a first loss function L of the style migration module according to the comparison between the source domain images with the style of the target domain images and the source domain images and the target domain images _neural ；

S33, obtaining similar characteristics of the images in the training set through the self-supervision contrast learning of the contrast learning module, and obtaining a second loss function L of the contrast learning module through the contrast characteristics _NCE Marking a pseudo label on the target domain image with the same similar characteristics based on the label of the source domain image;

s34, performing target recognition on the images of the training set through the target detection module to obtain target recognition results of the source domain images and the target domain images with the target domain image style, and determining a third loss function L of the target detection module according to the target recognition results _SSD ；

S35, obtaining a total loss function L of the cross-domain image description model _total Said total loss function L _total As said first loss function L _neural A second loss function L _NCE And a third loss function L _SSD Sum of the formula L _total ＝L _neural +L _NCE +L _SSD 。

Referring to fig. 2 and 8, in an embodiment of the present invention, the step S32 includes the following processes:

s321, grouping the training set to enable all images in the training set to be uniformly divided into a plurality of migration groups, wherein each migration group comprises a source domain image and a target domain image which are similar in type;

s322, extracting style characteristic images from the target domain images of the migration group through a style migration module, specificallyFor utilizing multiple convolution layer pair target domain images of convolution neural network architecture in style migration module

Extracting style characteristics to obtain a target domain image I from the migration group _P Multiple style feature images phi _l (I _P ) Wherein phi _l (I _P ) Represented as a target domain image I _P Style feature images extracted from the first convolution layer;

s323, extracting content characteristic images from the source domain images of the same migration group through the style migration module, specifically, utilizing a plurality of convolution layers of convolution neural network architecture in the style migration module to perform source domain image pair

Extracting the content features to obtain a source domain image I from the migration group _S Multiple content feature images phi _l (I _S ) Wherein phi _l (I _S ) Represented as a source domain image I _S Extracting content characteristic images at the first convolution layer;

s324, obtaining the style characteristic image phi _l (I _P ) And content feature image phi _l (I _S ) Generating a composite feature image Φ _l (I _C ) Based on the synthetic feature image phi _l (I _C ) Restoring a synthetic image, wherein the synthetic image is a source domain image with a target domain image style, replacing a corresponding source domain image in the training set with the synthetic image, and reserving a label of the source domain image, so that the style of the source domain image in the training set is transferred to the style of the target domain image; wherein the composite image Φ _l (I _C ) The content characteristics, such as structural characteristics, of the source domain image are reserved, and style characteristics, such as texture, color and the like, of the target domain image are fused;

s325, comparing the synthesized image with the source domain image and the target domain image in the same migration group, and calculating to obtain a first loss function of the style migration moduleNumber L _neural . A first loss function L of the style migration module _neural As a function of content loss L during style migration _content And a style loss function L _style Is expressed as L _neural ＝λ _content L _content +λ _style L _style ，λ _content And λ _style Weight factors for the content loss function and the style loss function, respectively.

Wherein the style loss function L _style The difference between style characteristics of the source domain image with the style of the target domain image is defined as the difference between style characteristics of the source domain image and the target domain image, specifically, the sum of the differences of Gram matrixes of the synthesized characteristic image and the target domain style image obtained by each convolution layer in the style migration module is represented as formula

Wherein, γ ^l Is the hyper-parametric control layer coefficient factor of the first convolution layer;

composite feature image Φ obtained for the first convolution layer _l (I _C ) Can be synthesized from the composite feature image phi _l (I _C ) According to the formula

Calculating to obtain;

from the target domain image I for the first convolution layer _P Obtained style feature image phi _l (I _P ) Can be derived from the stylistic image phi _l (I _P ) According to the formula

And (6) calculating.

The content loss function L _content Obtained by comparing the composite image with the initial source domain imageDefining the difference of content characteristics between the source domain image and the source domain image with the target domain image style, specifically defining the difference between the synthesized characteristic image and the content characteristic image in the style migration module, and expressing the formula as

Wherein the content of the first and second substances,

composite feature images synthesized for the first convolution layer in a style migration module

The first convolution layer in the style migration module is composed of the initial source domain image I _S Extracted content feature images

Referring to fig. 3 and 9, in an embodiment of the present invention, the step S33 includes the following processes:

s331, performing anchor point enhancement twice on the images in the training set;

s332, selecting two anchor point images from the same image in the training set as positive samples, and taking other anchor point images as negative samples;

s333, comparing the positive and negative samples through the self-supervision comparison learning of the comparison learning module, and calculating to obtain a second loss function L of the comparison learning module _NCE (ii) a Specifically, a characterization vector h = f (x) of the extended anchor point image is obtained through a basic encoder f (-) of the neural network, then the characterization vector h = f (x) of the anchor point image in the positive sample and the negative sample is mapped to the same space of the contrast loss through a neural network mapping head g (-) to obtain a second loss function L of the contrast learning module through the similarity of the characterization vector h = f (x) of the same space of the anchor point image in the positive sample and the negative sample _NCE Finally by making the second loss function L _NCE Minimizing to enhance the similarity of the two anchor point images in the positive sample, and enlarging the difference between the two anchor point images in the positive sample and other anchor point images in the negative sample, so as to obtain the similar features of the images in the training set corresponding to the positive sample; wherein, the basic encoder f (-) adopts a ResNet structure.

And S334, classifying the images in the training set according to the similar features of the images in the training set, and thus marking pseudo labels on the target domain images with the same similar features based on the labels of the source domain images in the training set.

Wherein the second loss function L of the comparison learning module _NCE Is obtained by comparing the feature difference of the positive sample and the negative sample in the self-supervision learning, and the formula is expressed as

Where sim (.) represents a cosine similarity function, q represents a source domain sample, k ₊ Denotes a positive sample, k _- Representing negative samples, τ is a hyperparameter, also known as the temperature coefficient.

Referring to fig. 4 and 10, in an embodiment of the present invention, the step S34 includes the following processes:

s341, extracting a Feature Map (Feature Map) of the images in the training set through a convolutional neural network of the target detection module to obtain Feature maps extracted by the images in the training set under different convolutional layers; in the embodiment, the target detection module uses a VGG16 neural network based on the SSD algorithm as an encoder to obtain a feature map of an image in a training set by using a plurality of convolutional layers in the VGG16 neural network;

s342, detecting the feature map by using a convolution kernel, and acquiring positioning information and feature information of an identification target in the image of the training set, wherein the positioning information is coordinate information of an identification target positioning frame in the image, and the feature information is feature information of the identification target on the image in the training set; the target detection module adopts a series of small convolution kernels, such as convolution kernels with the size of 3 x 3 or convolution kernels with the size of 1 x 1, to detect feature maps obtained by different convolution layers so as to predict the coordinates and the category of the image identification target; since the feature maps obtained by different convolution layers have different Receptive fields (received fields), the detection process can be regarded as regression and classification of feature maps of different sizes.

It should be noted that, in the convolutional neural network, the definition of the receptive field is the size of the area where the pixels on the feature map output by each layer of the convolutional neural network are mapped on the input image, and the receptive field with a large size can extract a larger range of features of the image. Therefore, convolution kernels of multiple scales are adopted on the same level of the network, and the convolution kernels can adapt to various image characteristics and can obtain better image representation in the deep layer of the network. This has the advantage that the adaptability of the network can be improved and some tuning work of the researcher is omitted.

S343, aligning the feature information of the images in the training set, and obtaining labels or pseudo labels corresponding to the feature information based on the labels of the source domain images and the pseudo labels of the target domain images in the training set, so as to obtain the classification information of the images in the training set;

s344, based on the positioning information and the classification information of the images in the training set, respectively calculating a positioning loss function L of the target detection module in the target identification process _loc And a confidence loss function L _conf According to the localization loss function L _loc And a confidence loss function L _conf Obtaining a third loss function L of the target detection module _SSD (ii) a While reducing the third loss function L by means of Non-Maximum Suppression (NMS) _SSD And then screening out the positioning information and the classification information with the highest confidence coefficient, and synthesizing the positioning information and the classification information to obtain the target recognition result of the images in the training set.

A third loss function L of the target detection module _SSD As a function of localization loss L _loc And a confidence loss function L _conf A sum of the formula

In the formula, N represents the number of positive samples of the prior box (positive samples represent the prior box matching the bounding box (ground route), and negative samples represent the prior box not matching the bounding box (ground route)), α is a weight term, and is set to 1 through cross validation,

is an indication parameter when

The time indicates that the ith prior frame is matched with the jth corresponding bounding box (ground route), p is the category of the bounding box, c is the category confidence prediction value, l is the (position prediction value of the) corresponding bounding box of the prior frame, and g is the position parameter of the bounding box.

Wherein the positioning loss function L _loc Positioning error of prior frame and corresponding bounding box (ground channel) obtained by target detection module in target identification process is Smooth _L1 loss is expressed as follows:

in the formula, the content of the active carbon is shown in the specification,

due to the fact that

Because of the existence of (1), the positioning error is calculated only for the positive sample, and g of the bounding box (ground route) needs to be encoded first

Since the prediction value l is also the coded value.

And for the confidence loss function L _conf Then, the expression is expressed by softmax loss, and is defined as follows:

wherein the content of the first and second substances,

referring to fig. 1 and fig. 4, the present invention further provides a cross-domain image description method, which is a cross-domain image description model obtained by training the training method of the cross-domain image description, and the cross-domain image description method includes:

acquiring image data, preprocessing the image data, and adjusting the image data into a size 224 multiplied by 224 input by a network in a cross-domain image description model;

Referring to fig. 11, fig. 11 is a block diagram illustrating a training system 10 for cross-domain image description model according to an embodiment of the invention. The training system 10 for describing the model by the cross-domain image comprises a data acquisition unit 11, a model calling unit 12 and a joint training unit 13. Wherein the data acquiring unit 11 is configured to acquire a data set, the data set comprising a source domain image with a label and a target domain image without a label; the model calling unit 12 is configured to obtain a cross-domain image description model, where the cross-domain image description model includes a style migration module, a contrast learning module, and a target detection module; and the joint training unit 13 is used for performing joint training on the style migration module, the comparison learning module and the target detection module based on the data set to obtain the trained cross-domain image description model, and the cross-domain image description model is used for performing target identification on the image with the target domain image style based on the label classification of the source domain image.

It should be noted that, in order to highlight the innovative part of the present invention, a module which is not so closely related to solve the technical problem proposed by the present invention is not introduced in the present embodiment, but this does not indicate that no other module exists in the present embodiment.

In addition, it is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again. In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a form of hardware or a form of a software functional unit.

The present embodiment also proposes a computer device comprising a processor and a memory, the processor and the memory being coupled, the memory storing program instructions, the program instructions stored by the memory implementing the above-mentioned task management method when executed by the processor. The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; or a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component; the Memory may include a Random Access Memory (RAM) and may also include a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory. The Memory may be an internal Memory of Random Access Memory (RAM) type, and the processor and the Memory may be integrated into one or more independent circuits or hardware, such as: application Specific Integrated Circuit (ASIC). It should be noted that the computer program in the above-mentioned memory can be implemented in the form of software functional units and stored in a computer readable storage medium when it is sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention or a part thereof which contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention.

The present embodiment also provides a computer-readable storage medium, which stores computer instructions for causing a computer to execute the above task management method. The storage medium may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system or a propagation medium. The storage medium may also include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a Random Access Memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Optical disks may include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-RW), and DVD.

The invention discloses a training and describing method, a system, equipment and a storage medium of an image description model, wherein stylization is embedded into contrast learning and target recognition, a source domain image and a target domain image are stylized, so that the source domain image has a target domain image style, domain differences are eliminated while the content structure of the source domain image is kept, the similarity of features in the source domain image and the target domain image is maximized without domain differences in the contrast learning, the recognition targets in the target domain image can be accurately positioned and classified by using label information of the source domain image during the target recognition, and a better description result of a target scene image is obtained.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A training method of a cross-domain image description model is characterized by comprising the following steps:

and performing joint training on the style migration module, the comparison learning module and the target detection module based on the training set to obtain the trained cross-domain image description model, wherein the cross-domain image description model performs target identification on the image with the target domain image style based on label classification of the source domain image.

2. The training method of the cross-domain image description model according to claim 1, wherein the jointly training the style migration module, the contrast learning module and the target detection module based on the training set to obtain the trained cross-domain image description model comprises:

inputting the training set into the cross-domain image description model, obtaining loss functions of the style migration module, the comparison learning module and the target detection module, and calculating to obtain a total loss function of the cross-domain image description model;

inputting the training set into the style migration module, the comparison learning module and the target detection module according to batches for iterative training by adopting a small-batch random gradient descending mode, minimizing the total loss function, and obtaining a trained cross-domain image description model.

3. The method for training the cross-domain image description model according to claim 2, wherein the step of inputting the training set into the cross-domain image description model to obtain the loss functions of the style migration module, the contrast learning module and the target detection module, and calculating the total loss function of the cross-domain image description model comprises:

inputting the training set into the cross-domain image description model;

performing style migration on the source domain images through the style migration module, migrating the style of the source domain images in the training set into a target domain image style, and obtaining a first loss function L of the style migration module according to comparison between the source domain images with the target domain image style and the source domain images with the target domain images _neural ；

Obtaining similar characteristics of the images in the training set through the self-supervision contrast learning of the contrast learning module, and obtaining a second loss function L of the contrast learning module through the contrast characteristics _NCE Marking a pseudo label on the target domain image with the same similar characteristics based on the label of the source domain image;

performing target recognition on the images of the training set through the target detection module to obtain target recognition results of the source domain images and the target domain images with the target domain image style, and determining a third loss function L of the target detection module according to the target recognition results _SSD ；

Obtaining a total loss function L of the cross-domain image description model _total Said total loss function L _total Is the first loss function L _neural A second loss function L _NCE And a third loss function L _SSD And (4) summing.

4. The method for training the cross-domain image description model according to claim 3, wherein the first loss function L of the style migration module _neural Content loss function L for style migration _content And a style loss function L _style Linear superposition of (2); wherein the style loss function is the difference of style characteristics between the target domain image and the source domain image with the style of the target domain image, and the content loss function L _content For the source domain imageAnd a difference in content characteristics between the source domain image with the target domain image style.

5. The method for training the cross-domain image description model according to claim 3, wherein the similar features of the images in the training set are obtained through the self-supervised contrast learning of the contrast learning module, and the second loss function L of the contrast learning module is obtained through the contrast features _NCE And based on the label of the source domain image, a pseudo label is marked on the target domain image with the same similar characteristics, and the method comprises the following steps:

performing anchor point enhancement twice on the images in the training set;

selecting two anchor point images of the same image in the training set as positive samples, and taking other anchor point images as negative samples;

comparing the positive sample with the negative sample, and calculating to obtain a second loss function L of the comparison learning module _NCE Obtaining similar characteristics of the images corresponding to the positive sample;

and according to the similar features of the images in the training set, marking a pseudo label on the target domain image with the same similar features based on the label of the source domain image.

6. The method according to claim 3, wherein the target detection module performs target recognition on the images of the training set to obtain a target recognition result for the source domain image and the target domain image with the target domain image style, and determines a third loss function L of the target detection module according to the target recognition result _SSD The method comprises the following steps:

extracting the feature maps of the images in the training set through a convolutional neural network of the target detection module to obtain the feature maps extracted by the images in the training set under different convolutional layers;

detecting the characteristic diagram by using a convolution kernel to acquire positioning information and characteristic information of an identification target in the images of the training set;

aligning the label of the source domain image in the training set and the pseudo label of the target domain image with the characteristic information of the images in the training set to obtain the classification information of the images in the training set;

calculating to obtain a third loss function L of the target detection module based on the positioning information and the classification information of the images in the training set _SSD And acquiring a target recognition result of the images in the training set.

7. A cross-domain image description method, characterized in that the cross-domain image description model obtained by training with the training method of the cross-domain image description model according to any one of claims 1 to 6 is adopted, and the cross-domain image description method includes:

acquiring image data;

8. A training system for cross-domain image description models, comprising:

and the joint training unit is used for performing joint training on the style migration module, the comparison learning module and the target detection module based on the training set to obtain the trained cross-domain image description model, and the cross-domain image description model is used for performing target identification on the image with the target domain image style based on the label classification of the source domain image.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.