CN116206175A

CN116206175A - Pre-training method, determining method, device and product of scene analysis model

Info

Publication number: CN116206175A
Application number: CN202310136678.1A
Authority: CN
Inventors: 郭胜; 韩冰; 杨剑阁; 王利民; 武港山
Original assignee: Nanjing University; Zhejiang eCommerce Bank Co Ltd
Current assignee: Nanjing University; Zhejiang eCommerce Bank Co Ltd
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2023-06-02

Abstract

Embodiments of the present disclosure provide a pre-training method of a scene analysis model, a determination method of a scene analysis model, a pre-training apparatus of a scene analysis model, a determination apparatus of a scene analysis model, a computer-readable storage medium, an electronic device, and a computer program product, the scene analysis model including an encoder, the pre-training method including: performing contrast learning on the encoder by a plurality of sample groups, wherein each sample group comprises a first modality image and a second modality image with respect to the same content; masking processing is respectively carried out on the plurality of sample groups, and image reconstruction learning is carried out on the encoder and the decoder through the sample groups after masking processing; thereby optimizing the parameters of the encoder and realizing the pre-training of the scene analysis model.

Description

Pre-training method, determining method, device and product of scene analysis model

Technical Field

Embodiments of the present disclosure relate to the field of scene recognition technologies, and in particular, to a method for pre-training a scene analysis model, a method for determining a scene analysis model, a device for pre-training a scene analysis model, a device for determining a scene analysis model, a computer-readable storage medium, an electronic device, and a computer program product.

Background

The scene analysis task may take the scene recognition task as an example. Recognition of multiple modality images, such as simultaneous presence of RGB modality images and depth images, may be encountered during the scene recognition process.

In the related art, for a scene recognition task of a multi-mode image, a supervised pre-training algorithm model is usually required to be carried out on an existing image data set, the pre-training algorithm models are further loaded to initialize backbone networks of RGB modes and backbone networks of depth modes respectively, and finally the two backbone networks are obtained through a fusion module or design so as to be used for the scene recognition task of the multi-mode image. However, the related art sets a plurality of backbone networks, and this dual-stream convergence paradigm is inefficient in performing scene recognition tasks.

It should be noted that the information disclosed in the foregoing background section is only for enhancement of understanding of the background of the present specification and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

The embodiment of the specification provides a pre-training method of a scene analysis model, a pre-training device of the scene analysis model, a computer readable storage medium, electronic equipment and a computer program product, which can improve scene recognition efficiency.

Additional features and advantages of embodiments of the present description will be set forth in the detailed description which follows, or in part will be apparent from the practice of the present description.

According to an aspect of embodiments of the present specification, there is provided a pre-training method of a scene analysis model, wherein the scene analysis model includes an encoder, the method comprising: performing contrast learning on the encoder by a plurality of sample groups, wherein each sample group comprises a first modal image and a second modal image related to the same content; masking the plurality of sample groups respectively, and performing image reconstruction learning on the encoder and the decoder through the sample groups after masking; and optimizing parameters of the encoder by the contrast learning and the image reconstruction learning to pretrain the scene analysis model.

According to another aspect of the embodiments of the present disclosure, there is provided a method for determining a scene analysis model, wherein the scene analysis model includes an encoder, the encoder being pre-trained by the method provided in the above aspect, the method comprising: determining full connection layer parameters according to a scene analysis task, and connecting the determined full connection layer to the head of the encoder after pre-training; and training the pre-trained encoder and the full-connection layer through the samples related to the scene analysis task to obtain the scene analysis model.

According to still another aspect of embodiments of the present specification, there is provided a pre-training apparatus of a scene analysis model, wherein the scene analysis model includes an encoder, the apparatus comprising: the device comprises a contrast learning module, an image reconstruction learning module and a pre-training module.

The contrast learning module is used for performing contrast learning on the encoder through a plurality of sample groups, wherein each sample group comprises a first modal image and a second modal image which are related to the same content; the image reconstruction learning module is used for carrying out mask processing on the plurality of sample groups respectively and carrying out image reconstruction learning on the encoder and the decoder through the sample groups after the mask processing; and the pre-training module is used for optimizing the parameters of the encoder through the contrast learning and the image reconstruction learning so as to pre-train the scene analysis model.

According to a further aspect of embodiments of the present specification, there is provided the above scene analysis model comprising an encoder, the encoder being pre-trained by the method provided in the first aspect, the apparatus comprising: the system comprises a determining module and a training module.

The determining module is used for determining parameters of the full-connection layer according to a scene analysis task and connecting the determined full-connection layer to the head of the encoder after pre-training; and the training module is used for training the pre-trained encoder and the full-connection layer through the samples related to the scene analysis task to obtain the scene analysis model.

According to an aspect of the embodiments of the present specification, there is provided an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the pre-training method of the scene analysis model in the embodiments as described above when executing the computer program.

According to another aspect of the embodiments of the present specification, there is provided a computer-readable storage medium having instructions stored therein, which when run on a computer or a processor, cause the computer or the processor to perform the method of pre-training a scene analysis model as in the embodiments described above.

According to a further aspect of embodiments of the present specification, there is provided a computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to perform the method of pre-training a scene analysis model as in the embodiments described above.

The pre-training method of the scene analysis model, the determining method of the scene analysis model, the pre-training device of the scene analysis model, the determining device of the scene analysis model, the computer readable storage medium, the electronic device and the computer program product provided by the embodiment of the specification have the following technical effects:

The scene analysis model provided in the embodiments of the present disclosure is implemented by using an encoder, and the pre-training process for the encoder includes: performing contrast learning on the encoder by a plurality of sample groups, wherein each sample group comprises a first modality image and a second modality image with respect to the same content; masking processing is respectively carried out on the plurality of sample groups, and image reconstruction learning is carried out on the encoder and the decoder through the sample groups after masking processing; thereby optimizing the parameters of the encoder and realizing the pre-training of the scene analysis model. The pre-trained scene recognition model of the scheme has high recognition efficiency on the multi-mode images, is also suitable for recognizing the single-mode images, and improves the flexibility and the robustness of the model.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the specification and together with the description, serve to explain the principles of the specification. It is obvious that the drawings in the following description are only some embodiments of the present specification, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a flow chart of a pre-training method of a scene analysis model according to an embodiment of the present disclosure.

Fig. 2 is a flow chart of a contrast learning pre-training method according to an embodiment of the present disclosure.

Fig. 3 is a schematic structural diagram of a model in learning process according to an embodiment of the present disclosure.

Fig. 4 is a flowchart of an image reconstruction learning pre-training method according to an embodiment of the present disclosure.

Fig. 5 is a schematic structural diagram of a model in an image reconstruction learning process according to an embodiment of the present disclosure.

Fig. 6 is a flowchart of an image reconstruction learning pre-training method according to an embodiment of the present disclosure.

Fig. 7 is a schematic structural diagram of a model in an image reconstruction learning process according to an embodiment of the present disclosure.

Fig. 8 is a flowchart of a method for determining a scene analysis model according to an embodiment of the present disclosure.

Fig. 9 is a schematic structural diagram of a scene analysis model according to an embodiment of the present disclosure.

Fig. 10 is a schematic structural diagram of a pre-training device of a scene analysis model according to an embodiment of the present disclosure.

Fig. 11 is a schematic structural diagram of a determining device for a scene analysis model according to an embodiment of the present disclosure.

Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present specification more apparent, the following detailed description of the embodiments of the present specification will be given with reference to the accompanying drawings.

When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present description as detailed in the accompanying claims.

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present specification. One skilled in the relevant art will recognize, however, that the aspects of the specification may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known aspects have not been shown or described in detail to avoid obscuring aspects of the description.

Furthermore, the drawings are only schematic illustrations of the present specification and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

The mainstream RGB-D scene recognition model provided by the related technology generally needs to perform supervised pretrained CNN weights on existing large-scale image datasets such as Places or ImageNet, load the pretrained convolutional neural network (Convolutional Neural Network, CNN) weights to initialize two backbone networks of RGB and deep modes respectively, and finally the two backbone networks are obtained by a fusion module or design for scene recognition tasks of multi-mode images. However, the scheme provided by the related art has the following disadvantages:

1. the use of parameters for large-scale pre-training of the image domain to initialize the depth network is susceptible to image model bias when the depth network extracts depth features, which may lead to suboptimal results, while a large-scale depth dataset is currently lacking for depth network pre-training.

The identification of RGB mode images and the identification of depth mode images are respectively provided with respective CNN backbone networks, the double-flow fusion mode is low in efficiency when a plurality of mode images are identified, and meanwhile, the double-flow fusion mode is not suitable for processing the condition of mode missing (namely single mode) in an actual application scene, so that the single-mode data input scene is not flexible and robust enough.

3. Large-scale pre-training requires a large amount of manually marked tag data, and has huge cost.

The pre-training method of the scene analysis model, the determining method of the scene analysis model, the pre-training device of the scene analysis model, the determining device of the scene analysis model, the computer readable storage medium, the electronic device and the computer program product provided by the embodiments of the present specification can solve the problems existing in the related art.

The model structure of the scene analysis model provided in the embodiments of the present specification employs an encoder, illustratively, a visual transducer (Vision Transformer, viT) as a backbone network. Further by the pre-training scheme provided by the embodiments of the present description, a unified single model of multiple modalities can be trained on a small-scale data set.

Fig. 1 is a schematic flow chart of a pre-training method of a scene analysis model according to an embodiment of the present disclosure. Referring to FIG. 1, the embodiment shown in this figure includes S110-S130.

In S110, the encoder is contrast-learned by a plurality of sample groups, wherein each sample group includes a first modality image and a second modality image with respect to the same content.

Illustratively, as described above, the model structure of the scene analysis model provided in the embodiments of the present disclosure employs an encoder, and illustratively, vision ViT is employed as a backbone network. Illustratively, each sample set contains images of both RGB modalities and depth modalities. Further, the encoder is subjected to contrast learning through a plurality of sample groups, so that a cross-mode contrast learning framework is provided, an alignment relation between two modes is captured, and the alignment task causes a transducer encoder to capture structural information between the modes. Further after cross-modal contrast learning, the transducer encoder can capture some structural information to distinguish aligned RGB modality and depth modality image pairs.

In S120, masking processing is performed on each of the plurality of sample groups, and image reconstruction learning is performed on the encoder and decoder by the masked sample groups.

To enhance the representation capabilities of the transducer encoder, the present embodiments also pretrain the transducer encoder another round from a generation perspective, specifically design a multi-modal mask self-encoder (Masked Autoencoder, MAE) in the RGB-D data. Since the RGB mode and the depth mode are homogeneous modes, the multi-mode mask self-encoder can process through a single shared encoder and decoder, and meanwhile, a projection encoding layer and a prediction head are specifically set for each mode, so that different modes and reconstruction tasks can be distinguished. The parameter sharing scheme provided by the embodiment can greatly reduce the number of the pre-training parameters, and can be used as a regularization technology to improve the pre-training performance.

In S130, the parameters of the encoder are optimized by contrast learning and image reconstruction learning to pre-train the scene analysis model.

Note that, the execution order of the contrast learning pre-training corresponding to S110 and the image reconstruction learning pre-training corresponding to S120 may be any of the following modes:

mode one

And executing the contrast learning pre-training and the image reconstruction learning pre-training in a combined mode. In particular, the two simultaneous execution refers to the simultaneous joint execution of contrast learning and mask reconstruction instead of two-stage execution, so the pre-training of the scene analysis model belongs to a multi (self-supervised) task learning.

Mode two

The contrast learning provided in S110 is performed first, and the initially parameter-optimized encoder is obtained. Then, image reconstruction learning as provided in S120 is performed on the primarily parameter-optimized encoder to again optimize the parameters of the encoder.

Mode three

Then, the contrast learning as provided in S110 is performed again on the primarily parameter-optimized encoder to optimize the parameters of the encoder again.

The scene analysis model pre-training scheme provided in fig. 1 provides a CoMAE framework, consisting of inter-modality contrast pre-training and multi-modality mask reconstruction pre-training. The CoMAE framework adopts a hybrid self-supervision pre-training strategy, gradually learns parameters of the encoder in a course learning mode, and can solve the problem of characterization learning from a limited number of RGB-D samples. The pre-trained scene recognition model of the scheme has high recognition efficiency on the multi-mode images, is also suitable for recognizing the single-mode images, and improves the flexibility and the robustness of the model.

In an exemplary embodiment, fig. 2 is a schematic flow chart of a contrast learning pre-training method provided in the embodiment of the present disclosure, which may be used as a specific implementation manner of S110. In the embodiment of the present specification, a target sample group (i.e., any one of a plurality of sample groups) is described as an example. Referring to FIG. 2, the embodiment shown includes S1102-S11010.

In S1102, for a first modality image in a target sample group, image blocking processing is performed, and feature extraction processing is performed on an obtained image block, so as to obtain a first embedded feature. And in S1104, performing image blocking processing on the second modality image in the target sample group and performing feature extraction processing on the obtained image block to obtain a second embedded feature.

In this embodiment, cross-mode patch-level contrast pre-training is performed, different patch coding layers are required to be set for different modes, and images of two modes in a target sample group are respectively input into the patch coding layers corresponding to the respective modes. Referring to fig. 3, a first mode image r (RGB mode) in a target sample set is input into the patch coding layer 32, and the patch coding layer 32 performs image blocking processing on the first mode image, so that a plurality of image blocks of the first mode image r can be obtained. For example, since the patch is set to 16×16 and the number of output channels is 16, when the first mode image is 224×224×3, 16 image blocks of 16×16×3 can be obtained by performing image block processing. Further, the patch encoding layer 32 performs feature extraction processing on the obtained image block to obtain a first embedded feature corresponding to the first mode image r.

Similarly, a second mode image d (Depth mode) in the target sample group is input into the patch coding layer 34, and the patch coding layer 34 performs image blocking processing on the second mode image so as to obtain a plurality of image blocks of the second mode image d. The patch sizes in the two patch layers are set to be the same, and the second mode image is 224×224×3, and 16 image blocks 16×16×3 can be obtained through image blocking processing. Further, the patch encoding layer 34 performs feature extraction processing on the obtained image block to obtain a second embedded feature corresponding to the second mode image.

In S1106, the first embedded feature and the second embedded feature are input to an encoder for encoding processing, so as to obtain a first encoded feature corresponding to the first mode image and a second encoded feature corresponding to the second mode image.

Illustratively, referring to FIG. 3, the encoder 36 shared by the first embedded feature and the second embedded feature inputs performs an encoding process to generate token identifications for each of the two modalities, i.e., to obtain a first encoded feature (e.g., represented as

) A second coding feature corresponding to the second modality image (e.g. expressed as +.>

). Where i is the token index and N is the total number of tokens in the RGB or depth map (e.g., n=196 for a patch size of 16×16 224×224 images).

In the case of ViT, the first embedded feature and the second embedded feature are input to the encoder without being subjected to position coding. The reason is that in the contrast learning pre-training phase, the pre-training aims at the encoder to pull up the feature similarity distance (positive sample) of the RGB token and the Depth token with aligned positions, whereas to pull up the feature similarity distance (negative sample) of the RGB token and the Depth token with misaligned positions, so the position information can be regarded as a supervisory signal of the self-supervising upstream task. Wherein RGB token and Depth token are the output of the encoder. If the embedded feature of the input encoder is added with position coding, this is equivalent to giving a supervisory signal directly on the input, and in particular, the contrast loss will quickly converge towards 0. However, the model will not learn the ability to extract data characterizations. Thus, in the case of ViT, the first embedded feature and the second embedded feature of the encoder described above are input as non-position-coded features.

In addition, it should be noted that, unlike the conventional contrast learning method, the query and dictionary keys in contrast learning provided in the embodiment of the present specification are cross-modal token in paired RGB depth, not complete examples.

For example, in order to obtain a better pre-training effect, in the embodiment of the present disclosure, the first embedded feature and the second embedded feature are respectively subjected to global average pooling processing compared to the class token introduced in ViT. Further, the two-mode characteristics after global average pooling processing are input into the encoder.

With continued reference to fig. 2, in S1108, a noise contrast estimate for each image block in the target set of samples is determined based on the first encoding feature and the second encoding feature, wherein the noise contrast estimate for each image block in the target set of samples is used to determine the first loss function. And in S11010, optimizing parameters of the encoder by the first loss function to achieve contrast learning of the encoder.

Specifically, the first loss function described above in the embodiment of the present specification employs InfoNCE loss. This loss maximizes the similarity of the RGB token to its depthtken aligned in the same position while minimizing the similarity to all other tokens in the corresponding depth map. Specifically, a first Loss function Loss representing cross-modal contrast Loss _cpc Can be expressed as:

wherein l _rgb (i) And l _depth (i) Expressed as:

where s () represents the calculated pixel level, and the fixed temperature τ is exemplified by a value of 0.07, f (r) _i Encoder output characteristics, f (d), representing the ith image block in the RGB modality image _i Encoder output characteristics, f (d), representing an i-th image block in a depth modality image _k Encoder output characteristics representing blocks other than the ith block in the depth modality image.

It can be understood that, after the contrast learning, for the ith image block (i.e. the target image block) of the image in the target sample group, the similarity between the ith image block in the first mode image in the target sample group and the ith image block in the second mode image in the sample group is greater than a first preset value, and the similarity between the ith image block in the first mode image in the target sample group and the other image blocks except the ith image block in the second mode image in the sample group is less than a second preset value; the first preset value (for example, the value is 0.95) is greater than the second preset value (for example, the value is 0.03).

In the embodiment shown in FIG. 2, pre-training by patch-level contrast learning explicitly utilizes inter-modality alignment relationships as self-supervising signals, which expects that the model can implicitly learn cross-modality alignment from paired small-scale RGB-D datasets. With the above optimization objectives, the encoder network is encouraged to better capture distinct local context representations on RGB and depth modalities, which can capture some structural information to distinguish locally aligned RGB and depth patch pairs.

In an exemplary embodiment, fig. 4 is a schematic flow chart of the image reconstruction learning pre-training method provided in the embodiment of the present disclosure, which may be used as a specific implementation manner of S120. The present embodiment will be described with reference to a target sample set (i.e., any one of a plurality of sample sets). Referring to FIG. 4, the illustrated embodiment includes S1202-S12010.

In S1202, performing image blocking processing, feature extraction processing and position encoding on the first mode image in the target sample group to obtain a third embedded feature; and in S1204, performing image blocking processing, feature extraction processing, and position encoding on the second modality image in the target sample group, to obtain a fourth embedded feature.

For example, referring to the model shown in fig. 5, in a scheme in which contrast learning pre-training is performed before image reconstruction learning pre-training is performed, the parameters of the patch coding layer for performing image reconstruction learning and the parameters of the encoder may be optimized parameters after the contrast learning.

Referring to fig. 5, a first modality image r and a second modality image d in the sample group are input to patch encoding layers 52 and 54, respectively. The first modality image r is subjected to image blocking processing and feature extraction processing through the patch encoding layer 52. Similarly, the patch coding layer 54 with position coding performs image blocking processing and feature extraction processing on the second mode image D, and then adds fixed and shared 2D sine and cosine position coding for the two modes, so as to obtain a third embedded feature and a fourth embedded feature.

In S1206, the third embedded feature and the fourth embedded feature with respect to the target sample group are spliced into a total embedded feature.

Illustratively, referring to fig. 5, the third embedded feature and the fourth embedded feature are spliced by the concat layer to obtain the total embedded feature. Illustratively, when the concat features are spliced, the token indexes 0-50 correspond to the features of the RGB modality, and the token indexes 51-101 correspond to the features of the modality of the Depth. It should be noted that the encoder and decoder both maintain token index order during this process, so that the modalities to which the token belongs can be distinguished according to the token index.

In S1208, masking is performed on the third embedded feature in the total embedded features, and masking is performed on the fourth embedded feature in the total embedded features, so as to obtain total masked features.

Illustratively, RGB token (third embedded feature) and Depth token (fourth embedded feature) are randomly masked at 75% masking rate for each of the two modalities.

It is to be understood that the masking process may be performed on the third embedded feature and the fourth embedded feature, respectively, and then the stitching process as S1206 may be performed. The stitching process in S1206 may be performed first, and then masking processes may be performed on different modes in the total embedded feature according to the token index, respectively. The present specification examples are not limited thereto.

In S12010, the encoder and decoder are subjected to image reconstruction learning by the total mask features respectively corresponding to the plurality of samples.

Fig. 6 is a schematic flow chart of the image reconstruction learning pre-training method according to the embodiment of the present disclosure, which may be used as a specific implementation of S12010. Referring to fig. 6, the illustrated embodiment includes S610-S650.

In S610, the unmasked portion of the total mask features is input to an encoder, which processes to obtain corresponding potential features for the target sample set. In S620, the mask portion of the latent feature and the total mask feature is input to the decoder to be decoded by the decoder.

It should be noted that, in order to obtain a better pre-training effect, compared to the class token introduced in ViT, the non-masked portion of the total mask feature is subjected to global averaging pooling in the embodiment of the present disclosure. Referring to fig. 7, the global average pooled feature is input to the encoder 510 described above, which processes the corresponding potential feature for the target sample set. Further, the mask portion of the latent feature and the total mask feature is input to a decoder for decoding processing by the decoder 512. The exemplary decoder 512 may employ a lightweight decoder.

Illustratively, the potential features and a learnable mask token with 2D position code embedding are taken as inputs to the lightweight decoder 512. More specifically, to enhance the model pre-training effect, the mask token of the input decoder is shared between two modalities, specifically: the mask portions in the total mask feature include a first mask portion for a first modality image and a second mask portion for a second modality image, and in this embodiment a shared mask portion for the same image block is acquired in the first mask portion and the second mask portion. For example, tiles patch 1, patch6, and patch 10 are masked in both modalities, and then the tiles are shared mask portions.

With continued reference to fig. 6, in S630, a first partial feature regarding the first modality image is determined among the output features of the decoder, and the image is reconstructed using the first partial feature, resulting in pixel values of the first modality reconstructed image; and in S640, determining a second partial feature related to the second mode image from the output features of the decoder, and reconstructing the image using the second partial feature to obtain a pixel value of the second mode reconstructed image.

Illustratively, the output features of the decoder comprise a first partial feature regarding the RGB modality and a second partial feature regarding the depth modality, which can be distinguished according to the token index. Referring to fig. 7, further, the first partial feature is input to the prediction head 72 to reconstruct an image using the first partial feature, and pixel values of the first modality reconstructed image may be obtained. Similarly, the second partial features are input to the prediction head 74 to reconstruct an image using the second partial features, and pixel values of the reconstructed image of the second modality may be obtained.

In S650, determining a second loss function from the pixel values of the first modality reconstructed image, the pixel values of the first modality image, the pixel values of the second modality reconstructed image, and the pixel values of the second modality image; and, in S660, optimizing parameters of the encoder by the second loss function to achieve image reconstruction learning of the encoder.

Illustratively, the above image reconstruction targets are patch normalized pixels per image block. In this case, although macroscopically, the two-mode image is input into the model, the image is divided into a plurality of image blocks, and the calculation loss is actually calculated and normalized by the mean value and standard deviation of one patch original image. For a target image block (such as an ith image block) of an image in a target sample group, acquiring a first group of pixel values of the target image block in a first-mode reconstructed image, and acquiring a second group of pixels of the target image block in the first-mode image The method comprises the steps of obtaining a third group of pixel values of a target image block in a second-mode reconstructed image, and obtaining a fourth group of pixel values of the target image block in the second-mode image; the second loss function is determined by the first set of pixel values, the second set of pixel values, the third set of pixel values, and the four sets of pixel values. Specifically, a person Loss function Loss representing a Loss of reconstruction of a multimodal mask _mm-mae Can be expressed as:

where MSE () represents the mean square error function,

and->

Can be expressed by the following formulas, respectively:

wherein F is _r And F _d The first partial characteristic and the second partial characteristic respectively represent the output characteristics of the decoder, and can be specifically determined by the following formula:

F＝Decoder(Encoder(T _{vis_r} ,T _{vis_d} ),T _mask )

wherein T is _{vis_r} T is the unmasked portion of the total masking feature for the RGB modality _{vis_d} T is the unmasked portion of the total masking feature for the depth modality _mask Is the mask portion of the total mask feature. Exemplary, T _mask Is the shared mask portion for the same image block in the above embodiment.

In the embodiment of the specification, by reconstructing all masked RGB tokens and Depth tokens simultaneously, the multi-modal mask self-encoder can force the encoder to capture the complementary features between modalities in a unified joint feature space while preserving the ability to capture single-modal specific features.

In the embodiments provided in fig. 4-7, to further enhance the transform encoder representation capability, another round of pre-training of the transform encoder is performed from the generation perspective, specifically a multi-modal mask auto-encoder with parameters shared between the two modalities is designed in the RGB-D data, and the masking and reconstruction tasks will facilitate the transform encoder to extract finer granularity features to regress the exact pixel values of the missing token. Meanwhile, as RGB and depth are in a homogeneous mode, the multi-mode mask self-encoder can easily process the RGB and the depth through a single shared encoder and decoder, and only a mode-specific patch projection layer and a prediction head are needed to distinguish different modes and reconstruction tasks. The parameter sharing scheme greatly reduces the number of pre-training parameters and can be used as a regularization technology to improve the pre-training performance.

The CoMAE provided by the embodiment conforms to the principle that a mixed training strategy follows course learning, firstly learns from simple tasks and then executes more difficult tasks, and experimental results show that the generalization capability of the pre-training model on downstream recognition tasks can be improved. Due to the design of a single model architecture, the CoMAE pre-trained encoder can be flexibly deployed in a multi-mode or single-mode input scene, so that a mode missing scene can be effectively processed in practice, and the robustness, the flexibility and the application range of the model are improved.

It should be noted that the present description examples demonstrate for the first time that a multimodal transducer can be successfully trained using a limited amount of RGB-D data without using a significant amount of additional tagged data or pre-training models. The model pre-trained by the embodiments of the present description can achieve very competitive recognition results on both challenging RGB-D scene recognition datasets, NYUDv2 and SUN RGB-D, compared to methods that do large-scale supervised pre-training.

In an exemplary embodiment, fig. 8 is a flowchart of a method for determining a scene analysis model according to an embodiment of the present disclosure. Referring to fig. 8, the illustrated embodiment includes S810 and S820.

In S810, full connection layer parameters are determined according to a scene analysis task, and the determined full connection layer is connected to a head of an encoder after pre-training.

Exemplary, the scene analysis tasks include: the scene recognition task or the scene segmentation task may, of course, be other tasks in the scene analysis process, which is not limited in the embodiment of the present specification.

The encoder is obtained by pre-training the scene analysis model through the pre-training method provided by the embodiment. Referring to the scene analysis model shown in fig. 9, it includes a pre-trained encoder 90 and a full connection layer 92 connected to the encoder head. The fully connected layer 92 serves as a classifier, the parameters of which can be determined from the scene analysis task. For example, if the number of categories classified by the scene analysis task is 19, the classifier is 19 in output dimension.

In S820, the pre-trained encoder and full-link layer are trained by the scene analysis task related samples to obtain a scene analysis model.

Because the encoder shown in fig. 9 is pretrained by the CoMAE, during the training process of the downstream scene recognition task, the encoder does not need to be trained from scratch, and the pretrained encoder and the full-link layer are trimmed only according to the samples related to the scene analysis task. In the fine tuning process, any mode is randomly discarded as data enhancement with the probability of 0.5, so that a better downstream recognition result is obtained.

The pre-trained model in the embodiment of the present disclosure can be migrated to the downstream scene recognition task to obtain very excellent results. Specifically, the CoMAE model is pretrained only on the SUN RGB-D training set, the average class accuracy on the SUN RGB-D testing set reaches 55.2%, and the average sample accuracy reaches 64.3%. The model trained on the SUN RGB-D training set is loaded, the training is further performed on the NYUDV2 training set, and the average class accuracy of 70.7% and the average sample accuracy of 76.3% can be obtained on the NYUDV2 testing set.

It should be noted that the above-described figures are only schematic illustrations of processes involved in the method according to the exemplary embodiments of the present specification, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

The following are device embodiments of the present specification that may be used to perform method embodiments of the present specification. For details not disclosed in the device embodiments of the present specification, please refer to the method embodiments of the present specification.

Fig. 10 is a schematic structural diagram of a pre-training device of a scene analysis model according to an embodiment of the present disclosure. Referring to fig. 10, the pre-training device of the scene analysis model shown in the figure may be implemented as all or a part of the electronic device by software, hardware or a combination of the two, and may be integrated on a server as an independent module, or may be integrated in the electronic device as an independent module.

In the embodiment of the present disclosure, the scene analysis model includes an encoder, and the pre-training device 1000 of the scene analysis model includes: a contrast learning module 1010, an image reconstruction learning module 1020, and a pre-training module 1030.

Wherein the contrast learning module 1010 is configured to perform contrast learning on the encoder by using a plurality of sample groups, where each of the sample groups includes a first modality image and a second modality image related to the same content; the image reconstruction learning module 1020 is configured to mask the plurality of sample groups, and perform image reconstruction learning on the encoder and the decoder through the mask-processed sample groups; and the pre-training module 1030 is configured to optimize parameters of the encoder through the contrast learning and the image reconstruction learning to pre-train the scene analysis model.

In an exemplary embodiment, based on the foregoing scheme, the comparison learning module 1010 is specifically configured to: performing image blocking processing on a first mode image in a target sample group and performing feature extraction processing on the obtained image block to obtain a first embedded feature, wherein the target sample group is any one of the plurality of sample groups; performing image blocking processing on a second mode image in the target sample group and performing feature extraction processing on the obtained image block to obtain a second embedded feature; inputting the first embedded feature and the second embedded feature into the encoder for encoding processing to obtain a first encoding feature corresponding to the first mode image and a second encoding feature corresponding to the second mode image; determining a noise contrast estimate for each image block in the target set of samples based on the first encoding feature and the second encoding feature, wherein the noise contrast estimate for each image block in the target set of samples is used to determine a first loss function; and optimizing parameters of the encoder through the first loss function to realize contrast learning of the encoder.

In an exemplary embodiment, based on the foregoing, the encoder is a visual transducer model, wherein the first embedded feature and the second embedded feature input to the encoder are not position encoded.

In an exemplary embodiment, based on the foregoing aspect, after the comparison learning, for a target image block of an image in the target sample group, a similarity between the target image block in a first mode image in the target sample group and the target image block in a second mode image in the target sample group is greater than a first preset value, and a similarity between the target image block in the first mode image in the target sample group and other image blocks in the second mode image in the target sample group is less than a second preset value; the first preset value is greater than the second preset value, and the target sample group is any one of the plurality of sample groups.

In an exemplary embodiment, based on the foregoing scheme, the image reconstruction learning module 1020 is specifically configured to: performing image blocking processing, feature extraction processing and position coding on the first mode image in the target sample group to obtain a third embedded feature; performing image blocking processing, feature extraction processing and position coding on the second mode image in the target sample group to obtain a fourth embedded feature; wherein the third embedded feature and the fourth embedded feature are used to perform the masking process.

In an exemplary embodiment, based on the foregoing scheme, the image reconstruction learning module 1020 is further specifically configured to: splicing the third embedded feature and the fourth embedded feature of the target sample group into a total embedded feature; and masking the third embedded feature in the total embedded features, and masking the fourth embedded feature in the total embedded features to obtain total masking features.

In an exemplary embodiment, based on the foregoing scheme, the image reconstruction learning module 1020 is further specifically configured to: inputting the unmasked part in the total mask characteristics into the encoder, and processing the unmasked part by the encoder to obtain potential characteristics corresponding to the target sample group; inputting a mask portion of the latent feature and the total mask feature into the decoder for decoding by the decoder; determining a first partial characteristic about the first modal image from the output characteristics of the decoder, and reconstructing an image by using the first partial characteristic to obtain a pixel value of the first modal reconstructed image; determining a second partial characteristic about the second mode image from the output characteristics of the decoder, and reconstructing an image by using the second partial characteristic to obtain a pixel value of the second mode reconstructed image; determining a second loss function by the pixel values of the first modality reconstructed image, the pixel values of the first modality image, the pixel values of the second modality reconstructed image, and the pixel values of the second modality image; and optimizing parameters of the encoder through the second loss function to realize image reconstruction learning of the encoder.

In an exemplary embodiment, based on the foregoing scheme, the image reconstruction learning module 1020 is further specifically configured to: and (3) carrying out global average pooling processing on the unmasked part in the total mask characteristics, and then inputting the unmasked part into the encoder.

In an exemplary embodiment, based on the foregoing scheme, the masking part in the total masking feature includes: a first mask portion for the first modality image and a second mask portion for the second modality image; the image reconstruction learning module 1020 is further specifically configured to: acquiring a shared mask portion with respect to the same image block in the first mask portion and the second mask portion; and inputting the latent feature and the shared mask feature into the decoder.

In an exemplary embodiment, based on the foregoing scheme, the image reconstruction learning module 1020 is further specifically configured to: for a target image block of an image in the target sample set, acquiring a first set of pixel values of the target image block in the first modality reconstructed image, acquiring a second set of pixel values of the target image block in the first modality image, acquiring a third set of pixel values of the target image block in the second modality reconstructed image, and acquiring a fourth set of pixel values of the target image block in the second modality image; and determining a second loss function from the first set of pixel values, the second set of pixel values, the third set of pixel values, and the four sets of pixel values.

In an exemplary embodiment, based on the foregoing scheme, the pre-training module 1030 is specifically configured to: performing the contrast learning and the image reconstruction learning in combination to optimize parameters of the encoder; or, executing the contrast learning to obtain an encoder after preliminary parameter optimization, and executing the image reconstruction learning on the encoder after preliminary parameter optimization to optimize the parameters of the encoder again; or, executing the image reconstruction learning to obtain an encoder after preliminary parameter optimization, and executing the contrast learning on the encoder after preliminary parameter optimization to optimize the parameters of the encoder again.

It should be noted that, when the pre-training device for a scene analysis model provided in the above embodiment performs the pre-training method for a scene analysis model, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above.

In addition, the pre-training device of the scene analysis model provided in the above embodiment and the pre-training method embodiment of the scene analysis model belong to the same concept, so for details not disclosed in the embodiments of the device in the present specification, please refer to the embodiment of the pre-training method of the scene analysis model in the present specification, and the details are not repeated here.

Fig. 11 is a schematic structural diagram of a determining device for a scene analysis model according to an embodiment of the present disclosure. Referring to fig. 11, the determining device of the scene analysis model shown in the figure may be implemented as all or a part of the electronic device by software, hardware or a combination of both, and may be integrated on a server as an independent module, or may be integrated in the electronic device as an independent module.

In this embodiment of the present disclosure, the scene analysis model includes an encoder, where the encoder is pre-trained by a pre-training method of the scene analysis model, and the determining device 1100 of the scene analysis model includes: a determination module 1110 and a training module 1120.

Wherein, the determining module 1110 is configured to determine full-link layer parameters according to a scene analysis task, and connect the determined full-link layer to a head of the encoder after pre-training; and the training module 1120 is configured to train the pre-trained encoder and the full-connection layer through the samples related to the scene analysis task to obtain the scene analysis model.

It should be noted that, when the determining device for a scene analysis model provided in the above embodiment performs the determining method for a scene analysis model, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above.

In addition, the apparatus for determining a scene analysis model provided in the above embodiment and the method for determining a scene analysis model belong to the same concept, so for details not disclosed in the embodiments of the apparatus of the present specification, please refer to the embodiments of the method for determining a scene analysis model described in the present specification, and the details are not repeated herein.

The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the previous embodiments. The computer readable storage medium may include, among other things, any type of disk including floppy disks, optical disks, DVDs (Digital Video Disc, digital versatile disks), CD-ROMs (Compact Disc Read-Only Memory), microdrives, and magneto-optical disks, ROMs (Read-Only Memory), RAMs (Random Access Memory, random access memories), EPROMs (Erasable Programmable Read-Only Memory), EEPROMs (Electrically Erasable Programmable Read Only Memory, electrically erasable programmable Read-Only Memory), DRAMs (Dynamic Random Access Memory ), VRAMs (Video Random Access Memory, video random access Memory), flash Memory devices, magnetic or optical cards, nanosystems (including molecular Memory ICs), or any type of media or device suitable for storing instructions and/or data.

Fig. 12 schematically shows a structural diagram of an electronic device in an exemplary embodiment according to the present specification. Referring to fig. 12, an electronic device 1200 includes: a processor 1201 and a memory 1202.

In the embodiment of the present disclosure, the processor 1201 is a control center of a computer system, and may be a processor of a physical machine or a processor of a virtual machine. Processor 1201 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1201 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). Processor 1201 may also include a main processor, which is a processor for processing data in an awake state, and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state.

In the present embodiment, the scene analysis model includes an encoder, and the processor 1201 is specifically configured to pre-train the scene analysis model. Specifically: performing contrast learning on the encoder by a plurality of sample groups, wherein each sample group comprises a first modal image and a second modal image related to the same content; masking the plurality of sample groups respectively, and performing image reconstruction learning on the encoder and the decoder through the sample groups after masking; optimizing parameters of the encoder by the contrast learning and the image reconstruction learning to pretrain the scene analysis model.

Further, the performing contrast learning on the encoder through a plurality of sample groups includes: performing image blocking processing on a first mode image in a target sample group and performing feature extraction processing on the obtained image block to obtain a first embedded feature, wherein the target sample group is any one of the plurality of sample groups; performing image blocking processing on a second mode image in the target sample group and performing feature extraction processing on the obtained image block to obtain a second embedded feature; inputting the first embedded feature and the second embedded feature into the encoder for encoding processing to obtain a first encoding feature corresponding to the first mode image and a second encoding feature corresponding to the second mode image; determining a noise contrast estimate for each image block in the target set of samples based on the first encoding feature and the second encoding feature, wherein the noise contrast estimate for each image block in the target set of samples is used to determine a first loss function; and optimizing parameters of the encoder through the first loss function so as to realize contrast learning of the encoder.

Further, the encoder is a visual transducer model, wherein the first embedded feature and the second embedded feature input to the encoder are not position encoded.

Further, after the contrast learning, regarding to a target image block of an image in the target sample group, a similarity between the target image block in a first mode image in the target sample group and the target image block in a second mode image in the target sample group is larger than a first preset value, and a similarity between the target image block in the first mode image in the target sample group and other image blocks in the second mode image in the target sample group is smaller than a second preset value; the first preset value is greater than the second preset value, and the target sample group is any one of the plurality of sample groups.

Further, the processor 1201 is further configured to: before the mask processing is performed on the plurality of sample groups, performing image blocking processing, feature extraction processing and position coding on the first mode image in the target sample group to obtain a third embedded feature; performing image blocking processing, feature extraction processing and position coding on the second mode image in the target sample group to obtain a fourth embedded feature; wherein the third embedded feature and the fourth embedded feature are used to perform the masking process.

Further, the masking processing for each of the plurality of samples includes: splicing the third embedded feature and the fourth embedded feature of the target sample group into a total embedded feature; and masking the third embedded feature in the total embedded features, and masking the fourth embedded feature in the total embedded features to obtain total masking features.

Further, the performing image reconstruction learning on the encoder and the decoder by the sample group after the mask processing includes: inputting the unmasked part in the total mask characteristics into the encoder, and processing the unmasked part by the encoder to obtain potential characteristics corresponding to the target sample group; inputting a mask portion of the latent feature and the total mask feature into the decoder for decoding by the decoder; determining a first partial characteristic about the first modal image from the output characteristics of the decoder, and reconstructing an image by using the first partial characteristic to obtain a pixel value of the first modal reconstructed image; determining a second partial characteristic about the second mode image from the output characteristics of the decoder, and reconstructing an image by using the second partial characteristic to obtain a pixel value of the second mode reconstructed image; determining a second loss function by the pixel values of the first modality reconstructed image, the pixel values of the first modality image, the pixel values of the second modality reconstructed image, and the pixel values of the second modality image; and optimizing parameters of the encoder through the second loss function to realize image reconstruction learning of the encoder.

Further, the inputting the unmasked portion of the total masking feature into the encoder includes: and (3) carrying out global average pooling processing on the unmasked part in the total mask characteristics, and then inputting the unmasked part into the encoder.

Further, the mask portion in the total mask feature includes: a first mask portion for the first modality image and a second mask portion for the second modality image; the inputting of the mask portion of the latent feature and the total mask feature into the decoder includes: acquiring a shared mask portion with respect to the same image block in the first mask portion and the second mask portion; the potential feature and the shared mask feature are input to the decoder.

Further, the determining the second loss function by the pixel value of the first-modality reconstructed image, the pixel value of the first-modality image, the pixel value of the second-modality reconstructed image, and the pixel value of the second-modality image includes: for a target image block of an image in the target sample set, acquiring a first set of pixel values of the target image block in the first modality reconstructed image, acquiring a second set of pixel values of the target image block in the first modality image, acquiring a third set of pixel values of the target image block in the second modality reconstructed image, and acquiring a fourth set of pixel values of the target image block in the second modality image; a second loss function is determined from the first set of pixel values, the second set of pixel values, the third set of pixel values, and the four sets of pixel values.

Further, the optimizing the parameters of the encoder by the contrast learning and the image reconstruction learning includes: performing the contrast learning and the image reconstruction learning in combination to optimize parameters of the encoder; or, executing the contrast learning to obtain an encoder after preliminary parameter optimization, and executing the image reconstruction learning on the encoder after preliminary parameter optimization to optimize the parameters of the encoder again; or, executing the image reconstruction learning to obtain an encoder after preliminary parameter optimization, and executing the contrast learning on the encoder after preliminary parameter optimization to optimize the parameters of the encoder again.

In the present embodiment, the scene analysis model includes an encoder, and the processor 1201 is specifically configured to determine the scene analysis model. Specifically: determining full connection layer parameters according to a scene analysis task, and connecting the determined full connection layer to the head of the encoder after pre-training; and training the pre-trained encoder and the full-connection layer through the samples related to the scene analysis task to obtain the scene analysis model.

Memory 1202 may include one or more computer-readable storage media, which may be non-transitory. Memory 1202 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments of the present description, a non-transitory computer readable storage medium in memory 1202 is used to store at least one instruction for execution by processor 1201 to implement the methods in embodiments of the present description.

In some embodiments, the electronic device 1200 further includes: a peripheral interface 1203, and at least one peripheral. The processor 1201, the memory 1202, and the peripheral interface 1203 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1203 via buses, signal lines, or a circuit board. Specifically, the peripheral device includes: at least one of a display 1204, a camera 1205, and an audio circuit 1206.

The peripheral interface 1203 may be used to connect at least one Input/Output (I/O) related peripheral to the processor 1201 and the memory 1202. In some embodiments of the present description, the processor 1201, the memory 1202, and the peripheral interface 1203 are integrated on the same chip or circuit board; in some other embodiments of the present description, either or both of the processor 1201, the memory 1202, and the peripheral interface 1203 may be implemented on separate chips or circuit boards. The embodiment of the present specification is not particularly limited thereto.

The display 1204 is for displaying a User Interface (UI). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1204 is a touch display, the display 1204 also has the ability to collect touch signals at or above the surface of the display 1204. The touch signal may be input as a control signal to the processor 1201 for processing. At this time, the display 1204 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments of the present description, the display 1204 may be one, providing a front panel of the electronic device 1200; in other embodiments of the present disclosure, the display 1204 may be at least two, respectively disposed on different surfaces of the electronic device 1200 or in a folded design; in still other embodiments of the present description, the display 1204 may be a flexible display disposed on a curved surface or a folded surface of the electronic device 1200. Even more, the display 1204 may be arranged in an irregular pattern other than rectangular, i.e., a shaped screen. The display 1204 may be made of a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), or other materials.

The camera 1205 is used to capture images or video. Optionally, the camera 1205 includes a front camera and a rear camera. In general, a front camera is disposed on a front panel of an electronic device, and a rear camera is disposed on a rear surface of the electronic device. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments of the present description, the camera 1205 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 1206 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, and converting the sound waves into electric signals to be input to the processor 1201 for processing. For purposes of stereo acquisition or noise reduction, the microphone may be multiple and separately disposed at different locations of the electronic device 1200. The microphone may also be an array microphone or an omni-directional pickup microphone. The power supply 1207 is used to power the various components in the electronic device 1200. The power source 1207 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power source 1207 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology. The block diagrams of the electronic device structures shown in the embodiments of the present description do not constitute a limitation of the electronic device 1200, and the electronic device 1200 may include more or less components than illustrated, or may combine some components, or may employ different arrangements of components.

In the description of the present specification, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meaning of the terms in this specification will be understood by those of ordinary skill in the art in the light of the specific circumstances. In addition, in the description of the present specification, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The present description also provides a computer-readable storage medium having instructions stored therein, which when executed on a computer or processor, cause the computer or processor to perform one or more steps of the above embodiments. The respective constituent modules of the above-described available field determining apparatus may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as independent products. The respective constituent modules of the above-described available field determining apparatus may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as independent products.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product described above includes one or more computer instructions. When the computer program instructions described above are loaded and executed on a computer, the processes or functions described in accordance with the embodiments of the present specification are all or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (Digital Subscriber Line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a digital versatile Disk (Digital Versatile Disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like. It should be noted that the foregoing describes specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The foregoing is merely specific embodiments of the present disclosure, but the scope of the disclosure is not limited thereto, and any person skilled in the art who is skilled in the art can easily think about variations or substitutions within the scope of the disclosure of the present disclosure, and it is intended to cover the variations or substitutions within the scope of the disclosure. Accordingly, equivalent variations from the claims of the present specification are intended to be covered by the present specification.

Claims

1. A method of pre-training a scene analysis model, wherein the scene analysis model includes an encoder, the method comprising:

performing contrast learning on the encoder by a plurality of sample groups, wherein each sample group comprises a first modality image and a second modality image on the same content;

performing mask processing on the plurality of sample groups respectively, and performing image reconstruction learning on the encoder and the decoder through the sample groups subjected to the mask processing;

optimizing parameters of the encoder through the contrast learning and the image reconstruction learning to pretrain the scene analysis model.

2. The method of claim 1, wherein the contrast learning of the encoder by a plurality of sample groups comprises:

Performing image blocking processing on a first mode image in a target sample group and performing feature extraction processing on the obtained image block to obtain a first embedded feature, wherein the target sample group is any one of a plurality of sample groups;

performing image blocking processing on a second mode image in the target sample group and performing feature extraction processing on the obtained image block to obtain a second embedded feature;

inputting the first embedded feature and the second embedded feature into the encoder for encoding processing to obtain a first encoding feature corresponding to the first mode image and a second encoding feature corresponding to the second mode image;

determining a noise contrast estimate for each image block in the target set of samples according to the first encoding feature and the second encoding feature, wherein the noise contrast estimate for each image block in the target set of samples is used to determine a first loss function;

and optimizing parameters of the encoder through the first loss function to realize contrast learning of the encoder.

3. The method of claim 2, wherein the encoder is a visual transducer model, wherein the first embedded feature and the second embedded feature input to the encoder are not position encoded.

4. A method according to any one of claims 1 to 3, wherein, after the contrast learning, for a target image block of an image in the target sample group, a similarity between the target image block in a first modality image in the target sample group and the target image block in a second modality image in the target sample group is greater than a first preset value, and a similarity between the target image block in the first modality image in the target sample group and other image blocks in the second modality image in the target sample group is less than a second preset value;

the first preset value is larger than the second preset value, and the target sample group is any one of the plurality of sample groups.

5. The method of claim 1, wherein prior to masking the plurality of sample groups, respectively, the method further comprises:

performing image blocking processing, feature extraction processing and position coding on the first mode image in the target sample group to obtain a third embedded feature;

performing image blocking processing, feature extraction processing and position coding on the second mode image in the target sample group to obtain a fourth embedded feature;

Wherein the third embedded feature and the fourth embedded feature are used to perform the masking process.

6. The method of claim 5, wherein masking each of the plurality of samples comprises:

splicing the third embedded feature and the fourth embedded feature with respect to the target sample group into a total embedded feature;

and carrying out mask processing on a third embedded feature in the total embedded features, and carrying out mask processing on a fourth embedded feature in the total embedded features to obtain total mask features.

7. The method of claim 6, wherein the image reconstruction learning of the encoder and decoder by the masked set of samples comprises:

inputting the unmasked part in the total mask characteristics into the encoder, and processing the unmasked part by the encoder to obtain potential characteristics corresponding to the target sample group;

inputting mask portions of the latent feature and the total mask feature into the decoder for decoding processing by the decoder;

determining a first partial characteristic related to the first modal image from the output characteristics of the decoder, and reconstructing an image by utilizing the first partial characteristic to obtain a pixel value of the first modal reconstructed image;

Determining a second partial characteristic about the second mode image from the output characteristics of the decoder, and reconstructing an image by using the second partial characteristic to obtain a pixel value of the second mode reconstructed image;

determining a second loss function through the pixel values of the first modality reconstructed image, the pixel values of the first modality image, the pixel values of the second modality reconstructed image, and the pixel values of the second modality image;

and optimizing parameters of the encoder through the second loss function to realize image reconstruction learning of the encoder.

8. The method of claim 7, wherein said inputting the unmasked portion of the total masking feature into the encoder comprises:

and carrying out global average pooling treatment on the unmasked part in the total mask characteristics, and then inputting the unmasked part into the encoder.

9. The method of claim 7, wherein the masking portion of the total masking feature comprises: a first mask portion for the first modality image and the second mask portion for the second modality image;

said inputting a mask portion of said potential features and said total mask features into said decoder, comprising:

Acquiring a shared mask portion for the same image block in the first mask portion and the second mask portion;

the potential features and the shared mask features are input to the decoder.

10. The method of claim 7, wherein the determining the second loss function from the pixel values of the first modality reconstructed image, the pixel values of the first modality image, the pixel values of the second modality reconstructed image, and the pixel values of the second modality image comprises:

for a target image block of an image in the target sample set, acquiring a first set of pixel values of the target image block in the first modality reconstructed image, acquiring a second set of pixel values of the target image block in the first modality image, acquiring a third set of pixel values of the target image block in the second modality reconstructed image, and acquiring a fourth set of pixel values of the target image block in the second modality image;

a second loss function is determined from the first set of pixel values, the second set of pixel values, the third set of pixel values, and the four sets of pixel values.

11. The method of any of claims 1 to 10, wherein the optimizing parameters of the encoder by the contrast learning and the image reconstruction learning comprises:

Performing the contrast learning and the image reconstruction learning in combination to optimize parameters of the encoder; or alternatively, the first and second heat exchangers may be,

executing the contrast learning to obtain an encoder with optimized primary parameters, and executing the image reconstruction learning on the encoder with optimized primary parameters to optimize the parameters of the encoder again; or alternatively, the first and second heat exchangers may be,

and executing the image reconstruction learning to obtain an encoder after preliminary parameter optimization, and executing the contrast learning on the encoder after preliminary parameter optimization to optimize the parameters of the encoder again.

12. A method of determining a scene analysis model, wherein the scene analysis model comprises an encoder, the encoder being pre-trained by the method of any of claims 1 to 11, the method comprising:

determining full connection layer parameters according to a scene analysis task, and connecting the determined full connection layer to the head of the encoder after pre-training;

and training the pre-trained encoder and the full-connection layer through the samples related to the scene analysis task to obtain the scene analysis model.

13. The method of claim 12, wherein the scene analysis task comprises: scene recognition tasks or scene segmentation tasks.

14. A pre-training apparatus for a scene analysis model, wherein the scene analysis model comprises an encoder, the apparatus comprising:

a contrast learning module for performing contrast learning on the encoder by a plurality of sample groups, wherein each of the sample groups includes a first modality image and a second modality image with respect to the same content;

the image reconstruction learning module is used for carrying out mask processing on the plurality of sample groups respectively and carrying out image reconstruction learning on the encoder and the decoder through the sample groups subjected to the mask processing;

and the pre-training module is used for optimizing the parameters of the encoder through the contrast learning and the image reconstruction learning so as to pre-train the scene analysis model.

15. A device for determining a scene analysis model, wherein the scene analysis model comprises an encoder, the encoder being pre-trained by the method of any of claims 1 to 9, the device comprising:

the determining module is used for determining parameters of the full-connection layer according to the scene analysis task and connecting the determined full-connection layer to the head of the encoder after pre-training;

and the training module is used for training the pre-trained encoder and the full-connection layer through the samples related to the scene analysis task to obtain the scene analysis model.

16. A computer readable storage medium having instructions stored therein, wherein the instructions, when run on a computer or processor, cause the computer or processor to perform the pre-training method of a scene analysis model according to any one of claims 1 to 11, or to perform the determining method of a scene analysis model according to claim 12 or 13.

17. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the pre-training method of the scene analysis model of any one of claims 1 to 11 or the determination method of the scene analysis model of claim 12 or 13.

18. A computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to perform the pre-training method of a scene analysis model according to any of claims 1 to 11 or to perform the determination method of a scene analysis model according to claim 12 or 13.