CN115115918A

CN115115918A - Visual learning method based on multi-knowledge fusion

Info

Publication number: CN115115918A
Application number: CN202210682147.8A
Authority: CN
Inventors: 高鹏; 张仁瑞; 莫申童; 马特立; 李鸿升; 乔宇
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2022-06-16
Filing date: 2022-06-16
Publication date: 2022-09-27

Abstract

The invention discloses a visual learning method based on multi-knowledge fusion. The method comprises the following steps: constructing a visual learning device, wherein the visual learning device comprises a plurality of convolution modules, a Transformer module, a decoder and a multi-knowledge fusion module, wherein the input images of the convolution modules have different resolutions, each input image corresponds to each of a plurality of kinds of knowledge and has a complementary masking region, and the convolution modules carry out mutually independent feature extraction on different non-masking regions corresponding to the plurality of kinds of knowledge; the Transformer module extracts mutually independent global features aiming at the unmasked features; the decoder performs image reconstruction based on the unmasked features and the mask; and pre-training the visual learning device by taking a set loss standard as a target, and in the pre-training process, inputting a plurality of kinds of knowledge learned by the multi-knowledge fusion module into the decoder as a supervision signal to guide the training process. The present invention improves pre-training efficiency and can be adapted to a wider range of downstream tasks.

Description

Visual learning method based on multi-knowledge fusion

Technical Field

The invention relates to the technical field of computer vision, in particular to a vision learning method based on multi-knowledge fusion.

Background

The mask automatic encoder is used for masking a part of an image by using a random mask in the pre-training of a computer vision backbone network, learning the characteristics of the unmasked part by using the encoder, and recovering the masked image characteristics according to the characteristics. Pre-training learning of visual features using a Mask Auto Encoder (MAE) has achieved good performance in various visual tasks.

Self-supervised pre-training has become a new paradigm for visual feature learning and enhances the performance of various visual tasks through its powerful visual characterization capabilities. In addition to the self-supervised approach of contrast learning of DINO, MOCO-V3, etc., the masked auto-encoder (MaskAutoencoder-MAE) also shows a very promising performance and inspires a series of subsequent works on its performance improvement, such as ConvMAE, HiVIT, MixMIM, etc. The self-supervised learning of the MAE is inspired by a BERT model in natural language processing, a part of an area of a picture is shielded by a random mask, and then pixel values of the mask area are reconstructed by using the part which is not shielded, so that a network learns low-level semantic information of the picture. However, in the prior art, the slow pre-training convergence speed and the huge computing resource overhead greatly restrict the further development and application of the MAE. Specifically, pre-training a visual Transformer network-based MAE requires 800 cycles and takes two thousand graphics hours, while the subsequent ConvMAE requires 1600 cycles and four thousand graphics hours.

Therefore, it is necessary to provide a new technical solution to accelerate the pre-training time and reduce the overhead of computing resources.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a visual learning method based on multi-knowledge fusion.

According to a first aspect of the present invention, a method for visual learning based on multi-knowledge fusion is provided. The method comprises the following steps:

constructing a visual learning device, wherein the visual learning device comprises a plurality of convolution modules, a Transformer module, a decoder and a multi-knowledge fusion module, wherein the input images of the convolution modules have different resolutions, each input image corresponds to each of a plurality of kinds of knowledge and has a complementary masking region, and the convolution modules carry out mutually independent feature extraction on different non-masking regions corresponding to the plurality of kinds of knowledge; the Transformer module extracts mutually independent global features aiming at the unmasked features; the decoder performs image reconstruction based on the unmasked features and the mask;

and pre-training the visual learning device by taking a set loss standard as a target, and in the pre-training process, inputting a plurality of kinds of knowledge learned by the multi-knowledge fusion module into the decoder as a supervision signal to guide the training process.

According to a second aspect of the present invention, a method for applying a visual learner is provided. The method comprises the following steps:

extracting features of different scales by utilizing a plurality of convolution modules and the trained Transformer module aiming at an input target image, wherein the Transformer module is used for enhancing the features by adopting a global or local attention mechanism;

and after down-sampling the features output by the transform module, sending the features and the extracted features with different scales to a detection network or a segmentation network together to obtain a corresponding detection result or segmentation result.

Compared with the prior art, the method has the advantages that various reconstruction tasks are added, semantic knowledge generated by various different models is injected into the network learning process, so that the network learns more diversified information, and different knowledge reconstruction targets are corresponding to complementary unblocked image areas by utilizing complementary masks. The present invention improves pre-training efficiency and is adaptable to a wider range of downstream tasks.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram of a method of visual learning based on multi-knowledge fusion, according to one embodiment of the present invention;

FIG. 2 is an architecture diagram of a multi-knowledge fusion based visual learner, according to one embodiment of the present invention;

FIG. 3 is a schematic diagram of a process of applying a visual learner to a downstream task, according to one embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The invention provides an efficient mask visual learning device based on multi-knowledge fusion, which is called as Mor-ConvMAE for short. First, the reconstruction task of MoR-ConvMAE is not just pixel values, but fusion of multiple knowledge, such as knowledge of language-vision multimodalities, discriminative knowledge between different pictures, and historical momentum knowledge of web learning. Through a stronger self-supervision signal with multi-knowledge fusion, the network can obtain extremely fast pre-training convergence speed. In addition, in order to further reduce the expenditure of operation resources, a complementary mask is provided to enable different knowledge reconstruction targets to correspond to complementary unshielded image regions, and different regions are independently coded through designed complementary mask convolution, so that different regions can be independently coded in the process of one-time network forward propagation, and information leakage is prevented. The Mor-ConvMAE provided by the invention can simultaneously complete the reconstruction targets of multiple knowledge by using the operation overhead of one picture, thereby greatly improving the efficiency of pre-training.

Specifically, referring to fig. 1, the provided multi-knowledge fusion-based visual learning method includes the following steps.

Step S110, a visual learning device based on multi-knowledge fusion is constructed, and the visual learning device uses complementary masks to enable different knowledge reconstruction targets to correspond to complementary unblocked image areas.

Referring to fig. 2, the visual learner generally includes a plurality of convolution modules, a transform module, a decoder, and a multi-knowledge fusion module (MoR). The convolution modules use the generated complementary random mask to mask the input image and perform mutually independent feature extraction on different non-mask areas corresponding to various knowledge, and the input image of each convolution module has different resolution. The Transformer module performs mutually independent global feature extraction on the unmasked features. The decoder performs image reconstruction based on the unmasked features and the mask. In the pre-training process of the visual learner, a plurality of kinds of knowledge output by the multi-knowledge fusion module are used as supervision signals and input to a decoder to guide the training process.

In one embodiment, the shallow Transformer module of the Decoder is a feature share of multiple knowledge, while the deep Transformer module is independent of different knowledge, and is therefore also referred to as a partially shared Decoder (PS-Decoder).

It should be noted that, during the pre-training process of the visual learner, since the convolution module performs the convolution operation on the complementary random mask image, it is also labeled as a complementary mask convolution module in fig. 2. In the following description, two complementary mask convolution modules are provided as an example for explanation.

Step S120, pre-training the visual learning device, and forming a supervision signal by using various different knowledge to learn more diversified and universal visual characteristics.

Still referring to fig. 2, in the pre-training phase, four random masks corresponding to the original image size 1/16 are first generated using the complementary masking mechanism in the lower left corner, each mask corresponds to a reconstruction target with knowledge, and the non-occluded regions outside the masks are independent and complementary; then, the complementary random mask of image size 1/16 is expanded by upsampling to the complementary random mask of image size 1/8 and image size 1/4, respectively.

In one embodiment, the image feature extraction process of the visual learner includes three stages, the first stage is to sample the original image to 1/4 of the original resolution and mask it with 1/4 random masks of the original resolution generated by the complementary mask generator, and then perform mutually independent feature extraction on different non-masked regions corresponding to four kinds of knowledge respectively by using two complementary mask convolutions with convolution kernel 5 × 5. Next, in a second stage, the 1/4 resolution features are downsampled to 1/8 resolution features, feature masking is performed with a mask corresponding to image size 1/8, and feature extraction is also performed with a 5 x 5 complementary masked convolution with two convolution kernels. The feature resolution of the third stage is 1/16, different features corresponding to different knowledge are separated and leveled, masking is carried out after leveling is carried out by a mask with 1/16 resolution, and then unmasked features are sent to an 11-layer Transformer module for extraction of global features which are independent of each other. After the feature extraction of the multi-scale coding network, the unmasked features and the mask are sent to a partially shared Decoder (PS-Decoder) together for image reconstruction, the shallow Transformer module of the Decoder is shared by features of multiple knowledge, and the deep Transformer module of the Decoder is independent of different knowledge, so that different Transformer modules concentrate on reconstruction of different knowledge, and the problem of gradient conflict is solved. Finally, the characteristics corresponding to different knowledge are respectively subjected to prediction reconstruction of the mask part, loss is calculated according to the characteristics corresponding to the actual covered part, and gradient back transmission is performed according to the loss to update the model parameters.

Preferably, the number of the complementary mask convolution layers in the first two stages is 2, and the size of the convolution kernel is 5 x 5. The number of layers of the Transformer in the third stage is 11, and the depth can be increased according to the size required by the network.

When the visual learning device is pre-trained, various types of knowledge are fused in the learning process, for example, the various types of knowledge include: knowledge of language-vision multimodality, knowledge of discriminability and historical momentum between different pictures, knowledge of RGB pixels, and the like. In one embodiment, historical momentum knowledge is learned using a historical momentum encoder. DINO (DETR) with Improved differentiating and player boxes are adopted to learn the discriminant knowledge among different pictures. Learning multi-modal knowledge of language-vision using CLIP, learning RGB pixel knowledge using existing ConvMAE or MixMIM, which belongs to the prior art, is only illustrated in fig. 2. Each knowledge is beneficial to guiding the visual characteristics of the network learning on the one hand, and various different knowledge is utilized to form a supervision signal, so that the network learning can obtain more diversified and universal visual characteristics, and the network learning is beneficial to obtaining good performance in different visual tasks.

Through the pre-training process, the network can be made to learn a certain visual signal understanding capability in advance, and such pre-training is efficient in terms of time and computational resource overhead. In addition, the designed complementary mask is utilized to enable visible regions corresponding to different knowledge to be complementary on the image, so that reconstruction of multiple kinds of knowledge can be carried out simultaneously by one-time forward propagation of the network, instead of needing forward propagation for multiple times for different knowledge, and information leakage among different knowledge does not occur during one-time propagation, because the designed complementary mask can enable non-shielded regions corresponding to different knowledge to form mutually independent and complementary relations, and corresponding complementary mask convolution is designed to independently encode different regions, so that parallel multi-knowledge reconstruction without information leakage can be realized.

Step S130, apply the pre-trained visual learner to the downstream task.

Furthermore, the pre-trained visual learning device can be applied to downstream tasks for processing, and the processing process of the downstream tasks is improved through the pre-trained visual learning device.

Specifically, referring to fig. 3, on the application of the downstream task, the PS-Decoder and MoR of the pre-trained visual learner are discarded, the complementary mask convolutions of the first stage and the second stage are converted into normal convolutions, and the global self-attention mechanism of the third stage is adjusted to adopt a local or global self-attention mechanism according to the downstream task. And after the third stage a new downsampling is added to downsample the features of the original resolution 1/16 to 1/32 features. Then sent to a detection network or a segmentation network together with 1/4, 1/8 and 1/16 characteristics generated in the first three stages for training of downstream tasks.

It is to be noted that the above-mentioned embodiments may be appropriately changed or modified by those skilled in the art without departing from the spirit and scope of the present invention. For example, in the pre-training stage, various knowledge is incorporated, and besides the mask reconstruction mode, a direct supervision mode can also be adopted. The complementary mask convolution can process multiple kinds of knowledge at one time, and can also decompose the complementary mask convolution, namely learning different knowledge respectively through multiple forward propagation. The decoder may be fully shared between different knowledge or fully independent, in addition to being partially shared. Furthermore, the scale of the different stages of processing the image may also be different from the above-described embodiment. Furthermore, the knowledge is not limited to the above four knowledge types, and more knowledge types or different knowledge types can be selected according to actual needs.

Furthermore, experiments prove that the method solves the problems of low convergence speed and high calculation resource overhead in pre-training, and has the following effects compared with the prior art:

1) the pre-training goal of existing mask auto-encoders (MAEs) is to reconstruct the pixel values of the regions that are occluded by the random mask, so the network can only learn low-level pixel information. According to the invention, through adding various reconstruction tasks, semantic knowledge generated by various different models is injected into the network learning process, so that more diversified information can be learned by the network. Referring to tables 1-3, the FastMoR-ConvMAE of the present invention achieves the best index and the fastest convergence rate on each task through the fusion of various knowledge.

2) The existing mask technology simply divides an image into an occluded part and an unoccluded part, recovers pixels of the occluded part by using information of the unoccluded part, and realizes coding of only information of the unoccluded part by using mask convolution. In the invention, the complementary mask can correspond different knowledge reconstruction targets to complementary non-occluded image regions, and the complementary mask convolution can independently encode the non-occluded regions corresponding to different knowledge in the process of one-time network forward propagation. Referring to tables 1-3, the FastMoR-ConvMAE of the present invention achieves the best index and the fastest convergence rate on each task through efficient complementary masking mechanism and complementary masking convolution for pre-training.

TABLE 1 Pre-training Performance

As can be seen from table 1, compared with other mask training methods, the Fast MoR-ConvMAE of the present invention employs multi-knowledge fusion, complementary mask mechanism and complementary mask convolution, so that a better model fine tuning accuracy than ConvMAE that is 20 times slower can be achieved only by 200 shorter pre-training rounds and 200 video card hours.

TABLE 2 Performance applied to detection and instance partitioning tasks

Table 2 is the performance of the pre-trained backbone network on detection and example segmentation using the Mask RCNN method. AP (Access Point) ^box And AP ^mask Indicating the accuracy of detection and semantic segmentation, respectively. It can be seen that Fast MoR-ConvMAE achieves the best results on downstream object detection and instance segmentation.

TABLE 3 Performance applied to semantic segmentation task

Method	Reconstructing an object	Pre-training round	mIoU
				GreenMIM	RGB pixel	800	-
HiViT	RGB pixel	800	51.2
				MixMIM	RGB pixel	600	50.3
ConvMAE	RGB pixel	1600	51.7
				FastMoR-ConvMAE	Multi-knowledge fusion	200

Table 3 shows the comparison of the pre-trained backbone network in semantic segmentation using the UperNet method. Compared with other methods utilizing mask training, the Fast Mor-ConvMAE can obtain better performance.

In summary, the technical effects of the present invention are mainly reflected in the following aspects:

1) high efficiency of pre-training. The conventional training utilizes single pixel knowledge and consumes a large amount of pre-training time and computing resources, and the multi-knowledge fusion, complementary mask and complementary mask convolution proposed by the invention can greatly improve the efficiency and final performance of pre-training.

2) The breadth of downstream tasks can be adapted. Since the downstream tasks can be applied in various scenes, the pre-training of the invention can fuse the knowledge of diversity to the network, thereby obtaining better results in a wider range of the downstream tasks.

3) Flexibility in deployment. The invention can obviously reduce the pre-training time, so that the model is made larger, and under the current tide that the large model is pre-trained to become an artificial intelligence paradigm, the larger model means better representation and better performance of downstream tasks, and has important significance for the deployment of an intelligent system.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the market, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A visual learning method based on multi-knowledge fusion comprises the following steps:

2. The method of claim 1, wherein for the input image of the plurality of convolution modules, the masked area for each knowledge is determined according to a set complementary mask that causes the visible areas corresponding to different knowledge to be complementary on the image.

3. The method of claim 2, wherein the plurality of knowledge includes four, wherein the plurality of convolution modules includes a first convolution module and a second convolution module, wherein the resolution of the input image of the first convolution module is 1/4 of the original image, and wherein the input image of the second convolution module is 1/8 of the original image; and down-sampling the output of the second convolution module to 1/16 of the resolution of the original image, separating different features corresponding to the four kinds of knowledge, leveling by using corresponding complementary masks, masking, and sending unmasked features to the transform module to extract mutually independent global features for the four kinds of knowledge, wherein the original image is the image input to the visual learner.

4. The method of claim 3, wherein the number of layers of the first convolution module and the second convolution module is set to 2, the convolution kernel size is set to 5 x 5, and the number of layers of the Transformer module is set to 11.

5. The method of claim 1, wherein the shallow Transformer module of the decoder shares the characteristics of the plurality of knowledge, and the deep Transformer module of the decoder is independent of each other for different knowledge.

6. The method of claim 1, wherein the plurality of knowledge comprises knowledge of language-vision multi-modality, knowledge of discriminant between different pictures, knowledge of historical momentum, and knowledge of RGB pixels.

7. The method of claim 6, wherein the historical momentum knowledge is learned using a historical momentum coder, the discriminative knowledge between different pictures is learned using a DINO model, and the knowledge of the language-vision multi-modality is learned using CLIP.

8. A method of applying a visual learner, comprising:

extracting features of different scales for an input target image by using a plurality of convolution modules and a trained Transformer module obtained according to the method of any one of claims 1 to 6, wherein the Transformer module adopts a global or local attention mechanism to enhance the features;

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.

10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the processor realizes the steps of the method according to any one of claims 1 to 8 when executing the computer program.