CN116109966B

CN116109966B - Remote sensing scene-oriented video large model construction method

Info

Publication number: CN116109966B
Application number: CN202211635612.9A
Authority: CN
Inventors: 孙显; 付琨; 于泓峰; 姚方龙; 卢宛萱; 邓楚博; 杨和明
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-06-27
Anticipated expiration: 2042-12-19
Also published as: CN116109966A

Abstract

The application relates to the technical field of computer model construction, in particular to a remote sensing scene-oriented video large model construction method. The method comprises the following steps: acquiring a remote sensing image set A and a target video set B, wherein A= { a ₁ ,a ₂ ,…,a _N }，a _n The method comprises the steps that N is the number of the remote sensing images in A, wherein the value range of N is 1 to N, and N is the number of the remote sensing images in A; b= { B ₁ ,b ₂ ,…,b _M }，b _m For the mth target video in B, the value range of M is 1 to M, M is the number of the target videos in B, B _m ＝(b _m,1 ,b _m,2 ,…,b _m,Q )，b _m,q B is _m A q-th frame target image; training a neural network model using a and B, the neural network model comprising a first neural network sub-model and a second neural network sub-model. The invention constructs the remote sensing scene-oriented video large model with strong feature extraction capability and feature rule discovery capability.

Description

Remote sensing scene-oriented video large model construction method

Technical Field

The invention relates to the technical field of computer model construction, in particular to a remote sensing scene-oriented video large model construction method.

Background

Because the remote sensing video has double characteristics in time and space, and the remote sensing scene itself has a complex texture background, a model required by a video interpretation task in the remote sensing scene needs to have stronger characteristic extraction capability, and meanwhile, the spatial characteristic rule and the temporal characteristic rule of the video need to be explored. How to construct a large video model with strong feature extraction capability and feature rule discovery capability for remote sensing scenes is a problem to be solved urgently.

Disclosure of Invention

The invention aims to provide a method for constructing a large video model oriented to a remote sensing scene, which constructs the large video model oriented to the remote sensing scene with strong feature extraction capability and feature rule discovery capability.

According to the invention, a method for constructing a video large model for a remote sensing scene is provided, which comprises the following steps:

acquiring a remote sensing image set A and a target video set B, wherein A= { a ₁ ,a ₂ ,…,a _N }，a _n The method comprises the steps that N is the number of the remote sensing images in A, wherein the value range of N is 1 to N, and N is the number of the remote sensing images in A; b= { B ₁ ,b ₂ ,…,b _M }，b _m For the mth target video in B, the value range of M is 1 to M, M is the number of the target videos in B, B _m ＝(b _m,1 ,b _m,2 ,…,b _m,Q )，b _m,q B is _m In the Q-th frame of target image, the value range of Q is 1 to Q, Q is the number of target images in the target video, b _m,1 、b _m,2 、…、b _m,Q Q frames of target images are continuously shot; the target video in B is a guardThe method comprises the steps that a video shot by satellite-mounted remote sensing equipment or a video shot by unmanned aerial vehicle-mounted remote sensing equipment is used, and a remote sensing image is an image shot by satellite-mounted remote sensing equipment.

Training a neural network model using a and B, the neural network model comprising a first neural network sub-model and a second neural network sub-model, the training comprising:

traversing A, pair a _n Performing block processing, and randomly performing block processing on the a _n The k x C blocks in (a) are subjected to mask processing; c is a pair a _n The number of blocks obtained by partitioning is k, which is a preset mask proportion; a processed by mask _n A first neural network sub-model is trained, the first neural network sub-model being a 2D swin-transformer structure, the first neural network sub-model comprising a first encoder and a first decoder.

Traversal B, pair B _m I of [ i ] _m ,i _m +L]Masking the frame image, i _m +L≤Q，i _m Not less than 1, L is the number of preset mask frames, i _m B is _m A start mask frame of (a); b processed by mask _m Training a second neural network sub-model, the second sub-model being a 3D swin-transformer structure, the second neural network sub-model comprising a second encoder and a second decoder; the training of the first neural network sub-model is performed simultaneously with the training of the second neural network sub-model, and the second encoder and the first encoder have weight sharing in the training process.

Compared with the prior art, the method provided by the invention has obvious beneficial effects, can achieve quite technical progress and practicality by virtue of the technical scheme, has wide industrial utilization value, and has at least the following beneficial effects:

the video large model facing the remote sensing scene comprises two branches, wherein the first branch corresponds to a first neural network sub-model, and a training sample corresponding to the branch is a remote sensing image set; the second branch corresponds to a second neural network sub-model, a training sample corresponding to the branch is a target video set, and the target video set comprises remote sensing videos (namely videos shot by satellite carried remote sensing equipment) and unmanned aerial vehicle videos (videos shot by unmanned aerial vehicle carried remote sensing equipment), and the number of remote sensing videos which can be used as training samples is small because the remote sensing videos are not easy to acquire; according to the invention, the number of video samples is expanded by introducing unmanned aerial vehicle video, and the expanded video samples are utilized to train the second neural network sub-model, so that the capability of feature extraction and rule mining of the second neural network sub-model is improved, the generalization capability of the trained second neural network sub-model is also improved, and the method can be applied to downstream tasks of different partial space-time predictions.

In addition, a masking strategy adopted by the invention for the remote sensing image sample corresponding to the first neural network sub-model is a random masking of a part of pixel points, and the capability of the first neural network model for extracting the space information of the remote sensing image is improved through the random masking strategy; masking a certain frame in the target video as a starting frame by using a masking strategy adopted by a target video sample corresponding to the second neural network sub-model, masking frames with fixed length after the starting frame, increasing the difficulty of video prediction by using the masking strategy, and improving the capability of the second neural network sub-model for extracting the space-time continuous information of objects in the video; according to the invention, the training process of the first neural network sub-model and the training process of the second neural network sub-model are performed simultaneously, so that the training process of the video large model is accelerated, and weight sharing exists between the first encoder in the first neural network sub-model and the second encoder in the second neural network sub-model in the training process, so that the second neural network sub-model can acquire the capability of the first neural network sub-model to extract the spatial information of the remote sensing image, and further the capability of the second neural network sub-model to extract the spatial information of the remote sensing image is improved, thereby being beneficial to accelerating the training process of the second neural network sub-model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for constructing a large video model for a remote sensing scene according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

According to the invention, a method for constructing a large video model for a remote sensing scene is provided, as shown in fig. 1, and comprises the following steps:

s100, acquiring a remote sensing image set A and a target video set B, wherein A= { a ₁ ,a ₂ ,…,a _N }，a _n The method comprises the steps that N is the number of the remote sensing images in A, wherein the value range of N is 1 to N, and N is the number of the remote sensing images in A; b= { B ₁ ,b ₂ ,…,b _M }，b _m For the mth target video in B, the value range of M is 1 to M, M is the number of the target videos in B, B _m ＝(b _m,1 ,b _m,2 ,…,b _m,Q )，b _m,q B is _m In the Q-th frame of target image, the value range of Q is 1 to Q, Q is the number of target images in the target video, b _m,1 、b _m,2 、…、b _m,Q Q frames of target images are continuously shot; and B, the target video is a video shot by a satellite-mounted remote sensing device or a video shot by an unmanned aerial vehicle-mounted remote sensing device, and the remote sensing image is an image shot by the satellite-mounted remote sensing device.

The video large model facing the remote sensing scene comprises two branches, wherein the first branch corresponds to a first neural network sub-model, and a training sample corresponding to the branch is a remote sensing image set; the second branch corresponds to a second neural network sub-model, a training sample corresponding to the branch is a target video set, and the target video set comprises remote sensing videos (namely videos shot by satellite-mounted remote sensing equipment) and unmanned aerial vehicle videos (videos shot by unmanned aerial vehicle-mounted remote sensing equipment).

Preferably, the number of videos shot by the remote sensing equipment carried by the unmanned aerial vehicle in the B is larger than the number of videos shot by the remote sensing equipment carried by the satellite in the B. According to the invention, the video shot by the remote sensing equipment carried by the unmanned aerial vehicle is taken as one of target videos, so that the number of the target videos can be expanded, and the problem that the number of the target videos is insufficient to meet the subsequent training requirement on the neural network model due to the fact that the remote sensing videos are not easy to acquire is solved; the video shot by the unmanned aerial vehicle carried remote sensing equipment and the video shot by the satellite carried remote sensing equipment are shot at the angle of the aerial carried remote sensing equipment like a nodding, so that the effect of training the neural network model can be achieved by taking the video shot by the unmanned aerial vehicle carried remote sensing equipment as a target video for subsequent training of the neural network model.

Preferably, both N and M are on the order of millions. The number set of the training samples is millions, the trained video large model facing the remote sensing scene has strong feature extraction capability, rule mining capability and generalization capability, and the model parameters of the trained video large model facing the remote sensing scene are used as initial model parameters of models corresponding to different downstream tasks, so that the training process of the models corresponding to the downstream tasks can be accelerated, and the accuracy of the models corresponding to the downstream tasks can be improved; the downstream tasks may be a video prediction task, a target detection task, a single target tracking task, a video segmentation task, and the like.

S200, training a neural network model by using A and B, wherein the neural network model comprises a first neural network sub-model and a second neural network sub-model, and the training process comprises the following steps:

s210, traversing A, for a _n Performing block processing, and randomly performing block processing on the a _n The k x C blocks in (a) are subjected to mask processing; c is a pair a _n The number of blocks obtained by partitioning is k, which is a preset mask proportion; a processed by mask _n A first neural network sub-model is trained, the first neural network sub-model being a 2D swin-transformer structure, the first neural network sub-model comprising a first encoder and a first decoder.

The structure of the 2D switch-transducer in the present invention is the prior art, and will not be described here. The first encoder in the invention is used for extracting the a after mask processing _n Is operative to predict the corresponding original pixel values of the mask block based on the output of the first encoder.

According to the invention, the masking strategy adopted for the remote sensing image sample corresponding to the first neural network sub-model is a random masking of a part of pixel points, and the capability of the first neural network model for extracting the space information of the remote sensing image is improved through the random masking strategy. Preferably, k is more than or equal to 40% and less than or equal to 60%. A small-scale experiment shows that when the value of k is set within the range of 40% -60%, the first neural network sub-model can not only better extract the spatial information of the remote sensing image, but also give consideration to the training time of the first neural network sub-model. Alternatively, k=50%.

As an example, a _n For an image with a resolution of 224 x 224, for a _n Performing block processing to obtain 56×56 blocks, wherein each block has 4*4 =16 pixels; randomly extracting half of 56 x 56 blocks, masking the extracted blocks to obtain masked a _n 。

S220 traversing B, pair B _m I of [ i ] _m ,i _m +L]Masking the frame image, i _m +L≤Q，i _m Not less than 1, L is the number of preset mask frames, i _m B is _m A start mask frame of (a); b processed by mask _m Training a second neural network sub-model, the second sub-model being a 3D swin-transformer structure, the second neural network sub-model comprising a second encoder and a second decoder; the training of the first neural network sub-model is performed simultaneously with the training of the second neural network sub-model, and the second encoder and the first encoder have weight sharing in the training process.

The invention is thatThe greatest difference between the 3D swin-transformer and the 2D swin-transformer is that the 3D is changed from 2D to 3D, and the structure of the 3D swin-transformer is also the prior art and will not be described herein. The second encoder in the invention is used for extracting b after mask processing _m Is characterized in that the second decoder is operative to predict the masked target image based on an output of the second encoder.

The training process of the first neural network sub-model and the training process of the second neural network sub-model are performed simultaneously, the training process of the video large model is quickened, and weight sharing exists between the first encoder in the first neural network sub-model and the second encoder in the second neural network sub-model in the training process, so that weights corresponding to modules with the same structure in the second encoder and the first encoder are the same, for example, the weights corresponding to the attention (attention) module in the second encoder and the attention (attention) module in the first encoder are the same. Therefore, the second neural network sub-model can acquire the capacity of the first neural network sub-model to extract the spatial information of the remote sensing image, further the capacity of the second neural network sub-model to extract the spatial information of the remote sensing image is improved, and the training process of the second neural network sub-model is facilitated to be accelerated.

The masking strategy adopted by the target video sample corresponding to the second neural network sub-model is to mask a certain frame in the target video as a starting frame, and frames with fixed length after the starting frame are masked, so that the difficulty of video prediction is increased through the masking strategy, and the capability of the second neural network sub-model for extracting the space-time continuous information of objects in the video is improved.

Preferably, Q=16, 5.ltoreq.L.ltoreq.9. A small-scale experiment shows that when Q=16, the value of L is set within the range of 5-9, the second neural network submodel can better extract the space-time continuous information of objects in the video, and the training time of the second neural network submodel can be considered. Alternatively, l=7.

The invention is directed to b _m The random continuous frame mask strategy is adopted, that is, the initial mask frames corresponding to different target videosWhich may be different or the same, but the number of frames masked is equal. As an example, b _m The method comprises the steps of continuously shooting 16 frames of target images, wherein each frame is 224 x 224, the number of mask frames is preset to be 7, randomly taking a starting point from the 16 frames of target images, masking off the starting point and 7 frames of subsequent images completely, and obtaining b after masking _m . It should be appreciated that the starting point is chosen to ensure that 7 frames or more than 7 frames of images follow the starting point.

According to the invention, the trained neural network model is the remote sensing scene-oriented video big model, and the remote sensing scene-oriented video big model has strong feature extraction capability and feature rule mining capability.

As a specific implementation manner, the remote sensing image set a comprises more than 109 ten thousand remote sensing images, the target video set B comprises more than 101 ten thousand target videos, and more than half of the target videos in the set B are videos shot by unmanned aerial vehicle carried remote sensing equipment; performing blocking processing on the remote sensing image, and performing mask processing on half of blocks in the remote sensing image at random; setting each target video to comprise continuous 16-frame target images, randomly selecting a starting mask frame in the target video, and masking the starting mask frame and 7 subsequent frame target images; training a first neural network sub-model in the neural network model by using the remote sensing image after mask processing, training a second neural network sub-model in the neural network model by using the target video after mask processing, and carrying out weight sharing on an encoder in the first neural network sub-model and an encoder in the second neural network sub-model in the training process until the training is finished.

Experiments show that compared with random initialization model parameters, the model parameters of the trained neural network model are used as initial model parameters of models corresponding to different downstream tasks, and the model corresponding to the downstream tasks with the same training duration achieves higher accuracy: when the downstream task is a target detection task, the corresponding average precision average (mAP) index rises from 0.3629 to 0.3718; when the downstream task is a video prediction task, the corresponding Structural Similarity (SSIM) index rises from 0.7018 to 0.7152. Therefore, the video large model for the remote sensing scene constructed by the method is suitable for different downstream tasks, has strong generalization capability, and has strong corresponding feature extraction capability and feature rule mining capability, and can improve the precision of the model corresponding to different downstream tasks.

While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. The method for constructing the large video model for the remote sensing scene is characterized by comprising the following steps of:

acquiring a remote sensing image set A and a target video set B, wherein A= { a ₁ ,a ₂ ,…,a _N }，a _n The method comprises the steps that N is the number of the remote sensing images in A, wherein the value range of N is 1 to N, and N is the number of the remote sensing images in A; b= { B ₁ ,b ₂ ,…,b _M }，b _m For the mth target video in B, the value range of M is 1 to M, M is the number of the target videos in B, B _m ＝(b _m,1 ,b _m,2 ,…,b _m,Q )，b _m,q B is _m In the Q-th frame of target image, the value range of Q is 1 to Q, Q is the number of target images in the target video, b _m,1 、b _m,2 、…、b _m,Q Q frames of target images are continuously shot; the target video in the step B is a video shot by a satellite-mounted remote sensing device or a video shot by an unmanned aerial vehicle-mounted remote sensing device, and the remote sensing image is an image shot by the satellite-mounted remote sensing device;

traversing A, pair a _n Performing block processing, and randomly performing block processing on the a _n The k x C blocks in (a) are subjected to mask processing; c is a pair a _n The number of blocks obtained by partitioning is k, which is a preset mask proportion; a processed by mask _n Training a first neural network sub-model, the first neural network sub-model being of a 2D swin-transformer structure, the first neural network sub-model comprising a first encoder and a first decoder;

traversal B, pair B _m I of [ i ] _m ,i _m +L]Masking the frame image, i _m +L≤Q，i _m Not less than 1, L is the number of preset mask frames, i _m B is _m A start mask frame of (a); b processed by mask _m Training a second neural network sub-model, the second neural network sub-model being a 3D swin-transformer structure, the second neural network sub-model comprising a second encoder and a second decoder; the training of the first neural network sub-model is performed simultaneously with the training of the second neural network sub-model, and the second encoder and the first encoder have weight sharing in the training process.

2. The method for constructing a large video model for a remote sensing scene according to claim 1, wherein k is more than or equal to 40% and less than or equal to 60%.

3. The method for constructing a large video model for a remote sensing scene according to claim 2, wherein k=50%.

4. The method for constructing a large video model for a remote sensing scene according to claim 1, wherein q=16, and 5.ltoreq.l.ltoreq.9.

5. The method for constructing a large video model for a remote sensing scene as claimed in claim 4, wherein l=7.

6. The method for constructing the large video model for the remote sensing scene according to claim 1, wherein the number of videos shot by the remote sensing equipment carried by the unmanned aerial vehicle in the B is larger than the number of videos shot by the remote sensing equipment carried by the satellite in the B.

7. The method for constructing a large video model for a remote sensing scene as claimed in claim 1, wherein N and M are each in the order of millions.