CN116109966A

CN116109966A - A video large-scale model construction method for remote sensing scenes

Info

Publication number: CN116109966A
Application number: CN202211635612.9A
Authority: CN
Inventors: 孙显; 付琨; 于泓峰; 姚方龙; 卢宛萱; 邓楚博; 杨和明
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-05-12
Anticipated expiration: 2042-12-19
Also published as: CN116109966B

Abstract

The application relates to the technical field of computer model construction, in particular to a remote sensing scene-oriented video large model construction method. The method comprises the following steps: acquiring a remote sensing image set A and a target video set B, wherein A= { a ₁ ,a ₂ ,…,a _N }，a _n The method comprises the steps that N is the number of the remote sensing images in A, wherein the value range of N is 1 to N, and N is the number of the remote sensing images in A; b= { B ₁ ,b ₂ ,…,b _M }，b _m Is the B thM target videos, wherein the value range of M is 1 to M, M is the number of target videos in B, B _m ＝(b _m,1 ,b _m,2 ,…,b _m,Q )，b _m,q B is _m A q-th frame target image; training a neural network model using a and B, the neural network model comprising a first neural network sub-model and a second neural network sub-model. The invention constructs the remote sensing scene-oriented video large model with strong feature extraction capability and feature rule discovery capability.

Description

A video large model construction method for remote sensing scenes

技术领域technical field

本发明涉及计算机模型的构建技术领域，特别是涉及一种面向遥感场景的视频大模型构建方法。The invention relates to the technical field of computer model building, in particular to a remote sensing scene-oriented large video model building method.

背景技术Background technique

由于遥感视频具有时间和空间上的双重特征，而且遥感场景本身具有复杂的纹理背景，因此，遥感场景下的视频解译任务需要的模型需要具有较强的特征提取能力，同时要发掘视频的空间特征规律和时间特征规律。如何构建一种具有较强特征提取能力和特征规律发掘能力的面向遥感场景的视频大模型，是亟待解决的问题。Since the remote sensing video has dual characteristics in time and space, and the remote sensing scene itself has a complex texture background, the model required for the video interpretation task in the remote sensing scene needs to have strong feature extraction capabilities, and at the same time, it is necessary to explore the space of the video. Characteristic regularity and temporal characteristic regularity. How to construct a large video model for remote sensing scenes with strong feature extraction ability and feature rule discovery ability is an urgent problem to be solved.

发明内容Contents of the invention

本发明目的在于，提供一种面向遥感场景的视频大模型构建方法，构建了一种具有较强特征提取能力和特征规律发掘能力的面向遥感场景的视频大模型。The purpose of the present invention is to provide a method for constructing a large video model for remote sensing scenes, and construct a large video model for remote sensing scenes with strong feature extraction capabilities and feature rule discovery capabilities.

根据本发明，提供了一种面向遥感场景的视频大模型构建方法，包括以下步骤：According to the present invention, there is provided a method for building a large video model for remote sensing scenes, comprising the following steps:

获取遥感图像集合A和目标视频集合B，A＝{a₁,a₂,…,a_N}，a_n为A中第n帧遥感图像，n的取值范围为1到N，N为A中遥感图像的数量；B＝{b₁,b₂,…,b_M}，b_m为B中第m个目标视频，m的取值范围为1到M，M为B中目标视频的数量，b_m＝(b_m,1,b_m,2,…,b_m,Q)，b_m,q为b_m中第q帧目标图像，q的取值范围为1到Q，Q为目标视频中目标图像的数量，b_m,1、b_m,2、…、b_m,Q为连续拍摄的Q帧目标图像；B中目标视频为卫星搭载遥感设备拍摄的视频或无人机搭载遥感设备拍摄的视频，所述遥感图像为卫星搭载遥感设备拍摄的图像。Obtain remote sensing image set A and target video set B, A={a ₁ ,a ₂ ,…,a _N }, a _n is the nth frame of remote sensing image in A, the value range of n is from 1 to N, and N is A The number of remote sensing images; B={b ₁ ,b ₂ ,…,b _M }, b _m is the mth target video in B, the value of m ranges from 1 to M, and M is the number of target videos in B , b _m ＝(b _m,1 ,b _m,2 ,…,b _m,Q ), b _m,q is the target image of the qth frame in b _m , the value range of q is from 1 to Q, and Q is the target The number of target images in the video, b _m,1 , b _m,2 ,..., b _m,Q are Q frames of target images shot continuously; the target video in B is the video taken by the satellite-equipped remote sensing equipment or the UAV-equipped remote sensing The video taken by the equipment, the remote sensing image is the image taken by the remote sensing equipment carried by the satellite.

利用A和B对神经网络模型进行训练，所述神经网络模型包括第一神经网络子模型和第二神经网络子模型，所述训练的过程包括：Utilize A and B to train the neural network model, the neural network model includes the first neural network sub-model and the second neural network sub-model, and the training process includes:

遍历A，对a_n进行分块处理，并随机对a_n中的k*C块进行掩码处理；C为对a_n进行分块得到的块数量，k为预设掩码比例；利用掩码处理后的a_n对第一神经网络子模型进行训练，所述第一神经网络子模型为2D swin-transformer结构，所述第一神经网络子模型包括第一编码器和第一解码器。Traverse A, block a _n , and randomly mask k*C blocks in a _n ; C is the number of blocks obtained by dividing a _n into blocks, and k is the preset mask ratio; using the mask The code-processed a _n trains the first neural network sub-model, the first neural network sub-model is a 2D swin-transformer structure, and the first neural network sub-model includes a first encoder and a first decoder.

遍历B，对b_m中的第[i_m,i_m+L]帧图像进行掩码处理，i_m+L≤Q，i_m≥1，L为预设掩码帧数量，i_m为b_m中的起始掩码帧；利用掩码处理后的b_m对第二神经网络子模型进行训练，所述第二子模型为3D swin-transformer结构，所述第二神经网络子模型包括第二编码器和第二解码器；所述对第一神经网络子模型进行训练与所述对第二神经网络子模型进行训练同时进行，所述第二编码器与所述第一编码器在训练的过程中存在权重共享。Traversing B, masking the [i _m ,i _m +L]th frame image in b _m , i _m +L≤Q, i _m ≥1, L is the number of preset mask frames, and i _m is b The initial mask frame in _m ; the second neural network sub-model is trained by b _m after mask processing, and the second sub-model is a 3D swin-transformer structure, and the second neural network sub-model includes the second neural network sub-model Two encoders and a second decoder; the training of the first neural network sub-model is performed simultaneously with the training of the second neural network sub-model, and the second encoder and the first encoder are training There is weight sharing in the process.

本发明与现有技术相比具有明显的有益效果，借由上述技术方案，本发明提供的方法可达到相当的技术进步性及实用性，并具有产业上的广泛利用价值，其至少具有以下有益效果：Compared with the prior art, the present invention has obvious beneficial effects. By means of the above-mentioned technical solutions, the method provided by the present invention can achieve considerable technical advancement and practicality, and has wide industrial application value. It has at least the following benefits Effect:

本发明的面向遥感场景的视频大模型包括两个支路，第一支路对应于第一神经网络子模型，该支路对应的训练样本为遥感图像集合；第二支路对应于第二神经网络子模型，该支路对应的训练样本为目标视频集合，本发明的目标视频集合不但包括遥感视频(即卫星搭载遥感设备拍摄的视频)，还包括无人机视频(无人机搭载遥感设备拍摄的视频)，由于遥感视频不容易获取，因此可作为训练样本的遥感视频的数量较少；本发明通过引入无人机视频对视频样本数量进行了扩充，利用扩充后的视频样本对第二神经网络子模型进行训练可提高第二神经网络子模型的特征提取和规律发掘的能力，也提高了经训练的第二神经网络子模型的泛化能力，可应用于不同的偏时空预测的下游任务。The remote sensing scene-oriented large video model of the present invention includes two branches, the first branch corresponds to the first neural network sub-model, and the training samples corresponding to the branch are remote sensing image sets; the second branch corresponds to the second neural network sub-model. The network sub-model, the training sample corresponding to this branch is a target video set, and the target video set of the present invention not only includes remote sensing video (that is, the video taken by the satellite-mounted remote sensing equipment), but also includes UAV video (the unmanned aerial vehicle is equipped with remote sensing equipment). shot video), because the remote sensing video is not easy to obtain, so the number of remote sensing videos that can be used as training samples is relatively small; the present invention expands the number of video samples by introducing drone videos, and utilizes the expanded video samples for the second The training of the neural network sub-model can improve the feature extraction and rule discovery capabilities of the second neural network sub-model, and also improve the generalization ability of the trained second neural network sub-model, which can be applied to the downstream of different partial space-time predictions Task.

而且，本发明对第一神经网络子模型对应的遥感图像样本采用的掩码策略为随机掩码一部分像素点，通过该随机掩码策略来提高第一神经网络模型提取遥感图像的空间信息的能力；对第二神经网络子模型对应的目标视频样本采用的掩码策略为将目标视频中的某一帧作为起始帧，将该起始帧之后的固定长度的帧都进行掩码，通过该掩码策略来增大视频预测的难度，提高第二神经网络子模型提取视频中物体的时空连续信息的能力；本发明对第一神经网络子模型的训练过程与对第二神经网络子模型的训练过程同时进行，加快了对视频大模型的训练过程，且训练过程中第一神经网络子模型中的第一编码器与第二神经网络子模型中的第二编码器之间存在权重共享，由此，第二神经网络子模型就可以获取第一神经网络子模型提取遥感图像的空间信息的能力，进而提升了第二神经网络子模型自身提取遥感图像的空间信息的能力，有利于加快对第二神经网络子模型的训练过程。Moreover, the masking strategy adopted by the present invention for the remote sensing image samples corresponding to the first neural network sub-model is to randomly mask a part of the pixels, and the ability of the first neural network model to extract the spatial information of the remote sensing image is improved through the random masking strategy ; The masking strategy adopted for the target video sample corresponding to the second neural network sub-model is to use a certain frame in the target video as the starting frame, and mask all the fixed-length frames after the starting frame. Masking strategy to increase the difficulty of video prediction, improve the ability of the second neural network sub-model to extract the temporal and spatial continuous information of objects in the video; the present invention is similar to the training process of the first neural network sub-model The training process is carried out at the same time, which speeds up the training process of the large video model, and there is weight sharing between the first encoder in the first neural network sub-model and the second encoder in the second neural network sub-model during the training process, Thus, the second neural network sub-model can obtain the ability of the first neural network sub-model to extract the spatial information of remote sensing images, thereby improving the ability of the second neural network sub-model itself to extract the spatial information of remote sensing images, which is conducive to speeding up The training process of the second neural network sub-model.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.

图1为本发明实施例提供的面向遥感场景的视频大模型构建方法的流程图。FIG. 1 is a flowchart of a method for constructing a large video model for a remote sensing scene provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative efforts fall within the protection scope of the present invention.

根据本发明，提供了一种面向遥感场景的视频大模型构建方法，如图1所示，包括以下步骤：According to the present invention, there is provided a kind of method for building a large video model for remote sensing scenes, as shown in Figure 1, comprising the following steps:

S100，获取遥感图像集合A和目标视频集合B，A＝{a₁,a₂,…,a_N}，a_n为A中第n帧遥感图像，n的取值范围为1到N，N为A中遥感图像的数量；B＝{b₁,b₂,…,b_M}，b_m为B中第m个目标视频，m的取值范围为1到M，M为B中目标视频的数量，b_m＝(b_m,1,b_m,2,…,b_m,Q)，b_m,q为b_m中第q帧目标图像，q的取值范围为1到Q，Q为目标视频中目标图像的数量，b_m,1、b_m,2、…、b_m,Q为连续拍摄的Q帧目标图像；B中目标视频为卫星搭载遥感设备拍摄的视频或无人机搭载遥感设备拍摄的视频，所述遥感图像为卫星搭载遥感设备拍摄的图像。S100, acquire remote sensing image set A and target video set B, A={a ₁ ,a ₂ ,...,a _N }, a _n is the nth frame of remote sensing image in A, and the value range of n is from 1 to N, N is the number of remote sensing images in A; B={b ₁ ,b ₂ ,…,b _M }, b _m is the mth target video in B, the value range of m is from 1 to M, and M is the target video in B b _m ＝(b _m,1 ,b _m,2 ,…,b _m,Q ), b _m,q is the target image of the qth frame in b _m , the value range of q is from 1 to Q, and Q is the number of target images in the target video, b _m,1 , b _m,2 ,..., b _m,Q are Q frames of target images captured continuously; the target video in B is the video taken by the satellite equipped with remote sensing equipment or the UAV A video taken with remote sensing equipment, and the remote sensing image is an image taken by a satellite equipped with remote sensing equipment.

本发明的面向遥感场景的视频大模型包括两个支路，第一支路对应于第一神经网络子模型，该支路对应的训练样本为遥感图像集合；第二支路对应于第二神经网络子模型，该支路对应的训练样本为目标视频集合，本发明的目标视频集合不但包括遥感视频(即卫星搭载遥感设备拍摄的视频)，还包括无人机视频(无人机搭载遥感设备拍摄的视频)。The remote sensing scene-oriented large video model of the present invention includes two branches, the first branch corresponds to the first neural network sub-model, and the training samples corresponding to the branch are remote sensing image sets; the second branch corresponds to the second neural network sub-model. The network sub-model, the training sample corresponding to this branch is a target video set, and the target video set of the present invention not only includes remote sensing video (that is, the video taken by the satellite-mounted remote sensing equipment), but also includes UAV video (the unmanned aerial vehicle is equipped with remote sensing equipment). captured video).

优选的，B中无人机搭载遥感设备拍摄的视频的数量大于B中卫星搭载遥感设备拍摄的视频的数量。本发明将无人机搭载遥感设备拍摄的视频作为目标视频的一种，可扩展目标视频的数量，解决由于遥感视频不易获取导致的目标视频的数量不足以满足后续对神经网络模型的训练需求的问题；而且无人机搭载遥感设备拍摄的视频与卫星搭载遥感设备拍摄的视频均是空中搭载遥感设备似俯拍的视角拍下的，因此，将无人机搭载遥感设备拍摄的视频作为目标视频用于后续对神经网络模型的训练也可以兼顾对神经网络模型训练的效果。Preferably, the number of videos taken by remote sensing equipment carried by drones in B is greater than the number of videos taken by remote sensing equipment carried by satellites in B. In the present invention, the video taken by the remote sensing equipment carried by the UAV is used as a kind of target video, and the number of target videos can be expanded to solve the problem that the number of target videos is not enough to meet the subsequent training requirements of the neural network model due to the difficulty in obtaining remote sensing videos. Question; moreover, the video taken by the UAV equipped with remote sensing equipment and the video taken by the satellite equipped with remote sensing equipment are all taken from the angle of view of the aerial remote sensing equipment, so the video taken by the UAV equipped with remote sensing equipment is taken as the target video The subsequent training of the neural network model can also take into account the effect of training the neural network model.

优选的，N和M的数量级均为百万级。本发明的训练样本的数量集为百万级，经训练的面向遥感场景的视频大模型具有强大的特征提取能力、规律发掘能力和泛化能力，将经训练的面向遥感场景的视频大模型的模型参数作为不同下游任务对应的模型的初始模型参数，可加快下游任务对应的模型的训练过程，提高下游任务对应的模型的精度；上述下游任务可为视频预测任务、目标检测任务、单目标追踪任务和视频分割任务等。Preferably, both N and M are on the order of millions. The number of training samples in the present invention is one million, and the trained large video model for remote sensing scenes has powerful feature extraction capabilities, rule discovery capabilities, and generalization capabilities. The trained large video model for remote sensing scenes The model parameters are used as the initial model parameters of the models corresponding to different downstream tasks, which can speed up the training process of the models corresponding to the downstream tasks and improve the accuracy of the models corresponding to the downstream tasks; the above downstream tasks can be video prediction tasks, target detection tasks, single target tracking tasks and video segmentation tasks, etc.

S200，利用A和B对神经网络模型进行训练，所述神经网络模型包括第一神经网络子模型和第二神经网络子模型，所述训练的过程包括：S200, using A and B to train the neural network model, the neural network model includes a first neural network sub-model and a second neural network sub-model, and the training process includes:

S210，遍历A，对a_n进行分块处理，并随机对a_n中的k*C块进行掩码处理；C为对a_n进行分块得到的块数量，k为预设掩码比例；利用掩码处理后的a_n对第一神经网络子模型进行训练，所述第一神经网络子模型为2D swin-transformer结构，所述第一神经网络子模型包括第一编码器和第一解码器。S210, traversing A, performing block processing on a _n , and randomly performing mask processing on k*C blocks in a _n ; C is the number of blocks obtained by performing block a _n , and k is a preset mask ratio; Using masked a _n to train the first neural network submodel, the first neural network submodel is a 2D swin-transformer structure, and the first neural network submodel includes a first encoder and a first decoder device.

本发明中2D swin-transformer的结构为现有技术，此处不再赘述。本发明中第一编码器的作用为提取掩码处理后的a_n的特征，第一解码器的作用为根据第一编码器的输出预测掩码块对应的原始像素值。The structure of the 2D swin-transformer in the present invention is the prior art, and will not be repeated here. In the present invention, the function of the first encoder is to extract the features of the masked a _n , and the function of the first decoder is to predict the original pixel value corresponding to the mask block according to the output of the first encoder.

本发明对第一神经网络子模型对应的遥感图像样本采用的掩码策略为随机掩码一部分像素点，通过该随机掩码策略来提高第一神经网络模型提取遥感图像的空间信息的能力。优选的，40％≤k≤60％。经小规模实验表明，k的值设置在40％≤k≤60％范围内时第一神经网络子模型既能够较好地提取遥感图像的空间信息，也能够兼顾第一神经网络子模型的训练时长。可选的，k＝50％。The masking strategy adopted by the present invention for the remote sensing image samples corresponding to the first neural network sub-model is to randomly mask a part of pixels, and the ability of the first neural network model to extract the spatial information of the remote sensing image is improved through the random masking strategy. Preferably, 40%≤k≤60%. Small-scale experiments show that when the value of k is set in the range of 40%≤k≤60%, the first neural network sub-model can not only extract the spatial information of remote sensing images better, but also take into account the training of the first neural network sub-model duration. Optionally, k=50%.

作为一种实施例，a_n为分辨率为224*224的图像，对a_n进行分块处理，得到56*56个块，每个块有4*4＝16个像素；随机地抽取56*56个块里一半的块，将这些抽取的块掩码掉，就得到了掩码处理后的a_n。As an embodiment, a _n is an image with a resolution of 224*224, and a _n is divided into blocks to obtain 56*56 blocks, and each block has 4*4=16 pixels; randomly extract 56* Half of the 56 blocks are masked out to obtain the masked a _n .

S220遍历B，对b_m中的第[i_m,i_m+L]帧图像进行掩码处理，i_m+L≤Q，i_m≥1，L为预设掩码帧数量，i_m为b_m中的起始掩码帧；利用掩码处理后的b_m对第二神经网络子模型进行训练，所述第二子模型为3D swin-transformer结构，所述第二神经网络子模型包括第二编码器和第二解码器；所述对第一神经网络子模型进行训练与所述对第二神经网络子模型进行训练同时进行，所述第二编码器与所述第一编码器在训练的过程中存在权重共享。S220 traverses B, and performs mask processing on the [i _m , i _m +L]th frame image in b _m , where i _m + L ≤ Q, i _m ≥ 1, L is the number of preset mask frames, and i _m is The initial mask frame in b _m ; the b _m after mask processing is used to train the second neural network submodel, the second submodel is a 3D swin-transformer structure, and the second neural network submodel includes A second encoder and a second decoder; the training of the first neural network sub-model is performed simultaneously with the training of the second neural network sub-model, and the second encoder and the first encoder are at the same time There is weight sharing during training.

本发明中3D swin-transformer相较于2D swin-transformer的最大区别在于从2D变成了3D，多了一个维度，3D swin-transformer的结构也为现有技术，此处不再赘述。本发明中第二编码器的作用为提取掩码处理后的b_m的特征，第二解码器的作用为根据第二编码器的输出预测被掩码的目标图像。The biggest difference between the 3D swin-transformer in the present invention and the 2D swin-transformer is that it has changed from 2D to 3D, and has one more dimension. The structure of the 3D swin-transformer is also an existing technology, and will not be described here. In the present invention, the function of the second encoder is to extract the features of the masked b _m , and the function of the second decoder is to predict the masked target image according to the output of the second encoder.

本发明对第一神经网络子模型的训练过程与对第二神经网络子模型的训练过程同时进行，加快了对视频大模型的训练过程，且训练过程中第一神经网络子模型中的第一编码器与第二神经网络子模型中的第二编码器之间存在权重共享，使第二编码器和第一编码器中结构相同的模块对应的权重相同，例如第二编码器中的注意力(attention)模块和第一编码器中的注意力(attention)模块对应的权重相同。由此，第二神经网络子模型就可以获取第一神经网络子模型提取遥感图像的空间信息的能力，进而提升了第二神经网络子模型自身提取遥感图像的空间信息的能力，有利于加快对第二神经网络子模型的训练过程。In the present invention, the training process of the first neural network sub-model is carried out simultaneously with the training process of the second neural network sub-model, which speeds up the training process of the large video model, and the first neural network sub-model in the training process There is weight sharing between the encoder and the second encoder in the second neural network sub-model, so that the weights corresponding to the modules with the same structure in the second encoder and the first encoder are the same, such as the attention in the second encoder The (attention) module has the same weight as the attention module in the first encoder. Thus, the second neural network sub-model can obtain the ability of the first neural network sub-model to extract the spatial information of remote sensing images, thereby improving the ability of the second neural network sub-model itself to extract the spatial information of remote sensing images, which is conducive to speeding up The training process of the second neural network sub-model.

本发明对第二神经网络子模型对应的目标视频样本采用的掩码策略为将目标视频中的某一帧作为起始帧，将该起始帧之后的固定长度的帧都进行掩码，通过该掩码策略来增大视频预测的难度，提高第二神经网络子模型提取视频中物体的时空连续信息的能力。The masking strategy adopted by the present invention for the target video sample corresponding to the second neural network sub-model is to use a certain frame in the target video as the starting frame, and mask all the fixed-length frames after the starting frame. The masking strategy is used to increase the difficulty of video prediction and improve the ability of the second neural network sub-model to extract the temporal and spatial continuous information of objects in the video.

优选的，Q＝16，5≤L≤9。经小规模实验表明，当Q＝16时，L的值设置在5≤L≤9范围内时第二神经网络子模型既能够较好地提取视频中物体的时空连续信息，也能够兼顾第二神经网络子模型的训练时长。可选的，L＝7。Preferably, Q=16, 5≤L≤9. Small-scale experiments show that when Q=16 and the value of L is set in the range of 5≤L≤9, the second neural network sub-model can not only extract the temporal and spatial continuous information of objects in the video, but also take into account the second The training time of the neural network submodel. Optionally, L=7.

本发明对于b_m采用的是随机的连续帧掩码策略，也就是说，不同目标视频对应的起始掩码帧可能不同也可能相同，但被掩码的帧数量相等。作为一种实施例，b_m包括连续拍摄的16帧目标图像，每一帧都是224*224的图像，预先设置掩码帧数量为7，随机取在16帧目标图像中选取一个起点，然后将这个起点以及之后的7帧图像全部掩码掉，就得到了掩码处理后的b_m。应当理解的是，起点的选取要保证起点之后有7帧或者大于7帧的图像。The present invention adopts a random continuous frame mask strategy for b _m , that is, the initial mask frames corresponding to different target videos may be different or the same, but the number of masked frames is equal. As an embodiment, b _m includes 16 frames of target images shot continuously, each frame is a 224*224 image, the number of mask frames is preset to 7, and a starting point is randomly selected in the 16 frames of target images, and then Mask all the starting point and the subsequent 7 frames of images to obtain the masked b _m . It should be understood that the selection of the starting point should ensure that there are 7 or more images after the starting point.

根据本发明，经训练的神经网络模型即为本发明的面向遥感场景的视频大模型，该面向遥感场景的视频大模型具有较强特征提取能力和特征规律发掘能力。According to the present invention, the trained neural network model is the remote sensing scene-oriented video large model of the present invention, and the remote sensing scene-oriented video large model has strong feature extraction capabilities and feature rule discovery capabilities.

作为一个具体实施方式，遥感图像集合A中包括109万以上的遥感图像，目标视频集合B中包括101万以上的目标视频，B中超过一半以上的目标视频为无人机搭载遥感设备拍摄的视频；对上述遥感图像进行分块处理，随机对遥感图像中的一半的块进行掩码处理；设置每个目标视频包括连续的16帧目标图像，随机选取目标视频中的起始掩码帧，将起始掩码帧及之后的7帧目标图像进行掩码；利用掩码处理后的遥感图像对神经网络模型中的第一神经网络子模型进行训练，同时利用掩码处理后的目标视频对神经网络模型中的第二神经网络子模型进行训练，训练的过程中将第一神经网络子模型中的编码器与第二神经网络子模型中的编码器进行权重共享，直至训练结束。As a specific implementation, the remote sensing image set A includes more than 1.09 million remote sensing images, the target video set B includes more than 1.01 million target videos, and more than half of the target videos in B are videos taken by drones equipped with remote sensing equipment ; The above-mentioned remote sensing images are processed in blocks, and half of the blocks in the remote sensing images are randomly masked; each target video is set to include continuous 16 frames of target images, and the initial mask frame in the target video is randomly selected, and the The initial mask frame and the following 7 frames of target images are masked; the remote sensing image after mask processing is used to train the first neural network sub-model in the neural network model, and the target video after mask processing is used to train the neural network model. The second neural network sub-model in the network model is trained, and the encoder in the first neural network sub-model is weight-shared with the encoder in the second neural network sub-model during the training until the training ends.

实验表明，相较于随机初始化模型参数而言，将该经训练的神经网络模型的模型参数作为不同下游任务对应的模型的初始模型参数，相同训练时长下下游任务对应的模型达到的精度较高：当下游任务为目标检测任务时，对应的平均精度均值(mAP)指标从0.3629涨到0.3718；当下游任务为视频预测任务时，对应的结构相似性(SSIM)指标从0.7018涨到0.7152。可见，本发明构建的面向遥感场景的视频大模型适用于不同下游任务，泛化能力较强，且对应的特征提取能力和特征规律发掘能力较强，可提高不同下游任务对应的模型的精度。Experiments show that compared with random initialization of model parameters, the model parameters of the trained neural network model are used as the initial model parameters of the models corresponding to different downstream tasks, and the accuracy of the models corresponding to the downstream tasks with the same training time is higher. : When the downstream task is a target detection task, the corresponding mean average precision (mAP) index rises from 0.3629 to 0.3718; when the downstream task is a video prediction task, the corresponding structural similarity (SSIM) index rises from 0.7018 to 0.7152. It can be seen that the large video model for remote sensing scenes constructed by the present invention is suitable for different downstream tasks, has strong generalization ability, and the corresponding feature extraction ability and feature law discovery ability are strong, which can improve the accuracy of models corresponding to different downstream tasks.

虽然已经通过示例对本发明的一些特定实施例进行了详细说明，但是本领域的技术人员应该理解，以上示例仅是为了进行说明，而不是为了限制本发明的范围。本领域的技术人员还应理解，可以对实施例进行多种修改而不脱离本发明的范围和精神。本发明的范围由所附权利要求来限定。Although some specific embodiments of the present invention have been described in detail through examples, those skilled in the art should understand that the above examples are for illustration only, rather than limiting the scope of the present invention. Those skilled in the art will also appreciate that various modifications can be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. The method for constructing the large video model for the remote sensing scene is characterized by comprising the following steps of:

acquiring a remote sensing image set A and a target video set B, wherein A= { a ₁ ,a ₂ ,…,a _N }，a _n The method comprises the steps that N is the number of the remote sensing images in A, wherein the value range of N is 1 to N, and N is the number of the remote sensing images in A; b= { B ₁ ,b ₂ ,…,b _M }，b _m For the mth target video in B, the value range of M is 1 to M, M is the number of the target videos in B, B _m ＝(b _m,1 ,b _m,2 ,…,b _m,Q )，b _m,q B is _m In the Q-th frame of target image, the value range of Q is 1 to Q, Q is the number of target images in the target video, b _m,1 、b _m,2 、…、b _m,Q Q frames of target images are continuously shot; b, the target video is a video shot by a satellite carried remote sensing device or a remote carried by an unmanned aerial vehicleThe remote sensing image is an image shot by the satellite carried remote sensing equipment;

training a neural network model using a and B, the neural network model comprising a first neural network sub-model and a second neural network sub-model, the training comprising:

traversing A, pair a _n Performing block processing, and randomly performing block processing on the a _n The k x C blocks in (a) are subjected to mask processing; c is a pair a _n The number of blocks obtained by partitioning is k, which is a preset mask proportion; a processed by mask _n Training a first neural network sub-model, the first neural network sub-model being of a 2D swin-transformer structure, the first neural network sub-model comprising a first encoder and a first decoder;

traversal B, pair B _m I of [ i ] _m ,i _m +L]Masking the frame image, i _m +L≤Q，i _m Not less than 1, L is the number of preset mask frames, i _m B is _m A start mask frame of (a); b processed by mask _m Training a second neural network sub-model, the second sub-model being a 3D swin-transformer structure, the second neural network sub-model comprising a second encoder and a second decoder; the training of the first neural network sub-model is performed simultaneously with the training of the second neural network sub-model, and the second encoder and the first encoder have weight sharing in the training process.

2. The method for constructing a large video model for a remote sensing scene according to claim 1, wherein k is more than or equal to 40% and less than or equal to 60%.

3. The method for constructing a large video model for a remote sensing scene according to claim 2, wherein k=50%.

4. The method for constructing a large video model for a remote sensing scene according to claim 1, wherein q=16, and 5.ltoreq.l.ltoreq.9.

5. The method for constructing a large video model for a remote sensing scene as claimed in claim 4, wherein l=7.

6. The method for constructing the large video model for the remote sensing scene according to claim 1, wherein the number of videos shot by the remote sensing equipment carried by the unmanned aerial vehicle in the B is larger than the number of videos shot by the remote sensing equipment carried by the satellite in the B.

7. The method for constructing a large video model for a remote sensing scene as claimed in claim 1, wherein N and M are each in the order of millions.