CN118070227A

CN118070227A - Multi-mode data processing method enhanced by large language model

Info

Publication number: CN118070227A
Application number: CN202410282739.XA
Authority: CN
Inventors: 李海峰; 邵润
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2024-03-13
Filing date: 2024-03-13
Publication date: 2024-05-24

Abstract

The invention discloses a large language model enhanced multi-mode data processing method, which comprises the following steps: the fusion model comprises a modal encoder, a modal bridge, a text word segmentation device, a multi-modal large language model and a task head; the modality encoder encodes each modality image data into a token sequence; the mode bridge is used for finishing dimension mapping from each mode to the language mode token; the multi-modal large language model completes analysis of each modal data; designing a specific task head for each task to promote generalization of the model on the task; one to four text prompt words are provided for each mode image data to guide the fusion model to correctly analyze each mode image data. The invention can explain image data of multiple modes by using one model; the large language model is used as a core to construct the universal artificial intelligence, and fusion of image data is promoted to be converted from a model-specific and task-specific paradigm to a universal paradigm.

Description

Multi-mode data processing method enhanced by large language model

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a large language model enhanced multi-mode data processing method.

Background

Different modal data such as RGB, infrared, multispectral and the like are essentially different observation means for the same ground object, so that the utilization of the multi-modal data is an intrinsic requirement for the cognition of the ground object. However, for a long time, due to the high heterogeneity of the image data of each mode in terms of structure and semantics, joint interpretation of multi-mode data has been a very challenging problem, and the key difficulty is the balance between uniformity and independence among different modes, and the balance is in nonlinear growth with the increase of the number of modes.

Thanks to the increasing abundance of observation means for space-time scenes, the observation means can be described by multiple modes such as RGB, infrared, SAR, multispectral, graph, space-time track and the like for the same ground object, wherein each mode describes information of a certain aspect of the ground object respectively. Similar to the process that human beings perceive and understand the world through a visual mode, the joint interpretation by utilizing multi-mode data is an intelligent model to realize the internal requirement of the cognition of the ground object, and the fusion processing can be better carried out on images in the fields of satellite remote sensing images, agricultural monitoring and the like.

However, due to the inherent mechanism differences of the modes, the modes are often highly heterogeneous in structure and semantics, such as in structure, the point cloud exists in the form of three-dimensional coordinates and several eigenvalues; semantically, the RGB image reflects the electromagnetic characteristics of the visible light wave band of the ground object reflection and the active emission, and the SAR image reflects the electromagnetic characteristics of the microwaves of the ground object reflection microwave radar active emission.

For a long time, due to the high heterogeneity of the above-mentioned modes, researchers in the field often design specific methods based on a priori assumptions of a certain mode or task, or design multi-mode joint interpretation methods based on a few modes with low heterogeneity. For example, in a single-mode study, c.r.qi et al propose classical networks PointNet for point cloud modes based on invariance of point cloud data ordering, importance of global and local features, classical networks convertors for language modes based on long Cheng Yilai of word sequences in language for language modes, vaswani et al propose classical networks GCN for graph modes based on adjacency dependency of nodes in the graph for graph modes, kipf and Welling; in multi-modal research, fusion of optical and SAR images is widely explored under both traditional remote sensing and deep learning remote sensing, and text visual models are rapidly developed in recent years. The foregoing prior differences in each modality often lead to a huge gap in the approach between different modalities, which are difficult to perceive and understand with a unified model.

The multi-modal data is used for realizing the intrinsic requirement of the cognition of the ground object, and an ideal multi-modal model has the capability of combining all modes for joint interpretation. Thus, an important trend in the study of intelligent methods in the space-time domain is the ever-increasing number of modalities.

First, early researchers often built a single-modality expert model based on a priori assumptions about a modality, and achieved excellent success in the respective modalities. In recent years, based on the deep understanding of the single-mode interpretation method, a large number of researchers have tried to propose a multi-mode interpretation method by integrating several modes with low heterogeneity. However, as the number of modalities increases, the difficulty in balancing the uniformity and independence among the modalities increases.

For the text visual mode, the text visual mode has the working of Transformer, BERT, GPT series of equivalent milestone meanings, and the generation of a subsequent series of working is inspired. For the RGB modality, resNet proposed by k.he et al based on visual global and local information importance assumptions and ViT proposed by a.dosovitskiy et al based on global attention mechanisms are two milestone-wise works.

The increased number of channels of the image results in multispectral and hyperspectral modes. The Huang et al put forward STDCNN model to model the global space and spectral characteristics of the multispectral image at the same time based on the characteristic that the multispectral image has a large number of wave bands; the number of hyperspectral image bands is further increased compared to multispectral images, typically up to hundreds of bands, based on which x.yang et al propose R-3D-CNN to further enhance the extraction of spectral features.

In addition to the mode in which the number of bands increases, a specific band of images is also receiving a great deal of attention. For example, in the fields of agriculture, ocean and the like, an infrared image with wide application range is provided by H.Chen et al in combination with a decision tree to provide a shallow convolutional neural network so as to meet the requirement of real-time intelligent interpretation of a large number of infrared images in agricultural water resource monitoring. S. Chen et al proposes a full convolution neural network AConvNet for intelligent interpretation of SAR images formed by active microwave radars.

The three-dimensional reconstruction task under the oblique photography mode aims at reconstructing a three-dimensional model of a target by utilizing a single or a plurality of two-dimensional images with different angles, and Y.Yao et al propose MVSNet aiming at the task, and construct the association between images with different views by utilizing technologies such as three-dimensional convolution and the like, so as to restore the three-dimensional model of the target from the characteristics of the two-dimensional images.

The space-time track mode reflects the change of the target space position in time, and A.Gupta et al propose a Social GAN based on the characteristics of track multiple rationality, and simultaneously combine historical track information and Social information to predict a plurality of reasonable future results.

For Graph modes, kipf and Welling propose a classical network GCN based on the adjacency dependency relationship of nodes in the Graph; GAT proposed by velic ˇ kovic et al then introduces a attention mechanism into the Graph modality.

The Video modality is formed by stacking images in time. The model proposed by Karpath et al expands the connectivity of the CNN model in time, and simultaneously accelerates the training process by utilizing a multi-resolution technology, thereby effectively reducing the influence of the space-time redundancy of the Video mode on the model efficiency.

The point cloud modality reflects information such as position, shape, color, texture, etc. of the three-dimensional target, and c.r.qi et al propose a classical network PointNet for the point cloud modality based on invariance of the ordering of the point cloud data, the importance of global and local features; wuu et al propose PointConv to generalize the convolution operation to three-dimensional point clouds.

Although single-mode methods achieve good performance in the respective modes, they are difficult to generalize among multiple modes due to their large differences from one another based on their respective prior assumptions of construction. Based on the intelligent model, the intrinsic requirements of feature object cognition are completed by utilizing multi-modal data, and a plurality of researchers try to balance the uniformity and independence among modalities to construct a multi-modal model.

Firstly, for the joint interpretation of two modes, H.Li and X. -J.Wu recognize that the fusion of RGB images and Infrared images can enhance the color, texture, temperature and other information of the images, and the method has wide requirements and application prospects in the fields of military, medical treatment and the like, so that a dense fusion method of the RGB and Infrared images is provided: denseFuse, which connects the outputs of all convolution layers through a dense module, successfully realizes the pixel level fusion of RGB and Inforred modes; the Sadeghian et al expands the Social GAN for predicting a plurality of reasonable tracks by using the historical tracks, proposes that the Sophie of the scene data is added by introducing RGB images, and achieves better effect on the space-time track prediction task; the RGB image and the SAR image have high complementarity, l.h. hughes et al propose a three-step deep neural network framework, a general prediction matching region, a thermodynamic diagram generation and abnormal value rejection method are used for matching the RGB image and the SAR image, and x.li et al propose a model for migrating the RGB image and the SAR image to the same feature space by using a GAN network to complete a target detection task: DTCDN; j. yang et al propose a dual stream convolutional network that uses high resolution multispectral images to enhance the spatial resolution of the hyperspectral images.

Notably, the CLIP model proposed by a.radford et al uses contrast learning to correlate RGB modalities with text modalities, and under pre-training of text weak supervisory signal guidance, the CLIP model exhibits excellent capabilities both on visual unimodal tasks and on visual text multimodal tasks, inspiring a subsequent series of work. The Meta-transformer proposed by zhang et al pre-trains a general backbone network in a visual language mode by utilizing a contrast learning mode proposed by CLIP, and the general backbone network shows multi-mode generalization capability in a plurality of modes such as point cloud, infrared, hyperspectrum and the like, and shows the potential of language as a central mode. J.Han et al directly uses a multi-modal large language model as a general backbone network, proposes an One-LLM model, and successfully unifies Image, audio, video, point and other 8 modalities.

The key difficulty in solving the problem of fusion of multimodal images is how to balance the uniformity and independence between modalities. Uniformity refers to the existence of interrelated shared information between the modalities, whereas independence refers to the existence of specific information for each modality compared to the other modalities. If the data of a plurality of modes are simply projected to a characterization space for fusion, namely, more uniformity among the modes is emphasized, the specific information of each mode is lost, and the meaning of multi-mode combination is lost; conversely, if the independence between the modalities is emphasized too much, the modalities cannot be linked, and the ability of the unified modality to simultaneously sense a plurality of modalities is limited. Moreover, as the number of modalities increases, the difficulty in balancing the uniformity and independence of the modalities increases non-linearly.

The computational effort and data size required for large model training is currently expanding dramatically, and even the cost of fine tuning large models is becoming unacceptable, so training large models for each domain is almost impossible in the future, and languages are expected to achieve efficient generalization of large models, expanding large models from the native domain to more domains with minimal cost.

The deep learning model is always called as a 'black box', which means that the reasoning process of the deep learning model is invisible and difficult to explain, and some work for researching the interpretability of the deep learning model is often based on a complex mathematical model and a large number of assumptions, which greatly limits the landing of the deep learning method in the low fault tolerance fields such as homeland resources and the like.

End-to-end to interactive paradigm shift. The end-to-end paradigm refers to a learning mode in which the model accepts input and directly outputs results. In recent years, the end-to-end paradigm has become more popular due to its simple and clear architecture and excellent performance, but its drawbacks are also apparent, namely uncontrollable internal operations, partial problems requiring optimization of the whole, difficult positioning of the problem causes, etc. The language-centric architecture has the potential to implement an interactive paradigm, i.e., a user inputs raw data and corresponding text prompts, the model automatically performs related operations according to the prompts, and even the user can adjust the text prompts according to the results, iterating repeatedly. The interactive paradigm thus has an incomparable advantage of the end-to-end paradigm, both in terms of performance and in terms of controllability.

Disclosure of Invention

Therefore, the invention provides a large language model enhanced multi-mode data processing method, which fuses a text (the text can also be regarded as a one-dimensional image visual mode), RGB, infrared, SAR, multispectral, hyperspectral, graph, space-time track, oblique photography, point cloud, video and other different image modes into one model.

In order to achieve uniformity among a plurality of modes, the invention uses a principle of using a language as a uniform reference system of a feature space, maps data of different modes to the language modes, and designs a specific text template for each mode so as to guide a uniform multi-mode large language model to complete analysis of data of each mode.

The invention provides a large language model enhanced multi-mode data processing method, which unifies a plurality of different modes such as texts, RGB, infrared, SAR, multispectral, hyperspectral, images, space-time tracks, oblique photography, point clouds, videos and the like into one model. In order to realize mode uniformity, the invention is inspired by a human cognitive system and a language philosophy: the perception signals of the five sense organs are finally converged in the language, unified and thought is carried out through concepts, a unified reference system which adopts the language as a feature space is provided, features of different modes are mapped to language modes in a unified way, and a mode-specific promt is designed to guide a multi-mode large language model to correctly perceive data of each mode; in order to maintain the mode independence, the invention introduces a mode-specific mode encoder to encode the basic unit of each space-time mode, and a mode bridge to complete the dimension projection from each mode to language mode. Because a gap exists between the analysis result of the model on the specific modal data and the specific downstream task, the task head is designed to improve the generalization of the model on the specific downstream task. Experiments show that under the condition that expert knowledge of most space-time modes is not learned and a unified model structure is used, the invention obtains the accuracy competitive with SOTA on the modes such as RGB, space-time track and the like, and has excellent adaptability on the modes such as MSI, HSI, pointCloud (point cloud), graph and the like. The invention shows the possibility and potential of constructing general artificial intelligence by taking a large language model as a core, and is helpful for promoting the research of the space-time domain intelligent method to be converted from a model-specific and task-specific paradigm to a general paradigm.

In order to maintain independence between modalities, the present invention designs a specific modality encoder for each modality to extract independent characterizations in each modality. Due to the high degree of heterogeneity between the modal data and the modal encoder, the modal representations differ greatly in dimension from the language modal representations. To solve this problem, the present invention introduces a modality bridge to accomplish the dimension mapping of each modality to language modality token.

Because the multi-mode large language model has a gap between the analysis result of each mode data and a specific downstream task. To this end, the invention designs a task head for each downstream task to promote generalization of the model on downstream tasks. It is worth noting that the present invention follows the principle of light weight in the design of both the modal encoder and the task head due to the strong parsing capability of the multi-modal large language model.

Experiments show that under the condition that the model structure is not changed and most of mode expert knowledge is not available, the method obtains the Accuracy competitive with SOTA on the modes such as RGB, space-time track and the like, wherein the difference between the RGB mode and the optimal model on an Accumay index is only 0.84, the difference between the space-time track mode and the optimal model on an ADE index is only 0.07, and in addition, the method also shows excellent adaptability on a plurality of modes such as point cloud, multispectral, hyperspectral and Graph.

Specifically, the invention discloses a universal analysis method of multi-mode image data, which comprises the following steps:

Acquiring multi-modal data and inputting the multi-modal data into a fusion model; the fusion model comprises a modal encoder, a modal bridge, a text word segmentation device, a multi-modal large language model and a task head; the multi-modal data includes Text data, RGB data, MSI data, HSI data, trajectory data, SAR data, infrared data, graph data, obliquePhotography data, video data, and PointCloud data;

The modal encoder encodes each modal data into a token sequence; the modal bridge is formed by stacking a cross attention layer and a feedforward neural network, and the dimension mapping from each mode to the language mode token is completed; the multi-modal large language model completes analysis of each modal data; for the gap between the analysis result of each modal data and the final task, a specific task head is used for each task to promote generalization of the model on the task;

Providing one to four text prompt words for each mode data, and guiding the fusion model to analyze each mode data; performing visible light remote sensing image scene classification tasks on the RGB data; labeling data for Text data to perform emotion classification tasks; performing a multispectral modal classification task on MSI data; performing a ground object category classification task on the HSI data; carrying out pedestrian track prediction tasks on Trajectory data; performing military stationary target recognition tasks on SAR data; performing a human re-recognition task on the Infrared data; carrying out traffic flow prediction tasks on Graph data; performing a three-dimensional reconstruction task on ObliquePhotography data; performing a human action recognition task on the Video data; performing a point cloud object classification task on PointCloud data;

and outputting an analysis result.

Further, the text word segmentation device and the multi-modal large language model are visual text models Lynx; and during training, a lightweight multi-mode adaptation layer integrated by Lynx in the model is used for adapting to multi-mode input, and parameters of the adaptation layer are not frozen during training, so that the adaptation capability of the model to various mode images is improved.

Furthermore, in order to achieve the purpose of enabling the token sequence of each mode to be fed into the unified multi-mode large language model, a mode bridge in the Lynx model is used to project the token of each mode to the dimension of the multi-mode large language model, and the above process is formed as follows:

s_i＝Φ(f_i(m_i),q)

Wherein { m _RGB,m_MSI,m_HSI,…,m_i,…,m_Video } represents the input of each mode, m _RGB represents an RGB mode, m _MSI represents an MSI mode, m _Video represents a video mode, f _i represents a mode encoder for the m _i mode, Φ represents a mode bridge, q ε R ^N*D represents defining N D-dimensional learnable vectors in the mode bridge, D is set to 4096, namely the dimension of a multi-mode large language model, and input data m _i of each mode is mapped into a token sequence s _i∈R^N*D in the same dimension as the language model;

The whole fusion model M is formed as:

M(m_i,p_i)＝H_task(F(s_i⊕T(p_i)))

Where p _i represents a text prompt for the m _i modality, T represents a text segmenter, the @ represents a concatenation operation of the text token and the modality token on the sequence, F represents a multi-modality large language model, and H _task represents a task header.

Further, the task in the fusion model is a supervision task, y represents a label, L represents a loss function, θ represents a learnable parameter in the model, and the optimization target of the fusion model is formed as follows:

still further, a different modality encoder is used for each modality data to maintain independence between modalities:

For text modality data, a text segmenter of the Lynx model is used, i.e. f (m _Text/Code)＝T(m_Text/Code), where m _Text/Code＝{w₁,w₂,w₃, … } represents a word sequence in text or code;

For RGB modal data, a visual encoder EVA in a Lynx model is adopted, wherein the visual encoder EVA is a visual basic model and is composed of TransformerBlock layers of stacked 40 layers with the width 1408;

For MSI mode data, the Patch Embled of ViT model is extended, and the channel number is changed into the band number of the inputted multispectral image;

for HIS modality data, the feature dimensions of each pixel are first expanded with a linear projection layer, a process that is formalized as Wherein W epsilon R ^1*d is the weight matrix of the linear layer projection layer; the features are then extracted with a 12-layer standard transducer encoder, the whole process being denoted

For a space-time track mode, the space-time track mode reflects the change information of a target in time and space and consists of a series of two-dimensional coordinate points, namely m _Trajectory∈R^l*2, wherein l represents the sequence length of track points; the encoder of the space-time track mode firstly uses a linear layer to expand the dimension of the two-dimensional track characteristic, and the step is formed as W.mu.m _Trajectory, wherein W.mu.R ^d*(l*2) is a weight matrix of a projection layer of the linear layer; the features are then extracted with a 2-layer transducer encoder, and the whole process is formally expressed as f (m _Trajectory)＝Enc₂(W*m_Trajectory);

using a three-layer convolution network as a modal encoder for SAR modal data;

For near-infrared modal data, including processing infrared images and visible light images, namely m _Infrared＝{I_r∈R^H*W*3,I_i∈R^H*W*3, inheriting a modal encoder for identifying a cross-modal pedestrian again, but discarding a method for sharing parameters of later layers of ResNet-50, respectively extracting characteristics of the infrared images and the visible light images by using ResNet-50 with two independent parameters, wherein the process is formalized as f (m _Infrared)＝Enc1_resnet50(I_r)⊕Enc2_resnet50(I_i);

For graph mode data, m _Graph∈R^K*d, wherein K is the number of nodes, d is the characteristic dimension of the nodes, the mode encoder firstly uses a linear layer to expand the characteristic dimension of the nodes, and then uses a plurality of embedded layers to respectively encode the self characteristic, the spatial characteristic and the time sequence characteristic of points in the graph, and the whole process is formalized and expressed as Wherein W ε R ^hidden*d is the weight matrix of the linear layer; emb _node is a point feature, emb _spatial is a spatial feature, and Emb _time is a timing feature;

For oblique photography modality m_video＝{V₁∈R^C*H*W,V₂∈R^C*H*W,V₃∈R^C*H*W,…,V_n∈R^C*H*W},, where V _i represents the image of each view, i=1, 2, …, n, n is the number of views, its modality encoder uses ViT model to extract features of multiple view images and concatenates them, formalized representation f(m_video)＝Enc₁₂(V₁)⊕Enc₁₂(V₂)⊕Enc₁₂(V₃)⊕…⊕Enc₁₂(V_n),Enc₁₂ represents a 12-layer encoder;

For the video mode m _video∈R^T*C*H*W, the mode encoder performs sparse sampling on the data with high redundancy in time and space based on TubeViT to reduce redundancy, and then the features obtained by sparse sampling are sent to a transform encoder of 6 layers to further extract the features, and the features are expressed as f (m _video)＝Enc₆(Sparse(m_video); sparse denotes Sparse sampling, enc ₆ denotes a 6-layer encoder;

For point cloud modal data m _pointCloud∈R^K*(d+3), the position, shape, color and texture information of the ground feature in space are reflected, wherein K represents the number of three-dimensional target points, d represents the dimension of a point cloud characteristic value, and a modal encoder inherits PointBERT model: the method comprises the steps of firstly performing block coding on point cloud data to unify the number of points input simultaneously, wherein the step is expressed as PointGroup = Grouper (m _pointCloud)∈R^G*N*3, G represents the group number of groups, N represents the number of points in each group, then inputting the grouping result into a one-dimensional convolution layer, extracting each group of points as a feature vector f _Group＝Conv1d(PointGroup)∈R^G*d, conv1d represents a 1-dimensional convolution layer, finally inputting the feature vector of each group into a standard 12-layer converter to extract the global features, and formalizing the whole process as f (m _pointCloud)＝Enc₁₂(Conv1d(PointGroup(m_pointCloud))),E_nc₁₂ represents a 12-layer encoder.

Furthermore, in the cross attention layer, a learnable query vector Q epsilon R ^N*D is predefined, wherein D is the internal dimension of the language model, and N is used as a super parameter; key and Value of the cross attention layer are all mode characteristics output by the mode encoder;

The feedforward neural network adopts a transducer network and consists of two linear layers, wherein an activation layer is inserted into the two linear layers;

the operation of the whole mode bridge is formed as

Where W _q∈R^D*hidden、W_k∈R^d*hidden、W_v∈R^d*hidden is the linear projection layer weight for Q, K, V, respectively, defined inside the cross-attention layer.

Furthermore, RGB, MSI, HSI, SAR, near infrared, video and point cloud modal data are adopted, and a single-layer linear classification layer is used as a task head;

For the regression task head on the space-time track and graph mode, a linear layer is used for completing regression prediction, and the inverse scaling operation without learning parameters is added;

for the task header of the text modality, a text decoder of Lynx is used;

For the task head of the oblique photography modality, a three-dimensional reconstruction architecture in the Ada-MVS model is used.

Compared with the prior art, the invention has the following beneficial effects:

A unified multi-mode image data analysis method is provided, and image data of multiple modes are unified into one model, so that the image data of multiple modes can be interpreted by using one model.

The method provides a new solution to the problem that uniformity and independence among multiple modes are difficult to balance from the viewpoint of taking language as a characteristic space reference system, and provides a new thought for the uniformity of the multiple modes.

The possibility and potential of building generic artificial intelligence with large language models as cores is presented to facilitate the transition of fusion of image data from modality-specific, task-specific paradigms to generic paradigms.

Experiments show that under the condition that expert knowledge of most space-time modes is not learned and a unified model structure is used, the invention obtains the accuracy competitive with SOTA on the modes such as RGB, space-time track and the like, and has excellent adaptability on the modes such as MSI, HSI, point cloud and the like.

Drawings

FIG. 1 is a basic roadmap unified with multi-modal image data.

Figure 2 is an overall frame diagram of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings, without limiting the invention in any way, and any alterations or substitutions based on the teachings of the invention are intended to fall within the scope of the invention.

In order to maintain uniformity and independence among modalities, the invention proposes: in the process of sensing and understanding the world, a human needs to receive information of multiple modes such as vision, hearing, touch and the like, a series of concepts (concepts) can be formed in the brain through analysis processing, and an abstract Concept space is formed by the concepts and relations among the concepts, and the Concept space is the basis for understanding the objective world. Concepts themselves are defined by connotations, meaning the attributes or properties that a concept has, and extensions, meaning all examples to which the concept applies. The language plays a key role in the definition process of the concept, on one hand, the language is used as an expression way of the concept, and a specific symbolized expression way is provided for the concept; on the other hand, languages also guide the formation of concepts, and new vocabulary and language structures are actually exploration of new concepts.

Therefore, language is used as a unified reference system of each modal concept, and analysis and reasoning are carried out under the characteristic space of the language. Thus, as shown in fig. 1, in building a multimodal model specifically, each modality data should be encoded as a basic unit (token) with context information, each modality basic unit should be aligned with a basic unit of language, and the parsing process is performed by using a transducer model (the prior art in this field, the transducer model uses Self-attribute mechanism, does not use RNN sequence structure, so that the model can be trained in parallel, and can possess global information) as a backbone network, which is a space-time general intelligent model, and finally, a multimodal characterization is generated.

Under the guidance of the principle and the thought, the invention consists of five modules: a modality encoder (modalencoder), a modality bridge (modal bridge), a text segmenter (text tokenizer), a multimodal big language model (MLLM), a task header (taskhead). The overall architecture is shown in fig. 2.

In order to maintain independence between modalities, the present invention designs a modality-specific encoder for each modality to encode highly heterogeneous respective modality data into a token sequence. However, the token dimensions output by the modal encoder are still inconsistent, and in order to achieve the purpose of enabling token sequences of all modes to be fed into a unified multi-modal large language model, the invention also introduces a modal bridge in a Lynx model (proposed by ByteDance team, reference 'WHAT MATTERS IN TRAINING A GPT-Style Language Model with Multimodal Inputs') which aims at projecting the token of each mode to the dimensions of the multi-modal large language model. The above process may be formalized as:

s_i＝Φ(f_i(m_i),q) (1)

Where { m _RGB,m_MSI,m_HIS,…,m_i,…,m_Video } represents the input of each modality, e.g., m _RGB represents an RGB modality, m _MSI represents an MSI modality, etc., f _i is defined as a modality encoder for the m _i modality, Φ represents a modality bridge, q e R ^N*D represents N D-dimensional learnable vectors defined in the modality bridge, D is set to 4096, i.e., the dimensions of a multi-modal large language model, and input data m _i of each modality is mapped to a token sequence s _i∈R^N*D of the same dimension as the language model.

In order to achieve uniformity among modes, the invention uses a uniform multi-mode large language model to complete analysis of data of each mode. The text word segmentation device and the multi-modal large language model are visual text models Lynx, and specific text prompts are designed for each mode in order to expand the visual text models Lynx to a plurality of image modes, so that the models are guided to accurately analyze information of the modes. In addition, lynx integrates a plurality of lightweight multi-modal adaptation layers inside the model to adapt to multi-modal input, the invention continues the design and does not freeze adaptation layer parameters during training to improve the adaptation capability of the model to other space-time modalities. Finally, the invention notes that a larger gap exists between the analysis result of the model for each mode data and the final task, so that the invention designs a specific task head aiming at each task to improve the generalization of the model on the task.

The entire model M may be formalized as:

M(m_i,p_i)＝H_task(F(s_i⊕T(p_i))) (2)

All tasks in the experiment are supervision tasks, y represents a label, L represents a loss function, θ represents a learnable parameter in a model, and an optimization target of the model can be formed as follows:

the modality encoder aims to encode the raw data of each modality into a token sequence, formalized as: t _i＝f_i(m_i), wherein t _i∈R^n*d. Different modality encoders are designed for each modality to maintain independence between modalities, as described below:

Text: the Lynx model itself is a language model, so the invention does not additionally design a modal encoder for the Text mode, but directly uses a Text word segmentation device of Lynx, i.e. f (m _Text/Code)＝T(m_Text/Code), where m _Text/Code＝{w₁,w₂,w₃, … represents word sequences in Text or code.

RGB: the RGB image is a visible light image, reflects electromagnetic characteristics of ground objects reflecting or emitting electromagnetic waves in a visible light wave band, and is a mode most commonly used in visual field research. The RGB image is a standard three-band image, i.e., m _RGB∈R^H ^*W*C. The mode encoder of the mode adopts a visual encoder in a Lynx model: EVA. EVA is a large visual base model consisting of a 40-layer stack of TransformerBlock layers with a width 1408. The invention loads the official weight of the EVA model in the experimental process, so that the EVA model is frozen and does not participate in training.

MSI: multispectral imaging is a mode widely studied in the remote sensing field, and usually, besides three visible light bands including RGB modes, there are multiple non-visible light bands including near infrared, red end, short wave infrared, coastal atmospheric aerosol, and cloud band, so that the number of channels of the imaging is usually more than three bands of RGB imaging, namely m _MSI∈R^H*W*C, and usually C >3. The invention expands the Patch Embed of the standard VisionTransformer (namely ViT model proposed by Google) model, changes the channel number into the band number of the input multispectral image, and uses the band number as the characteristic encoder of the multispectral mode.

HSI: the number of the wave bands of the hyperspectral image is further increased on the basis of the multispectral image, hundreds of wave bands can be achieved generally, and each wave band contains information rich in ground objects. In contrast to RGB images and hyperspectral images, in which all bands of each pixel are considered as a single sample, i.e. m _HSI∈R^1*1*C, pictures are taken as a single sample. In a modal encoder of hyperspectral imagery, the invention first extends the feature dimension of each pixel with a linear projection layer, a process that is formalized asWherein W ε R ^1*d is the weight matrix of the linear layer projection layer. The present invention then uses a 12-layer standard transducer encoder to extract its features. The whole process can be expressed asSpace-time trajectory: the spatiotemporal trajectory modality reflects the time and space varying information of the object, consisting of a series of two-dimensional coordinate points, i.e., m _Trajectory∈R^l*2, where l represents the sequence length of the trajectory points. The encoder of the spatio-temporal trajectory modality inherits the design of TrajFormer (reference "Trajectory Prediction with Local Self-ATTENTIVE CONTEXTS FOR AUTONOMOUS DRIVING"): firstly, expanding the dimension of a two-dimensional track characteristic by using a linear layer, wherein the step can be formed into W.mu.m _Trajectory, and W.mu.R ^d*(l*2) is a weight matrix of a projection layer of the linear layer; the invention then extracts its features using a 2-layer transform encoder. The whole process is formally expressed as f (m _Trajectory)＝Enc₂(W*m_Trajectory).

SAR: SAR mode is one of active remote sensing, and reflects electromagnetic characteristics of ground features on backward microwave scattering. Because of the different polarization modes, the final product of the SAR image is actually a 2-band or single-band image, namely m _SAR∈R^H ^*W*2, so that the three-layer convolution network is used as a mode encoder of the SAR image.

Near infrared: the near infrared data set selected by the invention is SYSU-MM01, which is actually a multi-mode re-identification data set of visible light and near infrared, and the model needs to process infrared images and visible light images, namely m _Infrared＝{I_r∈R^H ^*W*3,I_i∈R^H*W*3. The invention inherits but slightly modifies the design of DDAG (the prior art, namely, the cross-modal pedestrian re-recognition: dynamic Dual-ATTENTIVE AGGREGATION LEARNINGFOR VISIBLE-Infrared Person Re-Identification), discards the design of the DDAG which shares the parameters of the next several layers of ResNet-50, uses two independent parameters ResNet-50 to extract the characteristics of infrared images and visible light images respectively, and the process can be formalized as f (m _Infrared)＝Enc1_resnet50(I_r)⊕Enc2_resnet50(I_i).

The figure: the graph is constructed by a series of nodes and edges, and the properties of the points themselves and the adjacent properties of the points reflect most of the characteristics of the graph, m _Graph∈R^K*d, wherein K is the number of nodes and d is the node characteristic dimension. The modal encoder of the graph in the invention is based on STAEformer(H.Liu et al.,"STAEformer:Spatio-Temporal Adaptive Embedding Makes Vanilla Transformer SOTA for Traffic Forecasting."arXiv,Oct.07,2023.Accessed:Dec.09,2023.[Online].Available:http://arxiv.org/abs/2308.10425), main design ideas that the characteristic dimension is expanded by using a linear layer, and then the characteristics of the points in the graph, the spatial characteristics, the time sequence characteristics and the like are respectively encoded by using a plurality of EmbeddingLayer. EmbeddingLayer is an embedded layer, which is commonly used to map information into vectors; in the NLP field, words are mapped to vectors for processing, and in the space-time prediction field, time information (year, month, day, hour, etc.) is mapped to vectors. The whole process can be formalized as Where W εR ^hidden*d is the weight matrix of the linear layer.

Oblique photography: the oblique photography modality consists of images of multiple views, namely m_video＝{V₁∈R^C*H*W,V₃∈R^C*H*W,V₃∈R^C*H*W,…,V_n∈R^C*H*W}, where V _i represents the image of each view and n is the number of views. In an oblique photography mode encoder, the present invention uses a standard ViT model (Vision Transformer model by Google) to extract and stitch features of multiple view images, formalized as f(m_video)＝Enc₁₂(V₁)⊕Enc₁₂(V₂)⊕Enc₁₂(V₃)⊕…⊕Enc₁₂(V_n).

Video: the video modality can be seen as a stack of images in the time dimension, i.e. m _video∈R^T*C*H*W. A significant feature of the video modality is that it is highly redundant in time and space, and to solve this problem, the encoder of the video modality in the present invention sparsely samples data that is highly redundant in time and space based on TubeViT (refer to RETHINKING VIDEO VITS: sparse Video Tubes for Joint lmage and Video Learning) to reduce redundancy. The sparsely sampled features are then fed into a 6-layer transform encoder for further feature extraction, formalized as f (m _video)＝Enc₆(Sparse(m_video)).

And (3) point cloud: the point cloud is generally composed of three-dimensional coordinate points and characteristic values thereof, m _pointCloud∈R^K*(d+3), wherein K represents the number of three-dimensional target points, d represents the dimension of the characteristic values of the point cloud, and the dimension reflects the information of the position, the shape, the color, the texture and the like of the ground object in space. The encoder of the Point cloud modality inherits PointBERT (Point-BERT proposed by the university of bloom: point cloud self-attention model pre-training based on mask modeling) design: the point cloud data is first block coded to unify the number of points that are input simultaneously, this step can be expressed as PointGroup = Grouper (m _pointCloud)∈R^G*N*3 where G represents the number of groups of packets and N represents the number of points in each group; then the invention inputs the grouping result into a one-dimensional convolutional layer, extracts each group of points as a feature vector f _Group＝Conv1d(PointGroup)∈R^G*d, and finally the feature vector of each group is input into Transformer Encoder of a standard 12-layer to extract its global feature.

The modal bridge is based on the Lynx model and aims at carrying out dimension projection from each modal feature to language modal feature. In implementation, the modal bridge consists of a stack of cross-attention layers and feed-forward neural networks.

In the cross attention layer, the invention predefines a learnable query vector Q epsilon R ^N*D, wherein D is the internal dimension of the language model, N is used as a super parameter, and the invention can be flexibly adjusted to adapt to the input of different modes. And Key and Value of the cross attention layer are the characteristics of each mode output by the mode encoder. The feedforward neural network inherits the classical design in the original transducer and consists of two linear layers with an active layer inserted.

The whole process can be formalized intoWhere W _q∈R^D*hidden、W_k∈R^d*hidden、W_v∈R^d*hidden is the linear projection layer weight for Q, K, V, respectively, defined inside the cross-attention layer.

In order to extend the visual language multi-modal model to 13 modalities without the intervention of expert knowledge of the modalities, the invention designs one to four specific text templates for each modality to guide the model to correctly analyze the data of each modality. Analysis results: and outputting the classification result of the image for the classification task, outputting the target identification result of the image for the identification task, and outputting the regression prediction result for the regression task.

In the training process, in order to improve the model performance, the invention adopts a strategy of diversifying the promts, namely randomly selecting one promt in each forward process; while at the time of testing, the promt is fixed as the first of all promts for test result stability and reproducibility. Table 2 lists all text prompts for the design of the present invention.

TABLE 2 prompt list

The invention designs specific task heads for different tasks of each mode in order to promote generalization of the model on downstream tasks, and simultaneously, the design principle of the task heads is as simple and light as possible in order to ensure mobility of the modes on different tasks. The detailed summary of the task heads of each modality is shown in table 4.

For the classified task or the downstream task which can be formed into the classified task, the invention uniformly adopts a simple single-layer linear classified layer as a task head, and as the classified task which is implemented on RGB, MSI, SAR, video and point cloud PointCloud modes is a standard classified task, the classified task which is implemented on an HSI mode is a split task, but can be formed into a pixel-by-pixel classified task, and similarly, the classified task can be formed into a re-identified task on an Infinized mode, so that the linear classified layer is used as the task head for all modes.

For the regression tasks on the Traj and Graph modes, the invention also uses a linear layer to complete regression prediction, and the difference between the regression prediction and the classification task is only that one more inverse scaling operation without learning parameters is adopted. Since the Lynx modality itself is a language model, the Text modality directly uses its native Text decoder. The task head of the oblique photography modality is based on a three-dimensional reconstruction architecture in the Ada-MVS model (Ada-MVS model reference article Adaptive region aggregation for MVS MATCHING using deformable convolutional network (2023)).

In order to simplify training and reproducibility of results, the invention adopts similar experimental settings in all mode experiments, the optimizer is AdamW, the learning rate change strategy is a cosine annealing strategy, and the super-parameter settings of different experiments are only slightly adjusted on training rounds and learning rates. Table 2 summarizes the super-ginseng settings for each modality experiment.

TABLE 2 Experimental settings

Table 3: multi-modal experimental setup

Table 3 summarizes the data and tasks used for each modality experiment, and is presented in detail below in order:

Text: the IMDB dataset is a binary emotion analysis dataset containing 50000 comment data from Internet Movie Database (IMDB) marked positive or negative, in addition to which the dataset provides partially unlabeled data. The experiment of the invention only uses the marked data of the IMDB data to conduct the supervision task of emotion classification.

RGB: NWPU-RESISC is a large-scale public data set for classifying visible light remote sensing image scenes, and comprises 45 ground object categories such as air land, base ball diamond, beacon, communication area and the like, each category comprises 700 remote sensing images, the total number of the data set is 31500 images, the image size is 256 x 256, and 20% data provided by authorities are selected as divided versions of a training set.

MSI: euroSAT the data set is a multi-spectral mode earth surface use and coverage (LULC) classification data set, wherein samples in the data are derived from a sentinel No. 2 optical satellite, all 13 wave bands are included, the data divide the samples into 10 categories, and 27000 images are divided by adopting a random 9:1 training set test set.

HSI: the Pavia University dataset is a hyperspectral dataset which is acquired by a ROSIS sensor, comprises 103 wave bands and has the size of 610 x 340, and comprises nine ground object categories such as Asphalt, meadows, gravel. The invention adopts 4:6 training test set division.

Trajectory: ETH-UCY is one of the most widely used pedestrian trajectory prediction benchmarks, divided into five subsets ETH, HOTEL, UNIV, ZARA, ZARA2, of which the experiment of the present invention uses.

SAR: the MSTAR data set is a synthetic aperture radar data set for military stationary object identification, containing ten classes of military objects altogether. The present invention uses the Standard Operating Conditions (SOC) dataset preprocessing method proposed by s.chen et al, i.e. the sample serial numbers in the test set and training set are the same as the target configuration, but the azimuth and the depression angle are different.

Infrared: SYSU-MM01 is an optical-infrared cross-modality human re-recognition dataset containing images of 491 different people from 4 RGB cameras and 2 infrared cameras. The invention adopts official data set division, the training set comprises 20284 RGB images and 9929 infrared images from 296 persons, the testing set uses 3803 infrared images of 96 persons as inquiry, and 301 RGB images are randomly selected as heavy identification targets.

Graph: METR-LA is a traffic data collected from the los Angeles highway by annular detectors for a time period ranging from 3 months in 2012 to 1 month in 2012 to 30 days in 6 months, and is tasked with traffic flow prediction.

ObliquePhotography: WHO-OMVS is a oblique photographic dataset used for three-dimensional reconstruction tasks. The dataset provides image information of five visual angles and other parameter information such as camera parameters, and the like, and comprises six areas, and in the experiment of the invention, area1 is used as a training set, and area2 is used as a test set.

Video: the UCF101 dataset is a human motion recognition dataset comprising 13320 video slices of 101 classes for a total video time of up to 27 hours, with a resolution of 320 x 240, from YouTube.

PointCloud: modelNet40 is a simulated point cloud dataset containing a total of 12311 point cloud objects for 40 target classes, and the invention uses official dataset partitioning, 9843 for training and 2468 for testing.

The performance of the scene classification task on NWPU-RESISC data set is tested, and the test index is Top-1 Accuracy. The invention loads the weight provided by Lynx, so that the invention has the expert knowledge of RGB mode. In Table 4, the present invention compares to the best performing model currently known on the NWPU-RESISC45 dataset, and the results show that the present invention is superior to most baseline models, and differs from the best (95.69) by only 0.84, embodying the present invention's excellent sensing and interpretation capabilities for RGB modalities.

Table 4 RGBLand Cover classification (80% as trainset).

Method	Publication	Acc(％)
			CNN-CapsNet	RS2019	89.03
DFAGCN	TNNLS2021	89.29
			D-CNN with GoogleNet	TGRS2018	90.49
D-CNN with VGGNet	TGRS2018	91.89
			SCCov	TNNLS2019	92.10
SeCo-ResNet-50	ICCV2021	92.91
			MG-CAP	TIP2020	92.95
LSENet	TIP2021	93.34
			MSANet	JSTARS2021	93.52
IDCCP	TGRS2021	93.76
			MBLANet	TIP2021	94.66
GRMANet-ResNet-50	TGRS2021	94.72
			EMSNet	TGRS2023	95.37
ViTAE-B+RVSA	TGRS2022	95.69
			The invention is that	ours	94.85

The performance of the scene classification task was tested on EuroSAT dataset, in the experiment 13 bands of images were input simultaneously into the model, the goal of the model was to classify the images into the correct one of 10 categories, and the index was also selected Top1 Accuracy. The present invention does not have expert knowledge of the multispectral modes, so the baseline model is divided into two categories in table 5: expert knowledge intervention and non-expert knowledge intervention. Expert knowledge intervention means that the baseline model is pre-trained on a BigEarthNet and other large pre-training data set, and then the baseline model is finely tuned to EuroSAT data set; expert knowledge-free means that the baseline model is trained directly from scratch on EuroSAT datasets. The results show that the model of the invention is superior to most models in the expert-free knowledge group, has a 2.60 difference from the best model ResNet-152, has a 4.75 difference from the best result of the model with the expert knowledge group, and shows the excellent adaptability of the invention to multispectral modes.

Table 5 MSILand covers the classifications.

For a hyperspectral mode, the performance of a pixel classification task is tested on Pavia University data, all wave band information of a single pixel is taken as a sample by a model, the ground object category of the pixel is predicted, and reported indexes comprise OA, AA and Kappa. As no modal expert knowledge of multispectral modes exists, the method is compared with a semi-supervised baseline summarized by D.Uchaev and D.Uchaev, and the results in the table 6 show that the method is superior to most HSI mode classification methods such as IFRF, S-DMM and the like, and the difference from the optimal result in OA indexes is 6.42, so that the method has better adaptability to hyperspectral modes.

Table 6: HSI pixel classification

Method	OA	AA	Kappa
				3D-CNN	75.24	80.26	68.34
CA-GAN	76.81	76.94	71.02
				3D VS-CNN	81.63	83.86	76.46
RPNet	84.92	83.26	80.52
				S-DMM	88.30	93.76	84.90
IFRF	88.38	85.99	84.97
				The invention is that	89.18	86.65	85.32
DCFSL	90.71	90.20	87.73
				TC-GAN	93.20	91.60	91.00
PRNet-RF	95.60	94.96	94.27

The invention tests the performance of classification tasks on ModelNet data sets, and the index is Top-1Accuracy. In the single-mode research aiming at the point cloud, the invention observes that most of work is focused on designing a specific structure to maintain the properties of ordering invariance, symmetry and the like of the three-dimensional point cloud due to the unique three-dimensional structure of the point cloud data so as to improve the capability of the model to extract three-dimensional features, and the structure designed aiming at the unique prior of the mode is difficult to migrate among modes. In addition, some methods tend to pretrain on large point cloud datasets to obtain general modal expert knowledge, and generalize to specific downstream tasks to improve performance.

The design of the invention is based on a general architecture from sequence to sequence, as shown in table 7, under the condition that the system has no modal expert architecture or modal expert knowledge, the system still surpasses classical networks (PointNet, kd-net) based on specific structures of point clouds, and the system still maintains comparability with a best model PointGPT with the expert structures of point clouds and the expert knowledge, which shows that the system has larger application potential in the point cloud modes.

Table 8 PointCloud classification.

Method	Expert Architecture	Expert Knowledge	Acc(％)
				PointNet	Yes	No	89.2
Kd-net	Yes	No	90.6
				SPH3D-GCN	Yes	No	91.4
PointNet++	Yes	No	91.9
				SO-Net	Yes	No	92.5
PointVGG	Yes	No	93.6
				PointBERT	No	Yes	93.8
PointGPT	Yes	Yes	94.9
				The invention is that	No	No	91.2

For the spatio-temporal trajectory modality, the present invention tests the performance of the trajectory prediction task on the ETH dataset, i.e. using two-dimensional coordinate points over a period of time to predict the next possible two-dimensional trajectory. The present invention reports AVERAGE DISPLACEMENT Error (ADE) and FINAL DISPLACEMENT Error (FDE) to give future space-time trajectories(I.e., ground Truth) and predicted trajectoryADE and FDE are used to measure their L2 distance and their calculation formula is:

in table 8, the present invention is compared with the best track prediction model known at present, and the result shows that under the condition that the general structure is used and no space-time track mode expert knowledge is available, the prediction accuracy is superior to that of most expert models, the difference between the ADE index and the best model (STAR) is only 0.07, the difference between the ADE index and the best model SocialVAE +FPC is only 0.11, and the present invention has excellent adaptability to space-time track modes.

Table 8 Trajectory predictions

Method	ADE/FDE
		Social GAN	0.87/1.62
SoPhie	0.70/1.43
		STAR	0.36/0.64
SGCN	0.63/1.03
		CAGN	0.41/0.65
SIT	0.39/0.62
		SocialVAE	0.47/0.76
PCENet	0.54/0.87
		AgentFormer	0.45/0.75
MemoNet	0.40/0.61
		SocialVAE+FPC	0.41/0.58
TUTR	0.40/0.61
		The invention is that	0.43/0.69

The adaptability of the SAR image recognition method to SAR modes is tested on an MSTAR data set, the model needs to recognize SAR images of ten military targets, and the index is Top1 Accuracy. In the experiment, the preprocessing of the MSTAR dataset adopts the SOC setting in AConvNets, namely the sample serial numbers and target configurations in the test set and the training set are the same, but the azimuth and the depression angle are different. Table 9 shows the comparison result of the invention with the best model under the setting, the invention lacks the modal expert knowledge of SAR image and the space invariance priori under the angle change, so the invention has poor performance on MSTAR data set, the precision is 88 percent, the precision is similar to EMACH method precision, and the adaptation capability of the model to SAR image and the extraction capability of space information are one of the directions to be studied in the future.

Table 9 SAR Classification

Method	Accuracy(％)
		EMACH	88
SVM	90
		AdaBoost	92
MSRC	93.6
		IGT	95
MSS	96.6
		Cond Gauss	97
M-PMC	98.8
		AConvNets[	99.13
The invention is that	88

The adaptability of the invention to infrared modes is tested on SYSU-MM01 data sets, and in practice, the model is subjected to an RGB and infrared cross-mode human re-recognition task in an experiment, namely, the model needs to simultaneously accept an infrared image and an RGB image and recognize and match the same person in different mode images. Table 10 shows the comparison of the present invention with the best re-identification method known at present, and the evaluation index is Rank at 20accuracy. The results show that the invention (76.31%) achieves similar precision as HSME (77.95%). Although the method of the invention does not achieve the best accuracy, the method has surpassed expert models (TONE, HCML, etc.) of partial RGB-Inforred re-identification under the conditions of using a general architecture and without expert knowledge of modes, and the potential of the method for processing Infrared modes is shown.

Table 10: visible-Infrared re-identification

Method	Rank-20(All-search)
		Two-stream	65.50
One-stream	66.74
		TONE	68.60
HCML	69.17
		Zero-Pad	71.33
The invention is that	76.31
		HSME	77.95
BDTR	81.07
		DDAG	95.81

For Graph mode, the invention tests the performance of traffic flow prediction task on METR-LA dataset, and the estimated index is RMSE, MAE, R2. Table 11 shows a comparison of the present invention with the best results on the currently known METR-LA dataset, which shows that the present invention, without intervention of expert knowledge of the modality, differs from the best results by only 0.47 on the RMSE index, exhibiting the excellent adaptation ability of the present invention on the Graph modality.

Table 11 Traffic predictions

Method	RMSE	MAE	R2
				HI	6.80	14.20	10.15
GWNet	3.51	7.28	9.96
				DCRNN	3.54	7.47	10.32
AGCRN	3.59	7.45	10.47
				STGCN	3.60	7.43	10.35
GTS	3.59	7.44	10.25
				MTGNN	3.47	7.21	9.70
STNorm	3.57	7.51	10.24
				GMAN	3.44	7.35	10.07
PDFormer	3.62	7.47	10.91
				STID	3.55	7.55	10.95
STAEformer	3.34	7.02	9.70
				The invention is that	3.81	7.52	11.24

The natural language processing capability of the present invention was tested on IMDB datasets and the task was to make a positive or negative judgment of the emotion of the text. The invention loads the weight of the multi-modal large language model Lynx, so that the model can be regarded as expert knowledge with natural language modes. Comparing the invention with the best known model on the IMDB data set, as shown in the results of Table 12, the invention surpasses most language models, and only differs from the best results by 0.32, thus embodying the invention to have strong understanding and analyzing capability on natural language.

TABLE 12 text understanding

For Video modality, the performance of the motion recognition task of the present invention on UCF101 dataset is tested, the model needs to understand the Video and correctly classify it into one of 101 classes, with evaluation index Top1accuracy. In table 13, comparing the present invention with the best baseline model at present, the present invention has poor adaptability in Video mode, and has a larger gap from the baseline model result, and the analysis considers that the reasons mainly include the following two points: 1. the high information redundancy of the video results in the increase of the training cost, and the invention trains 3 epochs on the UCF101 data set only; 2. the model architecture provided by the invention has insufficient flexibility for three-dimensional data, and can not effectively enhance the capture of time sequence information.

TABLE 13 video classification

Method	Accuracy(％)
		OPN	59.6
VCOP	72.4
		SpeedNet	81.1
VTHCL	82.1
		CVRL	94.4
VideoMAE v1	96.1
		VideoMAE v2	99.6
The invention is that	27.5

For oblique photography mode, the performance of the three-dimensional reconstruction task on the WHU-OMVS dataset is tested, the model receives images of five views as input, the target is to output a depth map to reconstruct a three-dimensional model, the evaluation index is PERCENTAGE OF ACCURATE GRIDS IN TOTAL (PAG), and the calculation formula is as follows:

the suffix of the PAG represents different accuracy criteria, e.g. PAG-6 represents an error within 0.6m and PAG-10 represents an error within 1 m.

Table 14 shows a comparison of the results of the present invention with a common multi-view three-dimensional reconstruction model on the WHU-OMVS dataset. The invention has far less precision performance than a modal expert model, and the analysis reasons can have two points, namely 1. The dense space information is gradually lost in a deep structure of a large model, and the point is verified when the invention is used for carrying out heuristic experiments such as segmentation, detection and the like; 2. the model architecture is not flexible enough, and can not be connected with model gradual layer characteristics and designed into a specific model structure for processing like three-dimensional reconstruction expert models such as Ada-MVS, so that the performance of the model is severely limited.

Table 143 d reconstruction

Method	PAG-6	PAG-10
			MVSNet	81.15	91.44
CasMVSNet	95.45	98.02
			Ada-MVS	96.14	98.10
UCSNet	96.25	98.45
			The invention is that	6.4	10.4

The word "preferred" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "preferred" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "preferred" is intended to present concepts in a concrete fashion. The term "or" as used in this disclosure is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from the context, "X uses a or B" is intended to naturally include any of the permutations. That is, if X uses A; x is B; or X uses both A and B, then "X uses A or B" is satisfied in any of the foregoing examples.

Moreover, although the disclosure has been shown and described with respect to one or more implementations, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present disclosure includes all such modifications and alterations and is limited only by the scope of the following claims. In particular regard to the various functions performed by the above described components (e.g., elements, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. Furthermore, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or other features of the other implementations as may be desired and advantageous for a given or particular application. Moreover, to the extent that the terms "includes," has, "" contains, "or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.

The functional units in the embodiment of the invention can be integrated in one processing module, or each unit can exist alone physically, or a plurality of or more than one unit can be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. The above-mentioned devices or systems may perform the storage methods in the corresponding method embodiments.

In summary, the foregoing embodiment is an implementation of the present invention, but the implementation of the present invention is not limited to the embodiment, and any other changes, modifications, substitutions, combinations, and simplifications made by the spirit and principles of the present invention should be equivalent to the substitution manner, and all the changes, modifications, substitutions, combinations, and simplifications are included in the protection scope of the present invention.

Claims

1. A method for processing multi-modal data enhanced by a large language model, comprising the steps of:

Performing visible light remote sensing image scene classification tasks on the RGB data; labeling data for Text data to perform emotion classification tasks; performing a multispectral modal classification task on MSI data; performing a ground object category classification task on the HSI data; carrying out pedestrian track prediction tasks on Trajectory data; performing military stationary target recognition tasks on SAR data; performing a human re-recognition task on the Infrared data; carrying out traffic flow prediction tasks on Graph data; performing a three-dimensional reconstruction task on ObliquePhotography data; performing a human action recognition task on the Video data; performing a point cloud object classification task on PointCloud data;

and outputting an analysis result.

2. The large language model enhanced multi-modal data processing method of claim 1 wherein the text segmenter and multi-modal large language model are visual text models Lynx; and during training, a lightweight multi-mode adaptation layer integrated by Lynx in the model is used for adapting to multi-mode input, and parameters of the adaptation layer are not frozen during training, so that the adaptation capability of the model to various mode images is improved.

3. The method for processing multi-modal data enhanced by a large language model according to claim 2, wherein, for the purpose of enabling a token sequence of each modality to be fed into a unified multi-modal large language model, a modality bridge in the Lynx model is used to project a token of each modality to a dimension of the multi-modal large language model, and the above procedure is formed as:

s_i＝Φ(f_i(m_i),q)

The whole fusion model M is formed as:

Where p _i represents the text prompt for the m _i modality, T represents the text segmenter, Representing the concatenation operation of the text token and the modal token on the sequence, F represents a multi-modal large language model, and H _task represents a task head.

4. A method of processing multi-modal data enhanced by a large language model according to claim 3, wherein the tasks in the fusion model are supervisory tasks, y represents a label, L represents a loss function, θ represents a learnable parameter in the model, and the optimization objective of the fusion model is formed as:

5. the large language model enhanced multi-modal data processing method according to claim 4, wherein a different modality encoder is used for each modality data to maintain independence between modalities:

For Trajectory modes, which reflect the change information of the target in time and space, consists of a series of two-dimensional coordinate points, namely m _Trajectory∈R^l*2, wherein l represents the sequence length of the track points; the encoder of the space-time track mode firstly uses a linear layer to expand the dimension of the two-dimensional track characteristic, and the step is formed as W.mu.m _Trajectory, wherein W.mu.R ^d*(l*2) is a weight matrix of a projection layer of the linear layer; the features are then extracted with a 2-layer transducer encoder, and the whole process is formally expressed as f (m _Trajectory)＝Enc₂(W*m_Trajectory);

using a three-layer convolution network as a modal encoder for SAR modal data;

For Infrared data, including a modality encoder that processes Infrared and visible images, i.e., m _Infrared＝{I_r∈R^H*W*3,I_i∈R^H ^*W*3, inherits cross-modality pedestrian re-recognition, but discards the method in which the latter few layers of parameters are shared ResNet-50, uses two independent parameters ResNet-50 to extract features of the Infrared and visible images, respectively, this process formally represented as f

For Graph mode data, m _Graph∈R^K*d, wherein K is the number of nodes, d is the node characteristic dimension, the mode encoder firstly uses a linear layer to expand the characteristic dimension, and then uses a plurality of embedded layers to respectively encode the characteristic, the spatial characteristic and the time sequence characteristic of points in the Graph, and the whole process is formalized and expressed as Wherein W ε R ^hidden*d is the weight matrix of the linear layer; emb _node is a point feature, emb _spatial is a spatial feature, and Emb _time is a timing feature;

for ObliquePhotography,m_video＝{V₁∈R^C*H*W,V₂∈R^C*H*W,V₃∈R^C*H*W,…,V_n∈R^C*H*W}, images in which V _i represents each view, i=1, 2, …, n, n is the number of views, the modality encoder uses ViT models to extract features of multiple view images and concatenate them, formally represented as

Enc ₁₂ denotes a 12-layer encoder;

For Video mode m _video∈R^T*C*H*W, the mode encoder performs sparse sampling on the data with high redundancy in time and space based on TubeViT to reduce redundancy, and then the features obtained by sparse sampling are sent to a transform encoder with 6 layers to further extract the features, and the features are expressed as f (m _video)＝Enc₆(Sparse(m_video)); sparse denotes Sparse sampling, enc ₆ denotes a 6-layer encoder;

For PointCloud data m _pointCloud∈R^K*(d+3), the position, shape, color and texture information of the ground feature in space are reflected, wherein K represents the number of three-dimensional target points, d represents the dimension of the point cloud characteristic value, and the modal encoder inherits PointBERT model: the method comprises the steps of firstly performing block coding on point cloud data to unify the number of points input simultaneously, wherein the step is expressed as PointGroup = Grouper (m _pointCloud)∈R^G*N*3, G represents the group number of groups, N represents the number of points in each group, then inputting the grouping result into a one-dimensional convolution layer, extracting each group of points as a feature vector f _Group＝Conv1d(PointGroup)∈R^G*d, conv1d represents a 1-dimensional convolution layer, finally inputting the feature vector of each group into a standard 12-layer converter to extract the global features, and formalizing the whole process as f (m _pointCloud)＝Enc₁₂(Conv1d(PointGroup(m_pointCloud))),Enc₁₂ represents a 12-layer encoder.

6. The method for multi-modal data processing with large language model enhancement as claimed in claim 5, wherein in the cross-attention layer, a learnable query vector Q e R ^N*D is predefined, where D is the internal dimension of the language model and N is a super-parameter; key and Value of the cross attention layer are all mode characteristics output by the mode encoder;

the operation of the whole mode bridge is formed as

Where W _q∈R^D*hidden、W_k∈R^d ^*hidden、W_v∈R^d*hidden is the linear projection layer weight for Q, K, V, respectively, defined inside the cross-attention layer.

7. The method for processing multi-modal data enhanced by a large language model according to claim 6, wherein RGB, MSI, HSI, SAR, infrared, video, pointcloud data, a single-layer linear classification layer is used as a task header;

For a regression task head on Trajectory, graph modes, a linear layer is used for completing regression prediction, and inverse scaling operation without learning parameters is added;

for the task head of the Text mode, a Text decoder of Lynx is used;

For the mission-head of ObliquePhotography modalities, a three-dimensional reconstruction architecture in the Ada-MVS model is used.