CN114863132A

CN114863132A - Method, system, equipment and storage medium for modeling and capturing image spatial domain information

Info

Publication number: CN114863132A
Application number: CN202210609728.9A
Authority: CN
Inventors: 郝艳宾; 王志才; 王硕; 何向南; 谢发权
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-08-05

Abstract

The invention discloses a method, a system, equipment and a storage medium for modeling and capturing image airspace information, aiming at the problem that a multi-layer perceptron network model at the present stage is low in efficiency in processing image airspace information, a grouped airspace information fusion mode realized based on position coding is innovatively designed, the performance of a baseline model is effectively improved, meanwhile, the model parameters are remarkably reduced, and only little calculation overhead is introduced; and moreover, the application of the generalized secondary position coding in the perceptron network is firstly provided, the model performance is improved from the perspective of simultaneously realizing global/local feature modeling, and finally, a hierarchical connection network frame based on feature windowing design is designed, so that the performance of the perceptron network is further improved, and the network performance of the perceptron network is enabled to be quite even better than that of a network based on a convolution sum and self-attention model.

Description

Method, system, equipment and storage medium for modeling and capturing image spatial domain information

Technical Field

The invention relates to the technical field of computer vision, in particular to a method, a system, equipment and a storage medium for modeling and capturing image airspace information.

Background

The picture feature extraction is an important component in real application scenes (such as target detection and semantic segmentation) as an important bottom layer work in the multimedia era. For the processing of picture data, after a Convolutional Neural Networks (CNNs) based on convolution operation and a self-attention-based transform network, research on multi-layer perceptron networks (MLPs) based on linear full-connected layers is also started. Compared with the former two, the perceptron network has the advantages of simple network structure and high training speed under the condition of the same complexity, but due to the lack of prior knowledge of the model, the perceptron network often has the problems of insufficient data volume and poor performance under the condition of the same model complexity. This problem is often due to the simplicity of MLPs in modeling spatial information.

The problems that exist for a multi-layer perceptron network and its variants can be summarized as follows:

1) most networks based on linear layer modeling have a large number of randomly initialized parameters, the spatial domain information is captured by massive data training and learning, and under the condition of insufficient data quantity, the heavy parameter quantity is easy to cause overfitting of the model so as to influence the performance of the model.

2) The method is based on a space rotation operation model without introducing extra space complexity, and space prior is introduced to the model to a certain extent, but such prior information is often insufficient, which shows that the problems that the performance of the model is improved to a limited extent and the overall parameter quantity is still too heavy still exist.

3) The original full-connection layer structure can effectively capture the global information input by pictures without the modeling capability of local information and global information at the same time, the capture of the local information completely depends on the acquisition of model training, and the aggregation of the local information and the global information cannot be realized in the same layer. Local modeling capability is often lacking in situations where the amount of data is limited for shallow networks that require local feature extraction. While for the operation based on spatial rotation he can achieve an efficient capture of local information of a specific layer by adjusting hyper-parameters, but it is obtained by multi-layer local network overlay for global information.

In general, how to achieve visual-type performance and complexity overhead for perceptron networks has yet to be optimized.

Disclosure of Invention

The invention aims to provide a method, a system, equipment and a storage medium for modeling and capturing image airspace information, which can optimize the modeling mode of the airspace information, reduce the computational complexity, and obviously improve the performance of subsequent application of extracted image characteristics.

The purpose of the invention is realized by the following technical scheme:

a modeling and capturing method of image spatial domain information comprises the following steps:

the method comprises the steps of performing down-sampling on an input original image to obtain an original three-dimensional tensor and performing windowing operation;

inputting the three-dimensional tensor in the form of a window to a network platform designed based on characteristic windowing, and modeling and capturing spatial information by the network platform in a grouped spatial information fusion mode realized based on position coding to obtain a data characteristic tensor of an original image;

the network platform is of a pyramid type hierarchical connection frame structure, each hierarchy comprises a plurality of sequentially connected single-layer networks, a basic model layer designed based on position coding and a gating function is arranged in each single-layer network, the basic model layers group input information, one group of the basic model layers is used as spatial domain information, a quadratic position coding method is used for spatial domain information aggregation modeling, fusion characteristics are obtained, and then characteristic strengthening is achieved through the gating function with the other group of the basic model layers; and outputting the reinforced three-dimensional tensor at each level, taking the three-dimensional tensor as the input of the next level after down-sampling, wherein the output of the last level is the data characteristic tensor of the original image.

A system for modeling and capturing spatial information of an image, the system comprising:

the original image downsampling unit is used for downsampling the input original image to obtain an original three-dimensional tensor and perform windowing operation;

the network platform based on characteristic windowing design inputs a three-dimensional tensor in a window form output by a down-sampling unit of an original image, and adopts a grouped spatial domain information fusion mode based on position coding to model and capture spatial domain information so as to obtain a data characteristic tensor of the original image;

A processing device, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium, storing a computer program which, when executed by a processor, implements the aforementioned method.

The technical scheme provided by the invention can show that 1) aiming at the problem that the multi-layer perceptron network model at the present stage is low in efficiency in processing the picture spatial domain information, a packet spatial domain information fusion mode realized based on position coding is innovatively designed, the performance of a baseline model is effectively improved, the model parameters are obviously reduced, and only little calculation overhead is introduced. 2) The application of generalized secondary position coding in the perceptron network is firstly provided, the model performance is improved from the perspective of simultaneously realizing global/local feature modeling, and finally, a hierarchical connection network framework based on feature windowing design is adopted in design, so that the performance of the perceptron network is further improved, and the network performance of the perceptron network is enabled to be quite even better than that of a network based on a convolution sum and self-attention model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a method for modeling and capturing spatial domain information of an image according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a perceptron network visual model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a first downsampling unit according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a second downsampling unit according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a PosMLP layer according to an embodiment of the present invention;

FIG. 6 is a flowchart of a secondary position encoding method according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of various structures for verifying the utility of PEG unit structures provided by embodiments of the present invention;

FIG. 8 is a schematic diagram of a system for modeling and capturing spatial domain information of an image according to an embodiment of the present invention;

fig. 9 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The terms that may be used herein are first described as follows:

the terms "comprising," "including," "containing," "having," or other similar terms of meaning should be construed as non-exclusive inclusions. For example: including a feature (e.g., material, component, ingredient, carrier, formulation, material, dimension, part, component, mechanism, device, process, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product, or article of manufacture), is to be construed as including not only the particular feature explicitly listed but also other features not explicitly listed as such which are known in the art.

The following describes a method, a system, a device and a storage medium for modeling and capturing spatial domain information of an image provided by the present invention in detail. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art. Those not specifically mentioned in the examples of the present invention were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer. The reagents or instruments used in the examples of the present invention are not specified by manufacturers, and are all conventional products available by commercial purchase.

Example one

The embodiment of the invention provides a modeling and capturing method of image spatial domain information, as shown in figure 1, which mainly comprises the following steps:

step 1, down-sampling an input original image to obtain an original three-dimensional tensor and performing windowing operation.

And 2, inputting the three-dimensional tensor in the window form to a network platform based on characteristic windowing design, and modeling and capturing spatial information by the network platform in a grouped spatial information fusion mode based on position coding to obtain a data characteristic tensor of an original image.

In the embodiment of the invention, the network platform is a pyramid type hierarchical connection frame structure, each hierarchy comprises a plurality of sequentially connected single-layer networks, and the single-layer networks are provided with basic model layers designed based on position coding and gating functions. In order to extract the characteristic spatial information, the basic model layer can group the input information, one group of the input information is used as spatial information, a secondary position coding method is used for carrying out spatial information aggregation modeling to obtain fusion characteristics, and the fusion characteristics and the other group of the input information are subjected to characteristic enhancement through a gating function; and (2) inputting a first level, namely the three-dimensional tensor in the window form obtained after the windowing operation in the step (1), combining the three-dimensional tensors in the window form which are processed in each level to serve as the reinforced three-dimensional tensor output by the level, and taking the three-dimensional tensor output by the level as the input of the next level after down-sampling, wherein the output of the last level is the data characteristic tensor of the original image.

In the embodiment of the invention, the single-layer network is called a PosMLP layer, and the basic model layer is called a PEG unit.

In the embodiment of the invention, the PEG unit is a brand-new airspace modeling module designed based on a position coding mode and a full connection layer, and channel grouping calculation is added, so that the PEG unit is verified to have good local and global feature modeling capability, and meanwhile, extra parameters and calculation complexity are not introduced basically. When the PEG unit is internally calculated, a space action layer realized based on two-dimensional space position coding is used for acting on the space dimension of the three-dimensional tensor input by the window, so that high-efficiency space domain information aggregation modeling is realized, and the capture and feature enhancement of high-order signals are realized by using a gating function in the PEG.

In the embodiment of the invention, the advantages of convolution operation natural down-sampling are utilized, so that the down-sampling units serving as the network platform are connected with different levels, and compared with the down-sampling realized based on a full-connection layer in the prior art, the invention can realize more excellent performance.

The scheme of the embodiment of the invention is a new perceptron network visual model (short for perceptron network visual model), which is a visual model with low model complexity and excellent performance and can effectively extract image characteristics. The perceptron network visual model is a general model, and the extracted image features can be widely applied to tasks such as target detection and scene segmentation, and the performance of an image classification model in the corresponding task can be remarkably improved at a low cost.

When the scheme of the embodiment of the invention is applied to tasks such as target detection, scene segmentation and the like, the perceptron network visual model can be used as a universal visual backbone network of an image classification model in a corresponding task to perform feature extraction, the extracted features are the data feature tensor of the original image obtained in the step 2, the data feature tensor is flattened into a one-dimensional vector and then is sent to a linear classifier of the image classification model, and an image classification result can be obtained. In summary, the perceptron network visual model provided by the present invention can be used as a universal visual backbone network, and has both the excellent performance and the possibility of expanding the downstream task, and the processes involved in the subsequent applications can be realized by the conventional techniques, so that the details are not repeated.

For ease of understanding, further description of various portions of the invention and possible subsequent applications are provided below.

One, integral model structure.

Fig. 2 shows a perceptron network visual model, with information flow direction from left to right. The left Image represents an input original Image, and 3 × H × W represents the size and the number of channels of the original Image; ConvPE denotes a down-sampling unit (including a multilayer convolution layer and a normalization layer) that performs down-sampling processing on an input original image. The middle dashed box shows four levels inside the network platform, each level comprises a plurality of PosMLP layers, the lower right corner of the dashed box represents the number of PosMLP layers in the corresponding level, and ConvPM between levels represents a down-sampling unit (comprising a single-layer packet convolution and normalization layer) for connecting different levels; head on the far right represents the linear Head layer of the linear classifier; the upper left corner of each dotted line frame shows the size (resolution) of the input three-dimensional tensor, and the top inside the dotted line frame shows the relevant parameters of the window-form three-dimensional tensor in the corresponding level. The number of levels shown in fig. 2, the size of the input three-dimensional tensor of each level, and the parameters related to the window-form three-dimensional tensor are examples and are not intended to be limiting.

And II, working principle of each part of the model.

1. And a down-sampling unit.

The embodiment of the present invention relates to two down-sampling units, the first is a down-sampling unit at the front end of the model, i.e. ConvPE in fig. 2. The second is that the network platform internally connects the down sampling units between different layers, i.e. three convpms in fig. 2.

1) A first downsampling unit (original image downsampling unit).

The input original image is a C-channel image with a size H × W, denoted as H × W × C, and H, W respectively represent the height and width of the original image. Illustratively, if the original image is an RGB image, the number of channels C is 3. Data enhancement is performed on a richer visual scene during training in advance by common ways such as image flipping, random cropping, Mixup, CutMix, and RandAugment, etc.

Down-sampling and channel dilation are performed on each stage of the model before entering it, e.g. on entry to the first stage, resolution is obtained from the multilayer convolution layer to obtain a resolution of

Wherein K represents a down-sampling scale; for example, K may be set to 4, and the original three-dimensional tensor resolution is

As shown in fig. 3, the internal structure of the first downsampling unit is shown, which includes three layers of convolutional network and two layers of BN (batch normalization) and two layers of GELU activation function. Conve2D represents the two-dimensional convolution, (3,3) and (1,1) represent the convolution kernel size of the corresponding convolution network, and 1 represents the convolution step size.

2) A second downsampling unit.

In order to strengthen the reservation of local information while linking the network platforms, a second down-sampling unit is designed. As shown in fig. 4, for the structure of the second down-sampling unit, compared with a simple full-connection layer down-sampling mode, the present invention uses a single layer packet convolution and normalization (layer normalization) layer for down-sampling, so that the present invention has smaller parameters and computational complexity, and simultaneously strengthens the modeling for local information during down-sampling.

In fig. 4, Depconv2D represents a two-dimensional packet convolution, (3,3) represents the convolution kernel size, 2 represents the convolution step size, and the dimension of LayerNorm may be set to 2C.

After the two downsampling units perform downsampling, windowing operation is performed, and an input three-dimensional tensor is converted into an M multiplied by N multiplied by C form, wherein M represents the number of windows, N represents the number of pixels inside the windows, and C represents the number of channels of the original image. For example, assuming that the network platform includes four levels, the size of the input original image is 224 × 224, and the window size of the first level is 14 × 14, the window number M of the first level is 16 (i.e., M ═ 4 ÷ 14 ═ 4) ² 16. The introduction of windows significantly reduces the number of parameters and computational complexity, but in order to ensure that a larger window size is used as much as possible (for modeling global information), therefore, 14 is adopted as the window size, unless the resolution of the last platform picture itself is reduced to 7, and the window size used by us is also 7, so the number of windows in four levels is respectively 16, 4, 1,1, and in fig. 2, the number of pixels inside the window in different levels is represented by the symbol N with different corner marks.

2. The internal working principle of each layer.

1. And processing a three-dimensional tensor of a window form.

In the embodiment of the invention, a hierarchical connection frame designed based on characteristic windowing is utilized, and windowing operation is performed on data through a corresponding down-sampling unit at the front end of each hierarchy with different characteristic resolutions so as to convert the data into a three-dimensional tensor in a window form. As described above, before being input to each hierarchy, a downsampling unit at the front end of the corresponding hierarchy performs a windowing operation, where a first hierarchy of the network platform performs a windowing operation through a first downsampling unit (i.e., ConvPE), and the remaining hierarchies perform a windowing operation through a second downsampling unit (i.e., ConvPM). Inside each level, each PosMLP layer respectively processes the three-dimensional tensors of all window forms, the processing principle of PosMLP described later is to introduce the three-dimensional tensor data of each window form, after the last PosMLP layer is processed, the processed three-dimensional tensors of all window forms are combined to be used as the reinforced three-dimensional tensor output by the level, and after down-sampling and windowing are carried out through ConvPM at the rear end, the three-dimensional tensors are input to the next level; or, if the image is the last level, the output strengthened three-dimensional tensor is the data feature tensor of the original image. Wherein different windows within each level share parameters.

2. And (3) space domain information modeling realized based on grouping position coding.

In the embodiment of the invention, based on the enlightening of the spatial position coding technology of the Transformer model, the core space modeling unit (namely the PEG unit) of the model is obtained by modifying and modifying the scene of the perceptron network.

The PEG unit is positioned in the middle of the PosMLP layer and acts between two Channel full-connection layers (Channel FC), which means that the PEG unit is embedded into the unit after the Channel is expanded to realize the spatial information interaction integration, and then the PEG unit is laminated and retracted to the original input Channel size from the second Channel full-connection layer. As shown in fig. 5, the PosMLP layer includes: the device comprises a normalization layer (LayerNorm), a first Channel full-connection layer (Channel FC), an activation function (GELU) layer, a PEG Unit (PEG Unit) and a second Channel full-connection layer which are arranged in sequence;

the principle of the main process is as follows:

1) the feature expands.

The input of each PosMLP layer in each level is a three-dimensional tensor in the form of a window

Wherein the content of the first and second substances,

for a real number set, M represents the number of windows, N represents the number of pixels inside a window, and C represents the number of channels of the original image.

In the embodiment of the invention, the three-dimensional tensor X in the window form is processed through the normalization layer, and then is expanded from C channels to gamma C channels through the first channel full-connection layer, so that the feature tensor with rich feature expression is obtained and recorded as the feature tensor

Where γ represents an expansion coefficient, for example, set to 6 in the embodiment of the present invention; and meanwhile, carrying out nonlinear activation on the characteristic tensor X ' by using the activation function layer to obtain a characteristic tensor X ', and inputting the characteristic tensor X ' to a PEG unit at the rear end.

As shown in fig. 5, the present invention further introduces a Residual mechanism, a copied three-dimensional tensor X is pre-stored in the cache and is used as a Residual (Residual) connection to realize a deeper network design.

2) The working principle of the PEG unit.

The PEG unit groups the input information, and the input information is marked as X ', and is averagely divided into two groups in the channel dimension, and the two groups are marked as X' ₁ And X' ₂ ，

Prepared from X' ₁ As airspace information, performing airspace information aggregation modeling by using a secondary position coding method (GGQPE) to obtain a fusion feature, and then performing X' ₂ Feature enhancement is achieved via a gating function (as a residual term) expressed as:

PEG(X′)＝GGQPE(X′ ₁ )⊙X′ ₂

wherein GGQPE denotes an operation function of the secondary position encoding method, and symbol |, denotes a gate control operation implemented by the gate control function.

As shown in fig. 6, illustrating the principle of the GGQPE operation function, M is 1 and d is γ C in fig. 6. The GGQPE method gives a very strong location prior and stores it in cache, making it not a learnable parametric form. Firstly, defining GQPE linear mapping matrix as W ^gqpe The element form can be represented by the following v and r _δ Two vectors are given:

wherein i and j respectively represent a pixel i and a pixel j,

for transposing symbols, Softmax _j Representing the Softmax calculation performed for the dimension in which pixel j is located. Offset projection vector

Is from a learnable center of attention

Sum covariance matrix

And

derived by calculation of p _i 、p _j Which respectively represent the spatial positions of pixel i and pixel j. From a predefined positional prior tensor

We obtain the vector r using delta _δ (ii) a The specific expressions of the two vectors are as follows:

therein, sigma ^-1 An inverse matrix, sigma, representing the covariance matrix sigma ^-1 The result of the delta calculation is the vector, (∑ ^-1 Δ) ₁ And (∑) ^-1 Δ) ₂ Respectively represent ∑ ^-1 Calculating the 1 st element and the 2 nd element in the obtained vector;

respectively representing the 1 st row, the 1 st column, the 2 nd row, the 2 nd column and the 1 st row, the 2 nd column elements in the inverse matrix of the covariance matrix sigma; delta ₁ And delta ₂ Respectively representing the 1 st element and the 2 nd element in delta, and the superscript 2 represents the square.

Those skilled in the art will appreciate that the center of attention and the a priori tensor of position are both industry terms. In short, the attention center, through which it can be known that the information of a specific certain neighboring pixel of the current pixel needs to be assigned the largest weight (i.e. the coefficient of the linear mapping is the largest); the distance information of the current pixel and each adjacent pixel, namely the distance relationship can be obtained through the position prior tensor (the position information cannot be reflected by simple linear mapping).

The above expression is actually developed as a quadratic form with respect to δ and Δ:

the following is a description of a specific calculation principle. Because sigma Γ is used in the experiments ^T (wherein

Is a parametric matrix) to obtain a covariance matrix sigma, which is a symmetric matrix, thereby ensuring that the obtained covariance matrix is a semi-positive definite matrix, and thus ensuring that the weights for all other pixels relative to pixel i are normalizedQuantized row vector

P in a continuous spatial position _i The pixel of + Δ will get the largest weight, and on the two-dimensional map developed for pixel i, at the weight extreme point p _i The + Δ periphery forms a gaussian distribution with various differences. Although all pixels on an eigen map (in the form of the eigen tensor) share a common set of parameters, i.e. share the same gaussian distribution, the specific distribution of the normalization function Softmax varies from pixel to pixel due to its existence, and it is ensured that the eigen magnitudes on the pixels are aligned. On the basis of the above, the grouping operation is further carried out on the channel dimension, namely X' ₂ ＝{x ¹ ,x ² ,...,x ^s Are multiplied by

The total number of groups is determined by the super parameter. The purpose of the grouping is to assign a mapping matrix, i.e. W, to each group of features individually ^g-gqpe Therefore, a plurality of Gaussian distribution forms can be obtained, and finally, after the s groups are respectively acted by corresponding mapping matrixes, the s groups are spliced along the original channel, so that abundant space characteristic aggregation modes with different granularities are obtained in a single-layer network.

Based on the above principle, the working process of a single PEG unit can be described as: when the airspace information aggregation modeling is carried out, a group of information X 'as the airspace information is subjected to a grouping mapping mode' ₁ Divided into groups s, i.e. X' ₁ ＝{x ¹ ,x ² ,...,x ^s Is X' ₁ Each set of information x in (1) ^g Learning a mapping matrix W ^g-gqpe G is 1,2, …, s; mapping matrix W ^g-gqpe From a learnable centre of attention

Sum covariance matrix

And relative position information (position prior tensor) r registered in the memory, expressed as:

wherein the content of the first and second substances,

represents the mapping matrix W ^g-gqpe The element at the (i, j) position; offset projection vector v ^g The meaning of the individual elements in (a) is similar to the offset projection v described previously, but here a single group, in particular (Σ), is considered ^g-1 Δ ^g ) ₁ And (∑) ^g-1 Δ ^g ) ₂ Are respectively sigma ^g-1 Δ ^g The 1 st element and the 2 nd element in the obtained vector are calculated,

respectively represent covariance matrix ∑ ^g The 1 st row, the 1 st column, the 2 nd row, the 2 nd column and the 1 st row, the 2 nd column elements in the inverse matrix of (1).

Using a mapping matrix W ^g-gqpe For corresponding information x ^g Mapping, information X' ₁ After mapping of all the group information is completed, obtaining fusion characteristics through splicing operation (Concat), wherein the fusion characteristics are expressed as follows:

GGQPE(X′ ₁ )＝Concat{W ^1-gqpe x ¹ ,...,W ^s-gqpe x ^s }。

in the above formula, W ^g-gqpe The mapping matrix acts on the space dimension N, namely, the spatial domain information aggregation is realized. Thus far, the core of the PEG unit is described.

3) And (5) feature compression.

After passing through the PEG unit, although the dimension of the channel characteristic is changed into one half due to the initial grouping operation, the unit output dimension of the PEG is

But still relatively expanded in dimension, thus compressing the features output by the PEG unit to C channels through the second channel full-link layer. Therefore, under the condition that the input and output dimensions are not changed, the method can deepen the network depth by using a multi-layer splicing mode, so that the feature capture capability of the model is enhanced.

As described above, the PosMLP layer introduces a residual mechanism, and combines the three-dimensional tensor X in the window form of the output and the input of the second channel full-connection layer through residual connection to obtain an enhanced three-dimensional tensor output by the first PosMLP layer; taking the reinforced three-dimensional tensor output by the former PosMLP layer of the latter PosMLP layer as input, wherein the reinforced three-dimensional tensor output by the last PosMLP layer is the reinforced three-dimensional tensor output by the level; the three-dimensional tensors processed in each level are in window forms, and the three-dimensional tensors in the window forms after being processed are combined and output by the last PosMLP layer.

And processing each level of the network platform by adopting the flow, and finally extracting the data characteristic tensor of the original image.

And thirdly, subsequent application.

The data feature tensor of the original image can be obtained based on the method, for example, the data feature tensor of the original image can be input into the linear head layer, and after Softmax, an image classification scoring result is obtained to complete image classification.

To demonstrate the effectiveness of the present invention, it was verified by performing the following experiments.

Experiment experiments were performed on real data sets of ImageNet1K with picture classification accuracy (Acc) as an evaluation index. On the task, three Models with different sizes are used for transverse comparison with other Models, and are respectively marked as Tiny Models (T), Small Models (S) and Base Models (B) from Small to large, and the perceptron network vision model provided by the invention is sequentially marked as PosMLP-T, PosMLP-S and PosMLP-B corresponding to the three sizes. And an ablation experiment aiming at model structure design and PEG internal GGQPE grouping setting is designed, so that the effectiveness of a design unit is verified. The experiment is divided into three parts:

1. the image classification effect on ImageNet1K is shown in table 1.

Table 1 compares the parameters in a model of the same scale with equivalent computational complexity

The convolutional network-based design is RegNeTY, and the sizes of rear 4G models, rear 8G models and rear 16G models are sequentially increased; swin, Nest based on attention network design, and S2-MLP, gMLP, ViP, AS-MLP, and Hire-MLP based on MLP network design; since each model in table 1 is an existing model, it is not described in detail. # P represents the total parameter number of the model, FLOPs is the number of floating point operations per second, and the calculated amount required for classifying one picture is measured. As can be seen from Table 1, the method of the present invention has less parameters and excellent classification performance. It is worth mentioning that: under a model of two magnitudes "T, S", the model performance based on convolution and attention-free mechanisms can be kept up to and even exceeded, and smaller parameter magnitudes are typically required, e.g. comparable performance (83.0% vs 83.0%) and comparable computational complexity performance (8.7G vs 8.7G) are obtained at the "S" size compared to Swin Transformer (Swin-T), but with fewer parameters (37M vs 50M). Moreover, under the three-order model of "T, S, B", the invention obtains basically optimal performance in the network based on MLPs (the model containing MLP in Table 1) and tends to have lower parameters.

It should be noted that, because some existing models are referred to in table 1, but the sizes of the models are not unified, the existing models are arranged in table 1 by reference numbers, for example, Hire-MLP compares the three sizes of S/B/L with the T/S/B sizes of the present invention by reference numbers; as do the remaining existing models.

2. The results of model-based ablation experiments on ImageNet1K with half the number of samples (constant number of classes) are shown in table 2.

TABLE 2 Performance enhancements by inserting different structural models compared to the reference model

In table 2, the classification accuracy of the reference model on the data set is low, and when the spatial fully-connected layer in which the spatial domain information is fused is replaced by the GGQPE module provided by the present invention (gMLP + GGQPE), the accuracy is improved by 1.84% with only 0.7% of the increase in computational complexity, and the parameter amount change amount is-6.2%. Therefore, under the condition of sacrificing small complexity, the model parameters are obviously reduced, and the model performance is effectively improved. Finally, in combination with the convolutional downsampling link-based hierarchical structure (ConvLHS) presented in fig. 2, an accuracy improvement of 5.22% was obtained on the data set. This also illustrates the structural design rationality of the present invention.

3. Ablation experiments based on PEG unit design on ImageNet1K with half the number of samples (number of classes unchanged).

PEG Unit	Acc(％)	#P	FLOPs
				Standard	77.61	20.9M	5.21G
Element-wise Addition	76.97	20.9M	5.21G
				Concat	76.98	27.6M	6.54G
LayerNorm	76.95	21.0M	5.23G
				NonSplit	75.49	27.6M	7.42G

TABLE 3 Effect of modifying Unit Structure on Performance based on PEG units

In table 3, the effect of changing the internal structural form of the PEG unit on the performance is tested, and five parts (a) to (e) from left to right in fig. 7 sequentially show five structures in table 3, where Standard is the PEG unit introduced in the present invention (referred to as Standard PEG unit for short); element-wise Addition indicates that the gating operation of the standard PEG unit is changed into Element-by-Element Addition operation, and Concat indicates that the gating operation of the standard PEG unit is changed into Concat operation, and the two changes can find that the performance of the model is remarkably reduced (77.61% vs 76.97% vs 76.98%), thereby illustrating the rationality of the gating function. LayerNorm states that an additional normalization layer is added to the standard PEG unit, but the performance is also somewhat affected because the additional normalization layer conflicts with the normalization of the Softmax layer inside the GGQPE, thereby reducing the expression intensity of the features. NonSplit indicates packet gating operation with standard PEG units removed, but as can be seen from the results in Table 3, the removal of packet gating results in performance degradation.

Example two

The invention also provides a system for modeling and capturing image spatial domain information, which is implemented mainly based on the method provided by the foregoing embodiment, as shown in fig. 8, the system mainly includes:

the network platform is of a pyramid type hierarchical connection frame structure, each hierarchy comprises a plurality of sequentially connected single-layer networks, a basic model layer designed based on position coding and a gating function is arranged in each single-layer network, the basic model layers group input information, one group of the basic model layers is used as spatial domain information, a quadratic position coding method is used for spatial domain information aggregation modeling, fusion characteristics are obtained, and then characteristic strengthening is achieved through the gating function with the other group of the basic model layers; the input of the first level is a three-dimensional tensor of a window form obtained after the windowing operation of a down-sampling unit of an original image, the three-dimensional tensor of the processed window form is combined by each level to serve as an enhanced three-dimensional tensor output by the level, the three-dimensional tensor is used as the input of the next level after the down-sampling and windowing operation, and the output of the last level is the data characteristic tensor of the original image.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.

It should be noted that the main principle of each part of the above system has been described in detail in the first embodiment, and therefore, the detailed description is omitted.

EXAMPLE III

The present invention also provides a processing apparatus, as shown in fig. 9, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, a processor, a memory, an input device and an output device are connected through a bus.

In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical button or a mouse and the like;

the output device may be a display terminal;

the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.

Example four

The present invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.

The readable storage medium in the embodiment of the present invention may be provided in the foregoing processing device as a computer readable storage medium, for example, as a memory in the processing device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for modeling and capturing image spatial domain information is characterized by comprising the following steps:

2. The method of claim 1, wherein the spatial domain information of the image is modeled and captured,

down-sampling the input original image using the multi-convolution and normalization layers to obtain a resolution of

The three-dimensional tensor of (a); h, W respectively represents the height and width of the original image, C represents the number of channels of the original image, and K represents the down-sampling proportion;

and a single-layer packet convolution and normalization layer is adopted between different layers in the network platform for down-sampling.

3. The modeling and capturing method of image spatial domain information according to claim 1, wherein the input of the first level is a three-dimensional tensor in a window form obtained by performing windowing on an original three-dimensional tensor, each level combines the three-dimensional tensors in the processed window form as an enhanced three-dimensional tensor output by the level, and the three-dimensional tensors are used as the input of the next level after down-sampling and windowing; wherein different windows within each level share parameters.

4. The method of claim 1, wherein the single-layer network is referred to as PosMLP layer, and the base model layer is referred to as PEG unit; each of the PosMLP layers includes: the device comprises a normalization layer, a first channel full-connection layer, an activation function layer, a PEG unit and a second channel full-connection layer which are arranged in sequence.

5. The method of claim 4, wherein outputting the enhanced three-dimensional tensor at each level comprises:

processing the input three-dimensional tensor in the window form through a plurality of PosMLP layers which are sequentially connected at each level to obtain an enhanced three-dimensional tensor;

Wherein the content of the first and second substances,

m represents the number of windows, N represents the number of pixels in the windows, and C represents the number of channels of the original image; processing a three-dimensional tensor X in a window form through the normalization layer; expanding the first channel full-connection layer from C channels to gamma C channels to obtain the characteristic tensor

Wherein γ represents an expansion coefficient; performing nonlinear activation on the feature tensor X' by using the activation function layer; inputting the feature tensor after nonlinear activation into the PEG unit, and compressing the features output by the PEG unit to C channels through the second channel full-connection layer; combining the output of the second channel full-connection layer and the input three-dimensional tensor X in a window form through residual connection to obtain an enhanced three-dimensional tensor output by the first PosMLP layer; the reinforced three-dimensional tensor output by the last PosMLP layer is the reinforced three-dimensional tensor output by the level.

6. The method for modeling and capturing image spatial information according to claim 1, wherein the basic model layer groups input information, one group of the input information is used as spatial information, a quadratic position coding method is used for spatial information aggregation modeling to obtain fusion features, and feature enhancement is realized by a gating function with the other group of the fusion features, wherein the method comprises the following steps:

recording the input information of the basic model layer as X ', and averagely dividing the input information into two groups in channel dimension, and recording the two groups as X' ₁ And X' ₂ ；

Prepared from X' ₁ As airspace information, performing airspace information aggregation modeling by using a secondary position coding method to obtain a fusion feature, and then performing aggregation modeling on the fusion feature and X' ₂ Feature enhancement is achieved by a gating function, which is expressed as:

PEG(X′)＝GGQPE(X′ ₁ )⊙X′ ₂

7. The method of modeling and capturing spatial information of images according to claim 1 or 6, wherein said secondary position encoding method comprises:

when the airspace information aggregation modeling is carried out, a group of information X 'as the airspace information is subjected to a grouping mapping mode' ₁ Divided into groups s, i.e. X' ₁ ＝{x ¹ ,x ² ,...,x ^s Is X' ₁ Each set of information x in (1) ^g Learning a mapping matrix W ^g-gqpe ，g＝1,2,…,s；

Mapping matrix W ^g-gqpe From a learnable centre of attention

Sum covariance matrix

And determining from a location prior tensor r registered in the memory, expressed as:

wherein the content of the first and second substances,

represents the mapping matrix W ^g-gqpe The element at the (i, j) position,

is a transposed symbol;

is the relative distance of pixel i from pixel j, p _i 、p _j Representing the spatial positions of pixel i, pixel j, respectively, Softmax _j Representing the Softmax calculation for the dimension in which pixel j is located, offset projection vector v ^g Middle (sigma) ^g-1 Δ ^g ) ₁ And (∑) ^g-1 Δ ^g ) ₂ Respectively represent ∑ ^g-1 Δ ^g The 1 st element and the 2 nd element in the obtained vector are calculated,

respectively represent covariance matrix ∑ ^g The 1 st row, the 1 st column, the 2 nd row, the 2 nd column and the 1 st row, the 2 nd column elements in the inverse matrix;

using a mapping matrix W ^g-gqpe For corresponding information x ^g Mapping, information X' ₁ After all the grouped information is mapped, the fusion characteristics are obtained through splicing operation, and the fusion characteristics are expressed as follows:

GGQPE(X′ ₁ )＝Concat{W ^1-gqpe x ¹ ,...,W ^s-gqpe x ^s }

wherein Concat represents the splicing operation.

8. A modeling and capturing system of image spatial domain information, which is realized based on the method of any one of claims 1 to 7, and comprises:

9. A processing device, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, performs the method of any one of claims 1 to 7.