WO2024037330A1

WO2024037330A1 - Image feature processing method and apparatus, and storage medium

Info

Publication number: WO2024037330A1
Application number: PCT/CN2023/110526
Authority: WO
Inventors: 韩韬; 张园; 杨明川; 王慧芬; 薛俊达
Original assignee: 中国电信股份有限公司
Priority date: 2022-08-19
Filing date: 2023-08-01
Publication date: 2024-02-22
Also published as: CN117649569A

Abstract

The present disclosure relates to the field of image processing, and provides an image feature processing method and apparatus, and a storage medium. The image feature processing method comprises: extracting features of an original image so as to obtain a source feature map; segmenting the source feature map into a predetermined number of D mutually non-overlapping first blocks; calculating a gradient value of each of the first blocks; sorting all of the first blocks according to a gradient average value so as to delete the first p first blocks having the smallest gradient average value; using a preset encoder to encode block embedding features and position embedding features of D-p first blocks which are not deleted so as to obtain D second blocks, wherein the D second blocks comprise D-p encoded visual blocks and p mask tokens located at a preset position, and the D-p encoded visual blocks are in one-to-one correspondence with the D-p first blocks; and using a preset decoder to decode position embedding features of all of the second blocks and the source feature map so as to obtain a reconstructed feature map.

Description

Image feature processing method and device, storage medium

Cross-references to related applications

This application is based on the application with CN application number 202210998237.8 and the filing date is August 19, 2022, and claims its priority. The disclosure content of the CN application is hereby incorporated into this application as a whole.

Technical field

The present disclosure relates to the field of image processing, and in particular to an image feature processing method and device, and a storage medium.

Background technique

With the rapid development of 5G, big data and artificial intelligence, in the context of image or video big data application, media content such as images and videos are widely used in target detection, target tracking, image classification, image segmentation, pedestrian re-identification, etc. Intelligent vision task field.

The 5G era has spawned a large number of machine-oriented applications, such as Internet of Vehicles, driverless driving, industrial Internet, smart and safe cities, wearables, video surveillance and other machine vision content. Compared with the increasingly saturated videos for human vision tasks, application scenarios More extensive. Video encoding for machine vision will become one of the main sources of incremental traffic in the 5G and post-5G era.

Contents of the invention

The inventor notes that in the related art, an encoder encodes images or videos oriented to human vision tasks to generate a bit stream. The decoder decrypts the bitstream to obtain images or videos. Since the encoder and decoder are mainly based on convolutional neural networks, they cannot selectively compress images or videos and discard features of non-important areas in the image or video.

Accordingly, the present disclosure provides an image feature processing solution that can effectively discard features of non-important areas in images or videos in order to better complete machine vision intelligent analysis tasks.

According to a first aspect of an embodiment of the present disclosure, an image feature processing method is provided, which is executed by an image feature processing device, including: extracting features of the original image to obtain a source feature map; dividing the source feature map into a predetermined number D first blocks that do not overlap each other; calculate the gradient value of each first block in all first blocks; sort all the first blocks according to the gradient average to delete the one with the smallest gradient average The first p first blocks; use the preset encoder to embed the block embedding features and position embedding features of the Dp first blocks that have not been deleted. Encoding processing is performed to obtain D second blocks, wherein the D second blocks include Dp coded visible blocks and p mask tokens located at predetermined positions, wherein the Dp coded visible blocks are The blocks correspond to Dp first blocks one-to-one; a preset decoder is used to decode all the second blocks and the position embedded features of the source feature map to obtain the reconstructed feature map.

In some embodiments, calculating the gradient value of each first block includes: in the first block d(i,j), calculating the first gradient value of each feature point f(x,y) in the first direction. _Gradient _g _{_} _{_} _{_} ×d ₂ ; determine the gradient value of the feature point f (x, y) according to the first gradient g _x and the second gradient g _y ; according to the first block d (i, j) The gradient values of all feature points determine the gradient value of the first block d(i,j).

In some embodiments, the gradient value of the feature point f(x, y) is the mean square value of the first gradient g _x and the second gradient g _y .

In some embodiments, the gradient value of the first block d(i,j) is the average of the gradient values of all feature points in the first block d(i,j).

In some embodiments, the parameter p is determined according to a preset compression ratio α.

In some embodiments, the compression ratio α is:

Where n ₁ × n ₂ is the size of the source feature map.

In some embodiments, using a preset decoder to decode all the second blocks and the position embedded features of the source feature map includes: targeting the t-th second block in all the second blocks. , respectively according to the first attention weight matrix of each single head Second attention weight matrix and the third attention weight matrix Determine the corresponding first vector matrix Q _t , second vector matrix K _t and third vector matrix V _t , 1≤t≤D; according to the first vector matrix Q _t , second vector matrix K _t and third vector The matrix V _t determines the attention value of each single head; determines the multi-head attention value of the t-th second block based on the attention values of all single heads of the t-th second block; for The multi-head attention value of the t-th second block and the t-th second block are subjected to multi-layer perception processing to obtain the reconstructed feature map.

In some embodiments, the encoder is a Transformer encoder; the decoder is a Transformer decoder.

According to a second aspect of an embodiment of the present disclosure, an image feature processing device is provided, including: a first processing module configured to extract features of the original image to obtain a source feature map; and a second processing module configured to extract the source feature map. The source feature map is divided into a predetermined number D of first blocks that do not overlap each other, and each of the first blocks is calculated. The gradient value of the first block is to sort all the first blocks according to the gradient average value to delete the first p first blocks with the smallest gradient average value; the third processing module is configured to use the preset The encoder encodes the block embedding features and position embedding features of the Dp first blocks that have not been deleted to obtain D second blocks, wherein the D second blocks include components located at predetermined positions. Dp coded visible blocks and p mask tokens, where the Dp coded visible blocks correspond to Dp first blocks one-to-one; the fourth processing module is configured to use a preset decoder pair All the second blocks and the position embedded features of the source feature map are decoded to obtain the reconstructed feature map.

In some embodiments, the second processing module is configured to, in the first block d(i,j), calculate the first gradient g _x of each feature point f(x, y) in the first direction and in The second gradient g _y in the second direction, wherein the first direction and the second direction are perpendicular to each other, where 1≤i≤d ₁ , 1≤j≤d ₂ , D=d ₁ ×d ₂ ; according to The first gradient g _x and the second gradient g _y determine the gradient value of the feature point f (x, y); according to the gradient of all feature points in the first block d (i, j) The value determines the gradient value of the first block d(i,j).

In some embodiments, the second processing module is configured to determine the parameter p according to the preset compression ratio α.

In some embodiments, the compression ratio α is:

Where n ₁ × n ₂ is the size of the source feature map.

In some embodiments, the fourth processing module is configured to, for the t-th second block among all the second blocks, respectively, according to the first attention weight matrix of each single head. Second attention weight matrix and the third attention weight matrix Determine the corresponding first vector matrix Q _t , second vector matrix K _t and third vector matrix V _t , 1≤t≤D, according to the first vector matrix Q _t , second vector matrix K _t and third vector The matrix V _t determines the attention value of each single head, and determines the multi-head attention value of the t second block based on the attention values of all single heads in the t second block. The multi-head attention value of the t-th second block and the t-th second block are subjected to multi-layer perception processing to obtain the reconstructed feature map.

According to a third aspect of an embodiment of the present disclosure, an image feature processing device is provided, including: a memory, configured to store instructions; the processor is coupled to the memory, and the processor is configured to execute the method described in any of the above embodiments based on the instructions stored in the memory.

According to a fourth aspect of an embodiment of the present disclosure, a non-transitory computer-readable storage medium is provided, wherein the computer-readable storage medium stores computer instructions, and when the instructions are executed by a processor, the methods described in any of the above embodiments are implemented. method.

According to a fifth aspect of an embodiment of the present disclosure, a computer program product is provided, including computer instructions, wherein when the computer instructions are executed by a processor, the method as described in any of the above embodiments is implemented.

Other features and advantages of the present disclosure will become apparent from the following detailed description of exemplary embodiments of the present disclosure with reference to the accompanying drawings.

Description of drawings

In order to explain the embodiments of the present disclosure or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting any creative effort.

Figure 1 is a schematic flowchart of an image feature processing method according to an embodiment of the present disclosure;

2A-2C are schematic diagrams of feature diagrams of some embodiments of the present disclosure;

3A-3B are schematic diagrams of characteristic diagrams of other embodiments of the present disclosure;

Figure 4 is a schematic diagram of an encoder output according to an embodiment of the present disclosure;

Figure 5 is a schematic diagram of a decoder according to an embodiment of the present disclosure;

Figure 6 is a schematic structural diagram of an image feature processing device according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of an image feature processing device according to another embodiment of the present disclosure.

Detailed ways

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only some of the embodiments of the present disclosure, rather than all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses. Based on the embodiments in this disclosure, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of this disclosure.

Unless otherwise specifically stated, the relative arrangement of components and steps set forth in these examples, numerical representations Expressions and numerical values do not limit the scope of the disclosure.

At the same time, it should be understood that, for convenience of description, the dimensions of various parts shown in the drawings are not drawn according to actual proportional relationships.

Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods and devices should be considered part of the authorized specification.

In all examples shown and discussed herein, any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.

It should be noted that similar reference numerals and letters refer to similar items in the following figures, so that once an item is defined in one figure, it does not need further discussion in subsequent figures.

Figure 1 is a schematic flowchart of an image feature processing method according to an embodiment of the present disclosure. In some embodiments, the following image feature processing method is performed by an image feature processing device.

In step 101, the features of the original image are extracted to obtain the source feature map.

In some embodiments, the original image is input into a CNN (Convolutional Neural Network) to obtain the source feature map.

In step 102, the source feature map is divided into a predetermined number D of first blocks that do not overlap with each other.

In some embodiments, if the size of the source feature map is n ₁ ×n ₂ and the size of each first block is m ₁ ×m ₂ , then D=d ₁ ×d ₂ , where d ₁ =n ₁ / m ₁ , d ₂ =n ₂ /m ₂ .

For example, as shown in Figure 2A, the size of the source feature map is 28×28. As shown in Figure 2B, the size of each block is 7×7, that is, the source feature map is divided into 4×4 blocks, d(0,0) to d(3,3) as shown in Figure 2C.

In step 103, the gradient value of each first block in all first blocks is calculated.

In some embodiments, in the first block d(i,j), the first gradient _gx in the first direction and the second gradient in the second direction of each feature point f(x,y) are calculated. Gradient g _y , where the first direction and the second direction are perpendicular to each other, where 1≤i≤d ₁ , 1≤j≤d ₂ , and D=d ₁ ×d ₂ .

For example, the first direction is the x-axis direction in the preset plane, and the second direction is the y-axis direction in the preset plane.

The first gradient g _x of the feature point f(x,y) in the first direction is:

The second gradient g _y of the feature point f(x,y) in the second direction is:

Next, the gradient value of the feature point f(x,y) is determined based on the first gradient _gx and the second gradient _gy .

In some embodiments, the gradient value of the feature point f(x, y) is the mean square value of the first gradient g _x and the second gradient g _y . Right now:

Next, determine the gradient value of the first block d(i,j) based on the gradient values of all feature points in the first block d(i,j).

In some embodiments, the gradient value of the first block d(i,j) is the average of the gradient values of all feature points in the first block d(i,j). Right now:

In step 104, all the first blocks are sorted according to the gradient average, so as to delete the top p first blocks with the smallest gradient average.

For example, the relationship between the compression ratio α and the parameter p is shown in formula (5).

Where n ₁ × n ₂ is the size of the source feature map. Therefore, the number of discarded blocks can be selected as needed, and the compression ratio of features can be flexibly adjusted.

For example, as shown in Figure 3A, the feature map includes 16 blocks, that is, D=16. By sorting all the blocks according to the gradient average, and deleting the first 11 blocks with the smallest gradient average, that is, p=11. Therefore, only 5 blocks are retained in the feature map, that is, D-p=5. As shown in Figure 3B, the five reserved blocks are: the 2nd block, the 7th block, the 9th block, the 14th block and the 16th block.

In step 105, use a preset encoder to encode the block embedding features and position embedding features of the D-p first blocks that have not been deleted to obtain D second blocks, where D second blocks includes D-p coded visible blocks and p mask tokens located at predetermined positions, where the D-p coded visible blocks correspond to the D-p first blocks one-to-one.

In some embodiments, the default encoder is a Transformer encoder.

For example, the 5 blocks shown in Figure 3B, namely the 2nd block, the 7th block, the 9th block, the 14th block, and the 16th block, are input to the trained encoder , so that the encoder outputs the encoding result. The encoding result includes 16 blocks. Among these 16 patches, there are 5 encoded visible patches (Encoded Visible Patches) corresponding to the 5 patches in Figure 3B, as shown in the 5 white boxes 41 in Figure 4, and 11 Mask Token, as shown in the 11 dark boxes 42 in Figure 4.

It should be noted that the positions of the five coded visible blocks in the coding result are the same as those of the five blocks in Figure 3B. Corresponds to the location. For example, as shown in Figure 4, the five encoded visible blocks in the encoding result are the 2nd block, the 7th block, the 9th block, the 14th block and the 14th block in the encoding result. 16 blocks.

It should also be noted that in the feature map shown in Figure 3B, 11 blocks are discarded, that is, the 1st block, 3rd block, 4th block, and 5th block are discarded. , 6th block, 8th block, 10th block, 11th block, 12th block, 13th block and 15th block. Therefore, the encoder will add the corresponding mask token (Mask Token) at the position of the discarded block, as shown in the 11 dark boxes 42 in Figure 4. That is, the encoding result includes 11 mask tokens, which are the first block, the third block, the fourth block, the fifth block, the sixth block, and the third block in the encoding result. 8th chunk, 10th chunk, 11th chunk, 12th chunk, 13th chunk and 15th chunk.

It should be noted that since how to train the encoder is not the invention of the present disclosure, the description will not be carried out here.

In step 106, a preset decoder is used to decode all the second blocks and the position embedded features of the source feature map to obtain the reconstructed feature map.

In some embodiments, the default decoder is a Transformer decoder.

In some embodiments, the structure of the decoder is as shown in Figure 5.

As shown in Figure 5, the input features are processed by the multi-head self-attention (Multi-head Self Attention) layer after normalization.

For example, for the t-th second block, according to the first attention weight matrix of each single head Second attention weight matrix and the third attention weight matrix Determine the corresponding first vector matrix Q _t , second vector matrix K _t and third vector matrix V _t , 1≤t≤D.

For example, the calculation formula is as shown in formula (6), where Ft is the feature of the t-th second block.

Next, the attention value s _t of each single head is determined according to the first vector matrix Q _t , the second vector matrix K _t and the third vector matrix V _t , as shown in formula (7).

in, is the dimension of matrix K _t , δ is the attention calculation function, and τ is the Softmax logistic regression function.

Next, the multi-head attention value of the t-th second block is determined based on the attention values of all single-heads of the t-th second block, as shown in formula (8).

in, is the Concatenate function, is the parameter matrix.

Next, the multi-head attention value of the t-th second block and the t-th second block are input into the MLP (Multi-Layer perceptron, multi-layer perception) layer for corresponding processing to obtain the reconstructed feature map.

In the image feature processing method provided by the above embodiments of the present disclosure, by discarding the first p blocks with the smallest gradient average in the feature map, the features of non-important areas in the image or video can be effectively discarded to better Complete machine vision intelligent analysis tasks.

FIG. 6 is a schematic structural diagram of an image feature processing device according to an embodiment of the present disclosure. As shown in FIG. 6 , the image feature processing device includes a first processing module 61 , a second processing module 62 , a third processing module 63 and a fourth processing module 64 .

The first processing module 61 is configured to extract features of the original image to obtain a source feature map.

In some embodiments, the original image is input into the CNN to obtain the source feature map.

The second processing module 62 is configured to divide the source feature map into a predetermined number D of mutually non-overlapping first blocks, calculate the gradient value of each first block in all first blocks, and divide all first blocks into Blocks are sorted by gradient mean to remove the top p first blocks with the smallest gradient mean.

In some embodiments, the second processing module 62 calculates the first gradient g x of each feature point f (x, y) in the first direction and the second gradient g _x in the first block d (i, j). The second gradient g _y in the direction, where the first direction and the second direction are perpendicular to each other, where 1≤i≤d ₁ , 1≤j≤d ₂ , D=d ₁ ×d ₂ .

Next, the second processing module 62 determines the gradient value of the feature point f(x, y) according to the first gradient g _x and the second gradient _gy . The gradient value of the first block d(i,j) is determined based on the gradient values of all feature points in the first block d(i,j).

In some embodiments, the gradient value of the feature point f(x, y) is the mean square value of the first gradient g _x and the second gradient g _y , as shown in the above formula (3).

In some embodiments, the gradient value of the first block d(i,j) is the average of the gradient values of all feature points in the first block d(i,j), as shown in the above formula (4) .

In some embodiments, the parameter p is determined according to a preset compression ratio α. For example, the parameter p is determined according to the above formula (5).

The third processing module 63 is configured to use a preset encoder to segment the Dp first blocks that have not been deleted. Block embedding features and position embedding features are encoded to obtain D second blocks, where the D second blocks include Dp encoded visual blocks and p mask tokens located at predetermined positions, where Dp coded visible blocks correspond to Dp first blocks one-to-one.

In some embodiments, the default encoder is a Transformer encoder.

The fourth processing module 64 is configured to use a preset decoder to decode all the second blocks and the position embedded features of the source feature map to obtain the reconstructed feature map.

In some embodiments, the default decoder is a Transformer decoder.

In some embodiments, the fourth processing module 64 performs the calculation according to the first attention weight matrix of each single head for the t-th second block among all the second blocks. Second attention weight matrix and the third attention weight matrix Determine the corresponding first vector matrix Q _t , second vector matrix K _t and third vector matrix V _t , 1≤t≤D. For example, the above formula (6) is used for calculation.

Next, the fourth processing module 64 determines the attention value of each single head according to the first vector matrix Q _t , the second vector matrix K _t and the third vector matrix V _t . For example, the above formula (7) is used for calculation.

Next, the fourth processing module 64 determines the multi-head attention value of the t-th second block based on the attention values of all single heads of the t-th second block. For example, the above formula (8) is used for calculation.

Next, multi-layer perception processing is performed on the multi-head attention value of the t-th second block and the t-th second block to obtain the reconstructed feature map.

FIG. 7 is a schematic structural diagram of an image feature processing device according to another embodiment of the present disclosure. As shown in FIG. 7 , the image feature processing device includes a memory 71 and a processor 72 .

The memory 71 is used to store instructions, and the processor 72 is coupled to the memory 71 . The processor 72 is configured to execute the method involved in any embodiment in FIG. 1 based on the instructions stored in the memory.

As shown in Figure 7, the image feature processing device also includes a communication interface 73 for information interaction with other devices. At the same time, the image feature processing device also includes a bus 74, through which the processor 72, the communication interface 73, and the memory 71 complete communication with each other.

The memory 71 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 71 may also be a memory array. The memory 71 may also be divided into blocks, and the blocks may be combined into virtual volumes according to certain rules.

Additionally, processor 72 may be a central processing unit (CPU), or may be an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present disclosure.

The present disclosure also relates to a computer-readable storage medium, wherein the computer-readable storage medium stores computer Computer instructions, when executed by the processor, implement the method involved in any embodiment in Figure 1.

By implementing the above embodiments of the present disclosure, the following beneficial effects can be obtained:

1. By calculating the gradient values in the feature map blocks, this disclosure can filter out the important areas where the feature information focuses according to the sorting results of the gradient values, and then discard the non-important area blocks in the features to achieve feature compression;

2) The present disclosure can flexibly control the compression rate by controlling the ratio of discarding feature blocks, and flexibly match various compression rate requirements;

3) The encoding part of the present disclosure can be added to the image device of the machine vision system as a machine vision encoding module, and the decoding part of the present disclosure can be added to the edge function set of the machine vision system as a machine vision decoding module, thereby improving compression efficiency.

In some embodiments, the functional units described above can be implemented as a general-purpose processor, a programmable logic controller (PLC), a digital signal processor (Digital processor) for performing the functions described in this disclosure. Signal Processor (DSP for short), Application Specific Integrated Circuit (ASIC for short), Field-Programmable Gate Array (FPGA for short) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, or any appropriate combination thereof.

Those of ordinary skill in the art can understand that all or part of the steps to implement the above embodiments can be completed by hardware, or by hardware related to program instructions. The program can be stored in a computer-readable storage medium. The storage medium can be a read-only memory, a magnetic disk or an optical disk, etc.

The description of the present disclosure has been presented for the purposes of illustration and description, and is not intended to be exhaustive or to limit the disclosure to the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure and design various embodiments with various modifications as are suited to the particular use contemplated.

Claims

An image feature processing method, executed by an image feature processing device, including:

Extract features of the original image to obtain the source feature map;

Divide the source feature map into a predetermined number D of first sub-blocks that do not overlap with each other;

Calculate the gradient value of each first block in all first blocks;

Sort all the first blocks according to the average gradient to delete the first p first blocks with the smallest average gradient;

Use a preset encoder to encode the block embedding features and position embedding features of the D-p first blocks that have not been deleted to obtain D second blocks, wherein the D second blocks include D-p coded visible blocks and p mask tokens located at predetermined positions, the D-p coded visible blocks being in one-to-one correspondence with the D-p first blocks;

A preset decoder is used to decode all the second blocks and the position embedded features of the source feature map to obtain a reconstructed feature map.
The method of claim 1, wherein calculating the gradient value of each first block includes:

In the first block d(i,j), calculate the first gradient g x in the first direction and the second gradient g y in the second direction of each feature point f(x, y), where The first direction and the second direction are perpendicular to each other, where 1≤i≤d 1 , 1≤j≤d 2 , D=d 1 ×d 2 ;

Determine the gradient value of the feature point f(x, y) according to the first gradient g x and the second gradient g y ;

The gradient value of the first block d(i,j) is determined based on the gradient values of all feature points in the first block d(i,j).
The method of claim 2, wherein

The gradient value of the feature point f(x,y) is the mean square value of the first gradient g x and the second gradient g y .
The method of claim 2, wherein

The gradient value of the first block d(i,j) is the average of the gradient values of all feature points in the first block d(i,j).
The method of claim 1, further comprising:

The parameter p is determined according to the preset compression ratio α.
The method of claim 5, wherein,

The compression ratio α is:

Where n 1 × n 2 is the size of the source feature map.
The method according to claim 1, wherein using a preset decoder to decode all second blocks and position embedded features of the source feature map includes:

For the t-th second block among all the second blocks, according to the first attention weight matrix of each single head, Second attention weight matrix and the third attention weight matrix Determine the corresponding first vector matrix Q t , second vector matrix K t and third vector matrix V t , 1≤t≤D;

Determine the attention value of each single head according to the first vector matrix Q t , the second vector matrix K t and the third vector matrix V t ;

Determine the multi-head attention value of the t-th second block based on the attention values of all single heads of the t-th second block;

Multi-layer perception processing is performed on the multi-head attention value of the t-th second block and the t-th second block to obtain the reconstructed feature map.
The method according to any one of claims 1-7, wherein,

The encoder is a Transformer encoder;

The decoder is a Transformer decoder.
An image feature processing device, including:

The first processing module is configured to extract features of the original image to obtain the source feature map;

The second processing module is configured to divide the source feature map into a predetermined number D of first blocks that do not overlap with each other, calculate the gradient value of each first block in all first blocks, and divide the All first blocks are sorted according to the gradient average to delete the top p first blocks with the smallest gradient average;

The third processing module is configured to use a preset encoder to encode the block embedding features and position embedding features of the Dp first blocks that have not been deleted to obtain D second blocks, wherein the The D second blocks include Dp coded visible blocks located at predetermined positions and p mask tokens, and the Dp coded visible blocks are The visual blocks correspond one-to-one to the Dp first blocks;

The fourth processing module is configured to use a preset decoder to decode all the second blocks and the position embedded features of the source feature map to obtain a reconstructed feature map.
The device of claim 9, wherein:

The second processing module is configured to calculate the first gradient g x in the first direction and the second gradient g x in the second direction of each feature point f(x, y) in the first block d(i,j). Two gradients g y , wherein the first direction and the second direction are perpendicular to each other, where 1≤i≤d 1 , 1≤j≤d 2 , D=d 1 ×d 2 ; according to the first gradient g x and the second gradient g y determine the gradient value of the feature point f(x,y); determine the first block based on the gradient values of all feature points in the first block d(i,j) The gradient value of d(i,j).
The device of claim 10, wherein:

The gradient value of the feature point f(x,y) is the mean square value of the first gradient g x and the second gradient g y .
The device of claim 10, wherein:

The gradient value of the first block d(i,j) is the average of the gradient values of all feature points in the first block d(i,j).
The device of claim 9, wherein:

The second processing module is configured to determine the parameter p according to the preset compression ratio α.
The device of claim 13, wherein:

The compression ratio α is:

Where n 1 × n 2 is the size of the source feature map.
The device of claim 9, wherein:

The fourth processing module is configured to, for the t-th second block among all the second blocks, respectively, according to the first attention weight matrix of each single head Second attention weight matrix and the third attention weight matrix Determine the corresponding first vector matrix Q t , second vector matrix K t and third vector matrix V t , 1≤t≤D, according to the first vector matrix Q t , second vector matrix K t and third vector The matrix V t determines the attention value of each single head, The multi-head attention value of the t-th second block is determined based on the attention values of all single heads of the t-th second block, and the sum of the multi-head attention values of the t-th second block is The t-th second block undergoes multi-layer sensing processing to obtain the reconstructed feature map.
The device according to any one of claims 9-15, wherein,

The encoder is a Transformer encoder;

The decoder is a Transformer decoder.
An image feature processing device, including:

memory configured to store instructions;

A processor, coupled to the memory, configured to execute the method according to any one of claims 1-8 based on instructions stored in the memory.
A non-transitory computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and when the instructions are executed by a processor, the method according to any one of claims 1-8 is implemented.
A computer program product comprising computer instructions, wherein when the computer instructions are executed by a processor, the method according to any one of claims 1-8 is implemented.