CN117649569A

CN117649569A - Image feature processing method and device and storage medium

Info

Publication number: CN117649569A
Application number: CN202210998237.8A
Authority: CN
Inventors: 韩韬; 张园; 杨明川; 王慧芬; 薛俊达
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-08-19
Filing date: 2022-08-19
Publication date: 2024-03-05
Also published as: WO2024037330A1

Abstract

The disclosure provides an image feature processing method and device and a storage medium, and relates to the field of image processing. The image feature processing method comprises the following steps: extracting features of an original image to obtain a source feature map; dividing the source signature into a predetermined number D of first partitions which are not overlapped with each other; calculating a gradient value of each first partition in all the first partitions; all the first blocks are sequenced according to the gradient average value, so that the first p first blocks with the smallest gradient average value are deleted; the method comprises the steps of utilizing a preset encoder to encode block embedding features and position embedding features of D-p first blocks which are not deleted to obtain D second blocks, wherein the D second blocks comprise D-p encoding visual blocks and p mask tokens which are positioned at preset positions, and the D-p encoding visual blocks are in one-to-one correspondence with the D-p first blocks; and decoding all the second blocks and the position embedded features of the source feature map by using a preset decoder to obtain a reconstructed feature map.

Description

Image feature processing method and device and storage medium

Technical Field

The present disclosure relates to the field of image processing, and in particular, to an image feature processing method and apparatus, and a storage medium.

Background

With the rapid development of 5G, big data and artificial intelligence, media contents such as images and videos are widely used in the field of intelligent visual tasks such as object detection, object tracking, image classification, image segmentation, pedestrian re-recognition, etc. in the context of image or video big data applications.

The 5G era promotes the mass application of machine oriented, such as the machine vision content of the Internet of vehicles, unmanned, industrial Internet, wisdom and safe city, wearable, video monitoring and the like, and has wider application scene compared with the video oriented to human vision tasks which is saturated day by day. Machine vision oriented video coding will become one of the main incremental traffic sources in the 5G and later 5G ages.

Disclosure of Invention

The inventors noted that in the related art, an encoder encodes an image or video facing a human visual task to generate a bitstream. The decoder decrypts the bit stream to obtain the image or video. Since the encoder and decoder are based primarily on convolutional neural networks, selective compression of images or video is not possible, and features of non-important areas in the images or video are not discarded.

Accordingly, the image feature processing scheme can effectively discard the features of non-important areas in the image or the video so as to better complete the intelligent analysis task of the machine vision.

According to a first aspect of embodiments of the present disclosure, there is provided an image feature processing method, performed by an image feature processing apparatus, including: extracting features of an original image to obtain a source feature map; dividing the source signature into a predetermined number D of first partitions which are not overlapped with each other; calculating a gradient value of each first partition in all the first partitions; sorting all the first blocks according to the gradient average value to delete the first p first blocks with the smallest gradient average value; the method comprises the steps of utilizing a preset encoder to encode block embedding features and position embedding features of D-p first blocks which are not deleted to obtain D second blocks, wherein the D second blocks comprise D-p encoded visual blocks and p mask tokens which are positioned at preset positions, and the D-p encoded visual blocks are in one-to-one correspondence with the D-p first blocks; and decoding all the second blocks and the position embedded features of the source feature map by using a preset decoder to obtain a reconstructed feature map.

In some embodiments, calculating the gradient value for each first partition includes: in the first block d (i, j), a first gradient g of each feature point f (x, y) in the first direction is calculated _x And a second gradient g in a second direction _y Wherein the first direction and the second direction are perpendicular to each other, wherein 1.ltoreq.i.ltoreq.d ₁ ，1≤j≤d ₂ ，D＝d ₁ ×d ₂ The method comprises the steps of carrying out a first treatment on the surface of the According to the first gradient g _x And the second gradient g _y Determining gradient values of the feature points f (x, y); and determining the gradient value of the first block d (i, j) according to the gradient values of all the characteristic points in the first block d (i, j).

In some embodiments, the gradient value of the feature point f (x, y) is the first gradient g _x And the second gradient g _y Is a mean square value of (c).

In some embodiments, the gradient values of the first segment d (i, j) are the average of the gradient values of all feature points in the first segment d (i, j).

In some embodiments, the parameter p is determined according to a preset compression ratio α.

In some embodiments, the compression ratio α is:

wherein n is ₁ ×n ₂ Is the size of the source signature.

In some embodiments, decoding all second partitions and the position embedded features of the source signature using a preset decoder includes: for the t second block in the second blocks, according to the first attention weight matrix W of each single head _t ^Q Second attention weighting matrixAnd a third attention weighting matrix +.>Determining a corresponding first vector matrix Q _t Second vector matrix K _t And a third vector matrix V _t T is more than or equal to 1 and less than or equal to D; according to the first vector matrix Q _t Second vector matrix K _t And a third vector matrix V _t Determining an attention value of each single head; determining a multi-head attention value of the t second partition according to all the single-head attention values of the t second partition; and carrying out multi-layer sensing processing on the multi-head attention value of the t second block and the t second block to obtain the reconstruction feature map.

In some embodiments, the encoder is a transducer encoder; the decoder is a transducer decoder.

According to a second aspect of the embodiments of the present disclosure, there is provided an image feature processing apparatus including: the first processing module is configured to extract the characteristics of the original image to obtain a source characteristic diagram; the second processing module is configured to divide the source characteristic diagram into a preset number D of first blocks which are not overlapped with each other, calculate the gradient value of each first block in all the first blocks, and sort all the first blocks according to the gradient average value so as to delete the first p first blocks with the minimum gradient average value; the third processing module is configured to encode the block embedding characteristics and the position embedding characteristics of D-p first blocks which are not deleted by using a preset encoder to obtain D second blocks, wherein the D second blocks comprise D-p encoding visual blocks and p mask tokens which are positioned at preset positions, and the D-p encoding visual blocks are in one-to-one correspondence with the D-p first blocks; and the fourth processing module is configured to decode all the second partitions and the position embedded features of the source feature map by using a preset decoder so as to obtain a reconstructed feature map.

In some embodiments, the second processing module is configured to calculate a first gradient g of each feature point f (x, y) in the first direction in the first partition d (i, j) _x And a second gradient g in a second direction _y Wherein the first direction and the second direction are perpendicular to each other, wherein 1.ltoreq.i.ltoreq.d ₁ ，1≤j≤d ₂ ，D＝d ₁ ×d ₂ The method comprises the steps of carrying out a first treatment on the surface of the According to the first gradient g _x And the second gradient g _y Determining gradient values of the feature points f (x, y); and determining the gradient value of the first block d (i, j) according to the gradient values of all the characteristic points in the first block d (i, j).

In some embodiments, the second processing module is configured to determine the parameter p according to a preset compression ratio α.

In some embodiments, the compression ratio α is:

where n1×n2 is the size of the source signature.

In some embodiments, fourth placeThe processing module is configured to, for the t second partition of the total second partitions, respectively according to the first attention weight matrix W of each single head _t ^Q Second attention weighting matrixAnd a third attention weighting matrix +.>Determining a corresponding first vector matrix Q _t Second vector matrix K _t And a third vector matrix V _t T is more than or equal to 1 and less than or equal to D, according to the first vector matrix Q _t Second vector matrix K _t And a third vector matrix V _t And determining the attention value of each single head, determining the multi-head attention value of the t second partition according to all the single-head attention values of the t second partition, and performing multi-layer perception processing on the multi-head attention value of the t second partition and the t second partition to obtain the reconstruction feature map.

According to a third aspect of the embodiments of the present disclosure, there is provided an image feature processing apparatus including: a memory configured to store instructions; a processor coupled to the memory, the processor configured to perform a method according to any of the embodiments described above based on instructions stored in the memory.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium, wherein the computer readable storage medium stores computer instructions that, when executed by a processor, implement a method as in any of the embodiments described above.

Other features of the present disclosure and its advantages will become apparent from the following detailed description of exemplary embodiments of the disclosure, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flow chart of an image feature processing method according to an embodiment of the present disclosure;

2A-2C are schematic diagrams of feature maps of some embodiments of the present disclosure;

3A-3B are schematic diagrams of feature maps of other embodiments of the present disclosure;

FIG. 4 is a schematic diagram of encoder output according to one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a decoder according to one embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an image feature processing apparatus according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an image feature processing apparatus according to another embodiment of the present disclosure.

Detailed Description

The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

The relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Fig. 1 is a flowchart of an image feature processing method according to an embodiment of the present disclosure. In some embodiments, the following image feature processing method is performed by the image feature processing apparatus.

In step 101, features of an original image are extracted to obtain a source feature map.

In some embodiments, the raw image is input into a CNN (Convolutional Neural Network ) to obtain a source signature.

In step 102, the source signature is partitioned into a predetermined number D of first partitions that do not overlap each other.

In some embodiments, if the source signature has a size n ₁ ×n ₂ Each first partition has a size of m ₁ ×m ₂ D=d ₁ ×d ₂ Wherein d is ₁ ＝n ₁ /m ₁ ，d ₂ ＝n ₂ /m ₂ 。

For example, as shown in fig. 2A, the size of the source signature is 28×28. As shown in fig. 2B, each block has a size of 7×7, i.e., the source signature is divided into 4×4 blocks, d (0, 0) to d (3, 3) as shown in fig. 2C.

In step 103, a gradient value is calculated for each of the entire first partitions.

In some embodiments, in a first segment d (i, j), each feature point f (x,y) a first gradient g in a first direction _x And a second gradient g in a second direction _y Wherein the first direction and the second direction are perpendicular to each other, and i is equal to or greater than 1 and d is equal to or less than d ₁ ，1≤j≤d ₂ ，D＝d ₁ ×d ₂ 。

For example, the first direction is an x-axis direction in a predetermined plane, and the second direction is a y-axis direction in the predetermined plane.

A first gradient g of the feature point f (x, y) in a first direction _x The method comprises the following steps:

a second gradient g of the feature point f (x, y) in a second direction _y The method comprises the following steps:

next, according to a first gradient g _x And a second gradient g _y Gradient values of the feature points f (x, y) are determined.

In some embodiments, the gradient value of the feature point f (x, y) is a first gradient g _x And a second gradient g _y Is a mean square value of (c). Namely:

next, the gradient value of the first segment d (i, j) is determined from the gradient values of all the feature points in the first segment d (i, j).

In some embodiments, the gradient values of the first segment d (i, j) are the average of the gradient values of all feature points in the first segment d (i, j). Namely:

at step 104, all the first partitions are sorted according to the gradient average value to delete the first p first partitions with the smallest gradient average value.

For example, the relationship between the compression ratio α and the parameter p is shown in formula (5).

Wherein n is ₁ ×n ₂ Is the size of the source signature. The number of discarded blocks can be selected as desired, allowing flexible adjustment of the compression ratio of the feature.

For example, as shown in fig. 3A, the feature map includes 16 blocks, i.e., d=16. By ordering all the blocks according to the gradient average and deleting the first 11 blocks with the smallest gradient average, i.e. p=11. So that only 5 partitions remain in the feature map, i.e. D-p=5. As shown in fig. 3B, the 5 blocks reserved are: the 2 nd, 7 th, 9 th, 14 th and 16 th blocks.

In step 105, the block embedding features and the position embedding features of the D-p first blocks which are not deleted are encoded by using a preset encoder to obtain D second blocks, where the D second blocks include D-p encoded visual blocks and p mask tokens located at predetermined positions, and the D-p encoded visual blocks are in one-to-one correspondence with the D-p first blocks.

In some embodiments, the predetermined encoder is a transducer encoder.

For example, the 5 blocks shown in fig. 3B, i.e., the 2 nd block, the 7 th block, the 9 th block, the 14 th block, and the 16 th block, are input to the trained encoder so that the encoder outputs the encoding result. The encoding result includes 16 blocks. Of the 16 tiles, 5 coded visual tiles (Encoded Visible Patches) corresponding to the 5 tiles in fig. 3B are included, as shown by 5 white boxes 41 in fig. 4, and 11 Mask tokens (Mask tokens), as shown by 11 dark boxes 42 in fig. 4.

It should be noted that, the positions of the 5 coding visual blocks in the coding result correspond to the positions of the 5 blocks in fig. 3B. For example, as shown in fig. 4, the 5 coding visual blocks in the coding result are the 2 nd block, the 7 th block, the 9 th block, the 14 th block, and the 16 th block in the coding result, respectively.

In the feature map shown in fig. 3B, 11 blocks are discarded, that is, the 1 st block, the 3 rd block, the 4 th block, the 5 th block, the 6 th block, the 8 th block, the 10 th block, the 11 th block, the 12 th block, the 13 th block, and the 15 th block are discarded. A corresponding Mask Token is added at the encoder to the location of the discarded block, as indicated by the 11 dark boxes 42 in fig. 4. That is, the encoding result includes 11 mask tokens, which are respectively a 1 st block, a 3 rd block, a 4 th block, a 5 th block, a 6 th block, an 8 th block, a 10 th block, an 11 th block, a 12 th block, a 13 th block, and a 15 th block in the encoding result.

It should be noted that, since how to train the encoder is not the invention point of the present disclosure, it is not described here.

In step 106, the decoding process is performed on all the second blocks and the position embedded features of the source feature map by using a preset decoder, so as to obtain a reconstructed feature map.

In some embodiments, the predetermined decoder is a transducer decoder.

In some embodiments, the decoder is structured as shown in fig. 5.

As shown in FIG. 5, the input features are processed by a Multi-headed self-attention (Multi-head Self Attention) layer after normalization.

For example, for the t second partition, according to the first attention weight matrix W of each single head _t ^Q Second attention weighting matrixAnd a third attention weighting matrix +.>Determining a corresponding first vector matrix Q _t Second vector matrix K _t And a third vector matrix V _t ，1≤t≤D。

For example, the calculation formula is shown in formula (6), where Ft is the characteristic of the t second block.

Next, according to a first vector matrix Q _t Second vector matrix K _t And a third vector matrix V _t Determining the attention value s of each single head _t As shown in equation (7).

Wherein,for matrix K _t Delta is the attention calculation function and τ is the Softmax logistic regression function.

Next, the multi-headed attention value of the t second block is determined from the attention values of all the single heads of the t second block, as shown in formula (8).

Wherein,for the Concate function, +.>Is a parameter matrix.

Next, the Multi-head attention value of the t second block and the t second block input MLP (Multi-Layer persistence) Layer are processed accordingly to obtain a reconstructed feature map.

In the image feature processing method provided by the embodiment of the present disclosure, the first p blocks with the smallest gradient average value in the feature map are discarded, so that features of non-important areas in the image or the video can be effectively discarded, so that the machine vision intelligent analysis task can be completed better.

Fig. 6 is a schematic structural diagram of an image feature processing apparatus according to an embodiment of the present disclosure. As shown in fig. 6, the image feature processing apparatus includes a first processing module 61, a second processing module 62, a third processing module 63, and a fourth processing module 64.

The first processing module 61 is configured to extract features of the original image to obtain a source signature.

In some embodiments, the original image is input into a CNN to obtain a source signature.

The second processing module 62 is configured to divide the source signature into a predetermined number D of first partitions that do not overlap each other, calculate a gradient value for each of all the first partitions, and order all the first partitions by a gradient average value to delete the first p first partitions whose gradient average value is the smallest.

In some embodiments, the second processing module 62 calculates a first gradient g of each feature point f (x, y) in a first direction in the first partition d (i, j) _x And a second gradient g in a second direction _y Wherein the first direction and the second direction are perpendicular to each other, and i is equal to or greater than 1 and d is equal to or less than d ₁ ，1≤j≤d ₂ ，D＝d ₁ ×d ₂ 。

Next, the second processing module 62 follows the first gradient g _x And a second gradient g _y Gradient values of the feature points f (x, y) are determined. The gradient value of the first segment d (i, j) is determined from the gradient values of all the feature points in the first segment d (i, j).

In some embodiments, the gradient value of the feature point f (x, y) is a first gradient g _x And a second gradient g _y As shown in the above formula (3).

In some embodiments, the gradient value of the first segment d (i, j) is an average value of gradient values of all feature points in the first segment d (i, j), as shown in the above formula (4).

In some embodiments, the parameter p is determined according to a preset compression ratio α. The parameter p is determined, for example, according to the above equation (5).

The third processing module 63 is configured to perform encoding processing on the block embedding features and the position embedding features of the D-p first blocks that are not deleted by using a preset encoder, so as to obtain D second blocks, where the D second blocks include D-p encoded visual blocks and p mask tokens located at predetermined positions, and the D-p encoded visual blocks are in one-to-one correspondence with the D-p first blocks.

In some embodiments, the predetermined encoder is a transducer encoder.

The fourth processing module 64 is configured to perform decoding processing on all the second partitions and the position embedded features of the source signature by using a preset decoder to obtain a reconstructed signature.

In some embodiments, the predetermined decoder is a transducer decoder.

In some embodiments, the fourth processing module 64 is configured to, for a t-th second partition of the total second partitions, respectively according to the first attention weight matrix W of each single head _t ^Q Second attention weighting matrixAnd a third attention weight matrixDetermining a corresponding first vector matrix Q _t Second vector matrix K _t And a third vector matrix V _t T is more than or equal to 1 and less than or equal to D. For example, the calculation is performed using the above formula (6).

Next, the fourth processing module 64 processes the data according to the first vector matrix Q _t Second vector matrix K _t And a third vector matrix V _t An attention value for each individual head is determined. For example, the calculation is performed using the above formula (7).

Next, the fourth processing module 64 determines a multi-headed attention value for the t second partition based on all of the single headed attention values for the t second partition. For example, the calculation is performed using the above formula (8).

And then, carrying out multi-layer sensing processing on the multi-head attention value of the t second sub-block and the t second sub-block to obtain a reconstructed characteristic diagram.

Fig. 7 is a schematic structural diagram of an image feature processing apparatus according to another embodiment of the present disclosure. As shown in fig. 7, the image feature processing apparatus includes a memory 71 and a processor 72.

The memory 71 is for storing instructions and the processor 72 is coupled to the memory 71, the processor 72 being configured to perform a method as referred to in any of the embodiments of fig. 1 based on the instructions stored by the memory.

As shown in fig. 7, the image feature processing apparatus further includes a communication interface 73 for information interaction with other devices. Meanwhile, the image feature processing apparatus further includes a bus 74, and the processor 72, the communication interface 73, and the memory 71 perform communication with each other through the bus 74.

The memory 71 may comprise a high-speed RAM memory or may further comprise a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 71 may also be a memory array. The memory 71 may also be partitioned and the blocks may be combined into virtual volumes according to certain rules.

Further, the processor 72 may be a central processing unit CPU, or may be an application specific integrated circuit ASIC, or one or more integrated circuits configured to implement embodiments of the present disclosure.

The present disclosure also relates to a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a method as referred to in any of the embodiments of fig. 1.

By implementing the above embodiments of the present disclosure, the following advantageous effects can be obtained:

1. according to the method and the device, the gradient values in the feature map blocks are calculated, and the important areas focused by the feature information can be screened according to the sequencing result of the gradient values, so that non-important area blocks in the features can be discarded, and compression of the features is realized;

2) The method can flexibly control the compression rate by controlling the rate of discarding the characteristic blocks, and flexibly match the compression rate to meet the requirements of various compression rates;

3) The encoding part of the present disclosure may be added as a machine vision encoding module in an image device of a machine vision system, and the decoding part of the present disclosure may be added as a machine vision decoding module in an edge function set of the machine vision system, so that compression efficiency may be improved.

In some embodiments, the functional units described above may be implemented as general-purpose processors, programmable logic controllers (Programmable Logic Controller, abbreviated as PLCs), digital signal processors (Digital Signal Processor, abbreviated as DSPs), application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASICs), field programmable gate arrays (Field-Programmable Gate Array, abbreviated as FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or any suitable combination thereof for performing the functions described in the present disclosure.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. An image feature processing method, performed by an image feature processing apparatus, comprising:

extracting features of an original image to obtain a source feature map;

dividing the source signature into a predetermined number D of first partitions which are not overlapped with each other;

calculating a gradient value of each first partition in all the first partitions;

sorting all the first blocks according to the gradient average value to delete the first p first blocks with the smallest gradient average value;

the method comprises the steps of utilizing a preset encoder to encode block embedding features and position embedding features of D-p first blocks which are not deleted to obtain D second blocks, wherein the D second blocks comprise D-p encoded visual blocks and p mask tokens which are positioned at preset positions, and the D-p encoded visual blocks are in one-to-one correspondence with the D-p first blocks;

and decoding all the second blocks and the position embedded features of the source feature map by using a preset decoder to obtain a reconstructed feature map.

2. The method of claim 1, wherein calculating the gradient value for each first partition comprises:

in the first block d (i, j), a first gradient g of each feature point f (x, y) in the first direction is calculated _x And a second gradient g in a second direction _y Wherein the first direction and the second direction are perpendicular to each other, wherein 1.ltoreq.i.ltoreq.d ₁ ，1≤j≤d ₂ ，D＝d ₁ ×d ₂ ；

According to the first gradient g _x And the second gradient g _y Determining gradient values of the feature points f (x, y);

and determining the gradient value of the first block d (i, j) according to the gradient values of all the characteristic points in the first block d (i, j).

3. The method of claim 2, wherein,

the gradient value of the characteristic point f (x, y) is the first gradient g _x And the second gradient g _y Is a mean square value of (c).

4. The method of claim 2, wherein,

the gradient value of the first segment d (i, j) is the average value of the gradient values of all the feature points in the first segment d (i, j).

5. The method of claim 1, further comprising:

the parameter p is determined according to a preset compression ratio α.

6. The method of claim 5, wherein,

the compression ratio α is:

wherein n is ₁ ×n ₂ Is the size of the source signature.

7. The method of claim 1, wherein decoding all second partitions and the location embedded features of the source signature using a preset decoder comprises:

for the t second block in the second blocks, according to the first attention weight matrix W of each single head _t ^Q Second injectionItalian weight matrix W _t ^K And a third attention weight matrix W _t ^V Determining a corresponding first vector matrix Q _t Second vector matrix K _t And a third vector matrix V _t ，1≤t≤D；

According to the first vector matrix Q _t Second vector matrix K _t And a third vector matrix V _t Determining an attention value of each single head;

determining a multi-head attention value of the t second partition according to all the single-head attention values of the t second partition;

and carrying out multi-layer sensing processing on the multi-head attention value of the t second block and the t second block to obtain the reconstruction feature map.

8. The method according to any one of claims 1-7, wherein,

the encoder is a transducer encoder;

the decoder is a transducer decoder.

9. An image feature processing apparatus comprising:

the first processing module is configured to extract the characteristics of the original image to obtain a source characteristic diagram;

the second processing module is configured to divide the source characteristic diagram into a preset number D of first blocks which are not overlapped with each other, calculate the gradient value of each first block in all the first blocks, and sort all the first blocks according to the gradient average value so as to delete the first p first blocks with the minimum gradient average value;

the third processing module is configured to encode the block embedding characteristics and the position embedding characteristics of D-p first blocks which are not deleted by using a preset encoder to obtain D second blocks, wherein the D second blocks comprise D-p encoding visual blocks and p mask tokens which are positioned at preset positions, and the D-p encoding visual blocks are in one-to-one correspondence with the D-p first blocks;

and the fourth processing module is configured to decode all the second partitions and the position embedded features of the source feature map by using a preset decoder so as to obtain a reconstructed feature map.

10. The apparatus of claim 9, wherein,

the second processing module is configured to calculate a first gradient g of each feature point f (x, y) in a first direction in a first block d (i, j) _x And a second gradient g in a second direction _y Wherein the first direction and the second direction are perpendicular to each other, wherein 1.ltoreq.i.ltoreq.d ₁ ，1≤j≤d ₂ ，D＝d ₁ ×d ₂ The method comprises the steps of carrying out a first treatment on the surface of the According to the first gradient g _x And the second gradient g _y Determining gradient values of the feature points f (x, y); and determining the gradient value of the first block d (i, j) according to the gradient values of all the characteristic points in the first block d (i, j).

11. The apparatus of claim 10, wherein,

12. The apparatus of claim 10, wherein,

13. The apparatus of claim 9, wherein,

the second processing module is configured to determine the parameter p according to a preset compression ratio α.

14. The apparatus of claim 13, wherein,

the compression ratio α is:

wherein n is ₁ ×n ₂ Is the size of the source signature.

15. The apparatus of claim 9, wherein,

the fourth processing module is configured to, for the t-th second block of the total second blocks, respectively according to the first attention weight matrix W of each single head _t ^Q A second attention weight matrix W _t ^K And a third attention weight matrix W _t ^V Determining a corresponding first vector matrix Q _t Second vector matrix K _t And a third vector matrix V _t T is more than or equal to 1 and less than or equal to D, according to the first vector matrix Q _t Second vector matrix K _t And a third vector matrix V _t And determining the attention value of each single head, determining the multi-head attention value of the t second partition according to all the single-head attention values of the t second partition, and performing multi-layer perception processing on the multi-head attention value of the t second partition and the t second partition to obtain the reconstruction feature map.

16. The device according to any one of claims 8-15, wherein,

the encoder is a transducer encoder;

the decoder is a transducer decoder.

17. An image feature processing apparatus comprising:

a memory configured to store instructions;

a processor coupled to the memory, the processor configured to perform the method of any of claims 1-8 based on instructions stored by the memory.

18. A non-transitory computer readable storage medium storing computer instructions which, when executed by a processor, implement the method of any one of claims 1-8.