CN115294342B

CN115294342B - Image processing method and related device

Info

Publication number: CN115294342B
Application number: CN202211171675.3A
Authority: CN
Inventors: 吴日辉; 杨永兴; 周茂森; 杨建权
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2023-02-28
Anticipated expiration: 2042-09-26
Also published as: CN115294342A

Abstract

The embodiment of the application provides an image processing method and a related device, wherein in the image processing method, electronic equipment performs feature extraction on a target image to obtain two-dimensional feature information of the target image; processing the two-dimensional characteristic information of the target image by using a Hilbert curve to obtain one-dimensional characteristic information of the target image; processing the one-dimensional characteristic information of the target image by using a Hilbert transform module to obtain target characteristics used for executing a downstream task in the target image, wherein the downstream task comprises at least one of the following items: segmentation, detection, or identification. By adopting the embodiment of the application, the one-dimensional characteristic information of the target image can be obtained based on the Hilbert curve, so that the pixels which are originally adjacent on the image are still adjacent after the one-dimensional characteristic information is divided by utilizing the Hilbert transform module, and thus, the target characteristic of the target image is obtained based on each section of divided characteristic information, and the method has the performance of low time complexity.

Description

Image processing method and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image processing method and a related apparatus.

Background

Currently, the division of images is mainly a visual transform (ViT) or Shift window based transform (Swin transform) method. However, the current partitioning method divides the image into tiles (patches), which are all rectangular blocks, whether regular tiles or adaptive-length tiles. When a pixel in an image is segmented by a common boundary of two adjacent rectangular frames, the adjacent pixel in the image may not be located on an adjacent image block, and thus, the adjacent relationship between the pixels may be lost after the image is segmented.

Disclosure of Invention

The application provides an image processing method and a related device, which can obtain one-dimensional characteristic information of a target image based on a Hilbert curve, so that pixels which are originally adjacent on the image are still adjacent after the one-dimensional characteristic information is divided by using a Hilbert transform module, and thus, the target characteristic of the target image is obtained based on each section of divided characteristic information, and the image processing method and the related device have the performance of low time complexity.

In a first aspect, the present application provides an image processing method, including:

extracting the features of the target image to obtain two-dimensional feature information of the target image;

processing two-dimensional characteristic information of the target image by using a Hilbert curve to obtain one-dimensional characteristic information of the target image;

processing the one-dimensional characteristic information of the target image by using a Hilbert transform module to obtain a target characteristic used for executing a downstream task in the target image, wherein the downstream task comprises at least one of the following items: segmentation, detection, or identification.

After the method provided by the first aspect is implemented, the electronic device may obtain the one-dimensional feature information of the target image based on the hilbert curve, so that pixels originally adjacent to each other on the image are still adjacent to each other after the one-dimensional feature information is divided by using the hilbert transform module, and thus, the target feature of the target image is obtained based on each piece of divided feature information, and the method has the performance of low time complexity.

With reference to the method provided by the first aspect, the processing, by the electronic device, the one-dimensional feature information of the target image by using the hilbert transform module to obtain the target feature used for executing the downstream task in the target image includes:

processing the one-dimensional characteristic information of the target image by using a Hilbert transform module to obtain the characteristics of the target image;

performing feature recombination on the features of the target image to obtain two-dimensional feature information of the target image;

processing the two-dimensional characteristic information of the target image by using a Hilbert curve to obtain one-dimensional characteristic information of the target image, and performing the step of processing the one-dimensional characteristic information of the target image by using a Hilbert transformation module again to obtain the characteristic of the target image until the preset repetition times is reached to determine the target characteristic of the target image; and performing characteristic recombination on the target characteristics of the target image to obtain target two-dimensional characteristics used for executing downstream tasks in the target image.

With reference to the method provided by the first aspect, the hilbert transform module includes a first transform module and a second transform module, and the electronic device processes the one-dimensional feature information of the target image by using the hilbert transform module to obtain the feature of the target image, including:

inputting one-dimensional characteristic information of a target image into a first transformation module to obtain first characteristic information of the target image;

inputting the first characteristic information of the target image into a second transformation module to obtain second characteristic information of the target image;

and determining second characteristic information as the characteristics of the target image.

With reference to the method provided by the first aspect, the first transformation module includes a first normalization module, a hilbert-based multi-head self-attention mechanism module, a second normalization module, and a first multi-layered perceptron module, and the electronic device inputs the one-dimensional feature information of the target image into the first transformation module to obtain the first feature information of the target image, including:

inputting the one-dimensional characteristic information of the target image into a first normalization module to obtain the one-dimensional characteristic information after normalization processing;

processing the one-dimensional feature information after normalization processing by using a Hilbert-based multi-head self-attention mechanism module to obtain first weighted one-dimensional feature information;

adding the one-dimensional characteristic information after the first weighting and the one-dimensional characteristic information of the target image to obtain third characteristic information;

inputting the third characteristic information into a second normalization module to obtain normalized third characteristic information;

inputting the normalized third feature information into a first multilayer perceptron module to obtain a first feature of the target image;

and adding the third characteristic information and the first characteristic of the target image to obtain the first characteristic information of the target image.

With reference to the method provided by the first aspect, the hilbert-based multi-head self-attention mechanism module includes a hilbert partitioning sub-module, a hilbert self-attention sub-module, and a hilbert flipping sub-module, and the electronic device processes the normalized one-dimensional feature information by using the hilbert-based multi-head self-attention mechanism module to obtain first weighted one-dimensional feature information, including:

equally dividing the one-dimensional feature information after the normalization processing by using a Hilbert division submodule to obtain a plurality of pieces of equally divided feature information;

calculating the self-attention of each feature included in each section of feature information in the equally divided multiple sections of feature information by using a Hilbert self-attention submodule;

based on the self-attention of each feature included in each section of feature information in the equally divided multiple sections of feature information, weighting each feature to obtain each section of feature information after first weighting;

and obtaining the one-dimensional characteristic information after the first weighting based on each section of characteristic information after the first weighting by utilizing a Hilbert flip sub-module.

With reference to the method provided by the first aspect, the second transformation module includes a third normalization module, a multi-head self-attention mechanism module based on shift hilbert, a fourth normalization module, and a second multi-layered perceptron module, and the electronic device inputs the first feature information of the target image into the second transformation module to obtain the second feature information of the target image, including:

inputting the first characteristic information of the target image into a third normalization module to obtain normalized first characteristic information;

processing the first feature information after normalization processing by using a multi-head self-attention mechanism module based on shift Hilbert to obtain one-dimensional feature information after second weighting;

adding the one-dimensional feature information after the second weighting and the first feature information of the target image to obtain fourth feature information;

inputting the fourth feature information into a fourth normalization module to obtain normalized fourth feature information;

inputting the fourth feature information after normalization processing into a second normalization module to obtain a second feature of the target image;

and adding the fourth characteristic information and the second characteristic of the target image to obtain second characteristic information of the target image.

In combination with the method provided by the first aspect, the shifted hilbert-based multi-head self-attention mechanism module includes: the electronic device processes the normalized second feature information by using a multi-head self-attention mechanism module based on the shift hilbert to obtain second weighted one-dimensional feature information, and the method comprises the following steps:

utilizing a shift Hilbert division submodule to carry out non-equal division on the first feature information after the normalization processing according to a preset division ratio to obtain multiple pieces of feature information of non-equal division;

calculating the self-attention of each feature included in each section of feature information in the unevenly divided sections of feature information by using a shift Hilbert self-attention submodule;

weighting each feature based on the self-attention of each feature included in each piece of feature information in the plurality of pieces of feature information divided unequally to obtain each piece of feature information after second weighting;

and obtaining second weighted one-dimensional characteristic information based on each section of characteristic information after the second weighting by utilizing a shift Hilbert turnover submodule.

In a second aspect, the present application provides an image processing apparatus comprising:

the acquisition unit is used for extracting the characteristics of the target image to acquire two-dimensional characteristic information of the target image;

the processing unit is used for processing the two-dimensional characteristic information of the target image by using a Hilbert curve to obtain one-dimensional characteristic information of the target image;

the processing unit is further used for processing the one-dimensional feature information of the target image by using the Hilbert transform module to obtain a target feature used for executing a downstream task in the target image, and the downstream task comprises at least one of the following items: segmentation, detection, or identification.

In this aspect, reference may be made to relevant matters in the first aspect above, and details of the implementation of the image processing apparatus are not described here.

In a third aspect, the present application provides an electronic device comprising: one or more processors, one or more memories, a display screen, and one or more transceivers; the one or more memories are coupled with the one or more processors for storing computer program code comprising computer instructions which, when executed by the one or more processors, cause the electronic device to perform the method as described in any of the first aspects.

In a fourth aspect, the present application provides a computer-readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the method as described in any of the first aspects.

In a fifth aspect, the present application provides a chip or chip system comprising processing circuitry and interface circuitry for receiving code instructions and transmitting the code instructions to the processing circuitry, the processing circuitry being configured to execute the code instructions to perform a method as described in any one of the first aspects.

Drawings

FIG. 1a is a schematic diagram of a Vision Transformer model;

FIG. 1b is a schematic diagram of a transform coder;

FIG. 2a is a schematic structural diagram of the Swin transducer model;

FIG. 2b is a diagram illustrating a structure of a transform block based on a shift window;

FIG. 2c is a schematic illustration of a window division;

FIG. 2d is a schematic illustration of another window division;

FIG. 2e is a schematic diagram of a cyclic shift;

FIG. 2f is a schematic diagram of a hierarchical window division;

fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 4 is a schematic diagram of traversing an image space by using a hilbert curve according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another image processing method provided in an embodiment of the present application;

fig. 6 is a schematic diagram of obtaining second feature information by using a hilbert transform module according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a Hilbert-based multi-headed autofocusing mechanism module according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a multi-headed self-attentive mechanism module based on a shifted Hilbert according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of an image processing model provided by an embodiment of the present application;

FIG. 10 is a schematic flow chart of a method for training an image processing model according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic hardware architecture diagram of an electronic device 100 according to an embodiment of the present application;

fig. 13 is a schematic software architecture diagram of an electronic device 100 according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described in detail and clearly with reference to the accompanying drawings. In the description of the embodiments of the present application, unless otherwise specified, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Currently, the image processing model may include a Vision transform model, and please refer to fig. 1a, in which fig. 1a is a schematic structural diagram of a Vision transform model. As shown in FIG. 1a, the model includes a Linear Projection of unfolded patch sequences (Linear Projection of warped Patches) module, a transform Encoder (transform Encoder) and a self-attention multi-layer perceptron (MLPHead).

The modules included in the Vision Transformer model are described below.

Linear project of Flattened Patches Module: this module may also be referred to as an Embedding (Embedding) module, and is configured to convert a tile corresponding to an input image into a vector (or referred to as a token) sequence that can be recognized by a transform. That is to say, the electronic device may input the tiles corresponding to the input image into the Embedding module, and obtain the embedded token corresponding to each tile.

It should be noted that the reason why the electronic device inputs the image blocks corresponding to the input image to the Embedding module and obtains the embedded tokens corresponding to each image block is as follows: for the standard Transformer model, it is required to input token sequences, i.e. two-dimensional matrices. However, the data format of the image data is [ H, W, C ], i.e., a three-dimensional matrix, where H denotes the length of the image, W denotes the width of the image, and C denotes the number of channels of the image. Therefore, the electronic device needs to transform the input image data through the Embedding module to obtain the token sequence. As shown in FIG. 1a, 0-9 are tokens in the input transform model, and each token is a vector.

Optionally, the electronic device may further obtain a segment corresponding to the input image. As shown in FIG. 1a, 9 blocks of the Linear project of carved Patches module are input. Optionally, the electronic device obtains tiles corresponding to the input image, and may divide the input image into non-overlapping tiles according to the size of P × P, where P represents the length and the width of the tiles, respectively. For example, assuming that the size of the input image is 224 × 3, where 224 represents the length and width of the input image, respectively, and 3 represents the number of channels of the input image, the electronic device may divide the input image into 16 × 16 tiles, resulting in (224 × 224)/(16 × 16) =196 tiles. Where 16 represents the length and width of the tile, respectively.

Optionally, the electronic device inputs the image blocks corresponding to the input image into the Embedding module to obtain the embedded tokens corresponding to each image block, and may map each image block into a one-dimensional vector through linear mapping to obtain a vector corresponding to each image block. For example, assuming that each tile is [16, 16,3] in size, the electronic device may map each tile into a one-dimensional vector by linear projection, resulting in a vector of length 16 x 3= 768. Assuming that 196 image blocks are obtained after the input image is subjected to image block division, the electronic device may determine that the token sequence obtained after the input image is converted by the Embedding module is 196 by 768.

It should be noted that the electronic device may add a token for classification (i.e., extra learnable class embedding in fig. 1 a) to the embedded vector corresponding to the input image, where the token is a vector in the same data format as the embedded token of the tile. That is, if the electronic device determines that the token sequence obtained by converting the input image through the Embedding module is 196 × 768, after adding a token for classification, the token sequence corresponding to the input image is 197 × 768.

Optionally, the electronic device further performs position embedding (position embedding) on the embedding token. position encoding is used to preserve spatial position information between input tiles. Alternatively, the Vision Transformer model may encode the position of each tile using standard learnable/trained 1-D position-coding embedding. Alternatively, the electronic device may add the block embedding and the position embedding by element before inputting the transform Encoder, resulting in an input token sequence of the transform Encoder. It will be appreciated that the dimensions of the token sequence resulting from the addition of the position embedding are the same as the dimensions of the token sequence resulting before the addition of the position embedding. For example, the dimension of the token sequence before adding the position embedding is 197 by 768, and the dimension of the token sequence after adding the position embedding is also 197 by 768, that is, the dimension of the token sequence input to the Transformer Encoder is 197 by 768.

Transformer Encoder: the Transformer Encoder is used for extracting learnable class embedding vectors, namely, features corresponding to class tokens. Optionally, the features may be used in an image classification task.

Referring to fig. 1b, fig. 1b is a schematic structural diagram of a transform coder. As shown in fig. 1b, the transform coder is composed of a Normalization (Norm) module, a Multi-Head Attention (MHA) module, and a Multi Layer Perceptron (MLP) module. The Norm module is applied before each module and residual concatenation is applied after each block.

The Norm module is used for carrying out normalization processing on input features, converting input feature information into data with a mean value of 1 and a variance of 0, and therefore the data can be prevented from falling in a saturation region of an activation function, and the problem of gradient disappearance is reduced.

The MHA module is used for calculating the self-attention of each embedded image block and obtaining richer information characteristics.

The MLP module is used for ensuring that the dimension of the output feature vector is consistent with the dimension of the input feature vector.

Optionally, the electronic device may input the embedded tile corresponding to the input image into a transform Encoder, and extract the category embedded vector of the input image. Optionally, the electronic device may perform standardization processing on the embedded token corresponding to the input image through the Norm layer to obtain a standardized embedded token; calculating the self-attention of each embedded token by using the MHA module; and carrying out standardization processing on the embedding token after the self attention is calculated, and outputting a category embedding vector of the input image through an MLP module.

MLP Head module: the module is used for classifying the input images according to the category embedded vectors extracted by the transform Encoder. For example, as shown in fig. 1a, the Class into which the input image is classified includes Bird, ball, car, and the like.

The Vision Transformer model processes the input image, which may include but is not limited to the following steps:

step one, the electronic equipment divides an input image into image blocks with fixed sizes, and inputs the obtained image blocks into an Embedding module to obtain an Embedding vector (token) corresponding to each image block.

And step two, inputting the embedding vector corresponding to each image block into a Transformer encoder to obtain the class embedding token of the image.

And step three, embedding the category of the image into token and inputting the token into an MLP Head to obtain the category of the image.

It can be seen that, with this method, the electronic device divides the image into a plurality of blocks, and when performing MHA, any block needs to be calculated with all other blocks to obtain the attention corresponding to the block, that is, fig. 1a is based on global calculation attention. When the size of the tile is fixed, the amount of computation grows squared with the size of the input image, and thus the temporal complexity using this method is in the order of the square.

It should be noted that the electronic device mentioned in this application may be a server or a terminal device.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud computing service center. The embodiment of the present application does not specifically limit the specific type of the server.

The terminal device may be a terminal device carrying iOS, android, nylon, microsoft or other operating systems, such as a mobile phone, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR) device, a Virtual Reality (VR) device, an Artificial Intelligence (AI) device, a wearable device, a vehicle-mounted device, a smart home device and/or a city smart device, and the specific type of the terminal device is not particularly limited in the embodiments of the present application.

Referring to fig. 2a, fig. 2a is a schematic structural diagram of Swin Transformer model. As shown in fig. 2a, the model includes a tile partitioning (Patch Partition) module, a Linear Embedding (Linear Embedding) module, a Shift window transform Block (Swin transform Block), and a tile Merging (Patch Merging) module. The Swin Transformer model shown in FIG. 2a is the same as the Vision Transformer model shown in FIG. 1a in that: when the image is divided, the input image is divided into image blocks; the difference lies in that: the Swin Transformer model shown in FIG. 2a is a hierarchical Transformer, which can obtain features of multiple scales and improve efficiency by limiting self-attention computation to non-overlapping local windows while allowing cross-window connections.

The following describes each module in the Swin transform model.

1. Patch Partition module

The Patch Partition module is configured to Partition the image Images with an input size of H × W × 3 into blocks of P × 3 to obtain a sequence of (H/P) × (W/P) × (P × 3) dimensions. Wherein H denotes the length of the input image, W denotes the width of the input image, 3 denotes the number of channels of the input image, and P × P is the size of each patch; p3 is the feature dimension of each patch after flattening; (H/P) × (W/P) is the number of lots corresponding to the picture.

For example, assuming that the size of the input picture is 224 × 3, and assuming that P is 4, the electronic device may divide the picture with the input size of 224 × 3 into 4 × 3= 48-dimensional tiles through the Patch Partition module, so that an image with (224/4) × (4 × 3) =56 × 48-dimensional can be obtained. The number of the tiles corresponding to the input image is 56 × 56=3136.

2. Linear Embedding module

The Linear Embedding module is used for projecting the image obtained by the Patch Partition module to any dimension C, wherein C is a preset value, and a Linear Embedding vector with the dimension of (H/P) × (W/P) × C is obtained.

For example, assuming that C is 96 and the input image dimension obtained by the Patch Partition module is 56 × 48, after passing through the Linear Embedding module, the image dimension becomes 56 × 96. The electronic device may also perform a straightening process on the 56 x 96 image resulting in a 3136 x 96 dimensional image.

3、Swin Transformer Block

For example, please refer to fig. 2b, fig. 2b is a schematic structural diagram of a transform block based on a shift window. As shown in fig. 2b, a Swin Transformer block is composed of a Window-based Multi-head Self-Attention mechanism (W-MSA) module, a Shifted Window-based Multi-head Self-Attention mechanism (SW-MSA) module, and Linear Normalization (LN) module and MLP. Where the LN layer is applied before each MSA module and each MLP and the residual connection is applied after each module.

The following describes the individual blocks in Swin Transformer Block.

An LN module: the LN module is used for carrying out normalization processing on the input characteristics, and converting the input characteristic information into data with a mean value of 1 and a variance of 0, so that the data can be prevented from falling in a saturation region (nonlinear region) of an activation function, and the problem of gradient disappearance is reduced.

W-MSA module: and the W-MSA module is used for carrying out uniform window division on the input feature map and calculating the self attention. The W-MSA module mainly includes window division (window partition), window attention (window attention), and window reverse (window reverse).

The window partition is used to divide the input feature map (i.e., the output result of the Linear Embedding module) into a plurality of windows in a non-overlapping equal manner. There are m latches in each window, and the latch in each window is the smallest unit calculated from attention. As shown in fig. 2c, the input signature is divided into 4 non-overlapping windows, each window having 4 × 4=16 patches. Thus, since different windows contain the same number of patches, the calculation of self-attention is performed within each window, and thus, as the image size increases, the amount of calculation increases only linearly.

The window attention is used to calculate the self-attention of each patch in each window. Alternatively, the electronic device may calculate self-attention according to the following formula (1).

In the above formula (1), Q represents a query vector; k represents a key vector; v represents a vector of values; b represents a position offset parameter;

denoted the adjustment factor, d is the dimension of the K vector.

Optionally, the electronic device may obtain the dimension of Q, K, V through the vector dimension after window division and the LN layer. For example, assuming that the vector dimension after window division is (64b, 49, 96), the electronic device may obtain the vector with the dimension of (64b, 49, 288) through the LN layer, and trisect the vector with the dimension of (64b, 49, 288), so as to obtain that the dimensions of Q, K, V are all (64b, 49, 96). Assuming that head is 3 in the first stage (i.e., stage1 in FIG. 2 a), the dimensions of Q, K, V are all (64B, 3, 49, 32).

The multi-head self-Attention mechanism is to make multiple linear mappings to the original Q, K and V, and input the result of each mapping into the Attention, and the result obtained each time is called a head. Alternatively, the electronic device may calculate the multi-head self-attention according to the following equations (2) and (3).

In the above formulas (2) and (3), mutiHead indicates a multi-headed self-attention; q represents a query vector; k represents a key vector; v represents a vector of values; head _i The attention of the ith head is shown; n represents the total number of heads; w _i ^Q The mapping matrix for Q in the ith header is shown; w is a group of _i ^K The mapping matrix for K in the ith header is shown; w _i ^V The mapping matrix for V in the ith header is shown. Concat () represents a join function that converts all arguments to a string and then joins the string in order, i.e., the function returns the result as the result of joining the argumentsA raw string of characters; attention means the calculation of self-Attention, and the specific calculation can be seen in formula (1).

The window reverse, the inverse of window partition, is used to restore the dimension of the image after computing the attribute to the dimension before the window partition. For example, the image of dimension (64b, 49, 96) is restored to dimension (B, 56, 56, 96).

The overall process of the electronic device performing the W-MSA is illustrated below. The electronic device performs a window partition operation on the image of dimension (B, 3136, 96) with a window size of 7*7, i.e., change (B, 56, 56, 96) to (64b, 49, 96). The electronic equipment performs a window attention operation on the (64B, 49, 96) -dimensional image, and triples the dimension of the image through the LN layer to obtain a (64B, 49, 96-3) -dimensional vector; trisecting the vector with the dimension of (64B, 49, 288) to obtain Q, K, V with the dimension of (64B, 49, 96); the self-attention was calculated according to the above formula (1). After computing the self-attention, the electronic device restores (64b, 49, 96) to the original dimension, i.e., (B, 3136, 96), by window reverse.

SW-MSA module: the difference from the W-MSA module is that there is a window sliding in the SW-MSA module, by which not only cross-window association, i.e. window-to-window intercommunication, can be introduced, but also the computational efficiency of non-overlapping windows can be maintained. For example, as shown in FIG. 2d, the image is divided into 9 windows of different sizes, with different numbers of slots in each window. It can be understood that fig. 2d is obtained by moving 2*2 lots to the lower right corner of fig. 2c, and the original window (fig. 2 c) and the moved window (fig. 2 d) have overlapped parts, so that the window-to-window communication can be realized.

Although the communication between the windows can be realized through the SW-MSA module, the original characteristic diagram only has 4 windows, 9 windows are obtained after the windows are moved, the number of the windows is increased, and the sizes of the 9 windows are not completely the same, which results in increasing the calculation difficulty. Based on the method, the electronic equipment uses a cyclic shift and mask operation mode, so that the number of the windows after the windows are moved is kept unchanged, and the number of the patches in each window is also kept unchanged.

As shown in fig. 2e, the window partition map is changed into a cyclic shift map by cyclic shift (cyclic shift), after cyclic shift, the image is subdivided into 4 windows, 4 windows are located before the window is shifted, and the 4 windows are still located after cyclic shift, so that the number of windows is 4, the number of windows is fixed, and the calculation difficulty is reduced.

After cyclic shifting, one window may contain content from a different window. Therefore, a masked (masked) MSA mechanism can be used to normally calculate self-attention before masking, which places the unneeded attention at 0, thus limiting self-attention calculation to within each sub-window. And finally returning the self-attention result of each window by a reverse cyclic shift (reverse cyclic shift) method.

An MLP module: the MLP module consists of a full connection layer, a GELU activation function, and Dropout. The MLP module is used for outputting the characteristics of the image. The dimension of the image output by the MLP module is the same as that of the input image.

4、Patch Merging

The module is used for down-sampling (dimension reduction) before each Stage starts, and is used for reducing resolution and adjusting the number of channels, so that a hierarchical design is formed, and a certain amount of calculation can be saved. For example, as shown in fig. 2a, assuming that Stage1 in fig. 2a outputs an image with dimension 56 × 96, after the Patch gathering process in Stage2, the dimension of the image becomes 28 × 28, i.e., the dimension of the image inputted into Swin transform Block in Stage2 is 28 × 28. As can be seen from the foregoing, the input image dimension and the output image dimension of Swin Transformer Block are the same, so the output dimension of Stage2 is 28 × 192. The process of Stage3 and Stage4 is identical to that of Stage2, and the image output by Stage4 has a dimension of 7 × 768.

As shown in fig. 2a, the model includes 4 stages (stages), each stage being a similar repeating unit. The electronic device divides the input picture H W3 into non-overlapping Patch sets through a Patch Partition module, wherein the size of each Patch is 4*4, the characteristic dimension of each Patch is 4X 3=48, and the number of the patches is (H/4) W/4. In stage1, the electronic device changes the feature dimension of the divided patch into C by means of Linear Embedding, and then inputs the C into Swin Transformer Block. In the same operation process of stages 2-4, the electronic device merges the adjacent latches of 2*2 input by the latch gathering and then inputs the merged latches into Swin Transformer Block.

As can be seen, swin Transformer uses a hierarchical construction method, for example, as shown in fig. 2f, the feature map size includes 4 times down-sampled image (i.e., dividing the image into lots of 4*4), 8 times down-sampled image (i.e., dividing the image into lots of 8*8), and 16 times down-sampled image (i.e., dividing the image into lots of 16 × 16). Also in Swin transform the concept of W-MSA is used, as shown in FIG. 2e, in 4-fold and 8-fold down-sampling the signature graph is divided into non-overlapping windows and only the attention is calculated within each window. Since attention is only calculated within each local window, the amount of calculation increases only linearly as the image size increases, i.e., the time complexity of using the model is of a linear order.

As can be seen from the foregoing, the time complexity using the Vision Transformer model shown in fig. 1a is in the square order, and therefore, the time complexity using the Swin Transformer model shown in fig. 2a can be reduced from the square order to the linear order compared to the Vision Transformer model shown in fig. 1 a. Therefore, the Swin Transformer model shown in FIG. 2a is used as a backbone network, which can help to construct tasks such as image classification, target detection, instance segmentation, and the like.

Although the Swin Transformer model shown in fig. 2a reduces the time complexity compared to the Vision Transformer model shown in fig. 1a, the models shown in fig. 2a and 1a divide the image into tiles. However, whether regular or adaptive-length tiles, it is actually a rectangular box. When a pixel in an image is segmented by a common boundary of two adjacent rectangular frames, the pixels adjacent to each other in the image may no longer be located in the adjacent image blocks, and thus the adjacent relationship between the pixels may be lost.

The embodiment of the application provides an image processing method and a related device. In the image processing method, the electronic equipment performs feature extraction on a target image to obtain two-dimensional feature information of the target image; processing the two-dimensional characteristic information of the target image by using a Hilbert curve to obtain one-dimensional characteristic information of the target image; processing the one-dimensional characteristic information of the target image by using a Hilbert transform module to obtain target characteristics used for executing a downstream task in the target image, wherein the downstream task comprises at least one of the following items: segmentation, detection, or identification. It can be seen that, by adopting the embodiment of the application, the one-dimensional feature information of the target image can be obtained based on the hilbert curve, so that the pixels originally adjacent on the image are still adjacent after the one-dimensional feature information is divided by using the hilbert transform module, and thus, the target feature of the target image is obtained based on each section of divided feature information, and the performance of low time complexity is achieved.

In an optional implementation manner, the electronic device processes the one-dimensional feature information of the target image by using a hilbert transform module to obtain a target feature in the target image for executing a downstream task, including: processing the one-dimensional characteristic information of the target image by using a Hilbert transform module to obtain the characteristics of the target image; performing feature recombination on the features of the target image to obtain two-dimensional feature information of the target image; processing the two-dimensional characteristic information of the target image by using a Hilbert curve to obtain one-dimensional characteristic information of the target image, and performing the step of processing the one-dimensional characteristic information of the target image by using a Hilbert transformation module again to obtain the characteristic of the target image until the preset repetition times is reached to determine the target characteristic of the target image; and performing characteristic recombination on the target characteristics of the target image to obtain target two-dimensional characteristics used for executing downstream tasks in the target image.

In an optional implementation manner, the hilbert transform module includes a first transform module and a second transform module, and the electronic device processes the one-dimensional feature information of the target image by using the hilbert transform module to obtain the feature of the target image, including: inputting one-dimensional characteristic information of a target image into a first transformation module to obtain first characteristic information of the target image; inputting the first characteristic information of the target image into a second transformation module to obtain second characteristic information of the target image; and determining second characteristic information as the characteristics of the target image.

In an optional implementation manner, the first transformation module includes a first normalization module, a hilbert Multi-head Self-Attention mechanism (H-MSA) module, a second normalization module, and a first Multi-layered perceptron module, and the electronic device inputs the one-dimensional feature information of the target image into the first transformation module to obtain the first feature information of the target image, including: inputting the one-dimensional characteristic information of the target image into a first normalization module to obtain the one-dimensional characteristic information after normalization processing; processing the normalized one-dimensional feature information by using a Hilbert-based multi-head self-attention mechanism module to obtain first weighted one-dimensional feature information; adding the one-dimensional feature information after the first weighting and the one-dimensional feature information of the target image to obtain third feature information; inputting the third characteristic information into a second normalization module to obtain normalized third characteristic information; inputting the normalized third feature information into a first multilayer perceptron module to obtain a first feature of the target image; and adding the third characteristic information and the first characteristic of the target image to obtain the first characteristic information of the target image.

In an optional implementation manner, the hilbert-based multi-head self-attention mechanism module includes a hilbert partitioning sub-module, a hilbert self-attention sub-module, and a hilbert flipping sub-module, and the electronic device processes the normalized one-dimensional feature information by using the hilbert-based multi-head self-attention mechanism module to obtain first weighted one-dimensional feature information, including: equally dividing the one-dimensional feature information after the normalization processing by using a Hilbert division submodule to obtain a plurality of pieces of equally divided feature information; calculating the self-attention of each feature included in each section of feature information in the equally divided multiple sections of feature information by using a Hilbert self-attention submodule; based on the self-attention of each feature included in each section of feature information in the equally divided multiple sections of feature information, weighting each feature to obtain each section of feature information after first weighting; and obtaining the one-dimensional characteristic information after the first weighting based on each section of characteristic information after the first weighting by utilizing a Hilbert flip sub-module.

In an optional implementation manner, the second transformation module includes a third normalization module, a Shifted hilbert-based Multi-head Self-Attention mechanism (SH-MSA) module, a fourth normalization module, and a second Multi-layer perceptron module, and the electronic device inputs the first feature information of the target image into the second transformation module to obtain the second feature information of the target image, including: inputting the first characteristic information of the target image into a third normalization module to obtain normalized first characteristic information; processing the first feature information after normalization processing by using a multi-head self-attention mechanism module based on shift Hilbert to obtain one-dimensional feature information after second weighting; adding the one-dimensional characteristic information after the second weighting and the first characteristic information of the target image to obtain fourth characteristic information; inputting the fourth feature information into a fourth normalization module to obtain normalized fourth feature information; inputting the fourth feature information after normalization processing into a second normalization module to obtain a second feature of the target image; and adding the fourth characteristic information and the second characteristic of the target image to obtain second characteristic information of the target image.

In an alternative embodiment, the shift hilbert based multi-head self-attention mechanism module includes: the electronic device processes the normalized second feature information by using a multi-head self-attention mechanism module based on the shift hilbert to obtain one-dimensional feature information after second weighting, and the method comprises the following steps: utilizing a shift Hilbert division submodule to carry out non-equal division on the first feature information after the normalization processing according to a preset division ratio to obtain multiple pieces of feature information of non-equal division; calculating the self-attention of each feature included in each section of feature information in the unevenly divided sections of feature information by using a shift Hilbert self-attention submodule; based on the self-attention of each feature included in each piece of feature information in the multiple pieces of feature information divided unequally, weighting each feature to obtain each piece of feature information after second weighting; and obtaining second weighted one-dimensional characteristic information based on each section of characteristic information after second weighting by utilizing the shift Hilbert flip submodule.

After the image processing method and the related device provided by the application are implemented, the following beneficial effects are achieved:

(1) The electronic equipment obtains the one-dimensional characteristic information of the target image based on the Hilbert curve, so that pixels which are originally adjacent on the image are still adjacent after the one-dimensional characteristic information is divided by using the Hilbert transform module, and thus the target characteristic of the target image is obtained based on each section of divided characteristic information, and the electronic equipment has the performance of low time complexity.

(2) The electronic equipment can map the two-dimensional characteristic diagram to a one-dimensional space based on a Hilbert curve, so that objects adjacent to each other in the space can be stored adjacently, thereby ensuring spatial continuity and further enabling the characteristic to be measured and checked more easily.

Referring to fig. 3, fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present disclosure. As shown in fig. 3, the image processing method may include, but is not limited to, the following steps:

s301, extracting the features of the target image to obtain two-dimensional feature information of the target image.

S302, processing the two-dimensional characteristic information of the target image by using the Hilbert curve to obtain the one-dimensional characteristic information of the target image.

The Hilbert Curve (Hilbert Curve) is a fractal Curve that can fill the entire plane, i.e., the Hilbert Curve is a space-filling Curve. Wherein fractal refers to morphological features having space filled in the form of non-integer dimensions. Fractal is generally defined as "a rough or fragmented geometric shape that can be divided into several parts, and each part is (or at least approximately) an overall reduced shape", so that the fractal curve has self-similar properties.

The hilbert curve can linearly traverse (or be called as traversing) each discrete unit with two or higher dimensions according to the characteristics of the space filling curve of the hilbert curve, and only traverse once, and each discrete unit is linearly ordered and coded, and the code is used as the unique identifier of the corresponding unit. That is, the curve can map data that is not well ordered in a high dimensional space to a one dimensional space, such that spatially adjacent objects may be stored together in proximity.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating a hilbert curve traversing an image space according to an embodiment of the present disclosure. As shown in fig. 4, the curve may traverse each pixel in the image space, ensuring that the pixels adjacent to the original image remain adjacent after the division.

And S303, processing the one-dimensional characteristic information of the target image by using a Hilbert transform module to obtain a target characteristic used for executing a downstream task in the target image.

Wherein the downstream task comprises at least one of: segmentation, detection, or identification. Segmentation, i.e., image segmentation, refers to the technique and process of dividing an image into several specific regions with unique properties and extracting objects of interest, the objective of which is to classify each pixel in the image. Recognition, i.e., image recognition, refers to techniques for processing, analyzing, and understanding images with electronic devices to recognize various patterns of objects and objects. The image recognition technology at present is generally divided into face recognition and commodity recognition, and the face recognition is mainly applied to security inspection, identity verification and mobile payment; the commodity identification is mainly applied to the commodity circulation process, in particular to the unmanned retail field such as unmanned goods shelves and intelligent retail cabinets. Detection, object detection, is an image segmentation based on object geometry and statistical features that combines object segmentation and recognition into one, with the goal of determining the location and size of the object (object).

In the embodiment of the application, the electronic equipment performs feature extraction on a target image to obtain two-dimensional feature information of the target image; processing the two-dimensional characteristic information of the target image by using a Hilbert curve to obtain one-dimensional characteristic information of the target image; and inputting the one-dimensional characteristic information of the target image into a Hilbert transform module to obtain the target characteristic of the target image for executing the downstream task. It can be seen that, by adopting the embodiment of the application, the two-dimensional characteristic information of the target image is processed by using the Hilbert curve, so that the adjacent relation between pixels can be kept after the image is divided.

Referring to fig. 5, fig. 5 is a schematic flow chart of another image processing method according to an embodiment of the present application, and as shown in fig. 5, the image processing method may include, but is not limited to, the following steps:

s501, extracting features of the target image to obtain two-dimensional feature information of the target image;

s502, processing two-dimensional characteristic information of the target image by using a Hilbert curve to obtain one-dimensional characteristic information of the target image;

s503, processing the one-dimensional characteristic information of the target image by using a Hilbert transform module to obtain the characteristics of the target image;

the Hilbert transform module is used for obtaining the features of the target image based on the one-dimensional feature information of the target image.

In an alternative embodiment, the hilbert transform module comprises a first transform module and a second transform module. The electronic device inputs the one-dimensional feature information of the target image into the hilbert transform module to obtain the feature of the target image, and the method can include the following steps: inputting the one-dimensional characteristic information of the target image into a first transformation module to obtain second characteristic information of the target image; and inputting the second characteristic information of the target image into a second transformation module to obtain a fourth characteristic of the target image, and taking the fourth characteristic information as the characteristic of the target image. The details of the hilbert transform module can be combined with the related contents described in fig. 6 below.

S504, performing feature recombination on the features of the target image to obtain two-dimensional feature information of the target image;

s505, processing the two-dimensional characteristic information of the target image by using the Hilbert curve to obtain one-dimensional characteristic information of the target image, executing the step S503 again until the preset repetition times are reached, and determining the target characteristic of the target image;

it is understood that steps S503-S505 may correspond to step S303 described above.

S506, performing feature recombination on the target features of the target image to obtain target two-dimensional features used for executing downstream tasks in the target image.

Therefore, the image processing method can obtain the one-dimensional characteristic information of the target image based on the Hilbert curve, so that the pixels which are originally adjacent on the image are still adjacent after the one-dimensional characteristic information is divided by the Hilbert transformation module, and the target characteristic of the target image is obtained based on the one-dimensional characteristic information, and the image processing method has the performance of low time complexity. In addition, deeper features (namely target features) of the target image can be obtained by adopting the method, so that the accuracy of downstream tasks is improved.

The following explains the operation related to the one-dimensional feature information of the target image processed by the hilbert transform module to obtain the features of the target image, with reference to fig. 6. As shown in fig. 6, the hilbert transform module includes a first transform module and a second transform module. The first conversion module comprises a first Norm module, an H-MSA module, a second Norm module and a first MLP module; the second transform module includes a third Norm module, an SH-MSA module, a fourth Norm module, and a second MLP module. Wherein:

a first Norm module: the first Norm module is used for carrying out normalization processing on the one-dimensional characteristic information to obtain the one-dimensional characteristic information after the normalization processing. The normalization process can make the data in the one-dimensional feature information change into a certain fixed section (range). It will be appreciated that the normalization process is a process that varies the range of pixel intensity values. The normalization process can reduce the problem of gradient disappearance and can improve the running speed of the electronic equipment.

H-MSA module: and the H-MSA module is used for processing the one-dimensional characteristic information after the normalization processing to obtain the one-dimensional characteristic information after the first weighting.

A second Norm module: and the third characteristic information processing module is used for carrying out normalization processing on the third characteristic information to obtain normalized third characteristic information. And the third characteristic information is the sum of the one-dimensional characteristic information after the first weighting and the one-dimensional characteristic information of the target image.

A first MLP module: and the third feature information after the normalization processing is processed to obtain the first feature of the target image.

A third norm module: and the first characteristic information processing module is used for carrying out normalization processing on the first characteristic information to obtain the normalized first characteristic information. And the first characteristic information is the sum of the first characteristic information and the third characteristic information of the target image.

SH-MSA Module: and the processing unit is used for processing the first characteristic information after the normalization processing to obtain one-dimensional characteristic information after the second weighting.

Fourth Norm module: and the fourth characteristic information processing module is used for carrying out normalization processing on the fourth characteristic information to obtain normalized fourth characteristic information. And the fourth feature information is the sum of the second weighted one-dimensional feature information and the first feature information.

A second MLP module: and the fourth feature information after the normalization processing is processed to obtain a second feature of the target image.

Based on the above-mentioned structure of the hilbert transform module, as shown in fig. 6, the electronic device inputs the one-dimensional feature information of the target image into the hilbert transform module to obtain the feature of the target image, which may include, but is not limited to, the following steps:

601. inputting the one-dimensional feature information of the target image into a first Norm module to obtain normalized one-dimensional feature information;

602. processing the one-dimensional characteristic information after normalization processing by using an H-MSA module to obtain first weighted one-dimensional characteristic information;

details of obtaining the first weighted one-dimensional feature information may be combined with the related contents described below with reference to fig. 7.

603. Adding the one-dimensional characteristic information after the first weighting and the one-dimensional characteristic information of the target image to obtain third characteristic information;

604. inputting the third characteristic information into a second Norm module to obtain normalized third characteristic information;

605. inputting the normalized third feature information into a first MLP module to obtain a first feature of the target image;

606. adding the third characteristic information and the first characteristic of the target image to obtain first characteristic information of the target image;

607. inputting the first characteristic information of the target image into a third Norm module to obtain the first characteristic information after normalization processing;

608. processing the first feature information after normalization processing by using an SH-MSA module to obtain one-dimensional feature information after second weighting;

details of obtaining the second weighted one-dimensional feature information may be combined with the related contents described in fig. 8 below.

609. Adding the one-dimensional characteristic information after the second weighting and the first characteristic information of the target image to obtain fourth characteristic information;

610. inputting the fourth characteristic information into a fourth Norm module to obtain normalized fourth characteristic information;

611. inputting the fourth feature information after normalization processing into a second MLP module to obtain a second feature of the target image;

612. and adding the fourth characteristic information and the second characteristic of the target image to obtain second characteristic information of the target image, and taking the second characteristic as the characteristic of the target image.

Therefore, by adopting the embodiment of the application, the one-dimensional characteristic information of the target image can be obtained based on the Hilbert curve, so that the pixels originally adjacent to each other on the image are still adjacent to each other after the one-dimensional characteristic information is divided by utilizing the Hilbert transformation module. In addition, the embodiment of the application divides the one-dimensional feature information of the target image, and can limit the calculation of self-attention to each section of feature information, so that the time complexity is a linear order, and the embodiment of the application has the performance of low time complexity.

With reference to fig. 7, a description is given below of operations related to processing the normalized one-dimensional feature information by using the H-MSA module to obtain the first weighted one-dimensional feature information. As shown in FIG. 7, the H-MSA module includes three sub-modules: a hilbert partition sub-module, a hilbert self-attention sub-module, and a hilbert flip sub-module. Wherein:

hilbert partitioning sub-modules: and the method is used for equally dividing the one-dimensional characteristic information after the normalization processing to obtain equally divided multiple sections of characteristic information. For example, assuming that the sequence length corresponding to the one-dimensional feature information is 196, if the one-dimensional feature information needs to be equally divided into 16 pieces, the sequence length of each piece of one-dimensional feature information is 196/16=16. Each piece of one-dimensional feature information comprises a plurality of tokens, and each token is the minimum unit when calculating the self attention. Wherein each token corresponds to an image block.

Hilbert self-attention submodule: the method is used for calculating the self-attention of each feature in each piece of feature information aiming at the equally divided pieces of feature information, and performing first weighting processing on each feature based on the self-attention of each feature to obtain each piece of feature information after first weighting. Wherein self-attention is used to measure the importance of each feature in the input image. It can be understood that, since the calculation of self-attention is performed in each piece of feature information, the amount of calculation increases only linearly as the image size increases, that is, the time complexity of calculating self-attention using this method is of a linear order.

In an alternative embodiment, the electronic device may calculate the self-attention of each feature according to the foregoing equations (1) - (3), which will not be described herein.

Hilbert flip submodule: and the one-dimensional feature information is used for converting each piece of feature information after the first weighting into one-dimensional feature information after the first weighting. It is understood that the first weighted one-dimensional feature information has the same sequence length as the normalized one-dimensional feature information. Wherein, the hilbert flipping sub-module can also be regarded as the inverse process of the hilbert partitioning sub-module.

Based on the structure of the H-MSA module, as shown in fig. 7, the electronic device may process the normalized one-dimensional feature information by using the H-MSA module to obtain the first weighted one-dimensional feature information, and may include the following steps:

701. equally dividing the one-dimensional characteristic information after the normalization processing by using a Hilbert division submodule in an H-MSA module to obtain a plurality of sections of equally divided characteristic information;

wherein, each section of characteristic information corresponds to an image block in the target image.

702. Calculating the self-attention of each characteristic included in each section of characteristic information in the equally divided multiple sections of characteristic information by utilizing a Hilbert self-attention submodule in an H-MSA module;

703. based on the self-attention of each feature included in each section of feature information in the equally divided multiple sections of feature information, weighting each feature to obtain each section of feature information after first weighting;

704. and obtaining first weighted one-dimensional characteristic information based on each section of characteristic information after first weighting by utilizing a Hilbert flip sub-module in the H-MSA module.

Therefore, through the method, the pixels which are adjacent to each other on the image originally can still be adjacent to each other after the one-dimensional characteristic information is divided by the Hilbert transform module. In addition, the one-dimensional feature information of the target image is divided, and the calculation of self attention can be limited in each section of feature information, so that the time complexity is a linear order, and therefore the embodiment of the application has the performance of low time complexity.

The following explains, with reference to fig. 8, the operation related to the SH-MSA module processing the normalized second feature information to obtain the second weighted one-dimensional feature information. As shown in fig. 8, the SH-MSA module includes a shift hilbert division sub-module, a shift hilbert self-attention sub-module, and a shift hilbert flip sub-module. Wherein:

shift hilbert partition submodule: and the second characteristic information after the normalization processing is subjected to non-equal division according to a preset division ratio to obtain multiple pieces of non-equal division characteristic information. For example, assuming that the sequence length of the normalized second feature information is 196, the preset division ratio is 1:2:1:2:4:2:1:2:1, the sequence length of each segment of feature information obtained after division is 16, 32, 64, 32, 16.

After the second feature information after the normalization process is divided by the shift hilbert division submodule, the self-attention of the features at the boundary of each piece of feature information follows the self-attention of the equal division, and the reason is that: the feature of the boundary of each segment of feature information obtained by unequal division according to the preset division ratio may exist in each segment of one-dimensional feature information obtained by equal division, and the self-attention of the feature of the boundary is already calculated in the H-MSA module.

For example, assuming that the sequence length of each piece of feature information after equal division is 16, the sequence length of the first piece of feature information after unequal division is 12, and the sequence length of the second piece of feature information after unequal division is 12, the self-attention of the feature at the boundary between the first piece of feature information and the second piece of feature information obtained by unequal division has already been calculated at the time of equal division, and therefore the self-attention of the feature at the boundary between the first piece of feature information and the second piece of feature information can use the self-attention calculated at the time of equal division.

Shift hilbert self-attention submodule: for obtaining second weighted one-dimensional feature information.

A shift hilbert flip sub-module: and the second weighting module is used for converting each section of feature information after the second weighting into one-dimensional feature information to obtain the one-dimensional feature information after the second weighting.

Based on the structure of the SH-MSA module, as shown in fig. 8, the electronic device may process the normalized second feature information by using the SH-MSA module to obtain the second weighted one-dimensional feature information, and include the following steps:

801. carrying out non-equal division on the first feature information after the normalization processing according to a preset division ratio by utilizing a shift Hilbert division submodule in an SH-MSA module to obtain multiple sections of feature information of the non-equal division;

802. calculating the self-attention of each characteristic included in each section of characteristic information in the unevenly divided multi-section characteristic information by utilizing a shift Hilbert self-attention submodule in an SH-MSA module;

803. weighting each feature based on the self-attention of each feature included in each piece of feature information in the plurality of pieces of feature information divided unequally to obtain each piece of feature information after second weighting;

804. and obtaining second weighted one-dimensional characteristic information based on each section of characteristic information after second weighting by utilizing a shift Hilbert turning submodule in the H-MSA module.

By adopting the embodiment of the application, the SH-MSA module is utilized to process the normalized second feature information, so that the relation of each segment of feature information can be introduced, namely the mutual communication between each segment of feature information and each segment of feature information is introduced, and the self-attention calculation efficiency in the non-overlapped segment of feature information can be kept.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an image processing model according to an embodiment of the present disclosure. As shown in fig. 9, the image processing model may include a feature extraction module, a hilbert curve processing module, a hilbert transform module, a feature reconstruction module, and a downstream task module. The operations associated with training the image processing model are described below in conjunction with FIG. 10. As shown in fig. 10, training the image processing model may include the steps of:

and S1001, acquiring a training sample set.

S1002, processing each training sample image in the training sample set by using a feature extraction module to obtain two-dimensional feature information of each training sample image;

s1003, processing the two-dimensional characteristic information of each training sample image by using a Hilbert curve processing module to obtain target one-dimensional characteristic information of each training sample image;

s1004, processing the target one-dimensional characteristic information of each training sample image by using a Hilbert transform module to obtain the characteristic of each training sample image;

s1005, utilizing a feature recombination module to perform feature recombination on the features of each training sample image to obtain two-dimensional feature information of each training sample image;

s1006, processing the two-dimensional feature information of each training sample image by using a Hilbert curve processing module to obtain target one-dimensional feature information of each training sample image, executing the step S1004 again until the preset repetition times are reached, and determining the target feature of each training sample image;

s1007, utilizing a feature recombination module to perform feature recombination on the target features of each training sample image to obtain target two-dimensional features used for executing downstream tasks in each training sample image;

s1008, inputting the target two-dimensional characteristics used for executing the downstream task in each training sample image into a downstream task module to obtain the accuracy rate of the downstream task;

and S1009, adjusting parameters in the feature extraction module, the Hilbert transform module and the downstream task module which are included in the image processing model according to the accuracy, and executing the step S1003 again until the accuracy meets the condition of stopping training.

Optionally, the condition of stopping training may be that the accuracy rate obtained by executing the target task reaches a stable state, for example, for the same training sample set, the accuracy rate obtained by executing the target task for multiple times is continuously about 95%. Optionally, the training stopping condition may also be that the accuracy of executing the task is greater than or equal to a preset value.

That is, the process of training the image processing model is essentially a process of continuously updating the image processing model based on the accuracy obtained by performing the target task.

For example, if the training sample set is input into an initialized image processing model, the initialized image processing model includes an initialized feature extraction module, an initialized hilbert transform module, and an initialized target network, and if the accuracy obtained by executing the target task does not satisfy the training stopping condition, network parameters in the initialized feature extraction module, the initialized hilbert transform module, and the initialized target network included in the initialized image processing model may be adjusted to obtain the first image processing model. And inputting the training sample set into the first image processing model, determining whether the accuracy obtained by executing the target task meets the training stopping condition, and if not, adjusting the network parameters of each module in the first image processing model to obtain a second image processing model. And inputting the training sample set into a second image processing model, determining whether the accuracy obtained by executing the target task meets the condition of stopping training, if so, stopping training, and obtaining the trained image processing model.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application. As shown in fig. 11, the image processing apparatus may include, but is not limited to:

an acquisition unit 1101 configured to perform feature extraction on a target image to obtain two-dimensional feature information of the target image;

a processing unit 1102, configured to process the two-dimensional feature information of the target image by using a hilbert curve, to obtain one-dimensional feature information of the target image;

the processing unit 1102 is further configured to process the one-dimensional feature information of the target image by using a hilbert transform module, and obtain a target feature used for executing a downstream task in the target image, where the downstream task includes at least one of: segmentation, detection, or identification.

In an optional implementation manner, when the processing unit 1102 is configured to process the one-dimensional feature information of the target image by using the hilbert transform module to obtain a target feature in the target image for executing a downstream task, specifically, the processing unit is configured to:

processing the one-dimensional characteristic information of the target image by using a Hilbert transform module to obtain the characteristics of the target image; performing feature recombination on the features of the target image to obtain two-dimensional feature information of the target image; processing the two-dimensional characteristic information of the target image by using a Hilbert curve to obtain one-dimensional characteristic information of the target image, and performing the step of processing the one-dimensional characteristic information of the target image by using a Hilbert transformation module again to obtain the characteristic of the target image until the preset repetition times is reached to determine the target characteristic of the target image; and performing characteristic recombination on the target characteristics of the target image to obtain target two-dimensional characteristics used for executing downstream tasks in the target image.

In an optional implementation manner, the hilbert transform module includes a first transform module and a second transform module, and when the processing unit 1102 is configured to process the one-dimensional feature information of the target image by using the hilbert transform module to obtain the feature of the target image, the processing unit is specifically configured to:

inputting one-dimensional characteristic information of a target image into a first transformation module to obtain first characteristic information of the target image; inputting the first characteristic information of the target image into a second transformation module to obtain second characteristic information of the target image; and determining second characteristic information as the characteristics of the target image.

In an optional implementation manner, the first transformation module includes a first normalization module, a hilbert-based multi-head self-attention mechanism module, a second normalization module, and a first multi-layered perceptron module, and when the processing unit 1102 is configured to input the one-dimensional feature information of the target image into the first transformation module to obtain the first feature information of the target image, the processing unit is specifically configured to:

inputting the one-dimensional characteristic information of the target image into a first normalization module to obtain the one-dimensional characteristic information after normalization processing; processing the one-dimensional feature information after normalization processing by using a Hilbert-based multi-head self-attention mechanism module to obtain first weighted one-dimensional feature information; adding the one-dimensional characteristic information after the first weighting and the one-dimensional characteristic information of the target image to obtain third characteristic information; inputting the third characteristic information into a second normalization module to obtain normalized third characteristic information; inputting the normalized third feature information into a first multilayer perceptron module to obtain a first feature of the target image; and adding the third characteristic information and the first characteristic of the target image to obtain the first characteristic information of the target image.

In an optional implementation manner, the hilbert-based multi-head self-attention mechanism module includes a hilbert partitioning sub-module, a hilbert self-attention sub-module, and a hilbert flipping sub-module, and when the processing unit 1102 is configured to process the normalized one-dimensional feature information by using the hilbert-based multi-head self-attention mechanism module to obtain the first weighted one-dimensional feature information, the processing unit is specifically configured to:

equally dividing the one-dimensional feature information after the normalization processing by using a Hilbert division submodule to obtain a plurality of pieces of equally divided feature information; calculating the self-attention of each feature included in each section of feature information in the equally divided multiple sections of feature information by using a Hilbert self-attention submodule; based on the self-attention of each feature included in each section of feature information in the equally divided multiple sections of feature information, weighting each feature to obtain each section of feature information after first weighting; and obtaining the one-dimensional characteristic information after the first weighting based on each section of characteristic information after the first weighting by utilizing a Hilbert flip sub-module.

In an optional implementation manner, the second transformation module includes a third normalization module, a multi-head self-attention mechanism module based on shift hilbert, a fourth normalization module, and a second multi-layer perceptron module, and when the processing unit 1102 is configured to input the first feature information of the target image into the second transformation module to obtain the second feature information of the target image, the processing unit is specifically configured to:

inputting the first characteristic information of the target image into a third normalization module to obtain normalized first characteristic information; processing the first feature information after normalization processing by using a multi-head self-attention mechanism module based on shift Hilbert to obtain one-dimensional feature information after second weighting; adding the one-dimensional characteristic information after the second weighting and the first characteristic information of the target image to obtain fourth characteristic information; inputting the fourth feature information into a fourth normalization module to obtain normalized fourth feature information; inputting the fourth feature information after normalization processing into a second normalization module to obtain a second feature of the target image; and adding the fourth characteristic information and the second characteristic of the target image to obtain second characteristic information of the target image.

In an alternative embodiment, the shift hilbert based multi-head self-attention mechanism module includes: the processing unit 1102 is specifically configured to, when the processing unit is configured to process the normalized second feature information by using a multi-head self-attention mechanism module based on the shift hilbert to obtain second weighted one-dimensional feature information,:

utilizing a shift Hilbert division submodule to carry out non-equal division on the first feature information after the normalization processing according to a preset division ratio to obtain multiple pieces of feature information of non-equal division; calculating the self-attention of each feature included in each section of feature information in the unevenly divided sections of feature information by using a shift Hilbert self-attention submodule; weighting each feature based on the self-attention of each feature included in each piece of feature information in the plurality of pieces of feature information divided unequally to obtain each piece of feature information after second weighting; and obtaining second weighted one-dimensional characteristic information based on each section of characteristic information after second weighting by utilizing the shift Hilbert flip submodule.

Optionally, the image processing apparatus may also refer to the related contents in the above image processing method, and will not be described in detail here.

The software and hardware architecture of the electronic device applied by the image processing method provided by the application is as follows:

the electronic device provided in this embodiment of the present application may be a terminal device carrying iOS, android, nylon, microsoft, or other operating systems, such as a mobile phone, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, a super-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR) device, a Virtual Reality (VR) device, an Artificial Intelligence (AI) device, a wearable device, a vehicle-mounted device, a smart home device, and/or a city smart device, and this embodiment of the present application does not specially limit the specific type of the electronic device.

Referring to fig. 12, fig. 12 is a schematic diagram of a hardware architecture of an electronic device 100 according to an embodiment of the present disclosure. As shown in fig. 12, the electronic device 100 may include, but is not limited to: a processor 110, an antenna 1, an antenna 2, a user module 120, a mobile communication module 130, a wireless communication module 140, an internal memory 121, an external memory interface 122, and the like.

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown in FIG. 12, or some components may be combined, some components may be split, or a different arrangement of components. The components shown in fig. 12 may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a graphics processor, an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), among others. Wherein, the different processing units may be independent devices or may be integrated in one or more processors.

The controller may be, among other things, a neural center and a command center of the electronic device 100. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In an alternative embodiment, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In this embodiment of the application, the processor 110 may be configured to perform image processing, and specifically may be configured to perform feature extraction on a target image to obtain two-dimensional feature information of the target image; processing the two-dimensional characteristic information of the target image by using a Hilbert curve to obtain one-dimensional characteristic information of the target image; processing the one-dimensional characteristic information of the target image by using a Hilbert transform module to obtain target characteristics used for executing a downstream task in the target image, wherein the downstream task comprises at least one of the following items: segmentation, detection, or identification. For the specific steps executed by the processor 110, reference may be made to the description of the steps S301 to S303, which is not repeated herein.

In addition, when the processor 110 is configured to process the one-dimensional feature information of the target image by using the hilbert transform module to obtain a target feature in the target image for executing a downstream task, the processor is specifically configured to: processing the one-dimensional characteristic information of the target image by using a Hilbert transform module to obtain the characteristics of the target image; performing feature recombination on the features of the target image to obtain two-dimensional feature information of the target image; processing the two-dimensional characteristic information of the target image by using a Hilbert curve to obtain one-dimensional characteristic information of the target image, and performing the step of processing the one-dimensional characteristic information of the target image by using a Hilbert transformation module again to obtain the characteristic of the target image until the preset repetition times are reached, and determining the target characteristic of the target image; and performing characteristic recombination on the target characteristics of the target image to obtain target two-dimensional characteristics used for executing downstream tasks in the target image.

In addition, the hilbert transform module includes a first transform module and a second transform module, and when the processor 110 is configured to process the one-dimensional feature information of the target image by using the hilbert transform module to obtain the feature of the target image, the processor is specifically configured to:

In addition, the first transformation module includes a first normalization module, a hilbert-based multi-head self-attention mechanism module, a second normalization module, and a first multi-layered perceptron module, and when the processor 110 is configured to input the one-dimensional feature information of the target image into the first transformation module to obtain the first feature information of the target image, the processor is specifically configured to: inputting the one-dimensional characteristic information of the target image into a first normalization module to obtain the one-dimensional characteristic information after normalization processing; processing the one-dimensional feature information after normalization processing by using a Hilbert-based multi-head self-attention mechanism module to obtain first weighted one-dimensional feature information; adding the one-dimensional characteristic information after the first weighting and the one-dimensional characteristic information of the target image to obtain third characteristic information; inputting the third characteristic information into a second normalization module to obtain normalized third characteristic information; inputting the third feature information after normalization processing into a first multilayer perceptron module to obtain a first feature of the target image; and adding the third characteristic information and the first characteristic of the target image to obtain the first characteristic information of the target image.

In addition, the hilbert-based multi-head self-attention mechanism module includes a hilbert partitioning sub-module, a hilbert self-attention sub-module, and a hilbert flipping sub-module, and when the processor 110 is configured to process the normalized one-dimensional feature information by using the hilbert-based multi-head self-attention mechanism module to obtain the first weighted one-dimensional feature information, the processor is specifically configured to:

equally dividing the one-dimensional feature information after the normalization processing by using a Hilbert division submodule to obtain a plurality of pieces of equally divided feature information; calculating the self-attention of each feature included in each section of feature information in the equally divided multiple sections of feature information by using a Hilbert self-attention submodule; based on the attention of each feature included in each piece of feature information in the plurality of pieces of feature information divided equally, weighting each feature to obtain each piece of feature information after first weighting; and obtaining the one-dimensional characteristic information after the first weighting based on each section of characteristic information after the first weighting by utilizing a Hilbert flip sub-module.

In addition, the second transformation module includes a third normalization module, a multi-head self-attention mechanism module based on shift hilbert, a fourth normalization module, and a second multi-layer perceptron module, and when the processor 110 is configured to input the first feature information of the target image into the second transformation module to obtain the second feature information of the target image, the processor is specifically configured to: inputting the first characteristic information of the target image into a third normalization module to obtain normalized first characteristic information; processing the first feature information after normalization processing by using a multi-head self-attention mechanism module based on shift Hilbert to obtain one-dimensional feature information after second weighting; adding the one-dimensional characteristic information after the second weighting and the first characteristic information of the target image to obtain fourth characteristic information; inputting the fourth feature information into a fourth normalization module to obtain normalized fourth feature information; inputting the fourth feature information after normalization processing into a second normalization module to obtain a second feature of the target image; and adding the fourth characteristic information and the second characteristic of the target image to obtain second characteristic information of the target image.

Furthermore, the shift hilbert based multi-head self-attention mechanism module comprises: the processing unit 1102 is specifically configured to, when the processing unit is configured to process the normalized second feature information by using a multi-head self-attention mechanism module based on the shift hilbert to obtain second weighted one-dimensional feature information,: utilizing a shift Hilbert division submodule to carry out non-equal division on the first feature information after the normalization processing according to a preset division ratio to obtain multiple pieces of feature information of non-equal division; calculating the self-attention of each feature included in each section of feature information in the unevenly divided sections of feature information by using a shift Hilbert self-attention submodule; weighting each feature based on the self-attention of each feature included in each piece of feature information in the plurality of pieces of feature information divided unequally to obtain each piece of feature information after second weighting; and obtaining second weighted one-dimensional characteristic information based on each section of characteristic information after the second weighting by utilizing a shift Hilbert turnover submodule.

In an alternative embodiment, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

It should be understood that the interface connection relationship between the modules illustrated in the embodiments of the present application is only an illustration, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The user module 120 is a medium for implementing interaction and information exchange between a user and an electronic device, and may be embodied by a Display screen (Display) for output, a Keyboard (Keyboard) for input, and the like, where the Keyboard may be a physical Keyboard, a touch screen virtual Keyboard, or a Keyboard that is a combination of a physical Keyboard and a touch screen virtual Keyboard.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 130, the wireless communication module 140, and the like.

The

antennas

1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 130 may provide a solution including 2G/3G/4G/5G wireless communication applied to the electronic device 100. The mobile communication module 130 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 130 can receive the electromagnetic wave from the antenna 1, and filter, amplify, etc. the received electromagnetic wave, and transmit the electromagnetic wave to the modem processor for demodulation. The mobile communication module 130 can also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave to radiate the electromagnetic wave through the antenna 1. In an alternative embodiment, at least some of the functional modules of the mobile communication module 130 may be disposed in the processor 110. In an alternative embodiment, at least some of the functional modules of the mobile communication module 130 may be disposed in the same device as at least some of the modules of the processor 110.

The wireless communication module 140 may provide solutions for wireless communication applied to the electronic device 100, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (bluetooth, BT), global Navigation Satellite System (GNSS), frequency Modulation (FM), near Field Communication (NFC), infrared (IR), and the like. The wireless communication module 140 may be one or more devices integrating at least one communication processing module. The wireless communication module 140 receives electromagnetic waves via the antenna 2, demodulates and filters the electromagnetic wave signal, and transmits the processed signal to the processor 110. The wireless communication module 140 may also receive a signal to be transmitted from the processor 110, perform frequency modulation and amplification on the signal, and convert the signal into electromagnetic waves through the antenna 2 to radiate the electromagnetic waves.

In an alternative embodiment, the antenna 1 of the electronic device 100 is coupled to the mobile communication module 130 and the antenna 2 is coupled to the wireless communication module 140, so that the electronic device 100 can communicate with a network and other devices through wireless communication technology. The wireless communication technology may include global system for mobile communications (GSM), general Packet Radio Service (GPRS), code division multiple access (code division multiple access, CDMA), wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), long Term Evolution (LTE), BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou navigation satellite system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).

The internal memory 121 may include one or more Random Access Memories (RAMs) and one or more non-volatile memories (NVMs).

The random access memory may include static random-access memory (SRAM), dynamic random-access memory (DRAM), synchronous dynamic random-access memory (SDRAM), double data rate synchronous dynamic random-access memory (DDR SDRAM), such as fifth generation DDR SDRAM generally referred to as DDR5 SDRAM, and the like;

the nonvolatile memory may include a magnetic disk storage device, a flash memory (flash memory).

The FLASH memory may include NOR FLASH, NAND FLASH, 3D NAND FLASH, etc. according to the operation principle, the FLASH memory may include single-level cell (SLC), multi-level cell (MLC), three-level cell (TLC), four-level cell (QLC), etc. according to the potential order of the memory cell, and the FLASH memory may include universal FLASH memory (english: UFS), embedded multimedia memory Card (mc em), etc. according to the storage specification.

The random access memory may be read and written directly by the processor 110, may be used to store executable programs (e.g., machine instructions) of an operating system or other programs in operation, and may also be used to store data of users and applications, etc.

The nonvolatile memory may also store executable programs, data of users and application programs, and the like, and may be loaded into the random access memory in advance for the processor 110 to directly read and write.

In the embodiment of the present application, the nonvolatile memory may be used to store a preset voiceprint model and a preset speech synthesis model. The relevant data of the registered user, the voice relevant data of the registered user includes but is not limited to: the registered user is characterized, the voice input by the registered user corresponds to the synthesized voice, and the like.

The external memory interface 122 may be used to connect an external nonvolatile memory, so as to expand the storage capability of the electronic device 100. The external non-volatile memory communicates with the processor 110 through the external memory interface 122 to perform data storage functions. For example, files such as music, video, etc. are saved in an external nonvolatile memory.

The software system of the electronic device 100 may employ a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present application takes an Android system with a layered architecture as an example, and exemplarily illustrates a software structure of the electronic device 100.

The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In an optional embodiment, the Android system is divided into four layers, which are an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom.

The application layer may include a series of application packages.

As shown in FIG. 13, the application package may include applications such as a smart assistant, gallery, call, map, navigation, WLAN, bluetooth, music, video, short message, etc. In other embodiments of the present application, the application program for providing the image processing method described in the present application may also be referred to by other names other than the image processing program, such as a processing method, a processing program, and the like, which are not limited in this application.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 13, the application framework layers may include a window manager, content provider, notification manager, view system, phone manager, resource manager, and the like.

And a Hardware Abstraction Layer (HAL), wherein the HAL is positioned between the kernel Layer and the framework Layer and plays a role of starting and stopping. Specifically, the HAL defines a set of standard interfaces including: image HAL and other Sensor HAL, and the like.

The kernel layer is a layer between hardware and software. The kernel layer at least comprises a processing driver, a display driver, a camera driver, an audio driver and a sensor driver.

In this embodiment, the smart assistant application may issue an image processing command to the processing driver through the interface and the image HAL provided by the application framework layer, so that the processing driver controls the processor to process the image.

The workflow of the software and hardware of the electronic device is exemplarily described below with reference to the image processing scenario.

When the processor 110 receives an image processing operation, a corresponding hardware interrupt is issued to the search driver of the kernel layer. The search driver of the kernel layer processes the touch operation into an original input event. The raw input events are stored at the kernel layer. And the application program framework layer acquires the original input event from the kernel layer and identifies the operation instruction corresponding to the input event. The operation instruction is used for waking up the intelligent assistant operation of the electronic equipment, and the intelligent assistant application calls an interface of an application framework layer to start an image processing service to provide a service for the intelligent assistant application.

It should be understood that the steps of the above method embodiments provided by the present application may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

The embodiments of the present application can be combined arbitrarily to achieve different technical effects.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the procedures or functions described in accordance with the present application are generated, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk), among others.

One of ordinary skill in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the above method embodiments. And the aforementioned storage medium includes: various media capable of storing program codes, such as ROM or RAM, magnetic or optical disks, etc.

In short, the above description is only an example of the technical solution of the present invention, and is not intended to limit the scope of the present invention. Any modifications, equivalents, improvements and the like made in accordance with the disclosure of the present invention are intended to be included within the scope of the present invention.

Claims

1. An image processing method, characterized in that the method comprises:

performing feature extraction on a target image to obtain two-dimensional feature information of the target image;

processing the two-dimensional characteristic information of the target image by using a Hilbert curve to obtain one-dimensional characteristic information of the target image;

processing the one-dimensional feature information of the target image by using a Hilbert transform module to obtain a target feature used for executing a downstream task in the target image, wherein the downstream task comprises at least one of the following items: segmentation, detection, or identification.

2. The method according to claim 1, wherein the processing the one-dimensional feature information of the target image by using a hilbert transform module to obtain a target feature in the target image for performing a downstream task comprises:

processing the two-dimensional characteristic information of the target image by using the Hilbert curve to obtain one-dimensional characteristic information of the target image, and performing the step of processing the one-dimensional characteristic information of the target image by using the Hilbert transform module again to obtain the characteristic of the target image until the preset repetition times is reached to determine the target characteristic of the target image;

and performing feature recombination on the target features of the target image to obtain target two-dimensional features used for executing downstream tasks in the target image.

3. The method of claim 2, wherein the Hilbert transform module comprises a first transform module and a second transform module,

the processing the one-dimensional feature information of the target image by using the hilbert transform module to obtain the feature of the target image includes:

inputting the one-dimensional characteristic information of the target image into the first transformation module to obtain first characteristic information of the target image;

inputting the first characteristic information of the target image into the second transformation module to obtain second characteristic information of the target image;

and determining the second characteristic information as the characteristic of the target image.

4. The method of claim 3, wherein the first transformation module comprises a first normalization module, a Hilbert-based multi-head self-attention mechanism module, a second normalization module, and a first multi-layered perceptron module,

the inputting the one-dimensional feature information of the target image into the first transformation module to obtain the first feature information of the target image includes:

inputting the one-dimensional feature information of the target image into the first normalization module to obtain one-dimensional feature information after normalization processing;

processing the normalized one-dimensional feature information by using the Hilbert-based multi-head self-attention mechanism module to obtain first weighted one-dimensional feature information;

adding the one-dimensional feature information after the first weighting and the one-dimensional feature information of the target image to obtain third feature information;

inputting the third feature information into the second normalization module to obtain normalized third feature information;

inputting the third feature information after the normalization processing into the first multilayer perceptron module to obtain a first feature of the target image;

5. The method of claim 4, wherein the Hilbert-based multi-headed self-attention mechanism module comprises a Hilbert partition sub-module, a Hilbert self-attention sub-module, and a Hilbert flip sub-module,

the processing the one-dimensional feature information after the normalization processing by using the hilbert-based multi-head self-attention mechanism module to obtain first weighted one-dimensional feature information includes:

equally dividing the one-dimensional feature information after the normalization processing by using the Hilbert division submodule to obtain a plurality of pieces of equally divided feature information;

calculating the self-attention of each feature included in each piece of feature information in the equally divided pieces of feature information by using the Hilbert self-attention submodule;

based on the attention of each feature included in each piece of feature information in the equally divided multiple pieces of feature information, performing weighting processing on each feature to obtain first weighted each piece of feature information;

and obtaining first weighted one-dimensional characteristic information based on each section of characteristic information after the first weighting by utilizing the Hilbert flip sub-module.

6. The method of claim 3, wherein the second transformation module comprises a third normalization module, a shifted Hilbert-based multi-head self-attention mechanism module, a fourth normalization module, and a second multi-layered perceptron module,

the inputting the first feature information of the target image into the second transformation module to obtain the second feature information of the target image includes:

inputting the first characteristic information of the target image into the third normalization module to obtain normalized first characteristic information;

processing the normalized first feature information by using the multi-head self-attention mechanism module based on the shift Hilbert to obtain second weighted one-dimensional feature information;

inputting the fourth feature information into the fourth normalization module to obtain normalized fourth feature information;

inputting the fourth feature information after normalization processing into the fourth normalization module to obtain a second feature of the target image;

7. The method of claim 6, wherein the shifted Hilbert-based multi-head self-attention mechanism module comprises: a shift Hilbert partition submodule, a shift Hilbert self-attention submodule and a shift Hilbert flip submodule,

processing the normalized first feature information by using the multi-head self-attention mechanism module based on the shift Hilbert to obtain second weighted one-dimensional feature information, wherein the second weighted one-dimensional feature information comprises

Utilizing the shift Hilbert division submodule to carry out non-equal division on the first feature information after the normalization processing according to a preset division ratio to obtain multiple pieces of feature information of non-equal division;

calculating the self-attention of each feature included in each piece of feature information in the multiple pieces of feature information divided unequally by using the shift Hilbert self-attention submodule;

based on the self-attention of each feature included in each piece of feature information in the plurality of pieces of feature information divided unequally, performing weighting processing on each feature to obtain each piece of feature information after second weighting;

and obtaining second weighted one-dimensional characteristic information based on each section of characteristic information after the second weighting by utilizing a shift Hilbert turning submodule.

8. An electronic device, comprising: one or more processors, one or more memories, and a display screen; the one or more memories coupled with the one or more processors for storing computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-7.

9. A computer readable storage medium comprising computer instructions which, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1-7.

10. A chip or chip system comprising processing circuitry and interface circuitry for receiving code instructions and transmitting them to the processing circuitry, the processing circuitry being arranged to execute the code instructions to perform the method of any of claims 1 to 7.