CN118052827A

CN118052827A - Image processing method and related device

Info

Publication number: CN118052827A
Application number: CN202211409124.6A
Authority: CN
Inventors: 刘宇; 赵娟萍; 赵亚西
Original assignee: Zeku Technology Shanghai Corp Ltd
Current assignee: Zeku Technology Shanghai Corp Ltd
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2024-05-17

Abstract

The application provides an image processing method and a related device, which are applied to an image processing network, wherein the image processing network comprises a convolution encoder module, a converter module, a convolution decoder module and a dynamic convolution module, an original image is input into the convolution encoder module to obtain image characteristics, the channel number of the original image is C, the width is W and the length is H, and the dimension of the original image is C multiplied by W multiplied by H; then, inputting the image features into the transformer module to obtain first features; then, inputting the image features into the convolutional decoder module to obtain second features; and finally, inputting the first feature and the second feature into the dynamic convolution module to obtain a target segmentation mask, wherein the dimension of the target segmentation mask is NxW x H, and N is the upper limit number of the segmentation frame. The number of parameters and the amount of calculation can be greatly reduced while ensuring the effect of example segmentation of the image.

Description

Image processing method and related device

Technical Field

The application relates to the technical field of deep learning, in particular to an image processing method and a related device.

Background

The transducer-based End-to-End Object Detection with Transformers (DETR) framework can be used for carrying out example segmentation of images, so that a plurality of manually designed components such as Non-maximum suppression (Non-Maximum Suppression, NMS) post-processing steps and the like are effectively removed, prior knowledge and constraints such as anchoring and the like are not available, the simultaneous detection effect of the existing framework for target detection is greatly simplified, but the current DETR parameters and calculation amount are large, and the method cannot be deployed to a platform with constraints on resources.

Disclosure of Invention

In view of this, the present application provides an image processing method and related apparatus, which can redesign the structure of DETR, and greatly reduce the number of parameters and the amount of computation of the neural network while ensuring the example segmentation effect.

In a first aspect, an embodiment of the present application provides an image processing method applied to an image processing network, where the image processing network includes a convolutional encoder module, a transformer module, a convolutional decoder module, and a dynamic convolutional module, and the method includes:

Inputting an original image into the convolution encoder module to obtain image characteristics, wherein the number of channels of the original image is C, the width is W, the length is H, and the dimension of the original image is C multiplied by W multiplied by H;

inputting the image features into the transformer module to obtain first features;

inputting the image features into the convolutional decoder module to obtain second features;

And inputting the first feature and the second feature into the dynamic convolution module to obtain a target segmentation mask, wherein the dimension of the target segmentation mask is NxW x H, and N is the upper limit number of the segmentation frame.

In a second aspect, an embodiment of the present application provides an image processing apparatus applied to an image processing network, where the image processing network includes a convolutional encoder module, a transformer module, a convolutional decoder module, and a dynamic convolutional module, and the apparatus includes a processing unit configured to:

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, the programs including instructions for performing steps in any of the methods of the first aspect of the embodiments of the present application.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes a computer to perform part or all of the steps as described in any of the methods of the first aspect of the embodiments of the present application.

In a fifth aspect, embodiments of the present application provide a computer program product, wherein the computer program product comprises a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps described in any of the methods of the first aspect of the embodiments of the present application. The computer program product may be a software installation package.

It can be seen that, through the above image processing method and related apparatus, the image processing network includes a convolutional encoder module, a converter module, a convolutional decoder module and a dynamic convolutional module, first, an original image is input into the convolutional encoder module to obtain image features, the number of channels of the original image is C, the width is W, and the length is H, and the dimension of the original image is c×w×h; then, inputting the image features into the transformer module to obtain first features; then, inputting the image features into the convolutional decoder module to obtain second features; and finally, inputting the first feature and the second feature into the dynamic convolution module to obtain a target segmentation mask, wherein the dimension of the target segmentation mask is NxW x H, and N is the upper limit number of the segmentation frame. The number of parameters and the amount of calculation can be greatly reduced while ensuring the effect of example segmentation of the image.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an application scenario diagram of an image processing method according to an embodiment of the present application;

Fig. 2 is a schematic structural diagram of an image processing network according to an embodiment of the present application;

Fig. 3 is an input/output schematic diagram of a converter module according to an embodiment of the present application;

FIG. 4 is a schematic diagram of input/output of a convolutional decoder block according to an embodiment of the present application;

Fig. 5 is an input/output schematic diagram of an image processing network according to an embodiment of the present application;

fig. 6 is a schematic flow chart of an image processing method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 8 is a functional block diagram of an image processing apparatus according to an embodiment of the present application;

Fig. 9 is a block diagram showing functional units of another image processing apparatus according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, the character "/" indicates that the front and rear associated objects are an "or" relationship. The term "plurality" as used in the embodiments of the present application means two or more.

"At least one" or the like in the embodiments of the present application means any combination of these items, including any combination of single item(s) or plural items(s), meaning one or more, and plural means two or more. For example, at least one (one) of a, b or c may represent the following seven cases: a, b, c, a and b, a and c, b and c, a, b and c. Wherein each of a, b, c may be an element or a set comprising one or more elements.

The "connection" in the embodiment of the present application refers to various connection manners such as direct connection or indirect connection, so as to implement communication between devices, which is not limited in the embodiment of the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The following describes related content, concepts, meanings, technical problems, technical schemes, beneficial effects and the like related to the embodiment of the application.

1. Related terms

1. Example segmentation: instance segmentation (Instance Segmentation) is the relatively most difficult of the four tasks of visual classical, it has the characteristics of both semantic segmentation (Semantic Segmentation) and classification on the pixel level, and also has a part of the characteristics of Object Detection (Object Detection), i.e. it is necessary to locate different instances even if they are of the same kind. Object detection or localization is a progressive process of digital images from coarse to fine. It provides not only the class of image objects but also the location of objects in the classified image. The position is given in the form of a border or a center. Semantic segmentation gives a good reasoning by predicting the label of each pixel in the input image. Each pixel is marked according to the class of object in which it is located. For further development, instance partitioning provides different labels for individual instances of objects belonging to the same class. Thus, instance segmentation may be defined as a technique that solves both the object detection problem and the semantic segmentation problem. In the embodiment of the application, image processing can be understood as identifying and dividing the object to be detected in the original image to obtain the final target division mask. For example, there are seven balloons in the original image, and then an example segmentation may determine the locations of the seven balloons and the pixels of each of the seven balloons.

2. Application scene and electronic equipment

1. Application scenario

The embodiment of the application can be applied to the following application scenarios, including but not limited to: a neural network-based processing system deployed on an electronic device, for large-scale computer vision applications (e.g., image classification, object detection, and video analysis functions), for speech signal processing, natural language processing, recommendation systems, or other situations where a neural network is required to be compressed to remove redundant operators due to limited resources and latency requirements. The embodiments of the present application may be modified and improved according to specific application environments, and are not particularly limited herein.

2. Electronic equipment

The electronic device in the embodiments of the present application may be a portable electronic device that also includes other functions, such as a personal digital assistant and/or a music player, such as a cell phone, a tablet computer, a wearable electronic device (e.g., a smart watch) with wireless communication functions, and so on. Exemplary embodiments of portable electronic devices include, but are not limited to, portable electronic devices that are equipped with IOS systems, android systems, microsoft systems, or other operating systems. The portable electronic device may also be other portable electronic devices such as a Laptop computer (Laptop) or the like. It should also be appreciated that in other embodiments, the electronic device described above may not be a portable electronic device, but rather a desktop computer, server, or the like.

3. Description of the examples

Referring to fig. 1, fig. 1 is an application scenario diagram of an image processing method according to an embodiment of the present application. As shown in fig. 1, the electronic device 110 may first acquire the image processing network 120, where the image processing network 120 may be used to perform instance segmentation on an original image, the electronic device 110 may perform optimization processing on the image processing network 120, such as pruning and quantization, where a minimum unit of pruning may be a single weight, or may be a more structured set of weights (such as a weight bar, a filter, a channel, etc.), for example, to remove unimportant or redundant weights, or to remove, for example, unimportant fully-connected layers or a part of fully-connected operations therein, etc., which is not limited in detail herein, and finally, the optimized image processing network may be deployed to the electronic device 130.

Wherein the electronic device 130 may be a mobile phone, a wearable device, or other device that is limited to locally limited resources. It should be noted that, the image processing network 120 is a light-weighted network structure, and can be deployed to the electronic device 130 only by performing targeted optimization on the electronic device 130 to be deployed, which is beneficial to saving energy consumption and shortening time delay. It is understood that electronic device 130 may be the same device as electronic device 110 or may be a different device.

3. Image processing method

The existing image processing method can be completed through a DETR-based framework, the DETR regards target detection as a direct set prediction (set prediction) problem, the detection flow is simplified, the requirement of many manually designed components such as non-maximum suppression process or anchor generation is effectively eliminated, and the main component of the DETR comprises a set-based global loss function which constrains unique prediction results through binary matching and a converter-based encoding-decoding structure. In case of instance segmentation, only one feature pyramid (Feature Pyramid Networks, FPN) header is added to the DETR frame to complete the instance segmentation. However, the current DETR architecture has a large parameter amount and a large calculation amount, and when the example is divided, a Mask Head (Mask Head) is added, and the Mask Head is not only related to the number of the detection frames, but also related to the resolution of the image, so that when the image with a large resolution is used, a large memory and a large calculation amount are required to be occupied, and the DETR cannot be applied to a platform with a constraint on resources such as a mobile terminal.

In order to solve the above problems, an embodiment of the present application provides an image processing method, which is applied to an image processing network, where the image processing network includes a convolutional encoder module, a converter module, a convolutional decoder module, and a dynamic convolutional module, and can perform instance segmentation based on the image processing network, so as to reduce the amount of computation and the number of parameters while ensuring the effect of instance segmentation, and facilitate deployment to platforms with constraint on resources, such as mobile terminals, due to lighter weight of the image processing network.

The technical scheme, beneficial effects, concepts and the like related to the embodiment of the application are specifically described below.

1. Image processing network

As shown in fig. 2, the framework of the image processing network 200 includes a convolutional encoder block 210, a transformer block 220, a convolutional decoder block 230, and a dynamic convolutional block 240, wherein an original image may be input to the convolutional encoder block 210, an output of the convolutional encoder block 210 may be input to the transformer block 220 and an input of the convolutional decoder block 230, respectively, an output of the transformer block 220 and an output of the convolutional decoder block 230 may be input to the dynamic convolutional block 240, and the dynamic convolutional block 240 outputs a target segmentation mask. For ease of understanding, the architecture in the image processing network 200 described above is described one by one:

1) Convolutional encoder module

The convolutional encoder module (convolutional encoder) may be configured to extract image features of an original image, where the image features may be feature maps (feature maps) related to the number of channels of the original image, where the original image is a gray scale image, the feature maps are typically one, and where the original image is a three color channel image, the feature maps are typically three, and it is understood that the convolutional encoder module may use a backbone (backbone) of a ResNet network or the like to extract the image features, which is not specifically limited herein.

In particular, the convolutional encoder may be a conventional convolutional neural network consisting of a convolutional layer and a pooled layer. It generally reduces the spatial dimensions (i.e., height and width) of the input original image while increasing the depth (i.e., number of feature maps), for example, the original image may have dimensions of 3×384×512, where 3 represents the number of color channels, 384×512 represents the size of the original image, and the resulting dimensional data of the image features after feature extraction by the convolutional encoder may be 256×12×16, where 256 represents the increased number of channels (i.e., depth), and 12×16 represents the reduced size.

2) Converter module

The transducer module (transducer) generally comprises an encoding component and a decoding component, wherein the encoding component part is made up of a stack of encoders (decoders), and the decoding component part is also made up of the same number of decoders (decoders) corresponding to the encoders, all of which are identical in structure, but which do not share parameters. Each encoder can be decomposed into two sub-layers, namely a self-attention layer and a feedforward neural network layer, the first fully-connected activation function of the feedforward neural network layer is a ReLU, the second fully-connected layer is a linear activation function, and the self-attention layer, the encoding-decoding attention layer and the feedforward neural network layer are also arranged in the decoder.

Briefly, image features from a convolutional encoder module pass through the encoder component of the transformer module, passing spatial position encoding along with spatial encoding added to the query and at the keys. The decoder component then receives the query (initially set to zero), outputs the position code (object query) and the encoder memory, and generates a final set of prediction class labels and bounding boxes through the multiple multi-headed self-attention layers and decoder-encoder concerns.

In particular, in terms of the coding components of the converter module, which can create a new profile from the image characteristics, each encoder layer has a standard architecture and comprises a multi-headed self-attention module (self-attention module) and a feed-forward network, since the architecture of the converter module is permuted, it can be complemented with a fixed position code, which is added to the input of each layer of interest; in terms of the decoding components of the transformer module, the decoder also follows the standard architecture, converting M embeddings using a multi-headed self-encoder and encoder-decoder attention mechanism, decoding M objects in parallel at each decoder layer, since the decoder is also permutation-invariant, the M input embeddings must be different to produce different results. These input embeddings are learned position encodings, which we call object queries, and like encoders we add them to the input of each layer of interest. The M object queries are converted by the decoder into embedded outputs. They are then independently decoded into frame coordinates and class labels through the feed forward network, resulting in M final predictions. With the attention to these embedded self-encoders and decoders, the model uses the pairwise relationship between them to globally attribute all objects together while being able to use the entire image as a context.

It should be noted that, the specific architecture and the processing portion of the image feature of the converter module may refer to the architecture and the processing process of the converter module in the existing DETR, which are not described in detail herein, the embodiment of the present application mainly aims at changing the dimension of the output feature, and the following is mainly described with respect to the architecture and the flow related to the dimension change in the embodiment of the present application.

For convenience of understanding, in describing the input/output of the converter module in the embodiment of the present application with reference to fig. 3, the dimension of the image feature is 256×12×16, the dimension of the first feature output by the converter module after the input of the converter module is n×16, first, the product of the length and width of the first feature is 192 according to the dimension of the image feature, which can be divided into 192 labels, and the number of channels of the image feature is 256, the dimension of the first vector is 192×256, the dimension of the first preset parameter is 256×16, the dimension of the second preset parameter is 256×16, and since the multiplication of the full connection layers practically counteracts the same number, the dimension of the key vector obtained after the first vector passes through the first full connection layer is 192×16, and similarly, the dimension of the value vector obtained after the first vector passes through the second full connection layer is 192×16, and the dimension of the preset query vector is n×16, at this time, the first feature can be obtained by a softmax function, that is:

Softmax(q·k^T)·v＝(N×16)·(16×192)·(192×16)＝N×16

Where q represents the dimension of the query vector, k represents the dimension of the key vector, v represents the dimension of the value vector, k ^T represents the transpose of the dimension of the key vector, it being understood that the bolded "16" cancel each other out and the underlined "192" cancel each other out, resulting in a first feature of dimension N x 16.

Thus, the first feature can be used as the dynamic weight of the subsequent dynamic convolution to provide a basis for obtaining an accurate target segmentation mask subsequently, and meanwhile, the parameter quantity and the calculated quantity are reduced.

3) Convolutional decoder module

The convolutional decoder block (convolutional decoder) may include a plurality of convolutional layers, which need to correspond to the number of convolutional layers of the convolutional encoder, the larger the convolutional kernel in each convolutional layer is, the better the final example segmentation effect is, but directly replacing the small convolutional kernel with the large convolutional kernel can lead to the increase of the parameter number and the calculated amount.

The number of the two-way convolution modules and the number of the two-way convolution modules in the convolution decoder module can be set by themselves, the two-way convolution modules can comprise a first preset number of two-way convolution blocks, the convolution modules can comprise a second preset number of convolution blocks, the position of each two-way convolution block and the position of each convolution block can also be set by themselves, for example, the convolution decoder module comprises six layers of convolution layers, the two-way convolution modules can comprise 3 two-way convolution blocks, the convolution modules can comprise three convolution blocks, the three two-way convolution blocks are respectively located in a first layer, a third layer and a fifth layer, and the three convolution blocks are respectively located in a second layer, a fourth layer and a sixth layer.

Specifically, each convolution block is a conventional convolution and comprises a conventional convolution kernel, a normalization function and an activation function, each double-path convolution block comprises two half-split convolution kernels, namely, a K×K convolution kernel is half-split into a K×1 convolution kernel and a1×K convolution kernel, two-path branches are formed, each branch further comprises a group normalization layer and an activation function, and finally, the two-path branches are connected to output a complete feature map.

For convenience of understanding, an input/output schematic diagram of a convolutional decoder module in an embodiment of the present application is described below with reference to fig. 4, where fig. 4 is an input/output schematic diagram of the convolutional decoder module provided in the embodiment of the present application, and it can be seen that the convolutional decoder module has four convolutional layers, a first layer and a third layer are two-way convolutional blocks, the second layer and the fourth layer are convolutional blocks, each two-way convolutional block includes a convolutional kernel kx1, a convolutional kernel 1 xk, a group normalization layer (Group Normalization, GN) layer and a ReLU function, where K is not a fixed value, and K may be any value greater than a size of the convolutional kernel in the convolutional block, and K in different two-way convolutional blocks may also be different; each convolution block includes a K x K convolution kernel, a group normalization layer (Group Normalization, GN) layer, and a ReLU function, K being less than K.

The method comprises the steps of inputting an image feature with the dimension of 256×12×16 into a two-way convolution block D1 of a first layer, halving the channel number to obtain a feature map with the dimension of 128×12×16, performing up-sampling processing once for 4 times to obtain a feature map with the dimension of 128×48×64, outputting the feature map with the dimension of 128×48×64 to a convolution block C1 of a second layer to obtain a feature map with the dimension of 64×48×64, performing up-sampling processing once for 4 times to obtain a feature map with the dimension of 64×192×256, inputting the feature map with the dimension of 64×192×256 into a two-way convolution block D2 of a third layer to obtain a feature map with the dimension of 32×192×256, performing up-sampling processing once for 2 times to obtain a feature map with the dimension of 32×384×512, and inputting the feature map with the dimension of 32×384×512 into a convolution block of a fourth layer to obtain a second feature with the dimension of 16×384×512.

Therefore, the parameter quantity and the calculated quantity can be reduced while the subsequent segmentation effect can be ensured, and the method is convenient to be deployed to platforms such as mobile terminals and the like which have constraint on resources.

4) Dynamic convolution module

The dynamic convolution (Dynamic Convolution) module can adaptively adjust convolution parameters according to an input image, can dynamically convolve a first feature serving as a dynamic weight with a second feature to obtain a target segmentation mask with dimension of NxW x H, and structurally comprises an attention module and a convolution kernel, wherein the attention module can calculate and generate y convolution parameters pi_y with sum of 1, and then linearly sum the y convolution parameters to ensure that the convolution kernel changes along with the input change. The specific structure and processing manner can be referred to the structure and processing manner of the existing dynamic convolution layer, and will not be described herein.

Therefore, through the image processing network, the image processing network can ensure the effect of example segmentation on the image, greatly reduce the quantity of parameters and the calculated quantity, and is convenient to be deployed on platforms with resource constraint such as mobile terminals.

2. Dimensional change of original image in image processing network

For convenience of understanding, referring to fig. 5, the dimension of the original image is c×w×h, where C represents the number of channels of the original image, W represents the width of the original image, H represents the height of the original image, and the image features with dimension c×w×h are obtained after passing through a convolutional encoder, where C is greater than C, w×h is less than w×h, the image features are respectively input to a transformer module and a convolutional decoder module, the transformer module outputs a first feature with dimension n×x, where N is the upper limit number of the division frame, the value of x is equal to the value of H, the convolutional decoder module outputs a second feature with dimension x×w×h, the first feature and the second feature are input to a dynamic convolution module, and the dynamic convolution module outputs a target division mask with dimension n×w×h.

For example, the dimension of the original image may be 3×384×512, the image features with dimension 256×12×16 are obtained after passing through the convolutional encoder, the image features are respectively input into the converter module and the convolutional decoder module, the converter module outputs the first feature with dimension n×16, the convolutional decoder module outputs the second feature with dimension 16×384×512, the first feature and the second feature are input into the dynamic convolutional module, and the dynamic convolutional module outputs the object segmentation mask with dimension n×384×512.

Therefore, by the image processing method, the effect of example segmentation on the image can be ensured, the parameter quantity and the calculated quantity are greatly reduced, and the image processing method is convenient to be deployed on platforms with resource constraint such as mobile terminals.

In one possible embodiment, the image processing network may be adaptively pruned, quantized, etc., to obtain a lighter image processing network, which is convenient to be deployed to a corresponding mobile terminal, which is not limited herein.

3. An illustration of an image processing method

In combination with the foregoing, an example of an image processing method according to an embodiment of the present application is described below.

Fig. 6 is a schematic flow chart of an information transmission method according to an embodiment of the present application, which is applied to an image processing network, wherein the image processing network includes a convolutional encoder module, a transformer module, a convolutional decoder module and a dynamic convolutional module, and specifically includes the following steps:

The original image is input to a convolutional encoder module to obtain image features, step 601.

The number of channels of the original image is C, the width is W, the length is H, and the dimension of the original image is C multiplied by W multiplied by H.

The convolution encoder module may increase the channel number C of the original image to C and reduce the size w×h of the original image to w×h to obtain an image feature, where the dimension of the image feature is c×w×h.

Step 602, inputting the image feature into a transformer module to obtain a first feature.

The converter module comprises a first full-connection layer, a second full-connection layer and an output layer.

Firstly, the image feature can be converted into a first vector by a converter module, the dimension of the first vector is an identification number t×c, the identification number t is equal to the product of w and h, then, the first vector is input into a first full connection layer for carrying out dimension reduction processing to determine a key vector, the dimension of the key vector is t×x, the first vector is input into the second full connection layer for carrying out dimension reduction processing to determine a value vector, the dimension of the value vector is t×x, finally, the key vector, the value vector and a preset query vector are input into the output layer to determine the first feature, the dimension of the preset query vector is n×x, and the dimension of the first feature is n×x.

It should be noted that the first full-connection layer includes a first preset parameter with a dimension of c×x, and the second full-connection layer includes a second preset parameter with a dimension of c×x.

Therefore, the first characteristic can be used as the dynamic weight of the subsequent dynamic convolution module, so that the calculation amount and the parameter number are reduced while the more accurate target segmentation mask is obtained.

Step 603, inputting the image feature into a convolutional decoder module to obtain a second feature.

The two-way convolution module comprises a first preset number of two-way convolution blocks arranged at a first preset position, each two-way convolution block comprises two convolution kernels which are split in half, each convolution module comprises a second preset number of convolution blocks arranged at a second preset position, each convolution block comprises one convolution kernel, the combined size of the two convolution kernels which are split in half of any two-way convolution block is larger than the size of the convolution kernel of any one convolution block, and each two-way convolution block is used for splitting input features in half along the dimension direction and connecting after carrying out convolution processing on split input features respectively.

The image features can be input into a two-way convolution module and a convolution module to carry out convolution processing and up-sampling processing so as to obtain the second features, and the dimensions of the second features are xW multiplied by H. The specific processing is not described in detail herein.

Thus, a better segmentation effect can be verified by using a larger convolution block, and meanwhile, the calculation amount and the parameter number are not greatly increased because the two-way convolution block splits the convolution kernel.

Step 604, inputting the first feature and the second feature into the dynamic convolution module to perform convolution processing to obtain a target segmentation mask.

And the dynamic convolution module can dynamically convolve the first feature serving as a dynamic weight with the second feature to obtain the target segmentation mask.

In one possible embodiment, the pruning and/or quantization processing may be performed on the image processing network before the image processing, so as to obtain a lightweight image processing network, which is then deployed to a corresponding mobile terminal.

Therefore, through the image processing method, the structure of the DETR is redesigned, the parameter number and the calculated amount of the neural network are greatly reduced while the verification of the example segmentation effect is ensured, and the method is convenient to deploy to a platform with constraint on resources.

The above steps not described in detail may be referred to the descriptions of 1-2 above, and will not be described here again.

4. Exemplary description of an electronic device

An electronic device according to an embodiment of the present application is described below with reference to fig. 7, and fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Wherein the electronic device 700 comprises a processor 701, a memory 702 and a communication bus 703 for connecting the processor 701 and the memory 702.

In some possible implementations, the memory 702 includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM), the memory 702 being used to store program code executed by the electronic device 700 and data transmitted.

In some possible implementations, the electronic device 700 also includes a communication interface for receiving and transmitting data.

In some possible implementations, the processor 701 may be one or more Central Processing Units (CPUs), which in the case where the processor 701 is one Central Processing Unit (CPU), may be a single-core Central Processing Unit (CPU) or may be a multi-core Central Processing Unit (CPU).

In some possible implementations, the processor 701 may be a baseband chip, a Central Processing Unit (CPU), a general purpose processor, DSP, ASIC, FPGA, or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.

In particular implementation, the present invention is applied to an image processing network, where the image processing network includes a convolutional encoder module, a transformer module, a convolutional decoder module, and a dynamic convolutional module, and the processor 701 in the electronic device 700 is configured to execute the program instructions 721 stored in the memory 702 to perform the following operations:

It should be noted that, the specific implementation of each operation may be described in the above-illustrated method embodiment, and the electronic device 700 may be used to execute the above-illustrated method embodiment of the present application, which is not described herein.

5. Exemplary description of an image processing apparatus

The foregoing description of the embodiments of the present application has been presented primarily in terms of a method-side implementation. It will be appreciated that the electronic device, in order to achieve the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application can divide the functional units of the electronic device according to the method example, for example, each functional unit can be divided corresponding to each function, and two or more functions can be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice.

In the case of dividing each functional module by adopting a corresponding function, fig. 8 is a block diagram of functional units of an image processing apparatus according to an embodiment of the present application, where the image processing apparatus 800 includes:

A feature extraction unit 810 for inputting an original image, the number of channels of which is C, the width of which is W, and the length of which is H, into the convolutional encoder module to obtain an image feature, the original image having dimensions of c×w×h;

a first feature unit 820 for inputting the image features into the transformer module to obtain first features;

A second feature unit 830, configured to input the image feature into the convolutional decoder module to obtain a second feature;

The image segmentation unit 840 is configured to input the first feature and the second feature into the dynamic convolution module to obtain a target segmentation mask, where a dimension of the target segmentation mask is nxwxh, and N is an upper limit number of segmentation frames.

It should be noted that, the specific implementation of each operation may be described in the above-illustrated method embodiment, and the image processing apparatus 800 may be used to execute the above-illustrated method embodiment of the present application, which is not described herein.

6. Exemplary description of another image processing apparatus

In the case of using integrated units, another image processing apparatus 900 in the embodiment of the present application will be described in detail below with reference to fig. 9, where the image processing apparatus 900 includes a processing unit 901 and a communication unit 902, where the processing unit 901 is configured to perform any step in the above method embodiment, and when performing data transmission such as transmission, the communication unit 902 is selectively called to complete a corresponding operation.

The image processing apparatus 900 may further comprise a storage unit 903 for storing program codes and data. The processing unit 901 may be a processor, the communication unit 902 may be a wireless communication module, and the storage unit 903 may be a memory, applied to an image processing network including a convolutional encoder module, a transformer module, a convolutional decoder module, and a dynamic convolutional module.

The processing unit 901 is specifically configured to:

It should be noted that, the specific implementation of each operation may be described in the above-illustrated method embodiment, and the image processing apparatus 900 may be used to execute the above-illustrated method embodiment of the present application, which is not described herein.

7. Other examples of the description

The embodiment of the application also provides a chip which comprises a processor, a memory and a computer program or instructions stored on the memory, wherein the processor executes the computer program or instructions to realize the steps described in the embodiment of the method.

The embodiment of the application also provides a chip module, which comprises a receiving and transmitting assembly and a chip, wherein the chip comprises a processor, a memory and a computer program or instructions stored on the memory, and the processor executes the computer program or instructions to realize the steps described in the embodiment of the method.

The embodiment of the application also provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program makes a computer execute part or all of the steps of any one of the method embodiments, and the computer includes an electronic device.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform part or all of the steps of any one of the methods described in the method embodiments above. The computer program product may be a software installation package, said computer comprising an electronic device.

For the above embodiments, for simplicity of description, the same is denoted as a series of combinations of actions. It will be appreciated by persons skilled in the art that the application is not limited by the order of acts described, as some steps in embodiments of the application may be performed in other orders or concurrently. In addition, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts, steps, modules, or units, etc. that are described are not necessarily required by the embodiments of the application.

In the foregoing embodiments, the descriptions of the embodiments of the present application are emphasized, and in part, not described in detail in one embodiment, reference may be made to related descriptions of other embodiments.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, or may be embodied in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in RAM, flash memory, ROM, EPROM, electrically Erasable EPROM (EEPROM), registers, hard disk, a removable disk, a compact disk read-only (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may be located in a terminal device or a management device. The processor and the storage medium may reside as discrete components in a terminal device or management device.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented, in whole or in part, in software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Drive (SSD)), or the like.

The respective apparatuses and the respective modules/units included in the products described in the above embodiments may be software modules/units, may be hardware modules/units, or may be partly software modules/units, and partly hardware modules/units. For example, for each device or product applied to or integrated on a chip, each module/unit included in the device or product may be implemented in hardware such as a circuit, or at least some modules/units may be implemented in software program, where the software program runs on a processor integrated inside the chip, and the remaining (if any) part of modules/units may be implemented in hardware such as a circuit; for each device and product applied to or integrated in the chip module, each module/unit contained in the device and product can be realized in a hardware manner such as a circuit, different modules/units can be located in the same component (such as a chip, a circuit module and the like) or different components of the chip module, or at least part of the modules/units can be realized in a software program, the software program runs on a processor integrated in the chip module, and the rest (if any) of the modules/units can be realized in a hardware manner such as a circuit; for each device, product, or application to or integrated with the terminal device, each module/unit included in the device may be implemented in hardware such as a circuit, and different modules/units may be located in the same component (e.g., a chip, a circuit module, etc.) or different components in the terminal device, or at least some modules/units may be implemented in a software program, where the software program runs on a processor integrated within the terminal device, and the remaining (if any) some modules/units may be implemented in hardware such as a circuit.

The foregoing detailed description of the embodiments of the present application further illustrates the purposes, technical solutions and advantageous effects of the embodiments of the present application, and it should be understood that the foregoing description is only a specific implementation of the embodiments of the present application, and is not intended to limit the scope of the embodiments of the present application, and any modifications, equivalent substitutions, improvements, etc. made on the basis of the technical solutions of the embodiments of the present application should be included in the scope of the embodiments of the present application.

Claims

1. An image processing method, characterized by being applied to an image processing network, the image processing network comprising a convolutional encoder module, a transformer module, a convolutional decoder module, and a dynamic convolutional module, the method comprising:

2. The method of claim 1, wherein said inputting the original image into the convolutional encoder module to obtain the image feature comprises:

the convolution encoder module increases the channel number C of the original image to C, and reduces the dimension W multiplied by H of the original image to W multiplied by H to acquire the image feature, wherein the dimension of the image feature is C multiplied by W multiplied by H.

3. The method of claim 2, wherein the converter module comprises a first fully connected layer, a second fully connected layer, and an output layer; the inputting the image feature into the transformer module to obtain a first feature includes:

Converting, by the converter module, the image feature into a first vector, the first vector having dimensions of an identification number t×c, the identification number t being equal to the product of w and h;

Inputting the first vector into the first full connection layer for dimension reduction processing to determine a key vector, wherein the dimension of the key vector is t x, and inputting the first vector into the second full connection layer for dimension reduction processing to determine a value vector, wherein the dimension of the value vector is t x;

And inputting the key vector, the value vector and a preset query vector into the output layer to determine the first feature, wherein the dimension of the query vector is Nxx, and the dimension of the first feature is Nxx.

4. A method according to claim 3, wherein the first fully connected layer comprises a first predetermined parameter having a dimension c x, and the second fully connected layer comprises a second predetermined parameter having a dimension c x.

5. The method of claim 3, wherein the convolutional decoder block comprises a two-way convolutional block and a convolutional block; the inputting the image feature into the convolutional decoder module to obtain a second feature comprises:

And inputting the image features into the two-way convolution module and the convolution module for convolution processing and up-sampling processing to obtain the second features, wherein the dimensions of the second features are xW multiplied by H.

6. The method of claim 5, wherein the two-way convolution module comprises a first predetermined number of two-way convolution blocks disposed at a first predetermined location, each two-way convolution block comprising two convolution kernels split in half, the convolution module comprises a second predetermined number of convolution blocks disposed at a second predetermined location, each convolution block comprising one convolution kernel, the combined size of the two convolution kernels split in half of any one two-way convolution block being greater than the size of the convolution kernel of any one convolution block;

Each two-way convolution block is used for splitting input features in half along the dimension direction, and connecting the split input features after convolution processing is performed on the split input features respectively.

7. The method of claim 5, wherein inputting the first feature and the second feature into the dynamic convolution module for convolution processing to obtain a target segmentation mask comprises:

and carrying out dynamic convolution on the first feature serving as a dynamic weight and the second feature by the dynamic convolution module to obtain the target segmentation mask.

8. An image processing apparatus for application to an image processing network comprising a convolutional encoder module, a transformer module, a convolutional decoder module and a dynamic convolutional module, the apparatus comprising a processing unit for:

9. An electronic device, comprising: a processor, a memory, and one or more programs; the one or more programs are stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-7.

10. A computer storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-7.