CN117237394B

CN117237394B - Multi-attention-based lightweight image segmentation method, device and storage medium

Info

Publication number: CN117237394B
Application number: CN202311466721.7A
Authority: CN
Inventors: 张榜泽; 罗飞; 任大伟
Original assignee: Wanlicloud Medical Information Technology Beijing Co ltd
Current assignee: Wanlicloud Medical Information Technology Beijing Co ltd
Priority date: 2023-11-07
Filing date: 2023-11-07
Publication date: 2024-02-27
Anticipated expiration: 2043-11-07
Also published as: CN117237394A

Abstract

The application discloses a method, a device and a storage medium for lightweight segmentation of images based on multiple attentions. The encoder is further configured to perform the following: respectively inputting the feature map output by the encoder of the previous level into a first branch network, a second branch network and a third branch network, wherein the first branch network and the second branch network are one-dimensional convolution networks, and the third branch network is a combined network comprising a convolution kernel of 5 multiplied by 5 and a convolution kernel of 1 multiplied by 1; carrying out one-dimensional convolution on the received feature map by using a first branch network, and generating a second feature map; carrying out one-dimensional convolution on the received feature images by using a second branch network, and generating a third feature image; convolving the received feature map with a third branch network and generating a fourth feature map; and generating a feature map for input to a next-level encoder based on the feature map, the second feature map, the third feature map, and the fourth feature map output by the previous-level encoder.

Description

Multi-attention-based lightweight image segmentation method, device and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method, an apparatus, and a storage medium for lightweight segmentation of images based on multiple attentions.

Background

With the continuous development of deep learning, medical image processing is becoming a research area of great concern. Because the mature medical image processing technology not only can provide great auxiliary effect for medical workers, but also can effectively improve the accuracy of medical diagnosis, the requirements of researchers on the processing accuracy of medical images are higher and higher, and the purpose of assisting doctors to make more accurate diagnosis through more accurate automatic algorithms is achieved. However, due to the complexity and huge data volume of medical images, deep learning models often require more complex structures and deeper networks to achieve the desired accuracy. This inevitably increases the number of parameters and calculation amount of the deep learning model, and puts higher demands on the configuration of experimental equipment.

Thus, researchers still need to further study and explore to develop more applicable and efficient deep learning models.

Existing lightweight models, such as the GhostNet model and the MobileNet model, can be combined with the decoder to form a lightweight segmentation model. The effect on medical image segmentation is poor. These segmentation models, while reducing the scale of the deep learning model, also lose some feature extraction capabilities, resulting in some degree of information loss.

The medical images often have the problems of different scales, blurred boundaries, low contrast and the like, and the existing lightweight segmentation model is difficult to capture fine characteristics and edge information, so that a doctor cannot be assisted to make more accurate diagnosis.

Aiming at the problems that in the prior art, due to the fact that the medical images often have different scales of lesions, blurred boundaries, low contrast and the like, the existing lightweight segmentation model is difficult to capture fine characteristics and edge information, so that a doctor cannot be assisted in making more accurate diagnosis, no effective solution is proposed at present.

Disclosure of Invention

The embodiments of the present disclosure provide a method, an apparatus, and a storage medium for multi-attention-based lightweight segmentation of images, so as to at least solve the technical problems in the prior art that the existing lightweight segmentation model is difficult to capture fine features and edge information, and therefore cannot assist a doctor in making a more accurate diagnosis, because the medical images often have different scales of lesions, blurred boundaries, low contrast, and the like.

According to one aspect of an embodiment of the present disclosure, there is provided a method of multi-attention based lightweight segmentation of an image, comprising: inputting an image to be processed into an encoder module and outputting a first feature map from a context awareness module connected with the encoder module, wherein the encoder module comprises a plurality of encoders which are sequentially cascaded in sequence, and wherein the output end of the encoder of the previous level is connected with the input end of the encoder of the current level; inputting the first feature map to a decoder module connected to the context-aware module and outputting a segmented image corresponding to the image to be processed, wherein the encoder is further configured to: respectively inputting the feature map output by the encoder of the previous level into a first branch network, a second branch network and a third branch network, wherein the first branch network and the second branch network are one-dimensional convolution networks, and the third branch network is a combined network comprising a convolution kernel of 5 multiplied by 5 and a convolution kernel of 1 multiplied by 1; carrying out one-dimensional convolution on the received feature map by using a first branch network, and generating a second feature map; carrying out one-dimensional convolution on the received feature images by using a second branch network, and generating a third feature image; convolving the received feature map with a third branch network and generating a fourth feature map; and generating a feature map for input to a next-level encoder based on the feature map, the second feature map, the third feature map, and the fourth feature map output by the previous-level encoder.

According to another aspect of the embodiments of the present disclosure, there is also provided a storage medium including a computer program stored therein, wherein the computer program when executed by a processor implements the method of any one of the above.

According to another aspect of the embodiments of the present disclosure, there is also provided an apparatus for multi-attention based lightweight segmentation of images, including: the first input module is used for inputting the image to be processed into the encoder module and outputting a first characteristic diagram from the context sensing module connected with the encoder module, wherein the encoder module comprises a plurality of encoders which are sequentially cascaded in sequence, and the output end of the encoder of the previous level is connected with the input end of the encoder of the current level; the second input module is used for inputting the first feature map to the decoder module connected with the context sensing module and outputting a segmented image corresponding to the image to be processed, wherein the encoder further comprises: the third input module is used for inputting the feature diagram output by the encoder of the previous level into a first branch network, a second branch network and a third branch network respectively, wherein the first branch network and the second branch network are one-dimensional convolution networks, and the third branch network is a combined network comprising a convolution kernel of 5 multiplied by 5 and a convolution kernel of 1 multiplied by 1; the first generation module is used for carrying out one-dimensional convolution on the received feature images by utilizing a first branch network and generating a second feature image; the second generating module is used for carrying out one-dimensional convolution on the received feature images by utilizing a second branch network and generating a third feature image; the third generating module is used for convolving the received characteristic diagram by utilizing a third branch network and generating a fourth characteristic diagram; and a fourth generation module for generating a feature map for input to the encoder of the next level based on the feature map, the second feature map, the third feature map, and the fourth feature map output from the encoder of the previous level

According to another aspect of the embodiments of the present disclosure, there is also provided an apparatus for multi-attention based lightweight segmentation of images, including: a processor; and a memory, coupled to the processor, for providing instructions to the processor for processing the steps of: inputting an image to be processed into an encoder module and outputting a first feature map from a context awareness module connected with the encoder module, wherein the encoder module comprises a plurality of encoders which are sequentially cascaded in sequence, and wherein the output end of the encoder of the previous level is connected with the input end of the encoder of the current level; inputting the first feature map to a decoder module connected to the context-aware module and outputting a segmented image corresponding to the image to be processed, wherein the encoder is further configured to: respectively inputting the feature map output by the encoder of the previous level into a first branch network, a second branch network and a third branch network, wherein the first branch network and the second branch network are one-dimensional convolution networks, and the third branch network is a combined network comprising a convolution kernel of 5 multiplied by 5 and a convolution kernel of 1 multiplied by 1; carrying out one-dimensional convolution on the received feature map by using a first branch network, and generating a second feature map; carrying out one-dimensional convolution on the received feature images by using a second branch network, and generating a third feature image; convolving the received feature map with a third branch network and generating a fourth feature map; and generating a feature map for input to a next-level encoder based on the feature map, the second feature map, the third feature map, and the fourth feature map output by the previous-level encoder.

The application provides a multi-attention based method for lightweight segmentation of images. The method provides an encoder module, wherein the encoder module comprises a plurality of encoders which are sequentially cascaded. Wherein the output end of the last-level encoder is connected with the input end of the current-level encoder. And wherein three branching networks are provided in the encoder. Wherein the first branch network and the second branch network are one-dimensional convolutional networks. Therefore, the feature map is processed by using a one-dimensional convolution network, so that the calculation burden can be reduced to the maximum extent, and the calculation efficiency is improved. That is, the feature map can be generated using a very small number of parameters and calculation amount.

In addition, in order to solve the disadvantage of the one-dimensional convolution network in processing the context information, the encoder is further provided with a third branch network, considering that the one-dimensional convolution network usually ignores surrounding pixel information. Wherein the third branch network comprises a 5 x 5 convolution kernel and a 1 x 1 convolution kernel. The combined network consisting of the 5×5 convolution kernel and the 1×1 convolution kernel can acquire a larger receptive field, and is beneficial to capturing more context information, so that a doctor can be assisted in making more accurate diagnosis.

Therefore, the technical effect that fine characteristics and edge information can be captured while the calculated amount and the parameter number are reduced is achieved. The problems that in the prior art, due to the fact that lesion scales are often different in medical images, boundaries are blurred, contrast is low and the like are solved, fine characteristics and edge information are difficult to capture by an existing lightweight segmentation model, and therefore a doctor cannot be assisted in making more accurate diagnosis are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and do not constitute an undue limitation on the disclosure. In the drawings:

FIG. 1 is a block diagram of the hardware architecture of a computing device for implementing the method according to embodiment 1 of the present application;

FIG. 2 is a schematic diagram of a system for multi-attention based lightweight segmented image according to embodiment 1 of the present application;

FIG. 3 is a flow chart of a method for multi-attention based lightweight segmented image according to embodiment 1 of the present application;

FIG. 4 is a flowchart of operations performed by an encoder according to embodiment 1 of the present application;

FIG. 5 is a schematic diagram of a model structure of a multi-attention based lightweight segmented image according to embodiment 1 of the present application;

FIG. 6 is a schematic diagram of an encoder according to embodiment 1 of the present application;

FIG. 7 is a schematic diagram of a multi-attention module according to embodiment 1 of the present application;

FIG. 8 is a schematic diagram of a context awareness module according to embodiment 1 of the present application;

fig. 9 is a schematic structural diagram of a decoder module according to embodiment 1 of the present application;

FIG. 10 is a schematic illustration of an apparatus for multi-attention based lightweight segmented image according to embodiment 2 of the present application;

fig. 11 is a schematic diagram of an apparatus for multi-attention based lightweight segmented image according to embodiment 3 of the present application.

Detailed Description

In order to better understand the technical solutions of the present disclosure, the following description will clearly and completely describe the technical solutions of the embodiments of the present disclosure with reference to the drawings in the embodiments of the present disclosure. It will be apparent that the described embodiments are merely embodiments of a portion, but not all, of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure, shall fall within the scope of the present disclosure.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

According to the present embodiment, there is provided a method embodiment for multi-attention based lightweight segmentation of images, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system such as a set of computer executable instructions, and, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.

The method embodiments provided by the present embodiments may be performed in a mobile terminal, a computer terminal, a server, or similar computing device. FIG. 1 illustrates a block diagram of a hardware architecture of a computing device for implementing multi-attention based lightweight segmented images. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a microprocessor MCU, a processing device such as a programmable logic device FPGA), memory for storing data, transmission means for communication functions, and input/output interfaces. Wherein the memory, the transmission device and the input/output interface are connected with the processor through a bus. In addition, the method may further include: a display connected to the input/output interface, a keyboard, and a cursor control device. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors and/or other data processing circuits described above may be referred to herein generally as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the embodiments of the present disclosure, the data processing circuit acts as a processor control (e.g., selection of the variable resistance termination path to interface with).

The memory may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the multi-attention based method for segmenting images in the embodiments of the present disclosure, and the processor executes the software programs and modules stored in the memory, thereby performing various functional applications and data processing, that is, implementing the multi-attention based method for segmenting images of the application program. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory may further include memory remotely located with respect to the processor, which may be connected to the computing device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communications provider of the computing device. In one example, the transmission means comprises a network adapter (Network Interface Controller, NIC) connectable to other network devices via the base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.

It should be noted herein that in some alternative embodiments, the computing device shown in FIG. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computing devices described above.

Fig. 2 is a schematic diagram of a system for multi-attention based lightweight segmented image according to this embodiment. Referring to fig. 2, the system includes: a terminal device 100 and a processor 200.

Wherein the terminal device 100 is configured to transmit the received image to be processed to the processor 200 through a network.

The processor 200 is configured to input the received image to be processed to the encoder module, and perform subsequent processing on the image to be processed by the encoder module.

The processor 200 is further configured to input the feature map generated after being processed by the encoder module to the context-aware module, so that the context-aware module performs subsequent processing on the received feature map.

The processor 200 is further configured to input the first feature map output by the context awareness module to the decoder module, and perform subsequent processing on the first feature map by the decoder module.

It should be noted that the above-described hardware configuration may be applied to both the terminal device 100 and the processor 200 in the system.

In the above-described operating environment, according to a first aspect of the present embodiment, there is provided a method of multi-attention based lightweight segmentation of images, the method being implemented by the processor 200 shown in fig. 2. Fig. 3 shows a flow diagram of a method for multi-attention based lightweight segmented image according to an embodiment of the present application. Fig. 4 shows a flowchart of operations performed by an encoder according to embodiments of the present application. Referring to fig. 3 and 4, the method includes:

s302: inputting an image to be processed into an encoder module and outputting a first feature map from a context awareness module connected with the encoder module, wherein the encoder module comprises a plurality of encoders which are sequentially cascaded in sequence, and wherein the output end of the encoder of the previous level is connected with the input end of the encoder of the current level;

s304: inputting the first feature map to a decoder module connected to the context-aware module and outputting a segmented image corresponding to the image to be processed, wherein the encoder is further configured to:

S402: respectively inputting the feature map output by the encoder of the previous level into a first branch network, a second branch network and a third branch network, wherein the first branch network and the second branch network are one-dimensional convolution networks, and the third branch network is a combined network comprising a convolution kernel of 5 multiplied by 5 and a convolution kernel of 1 multiplied by 1;

s404: carrying out one-dimensional convolution on the received feature map by using a first branch network, and generating a second feature map;

s406: carrying out one-dimensional convolution on the received feature images by using a second branch network, and generating a third feature image;

s408: convolving the received feature map with a third branch network and generating a fourth feature map; and

s410: a feature map for input to the next-level encoder is generated based on the feature map, the second feature map, the third feature map, and the fourth feature map output by the previous-level encoder.

Specifically, fig. 5 is a schematic diagram of a model structure of a multi-attention-based lightweight segmented image according to embodiment 1 of the present application. Fig. 6 is a schematic structural diagram of an encoder according to embodiment 1 of the present application. Referring to fig. 5, the encoder module includes a plurality of encoders sequentially cascaded in sequence, and an output terminal of a previous-level encoder is connected to an input terminal of a current-level encoder. In addition, the output end of the encoder module is connected with the input end of the context sensing module. The output end of the context sensing module is connected with the input end of the decoder module.

Thus, first, the processor 200 inputs an image to be processed to the encoder module and outputs a first feature map from the context awareness module connected to the encoder module (S302). The first feature map is an image generated by splicing the feature maps output by the encoders and adjusting the spliced feature maps based on a channel attention mechanism.

Then, the processor 200 inputs the first feature map to a decoder module connected to the context awareness module, and outputs a divided image corresponding to the image to be processed (S304). Namely, an image generated by dividing an image to be processed.

And wherein the encoder is further configured to:

first, the processor 200 inputs the feature map output from the previous-level encoder to the first branch network, the second branch network, and the third branch network, respectively (S402). Wherein the first branch network and the second branch network are one-dimensional convolution networks, and the third branch network is a combination network comprising a convolution kernel of 5×5 and a convolution kernel of 1×1. Specifically, referring to fig. 6, in order to reduce the computational burden as much as possible, a one-dimensional convolution, which is very fast in computation, is used in all of the first branch network, the second branch network, and the third branch network to process the feature map. Meanwhile, since surrounding pixel information is generally ignored in the case of processing a feature map using a one-dimensional convolution network, in order to make up for the shortage of processing of context information by one-dimensional convolution, a combination network including a convolution kernel of 5×5 and a convolution kernel of 1×1 is used in the third branch network. The combined network is capable of capturing subtle features and edge information, thereby facilitating more accurate segmentation of images.

Then, the processor 200 performs one-dimensional convolution on the received feature map using the first branch network, thereby generating a second feature map (S404). The foregoing will be described in detail later, and thus will not be described in detail here.

Further, the processor 200 performs one-dimensional convolution on the received feature map using the second branch network, thereby generating a third feature map (S406). The foregoing will be described in detail later, and thus will not be described in detail here.

The processor 200 then convolves the received feature map with the third branch network to generate a fourth feature map (S408). The foregoing will be described in detail later, and thus will not be described in detail here.

Finally, the processor 200 generates a feature map for input to the next-level encoder based on the feature map, the second feature map, the third feature map, and the fourth feature map output from the previous-level encoder (S410). Specifically, referring to fig. 6, the processor 200 inputs the second, third, and fourth feature maps into the multi-attention module, respectively, in the case of generating the second feature map using the first branch network, the third feature map using the second branch network, and the fourth feature map using the third branch network. The multi-attention module then processes the second, third, and fourth feature maps, respectively, based on the spatial attention mechanism and the self-attention mechanism, thereby generating feature maps for input to a next-level encoder. The foregoing will be described in detail later, and thus will not be described in detail here.

As described in the background, existing lightweight models, such as the GhostNet model and the MobileNet model, can be combined with a decoder to form a lightweight segmentation model. The effect on medical image segmentation is poor. These segmentation models, while reducing the scale of the deep learning model, also lose some feature extraction capabilities, resulting in some degree of information loss.

However, due to the different dimensions of the lesions, blurred boundaries, low contrast, etc. often exist in the medical image, for example, the existing lightweight segmentation model is difficult to capture fine features and edge information, so that a doctor cannot be assisted in making a more accurate diagnosis.

In view of this, the present application provides a method of multi-attention based lightweight segmentation of images. The method provides an encoder module, wherein the encoder module comprises a plurality of encoders which are sequentially cascaded. Wherein the output end of the last-level encoder is connected with the input end of the current-level encoder. And wherein three branching networks are provided in the encoder. Wherein the first branch network and the second branch network are one-dimensional convolutional networks. Therefore, the feature map is processed by using a one-dimensional convolution network, so that the calculation burden can be reduced to the maximum extent, and the calculation efficiency is improved. That is, the feature map can be generated using a very small number of parameters and calculation amount.

Further, in the case where the encoder of the current level is the first-level encoder, the image to be processed is received, so that the first-level encoder processes the image to be processed.

Optionally, the first branch network includes a first convolution unit and a second convolution unit, and the operation of one-dimensionally convolving the received feature map with the first branch network and generating the second feature map includes: performing a row convolution operation on the received feature map by using a first convolution unit, thereby generating a first sub-feature map; and performing column convolution operation on the first sub-feature map by using a second convolution unit so as to generate a second feature map.

Specifically, referring to fig. 6, for example, the feature map outputted from the encoder of the previous layer has the dimensions (B, C, H, W). Wherein B is the sample size in the training process. C is the number of channels of the feature map. H is the length of the feature map. W is the width of the feature map.

The first branch network comprises a first convolution unit (i.e. Convld H in the first branch network shown in fig. 6) and a second convolution unit (i.e. Convld W in the first branch network shown in fig. 6). So that first, the first convolution unit performs a row convolution operation on the feature map output by the encoder of the previous level (i.e., processes the feature map using one-dimensional convolution in the H direction), thereby generating a first sub-feature map.

Then, the second convolution unit performs a column convolution operation on the first sub-feature map (i.e., processes the first sub-feature map using one-dimensional convolution in the W direction), thereby generating a second feature map.

In addition, after the first sub-feature map is generated, batch normalization operations and ReLU activation functions (i.e., rectified Linear Unit modify linear units) are also required on the first sub-feature map to increase nonlinearity.

Therefore, the technical effects of processing the received characteristic images with little parameter number and calculation amount can be achieved by performing row convolution operation on the characteristic images received by the first convolution unit in the first branch network and performing column convolution operation on the generated first sub-characteristic images by the second convolution unit.

Optionally, the second branch network includes a third convolution unit (i.e., convld W in the second branch network shown in fig. 6) and a fourth convolution unit (i.e., convld H in the second branch network shown in fig. 6), and the operation of one-dimensionally convolving the received feature map with the second branch network and generating a third feature map includes: performing column convolution operation on the received feature map by using a third convolution unit, so as to generate a second sub-feature map; and performing row convolution operation on the second sub-feature map by using a fourth convolution unit so as to generate a third feature map.

Specifically, referring to fig. 6, the second branch network includes a third convolution unit and a fourth convolution unit. So that first, the third convolution unit performs a column convolution operation on the feature map output by the encoder of the previous level (i.e., processes the received feature map using one-dimensional convolution in the W direction), thereby generating a second sub-feature map.

Then, the fourth convolution unit performs a row convolution operation on the second sub-feature map (i.e., processes the second sub-feature map using one-dimensional convolution in the H direction), thereby generating a third feature map.

Therefore, the technical effects of processing the received characteristic images with little parameter number and calculation amount can be achieved by performing column convolution operation on the characteristic images received by the third convolution unit in the second branch network and performing row convolution operation on the generated second sub-characteristic images by the fourth convolution unit.

Optionally, the third branch network includes a fifth convolution unit and a sixth convolution unit, and the operations of deep convolving the received feature map with the third branch network and generating a fourth feature map include: performing convolution operation on the received feature map by using a 5×5 convolution kernel in a fifth convolution unit, thereby generating a third sub-feature map; and performing convolution operation on the third sub-feature map by using the 1×1 convolution kernel in the sixth convolution unit, thereby generating a fourth feature map.

Specifically, referring to fig. 5, in the third branch network, first, the spatial information enriched per channel in the received feature map is extracted using a fifth convolution unit (i.e., a convolution kernel of 5×5), thereby generating a third sub-feature map. Then, the relation between the individual channels in the third sub-feature map is adjusted using a sixth convolution unit (i.e., a convolution kernel of 1×1), thereby generating a fourth feature map.

Therefore, the technical effects of being beneficial to capturing more context information and facilitating diagnosis of doctors are achieved by extracting rich spatial information of each channel in the received feature map by utilizing a fifth convolution unit in the third branch network and adjusting the relation among all channels in the generated third sub-feature map by utilizing a sixth convolution unit.

Further, referring to fig. 5, when both the input channel and the output channel are C, the first branch network and the second branch network have a reference number of C x 1 x C x 2, the third branch network has a parameter of c×5×5× 1+C ×1×1×c. That is, the total parameter number of the three branch networks is 5C ² +25C. And a 3×3 convolutional network has a reference number of Cx3×3×C, i.e. 9C ² . In convolutional neural networks, the number of channels is incremented layer by layer, so the parameter gaps can be very large. For example, when both the input channel and the output channel are C, and C > 6, the parameter amounts of the first branch network and the second branch network are smaller than those of the third branch network, but in the convolutional neural network, the channel number is much larger than 6.

Optionally, the encoder further includes a multi-attention module connected to the first branch network, the second branch network, and the third branch network, respectively, and generating a feature map for input to the encoder of the next level based on the feature map, the second feature map, the third feature map, and the fourth feature map output by the encoder of the previous level, including: inputting the second feature map, the third feature map and the fourth feature map to the multi-attention module respectively, and outputting a fifth feature map from the multi-attention module; and splicing the feature map output by the encoder of the previous level with the fifth feature map, and outputting the feature map for input to the encoder of the next level. Further alternatively, the operations of inputting the second feature map, the third feature map, and the fourth feature map to the multi-attention module, respectively, and outputting the fifth feature map from the multi-attention module, include: updating the second feature map, the third feature map and the fourth feature map respectively based on a spatial attention mechanism; and generating a fifth feature map based on the self-attention mechanism and using the updated second feature map, the updated third feature map, and the updated fourth feature map.

Specifically, fig. 7 is a schematic structural diagram of the multi-attention module according to embodiment 1 of the present application. Referring to fig. 7, a multi-attention module (i.e., the MAM module shown in fig. 6) connected to the first branch network, the second branch network, and the third branch network, respectively, is further provided in the encoder. The multi-attention module includes a spatial attention unit and a self-attention unit. The spatial attention unit is used for processing the received characteristic diagram based on a spatial attention mechanism. The self-attention unit is used for processing the received characteristic diagram based on a self-attention mechanism.

So that first the first branch network sends the generated second feature map to the attention-to-most module, the second branch network sends the generated third feature map to the attention-to-most module, and the third branch network sends the generated fourth feature map to the attention-to-most module.

The spatial attention unit in the multi-attention module then performs channel compression on the received fourth feature map based on the spatial attention mechanism. And compressing the channel of the fourth characteristic diagram to 1 to obtain the weight value of the fourth characteristic diagram.

Further, the spatial attention unit performs element multiplication on the obtained weight value and the second feature map, thereby obtaining an updated second feature map (i.e., the multi-attention module updates the second feature map based on the cross spatial attention mechanism). The spatial attention unit multiplies the obtained weight value by the third feature map to obtain an updated third feature map (i.e., the multi-attention module updates the third feature map based on the cross spatial attention mechanism). The spatial attention unit performs element multiplication on the obtained weight value and the fourth feature map, so as to obtain an updated fourth feature map (i.e., a fourth feature map updated by the multi-attention module based on the spatial attention mechanism). Thus, the updated second feature map, the updated third feature map and the updated fourth feature map can contain more context information.

Then, the self-attention unit in the multi-attention module generates a fifth feature map based on the self-attention mechanism and using the updated second feature map, third feature map, and fourth feature map. Specifically, the self-attention unit takes the updated second feature map as q of the self-attention mechanism, the updated third feature map as k of the self-attention mechanism, and the updated fourth feature map as v of the self-attention mechanism. The self-attention unit then transposes the updated third feature map (i.e., the self-attention unit transposes the updated third feature map using the Trans module shown in FIG. 7) and performs element multiplication with the updated second feature map. Finally, the self-attention unit performs element multiplication on the generated feature map and the updated fourth feature map, so as to generate a fifth feature map. So that the feature information of interest to the user in the fifth feature map can be highlighted.

Optionally, the encoder module includes a first level encoder, a second level encoder, a third level encoder, and a fourth level encoder, where an output of the first level encoder, an output of the second level encoder, an output of the third level encoder, and an output of the fourth level encoder are further respectively connected to an input of the context awareness module; wherein outputting the first feature map from the context-aware module coupled to the encoder module comprises outputting the first feature map from the context-aware module coupled to the first, second, third, and fourth level encoders; wherein outputting the first feature map from a context awareness module coupled to the first, second, third, and fourth level encoders comprises: performing bicubic interpolation on the feature map output by the first-level encoder, thereby generating a fourth sub-feature map; performing bicubic interpolation on the feature map output by the second-level encoder, thereby generating a fifth sub-feature map; reducing the channel number of the feature map output by the third-level encoder to half, and performing bicubic interpolation on the generated feature map, thereby generating a sixth sub-feature map; reducing the number of channels of the feature map output by the fourth-level encoder to half, thereby generating a seventh sub-feature map; and splicing the fourth sub-feature map, the fifth sub-feature map, the sixth sub-feature map and the seventh sub-feature map, and processing the spliced feature map based on a channel attention mechanism, thereby generating a first feature map.

Specifically, fig. 8 is a schematic structural diagram of a context awareness module according to embodiment 1 of the present application. Referring to fig. 5 and 8, the first hierarchical encoder transmits the output feature map to the context-aware module, which performs bicubic interpolation on the received feature map, thereby generating a fourth sub-feature map. The second-level encoder transmits the output feature map to the context-aware module, which performs bicubic interpolation on the received feature map, thereby generating a fifth sub-feature map. The third level encoder transmits the output feature map to the context-aware module, which reduces the channels of the received feature map to half, and then performs bicubic interpolation, thereby generating a sixth sub-feature map. The fourth-level encoder transmits the output feature map to a context-aware module, which reduces the channels of the received feature map to half, thereby generating a seventh sub-feature map.

It should be noted that, since the number of channels of the feature map output by the first level encoder and the feature map output by the second level encoder is smaller, the fourth sub-feature map and the fifth sub-feature map can be completely obtained, so that the number of channels of the fourth sub-feature map and the fifth sub-feature map does not need to be reduced, and only the fourth sub-feature map and the fifth sub-feature map need to be reduced to be consistent with the size of the seventh sub-feature map.

And the context sensing module reduces the channel number of the feature map output by the third-level encoder and the feature map output by the fourth-level encoder by half because the channel number of the feature map output by the third-level encoder and the feature map output by the fourth-level encoder are larger. Thus, by shrinking the fourth, fifth, sixth, and seventh sub-feature maps to the same size, more rich context information can be obtained.

Further, after the fourth sub-feature diagram, the fifth sub-feature diagram, the sixth sub-feature diagram and the seventh sub-feature diagram are obtained, the context awareness module performs a stitching operation on the fourth sub-feature diagram, the fifth sub-feature diagram, the sixth sub-feature diagram and the seventh sub-feature diagram, and adjusts the importance of the channels of the feature diagrams generated after stitching based on the channel attention mechanism, so that the first feature diagram generated after fusing is more flexible and accurate.

Optionally, the decoder module comprises a plurality of decoders cascaded in sequence, an input of a decoder of a previous hierarchy being connected to an output of a decoder of a current hierarchy, and the decoder being further configured to perform operations comprising: receiving the feature map input by the decoder of the next level, performing convolution operation on the received feature map by using a convolution check of 3×3, and generating an eighth sub-feature map; based on bicubic interpolation, the channels of the eighth sub-feature map are reduced to half, thereby generating a feature map for input to the previous-level decoder.

Specifically, fig. 9 is a schematic structural diagram of a decoder module according to embodiment 1 of the present application. Referring to fig. 5 and 9, after the decoder of the current hierarchy receives the feature map input by the decoder of the next hierarchy, the received feature map is up-sampled using a convolution check of 3×3, and an eighth sub-feature map is generated. The decoder module then reduces the channels of the eighth sub-feature map by one half based on bicubic interpolation, thereby generating a feature map for input to the next-level decoder.

Since the decoder module performs the up-sampling operation using the 3×3 convolution check received feature map, the up-sampling operation is more effective than the up-sampling operation using the 1×1 convolution check received feature map.

Further, since the decoder module uses bicubic interpolation to reduce the channel of the eighth sub-feature map instead of bilinear interpolation, the details and edge information of the eighth sub-feature map can be better preserved, and more parameters and calculation amount can be saved.

In addition, after the convolution operation is performed on the received feature map by using the convolution check of 3×3, a convolution kernel of 1×1 can be used for preliminary prediction, so that a loss after scaling of the eighth sub-feature map can be predicted, and convergence of the model is facilitated through back propagation in the training process.

It should be noted that if the current level decoder is the fourth level decoder, the first feature map output by the context awareness module is received.

Thus, according to the first aspect of the present embodiment, the technical effect that fine feature and edge information can be captured while reducing the amount of calculation and the number of parameters is achieved.

Further, referring to fig. 1, according to a second aspect of the present embodiment, there is provided a storage medium. The storage medium includes a computer program stored therein, wherein the method of any of the above is performed by a processor when the computer program is run.

Thus, according to the present embodiment, the technical effect of capturing fine feature and edge information while reducing the amount of calculation and the number of parameters is achieved.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

Example 2

Fig. 10 shows an apparatus 1000 for multi-attention based lightweight segmentation of images according to the first aspect of the present embodiment, which apparatus 1000 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 10, the apparatus 1000 includes: a first input module 1010, configured to input an image to be processed to an encoder module, and output a first feature map from a context awareness module connected to the encoder module, where the encoder module includes a plurality of encoders cascaded sequentially in sequence, and where an output end of a previous-level encoder is connected to an input end of a current-level encoder; a second input module 1020 for inputting the first feature map to a decoder module connected to the context awareness module and outputting a segmented image corresponding to the image to be processed, wherein the encoder further comprises: a third input module 1030, configured to input the feature map output by the previous-level encoder to a first branch network, a second branch network, and a third branch network, where the first branch network and the second branch network are one-dimensional convolution networks, and the third branch network is a combination network including a convolution kernel of 5×5 and a convolution kernel of 1×1; a first generating module 1040, configured to perform one-dimensional convolution on the received feature map using the first branch network, and generate a second feature map; a second generating module 1050, configured to perform one-dimensional convolution on the received feature map using a second branch network, and generate a third feature map; a third generating module 1060, configured to convolve the received feature map with a third branch network, and generate a fourth feature map; and a fourth generation module 1070 for generating a feature map for input to the encoder of the next level based on the feature map, the second feature map, the third feature map, and the fourth feature map output by the encoder of the previous level.

Optionally, the first generating module 1040 includes: the first generating sub-module is used for carrying out row convolution operation on the received characteristic diagram by utilizing a first convolution unit so as to generate a first sub-characteristic diagram; and a second generating sub-module for performing column convolution operation on the first sub-feature map by using a second convolution unit, thereby generating a second feature map.

Optionally, the second generating module 1050 includes: a third generating sub-module, configured to perform a column convolution operation on the received feature map by using a third convolution unit, so as to generate a second sub-feature map; and a fourth generating sub-module, configured to perform a convolution operation on the second sub-feature map by using a fourth convolution unit, so as to generate a third feature map.

Optionally, the third generating module 1060 includes: a fifth generating sub-module for performing convolution operation on the received feature map by using the 5×5 convolution kernel in the fifth convolution unit, thereby generating a third sub-feature map; and a sixth generating sub-module for performing a convolution operation with the third sub-feature map using the 1×1 convolution kernel in the sixth convolution unit, thereby generating a fourth feature map.

Optionally, the fourth generating module 1070 includes: the first output module is used for inputting the second feature map, the third feature map and the fourth feature map into the multi-attention module respectively and outputting a fifth feature map from the multi-attention module; and the second output module is used for splicing the characteristic diagram output by the coder of the previous level and the fifth characteristic diagram and outputting the characteristic diagram input to the coder of the next level.

Optionally, the first output module includes: the updating module is used for updating the second feature map, the third feature map and the fourth feature map based on a spatial attention mechanism; and a fifth generation module, configured to generate a fifth feature map based on the self-attention mechanism and using the updated second feature map, the updated third feature map, and the updated fourth feature map.

Optionally, the first input module 1010 includes: the first feature map output module, wherein the first feature map output module includes: a first feature map output sub-module, and wherein the first feature map output sub-module comprises: the first interpolation module is used for performing bicubic interpolation on the feature map output by the first level encoder so as to generate a fourth sub-feature map; the second interpolation module is used for performing bicubic interpolation on the feature map output by the second-level encoder so as to generate a fifth sub-feature map; the third interpolation module is used for reducing the channel number of the feature map output by the third-level encoder to half, and performing bicubic interpolation on the generated feature map so as to generate a sixth sub-feature map; the channel reduction module is used for reducing the channel number of the feature map output by the fourth-level encoder to half so as to generate a seventh sub-feature map; and the splicing module is used for splicing the fourth sub-feature diagram, the fifth sub-feature diagram, the sixth sub-feature diagram and the seventh sub-feature diagram, and processing the spliced feature diagram based on a channel attention mechanism so as to generate a first feature diagram.

Optionally, the apparatus 1000 further comprises: a sixth generation module for receiving the feature map input by the next-level decoder and performing a convolution operation on the received feature map using a convolution check of 3×3, thereby generating an eighth sub-feature map; and a seventh generation module for reducing the channel of the eighth sub-feature map to half based on bicubic interpolation, thereby generating a feature map for input to the previous-level decoder.

Example 3

Fig. 11 shows an apparatus 1100 for multi-attention based lightweight segmentation of images according to the first aspect of the present embodiment, which apparatus 1100 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 11, the apparatus 1100 includes: a processor 1110; and a memory 1120 coupled to the processor 1110 for providing instructions to the processor 1110 for processing the following processing steps: inputting an image to be processed into an encoder module and outputting a first feature map from a context awareness module connected with the encoder module, wherein the encoder module comprises a plurality of encoders which are sequentially cascaded in sequence, and wherein the output end of the encoder of the previous level is connected with the input end of the encoder of the current level; inputting the first feature map to a decoder module connected to the context-aware module and outputting a segmented image corresponding to the image to be processed, wherein the encoder is further configured to: respectively inputting the feature map output by the encoder of the previous level into a first branch network, a second branch network and a third branch network, wherein the first branch network and the second branch network are one-dimensional convolution networks, and the third branch network is a combined network comprising a convolution kernel of 5 multiplied by 5 and a convolution kernel of 1 multiplied by 1; carrying out one-dimensional convolution on the received feature map by using a first branch network, and generating a second feature map; carrying out one-dimensional convolution on the received feature images by using a second branch network, and generating a third feature image; convolving the received feature map with a third branch network and generating a fourth feature map; and generating a feature map for input to a next-level encoder based on the feature map, the second feature map, the third feature map, and the fourth feature map output by the previous-level encoder.

Optionally, the second branch network includes a third convolution unit and a fourth convolution unit, and the operation of one-dimensionally convolving the received feature map with the second branch network and generating a third feature map includes: performing column convolution operation on the received feature map by using a third convolution unit, so as to generate a second sub-feature map; and performing convolution operation on the second sub-feature map by using a fourth convolution unit so as to generate a third feature map.

Optionally, the encoder further includes a multi-attention module connected to the first branch network, the second branch network, and the third branch network, respectively, and generating a feature map for input to the encoder of the next level based on the feature map, the second feature map, the third feature map, and the fourth feature map output by the encoder of the previous level, including: inputting the second feature map, the third feature map and the fourth feature map to the multi-attention module respectively, and outputting a fifth feature map from the multi-attention module; and splicing the feature map output by the encoder of the previous level with the fifth feature map, and outputting the feature map for input to the encoder of the next level.

Optionally, the operations of inputting the second feature map, the third feature map, and the fourth feature map to the multi-attention module, and outputting the fifth feature map from the multi-attention module, respectively, include: updating the second feature map, the third feature map and the fourth feature map respectively based on a spatial attention mechanism; and generating a fifth feature map based on the self-attention mechanism and using the updated second feature map, the updated third feature map, and the updated fourth feature map.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A method of multi-attention based lightweight segmentation of images, comprising:

inputting an image to be processed into an encoder module and outputting a first characteristic diagram from a context awareness module connected with the encoder module, wherein the encoder module comprises a plurality of encoders which are sequentially cascaded in sequence, and wherein the output end of a previous-level encoder is connected with the input end of a current-level encoder;

inputting the first feature map to a decoder module connected to the context awareness module, thereby outputting a segmented image corresponding to the image to be processed, wherein the encoder is further configured to:

respectively inputting the feature map output by the encoder of the previous level into a first branch network, a second branch network and a third branch network, wherein the first branch network and the second branch network are one-dimensional convolution networks, and the third branch network is a combined network comprising a convolution kernel of 5 multiplied by 5 and a convolution kernel of 1 multiplied by 1;

Carrying out one-dimensional convolution on the received feature map by utilizing the first branch network, and generating a second feature map;

carrying out one-dimensional convolution on the received feature map by utilizing the second branch network, and generating a third feature map;

convolving the received feature map with the third branch network and generating a fourth feature map; and

generating a feature map for input to a next-level encoder based on the feature map output by the previous-level encoder, the second feature map, the third feature map, and the fourth feature map, wherein

The plurality of encoders includes a first level encoder, a second level encoder, a third level encoder, and a fourth level encoder, wherein an output of the first level encoder, an output of the second level encoder, an output of the third level encoder, and an output of the fourth level encoder are also respectively connected to an input of the context aware module, and an operation of outputting a first feature map from the context aware module connected to the first level encoder, the second level encoder, the third level encoder, and the fourth level encoder, includes:

Performing bicubic interpolation on the feature map output by the first-level encoder, thereby generating a fourth sub-feature map;

performing bicubic interpolation on the feature map output by the second-level encoder, thereby generating a fifth sub-feature map;

reducing the channel number of the feature map output by the third-level encoder to half, and performing bicubic interpolation on the generated feature map, thereby generating a sixth sub-feature map;

reducing the number of channels of the feature map output by the fourth-level encoder to half, thereby generating a seventh sub-feature map; and

splicing the fourth sub-feature map, the fifth sub-feature map, the sixth sub-feature map and the seventh sub-feature map, and processing the spliced feature map based on a channel attention mechanism to generate the first feature map, wherein

The decoder module comprises a plurality of decoders which are sequentially cascaded, wherein the input end of the decoder of the previous level is connected with the output end of the decoder of the current level, and the decoder is further configured to execute the following operations, comprising:

receiving the feature map input by the next-level decoder, and performing a convolution operation on the received feature map using a 3×3 convolution kernel, thereby generating an eighth sub-feature map; and

Based on bicubic interpolation, the channels of the eighth sub-feature map are reduced to half, thereby generating a feature map for input to a previous-level decoder.

2. The method of claim 1, wherein the first branch network includes a first convolution unit and a second convolution unit, and wherein the operation of one-dimensionally convolving the received feature map with the first branch network and generating a second feature map includes:

performing a row convolution operation on the received feature map by using the first convolution unit, so as to generate a first sub-feature map; and

and performing column convolution operation on the first sub-feature map by using the second convolution unit so as to generate a second feature map.

3. The method of claim 2, wherein the second branch network includes a third convolution unit and a fourth convolution unit, and wherein the operation of one-dimensionally convolving the received feature map with the second branch network and generating a third feature map includes:

performing column convolution operation on the received feature map by using the third convolution unit, so as to generate a second sub-feature map; and

and carrying out convolution operation on the second sub-feature map by using the fourth convolution unit so as to generate a third feature map.

4. A method according to claim 3, wherein the third branch network comprises a fifth convolution unit and a sixth convolution unit, and wherein the operations of convolving the received feature map with the third branch network and generating a fourth feature map comprise:

performing convolution operation on the received feature map by using a 5×5 convolution kernel in the fifth convolution unit, thereby generating a third sub-feature map; and

and performing convolution operation on the third sub-feature map by using a 1×1 convolution kernel in the sixth convolution unit, thereby generating a fourth feature map.

5. The method of claim 1, wherein the encoder further comprises a multi-attention module connected to the first branch network, the second branch network, and the third branch network, respectively, and wherein generating a profile for input to a next level encoder based on the profile of a previous level encoder output, the second profile, the third profile, and the fourth profile comprises:

inputting the second feature map, the third feature map and the fourth feature map to the multi-attention module, respectively, and outputting a fifth feature map from the multi-attention module; and

And splicing the characteristic diagram output by the coder of the previous level with the fifth characteristic diagram, and outputting the characteristic diagram for being input to the coder of the next level.

6. The method of claim 5, wherein the operations of inputting the second, third, and fourth feature maps to the multi-attention module, respectively, and outputting a fifth feature map from the multi-attention module, comprise:

updating the second feature map, the third feature map and the fourth feature map respectively based on a spatial attention mechanism; and

and generating the fifth characteristic diagram based on a self-attention mechanism and by using the updated second characteristic diagram, the updated third characteristic diagram and the updated fourth characteristic diagram.

7. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 6 is performed by a processor when the program is run.

8. An apparatus for multi-attention based lightweight segmentation of images, comprising:

the first input module is used for inputting an image to be processed into the encoder module and outputting a first characteristic diagram from the context sensing module connected with the encoder module, wherein the encoder module comprises a plurality of encoders which are sequentially cascaded in sequence, and the output end of the encoder of the previous level is connected with the input end of the encoder of the current level;

The second input module is configured to input the first feature map to a decoder module connected to the context awareness module, and output a segmented image corresponding to the image to be processed, where the encoder further includes:

the third input module is used for inputting the feature diagram output by the encoder of the previous level into a first branch network, a second branch network and a third branch network respectively, wherein the first branch network and the second branch network are one-dimensional convolution networks, and the third branch network is a combined network comprising a convolution kernel of 5 multiplied by 5 and a convolution kernel of 1 multiplied by 1;

the first generation module is used for carrying out one-dimensional convolution on the received feature map by utilizing the first branch network and generating a second feature map;

the second generating module is used for carrying out one-dimensional convolution on the received feature images by utilizing the second branch network and generating a third feature image;

the third generating module is used for convolving the received characteristic diagram by utilizing the third branch network and generating a fourth characteristic diagram; and

a fourth generation module for generating a feature map for input to a next-level encoder based on the feature map output by the previous-level encoder, the second feature map, the third feature map, and the fourth feature map, wherein

The first input module includes:

the first interpolation module is used for performing bicubic interpolation on the feature map output by the first level encoder so as to generate a fourth sub-feature map;

the second interpolation module is used for performing bicubic interpolation on the feature map output by the second-level encoder so as to generate a fifth sub-feature map;

the third interpolation module is used for reducing the channel number of the feature map output by the third-level encoder to half, and performing bicubic interpolation on the generated feature map so as to generate a sixth sub-feature map;

the channel reduction module is used for reducing the channel number of the feature map output by the fourth-level encoder to half so as to generate a seventh sub-feature map; and

a stitching module, configured to stitch the fourth sub-feature map, the fifth sub-feature map, the sixth sub-feature map, and the seventh sub-feature map, and process the stitched feature map based on a channel attention mechanism, thereby generating a first feature map, where

The apparatus further comprises: a sixth generation module for receiving the feature map input by the next-level decoder and performing a convolution operation on the received feature map using a convolution check of 3×3, thereby generating an eighth sub-feature map; and

And a seventh generating module, configured to reduce the channel of the eighth sub-feature map to half based on bicubic interpolation, so as to generate a feature map for input to a decoder of a previous hierarchy.